Index pipeline
The index pipeline is uwwoe’s offline data side. It runs as a Go CLI
(cmd/uwscrape) on a developer machine or a CI runner. The backend
never participates.
Stages
- Scrape — fetch the public Waterloo Kuali catalog into raw
snapshots under
data/snapshots/. RequiresUWSCRAPE_ALLOW_LIVE_WATERLOO=1; default is0. - Parse — turn raw snapshots into structured records (courses,
credentials, prerequisite expressions, antirequisites, cross-listings).
Unresolved fragments are preserved as
UnparsedRequirementrecords, never silently dropped (ADR 0005). - Patch — apply manual parser patches under
data/patches/for known catalog quirks. Required misses block release (ADR 0008). - Validate — run schema, cross-reference, and structural checks.
Produces
validation-summary.json. - Build index — assemble
course-universe.sqliteplus the metadata files. Output goes to a chosen published directory. - Release-gate — write
release-decision.jsonwith statusapproved,approved_with_warnings, orrejected. The backend only serves the first two (ADR 0007).
The published artifact
A published index directory contains exactly:
| File | Purpose |
|---|---|
course-universe.sqlite | The canonical runtime read model. |
build-metadata.json | Build inputs, commit hashes, timestamps, catalog version. |
validation-summary.json | Pass/fail counts and unresolved warnings. |
release-decision.json | approved / approved_with_warnings / rejected + reasons. |
build-report.md | Human-readable build narrative. |
The backend treats this directory as read-only. There is no runtime write to the index ever.
Determinism
The pipeline is deterministic from fixture inputs: running parse,
validate, build-index against committed test fixtures produces
identical index hashes. This is checked in CI.
Schema versioning
The index has a mandatory index_metadata table with a semantic
version. Initial version is 0.1.0; the backend supports one major
version at a time. Startup against an unsupported schema fails closed.
(ADR 0003)
Catalog pinning
Every plan in the state store is pinned to a specific
catalog_version_id from the index it was built against. When a new
index is published, existing plans are not auto-migrated —
POST /api/v1/state/current/migration-preview previews the diff, and
the user decides whether to accept. (ADR 0004)
Hot-swap is not supported
Index replacement is a deployment operation: stop the backend, swap the published directory, restart. Hot-swap is forbidden by ADR 0010; the runtime is built around a single, immutable index handle for the lifetime of the process.
Want the full spec?
- Scraper pipeline spec — every parser stage, fixture format, patch governance.
- Decision: Offline Kuali scraper — the locked architectural choice.
- Decision: Index artifact storage — directory layout, atomic publication.
- Decision: Release gate policy — what gets served, what doesn’t.