Skip to content

Index pipeline

The index pipeline is uwwoe’s offline data side. It runs as a Go CLI (cmd/uwscrape) on a developer machine or a CI runner. The backend never participates.

Stages

  1. Scrape — fetch the public Waterloo Kuali catalog into raw snapshots under data/snapshots/. Requires UWSCRAPE_ALLOW_LIVE_WATERLOO=1; default is 0.
  2. Parse — turn raw snapshots into structured records (courses, credentials, prerequisite expressions, antirequisites, cross-listings). Unresolved fragments are preserved as UnparsedRequirement records, never silently dropped (ADR 0005).
  3. Patch — apply manual parser patches under data/patches/ for known catalog quirks. Required misses block release (ADR 0008).
  4. Validate — run schema, cross-reference, and structural checks. Produces validation-summary.json.
  5. Build index — assemble course-universe.sqlite plus the metadata files. Output goes to a chosen published directory.
  6. Release-gate — write release-decision.json with status approved, approved_with_warnings, or rejected. The backend only serves the first two (ADR 0007).

The published artifact

A published index directory contains exactly:

FilePurpose
course-universe.sqliteThe canonical runtime read model.
build-metadata.jsonBuild inputs, commit hashes, timestamps, catalog version.
validation-summary.jsonPass/fail counts and unresolved warnings.
release-decision.jsonapproved / approved_with_warnings / rejected + reasons.
build-report.mdHuman-readable build narrative.

The backend treats this directory as read-only. There is no runtime write to the index ever.

Determinism

The pipeline is deterministic from fixture inputs: running parse, validate, build-index against committed test fixtures produces identical index hashes. This is checked in CI.

Schema versioning

The index has a mandatory index_metadata table with a semantic version. Initial version is 0.1.0; the backend supports one major version at a time. Startup against an unsupported schema fails closed. (ADR 0003)

Catalog pinning

Every plan in the state store is pinned to a specific catalog_version_id from the index it was built against. When a new index is published, existing plans are not auto-migrated — POST /api/v1/state/current/migration-preview previews the diff, and the user decides whether to accept. (ADR 0004)

Hot-swap is not supported

Index replacement is a deployment operation: stop the backend, swap the published directory, restart. Hot-swap is forbidden by ADR 0010; the runtime is built around a single, immutable index handle for the lifetime of the process.

Want the full spec?