ADR 0001: Use an Offline Kuali Scraper and Versioned Index Pipeline
ADR 0001: Use an Offline Kuali Scraper and Versioned Index Pipeline
Status: Accepted for architectural direction.
Date: 2026-05-11.
Context
UWScrape needs a trustworthy index of Waterloo courses, requisites, antirequisites, cross-listings, programs, plans, minors, options, and specializations.
The current Waterloo undergraduate calendar is served through a Kuali Catalog JavaScript embed.
The observed Waterloo page embeds https://uwaterloocm.kuali.co as the Kuali subdomain.
The observed undergraduate catalog id is 67e557ed6ed2fe2bd3a38956.
The observed catalog title is 2026-2027 Undergraduate Studies Academic Calendar.
Observed Kuali Catalog endpoints can return catalog metadata, schemas, search results, course details, and program details.
Waterloo documentation says Kuali CM feeds Kuali Catalog and that direct calendar links use opaque item pids.
Kuali documentation says the Catalog API is not a public stable API and primarily exists to drive the UI.
Calendar course and program facts are relatively slow-moving compared with timetable data.
The product goal is pathway reasoning and visualization, not live timetable scheduling.
The user-facing app should remain responsive even when Waterloo or Kuali endpoints are unavailable.
The parser must preserve evidence because rendered Kuali JSON may contain HTML fragments for requirements.
Decision
UWScrape will use an offline scraper and index pipeline.
The scraper will not run inside normal user-facing request paths.
The scraper will fetch Kuali catalog data during a manual or scheduled maintenance process.
The scraper will store immutable raw snapshots before parsing.
The parser will derive structured course, credential, and requirement records from snapshots.
The parser will preserve unparsed requirement fragments as explicit UnparsedRequirement records.
The validator will produce a review report before an index is published.
The index builder will publish versioned runtime artifacts.
The Go backend will load published artifacts and serve query endpoints.
The frontend will consume backend APIs and graph projection data rather than live Kuali data.
Alternatives Considered
Alternative 1: Runtime Kuali Fetching
The backend could fetch Kuali data during user requests.
This would make the latest upstream data visible without a build step.
This would reduce up-front artifact management.
However, it would couple user experience to an undocumented UI-driving API.
It would make performance depend on Kuali latency and availability.
It would complicate caching, retries, and failure explanations.
It would make reproducible solver answers harder because source data could change between sessions.
It would also increase request volume against a third-party service.
This alternative is rejected for the first architecture.
Alternative 2: Frontend-Only Static App
The project could publish a fully static app with precomputed JSON.
This would simplify deployment.
This would remove backend operational burden.
This would work for read-only exploration.
However, it would make private student state awkward.
It would push solver complexity into the browser.
It would make exact query explanations harder to centralize.
It would make future state migration and server-side validation harder.
This alternative remains plausible for a demo but is not the preferred product architecture.
Alternative 3: Use UWFlow as Primary Source
The project could ingest UWFlow data rather than building an independent index.
This could accelerate early exploration.
UWFlow already models many course relationships.
However, the project goal is an independent index grounded in authoritative Waterloo calendar data.
UWFlow may still be useful as a comparison source.
UWFlow should not become the primary source of truth without a separate decision.
This alternative is rejected as the primary architecture.
Alternative 4: Manual Dataset First
The project could start with a hand-authored subset of Faculty of Mathematics data.
This would allow quick prototyping of UI concepts.
This would avoid scraper complexity at the start.
However, it would risk designing against a toy structure.
It would delay learning the real Kuali field and requirement shapes.
It would make correctness claims weak.
This alternative is acceptable only for tiny fixtures, not for the main index.
Consequences
The project gets reproducible index builds.
The project gets a clean boundary between source ingestion and runtime behavior.
The backend can be simple and robust.
The frontend can be designed around stable API contracts.
The scraper can fail without breaking active users.
The parser can improve over time while preserving raw evidence.
Manual parser patches become reviewable artifacts.
The system can support historical catalog versions later.
The system must maintain scraper code as Kuali changes.
The system must define artifact publication and rollback.
The system must decide whether raw snapshots are committed, archived, or stored externally.
The system must expose parser uncertainty to avoid false authority.
The first implementation spec should focus on the scraper pipeline.
The first implementation spec should define fetch commands, file layouts, parser interfaces, validation reports, and publish criteria.