Offline Index and Runtime Architecture

Status: Draft architecture baseline.

Project: UWScrape.

Audience: project maintainers, future implementers, and design reviewers.

Scope: scraper boundary, index artifacts, Go backend, solver services, and frontend contract.

Last reviewed: 2026-05-11.

1. Thesis

UWScrape should treat Waterloo calendar data as a periodically imported source.

The scraper should not run as part of the user-facing runtime.

The user-facing runtime should load a versioned index artifact.

The backend should manage student state and answer exact query requests.

The frontend should provide the interactive visual workspace.

The source calendar changes slowly enough to justify offline indexing.

Course descriptions, requisites, cross-listings, and plan rules are calendar-versioned facts.

Term offerings, instructors, rooms, and section times are a different product boundary.

This architecture focuses on calendar-pathway reasoning rather than course scheduling.

The first durable asset is a trustworthy course and credential index.

The second durable asset is a repeatable import pipeline.

The third durable asset is a runtime query surface over that index.

The fourth durable asset is a visual interface that can evolve without redefining truth.

The geometry of the UI must remain a projection.

The index must remain the operational source of truth for the app.

2. Naming Policy

Implementation names should be plain and operational.

The formal paper may use mathematical vocabulary when helpful.

Code and schemas should avoid rhetorical names for durable components.

Use RequirementExpression for a parsed logical requirement tree.

Use RequirementCondition for a typed leaf-level condition.

Use RuleNode for a display or parser tree node when shape matters.

Use RequirementGroup for grouped all-of, any-of, or choose-k structures.

Use ParsedRequirement for a parsed representation with provenance.

Use UnparsedRequirement for a preserved opaque fragment.

Use CourseListing for a subject-number listing in a calendar.

Use CourseCredit for the internal identity used for credit equivalence.

Use CredentialRequirement for requirements attached to a credential.

Use StudentRecord for local user state.

Use CatalogVersion for a Waterloo calendar publication.

Avoid names that sound elegant but hide data shape.

Avoid names that imply more certainty than the parser has.

Prefer names that answer what the component stores or computes.

3. External Facts

Waterloo publishes the undergraduate calendar through a Kuali Catalog embed.

The public page is a JavaScript shell, not a static HTML calendar.

Waterloo embeds a Kuali subdomain in the page.

The observed Kuali subdomain is https://uwaterloocm.kuali.co.

Waterloo embeds a catalog id in the page.

The observed undergraduate catalog id is 67e557ed6ed2fe2bd3a38956.

The observed title is 2026-2027 Undergraduate Studies Academic Calendar.

The observed calendar metadata endpoint is /api/v1/catalog/public/catalogs/:catalogId.

The observed course schema endpoint is /api/v1/catalog/schema/:catalogId/courses.

The observed program schema endpoint is /api/v1/catalog/schema/:catalogId/programs.

The observed search endpoint is /api/v1/catalog/search/:catalogId.

The observed course item endpoint is /api/v1/catalog/course/:catalogId/:pid.

The observed program item endpoint is /api/v1/catalog/program/:catalogId/:pid.

The search endpoint exposes an item-count response header.

An empty course search enumerated undergraduate course records.

One observed empty course search reported item-count: 4342.

Kuali documentation says the Catalog API is not a public stable API.

Kuali documentation says the API primarily drives UI functionality.

Kuali documentation says backward compatibility is not guaranteed.

Waterloo documentation says Kuali CM feeds Kuali Catalog.

Waterloo documentation says direct calendar links use item pids.

Waterloo documentation distinguishes course, program or plan, and regulation links.

These facts support offline import rather than runtime coupling.

Source links should be retained near scraper implementation docs.

4. System Boundaries

The scraper is a build-time or maintenance-time tool.

The scraper fetches catalog data from Waterloo and Kuali surfaces.

The scraper stores raw responses without destructive rewriting.

The parser derives structured records from raw responses.

The validator checks derived records for consistency and parser gaps.

The index builder produces compact runtime artifacts.

The backend serves the active index and optional historical indexes.

The backend stores anonymous student state.

The backend runs exact query logic against the index and student state.

The frontend renders exploration, planning, and explanation views.

The frontend never owns canonical degree semantics.

The frontend may cache index slices for responsiveness.

The frontend may compute local visual highlighting.

The frontend may defer solver-grade answers to the backend.

The project does not initially own timetable data.

The project does not initially own instructor data.

The project does not initially own room or section availability data.

The project does not replace academic advising.

The project should expose provenance for every derived fact.

5. Repository Layout

The repository should separate documents, commands, packages, and data.

docs/architecture should contain system architecture narratives.

docs/ADRs should contain decision records.

docs/specs should contain approved implementation specifications.

cmd/uwscrape should contain the scraper CLI entrypoint.

cmd/uwserver should contain the backend service entrypoint.

internal/kuali should contain Kuali client code.

internal/snapshot should contain raw snapshot storage code.

internal/parser should contain parsing stages.

internal/model should contain shared domain types.

internal/indexbuild should contain index construction code.

internal/solver should contain query evaluation code.

internal/state should contain student state persistence code.

web should contain the frontend application.

data/raw should hold downloaded raw snapshots when checked in or staged.

data/parsed should hold derived intermediate records when useful.

data/index should hold generated runtime artifacts when useful.

Generated artifacts may be stored outside git if large.

Generated artifact policy should be explicit before the first large scrape.

6. Data Lifecycle

The import lifecycle begins with catalog discovery.

Catalog discovery finds catalog id, title, start date, end date, and settings.

The scraper fetches schemas for item types needed by the project.

The scraper enumerates course records through paginated search.

The scraper enumerates program records through paginated search.

The scraper may later enumerate policies, experiences, and specializations.

Each search result provides a pid and a stable item id.

Each item pid should be fetched as a detailed item record.

Each raw response should be stored as received.

Each raw response should store URL, method, headers, status, and timestamp.

Each raw response should store catalog id and scraper version.

Each raw response should store a content hash.

The parser reads only from snapshots.

The parser should be deterministic.

Parser output should include provenance references into raw snapshots.

Parser output should include parser confidence where uncertainty remains.

The validator reads parser output and source metadata.

The index builder reads validated parser output.

The index builder emits a versioned runtime index.

The runtime never fetches Kuali during normal user sessions.

7. Raw Snapshot Design

Raw snapshots are evidence.

Raw snapshots should be immutable once captured.

Raw snapshots should use stable directory names.

A directory name should include catalog year and catalog type.

A directory name may include catalog id for disambiguation.

The snapshot manifest should list every request.

The snapshot manifest should list every response hash.

The snapshot manifest should list total item counts.

The snapshot manifest should list expected and actual fetched counts.

The snapshot manifest should list failed requests.

The snapshot manifest should list retries.

The snapshot manifest should list scraper binary version or commit.

Raw course detail JSON should be stored separately per item.

Raw program detail JSON should be stored separately per item.

Schemas should be stored with the snapshot.

Catalog metadata should be stored with the snapshot.

Search pages should be stored when practical.

Raw HTML fragments inside JSON should not be cleaned in place.

No parser stage should overwrite raw evidence.

This protects the project from later parser mistakes.

This also supports reproducible bug reports.

8. Parser Design

The parser should be a multi-stage pipeline.

Stage one reads JSON and maps known top-level fields.

Stage two parses HTML fragments into DOM trees.

Stage three converts DOM trees into display rule trees.

Stage four recognizes typed requirement expressions.

Stage five records unparsed fragments.

Stage six attaches provenance to every parsed result.

The JSON mapper should understand Kuali option objects.

The JSON mapper should preserve unknown fields.

The DOM parser should use an HTML parser, not regular expressions.

The display rule parser should understand list nesting.

The display rule parser should understand data-test rule markers.

The display rule parser should understand group headers.

The display rule parser should understand links to course records.

The display rule parser should understand links to program records.

The semantic parser should identify all-of groups.

The semantic parser should identify any-of groups.

The semantic parser should identify choose-k groups.

The semantic parser should identify course completion conditions.

The semantic parser should identify minimum grade conditions.

The semantic parser should identify academic progress conditions.

The semantic parser should identify enrolled-in-program conditions.

The semantic parser should identify unit pool requirements.

The semantic parser should identify subject and level range pools.

The semantic parser should identify antirequisite conflicts.

Any unmatched phrase should become an UnparsedRequirement.

Unparsed requirements are not parser failures.

Unparsed requirements are known unknowns.

Known unknowns must remain visible to validators and UI explanations.

9. Requirement Representation

Requirements should be represented as typed expression trees.

A requirement expression may be an all-of group.

A requirement expression may be an any-of group.

A requirement expression may be a choose-k group.

A requirement expression may be a unit threshold group.

A requirement expression may be an average threshold group.

A requirement expression may be a subject pool group.

A requirement expression may be a typed condition.

A requirement expression may be an unparsed requirement.

Each expression should have a stable local id.

Each expression should point to source item id.

Each expression should point to source field name.

Each expression should point to a DOM or text span when possible.

Each expression should store normalized display text.

Each condition should store referenced course ids when known.

Each condition should store referenced credential ids when known.

Each condition should store referenced subject codes when known.

Each condition should store numeric thresholds when known.

Each condition should store comparison operators when known.

Each condition should store whether it is required for enrollment or graduation.

Each condition should store whether it is positive or excluding.

Each expression should distinguish prerequisites from credential requirements.

Each expression should distinguish eligibility from completion.

Each expression should distinguish course credit from course listing.

10. Course and Credential Model

CourseListing represents a calendar-visible subject and number.

CourseListing includes title, description, units, level, and owning faculty.

CourseListing includes the Kuali pid and item id.

CourseListing includes catalog version.

CourseListing may point to one CourseCredit.

CourseCredit represents credit identity or equivalence grouping.

Cross-listed listings may share a credit identity.

Former course numbers may be connected by equivalence relations.

Antirequisites may be connected by conflict relations.

Credential represents a major, minor, option, specialization, plan, or degree-level requirement set.

Credential includes title, code, type, faculty, and catalog version.

Credential includes declaration and graduation fields where present.

CredentialRequirement connects credentials to requirement expressions.

CourseRequirement connects courses to requisite expressions.

SubjectCode represents a course subject and owning metadata.

CatalogVersion represents a published Waterloo calendar.

RequirementSource represents a field in a source item.

This model should not collapse all information into graph edges.

Graph edges are useful projections.

The persisted model should retain expression structure.

11. Index Artifact

The runtime index should be versioned.

The runtime index should be compact enough for fast server startup.

SQLite is the preferred first index format.

SQLite gives simple queries, portability, and inspectability.

A compressed JSON artifact can be emitted for frontend graph bootstrapping.

The SQLite artifact should include entity tables.

The SQLite artifact should include requirement expression tables.

The SQLite artifact should include provenance tables.

The SQLite artifact should include search tables.

The SQLite artifact should include derived relationship tables.

The derived relationship tables should be reproducible from expressions.

The artifact should include build metadata.

The artifact should include source catalog metadata.

The artifact should include parser version metadata.

The artifact should include validation summary metadata.

The artifact should include unresolved parser warnings.

The artifact should include content hashes for traceability.

The artifact should be replaceable as a whole.

The backend should not mutate the index artifact.

Student state should live in a separate database.

12. Backend Runtime

The backend should be a small Go service.

The backend should load one active index at startup.

The backend may load multiple historical indexes later.

The backend should expose read-only catalog endpoints.

The backend should expose query endpoints.

The backend should expose anonymous state endpoints.

The backend should not expose scraper endpoints initially.

The backend should not depend on live Kuali availability.

The backend should perform exact requirement evaluation.

The backend should produce explanation data with every solver answer.

The backend should avoid storing unnecessary personal information.

The backend should rate limit state and solver endpoints.

The backend should log errors without logging secret state tokens.

The backend should support local file-based deployment first.

The backend can use SQLite for both index and student state.

The index database should be read-only at runtime.

The state database should be writable.

The state database should be backed up independently.

The backend API should be stable enough for frontend iteration.

The backend API should remain honest about incomplete parser coverage.

13. Anonymous State

The project can avoid email-based accounts initially.

The system should generate a random secret token.

The token should have at least 256 bits of entropy.

The token should be encoded as base64url without padding.

The server should store only a hash of the token.

The raw token should be shown once to the user.

The user should be told that losing the token loses access.

The token functions like a password.

The token should not be called a hash in user-facing text.

The state record should store completed courses.

The state record should store grades when provided.

The state record should store planned courses.

The state record should store academic progress.

The state record should store academic standing separately.

The state record should store declared credentials.

The state record should store desired credentials.

The state record should store dismissed or pinned explanations.

The state record should store UI layout preferences only when useful.

State export should be possible early.

14. Query Services

The first query service should evaluate unlock status.

Unlock status should return satisfied, blocked, partial, unknown, or conflict.

The second query service should evaluate credential progress.

Credential progress should identify fulfilled requirements.

Credential progress should identify unmet requirements.

Credential progress should identify courses that do not contribute.

Credential progress should identify requirements blocked by missing prerequisites.

The third query service should evaluate what-if changes.

What-if changes should compare current and proposed student records.

The fourth query service evaluates bounded course impact.

Course impact should explain what a course unlocks.

Course impact should remain relationship evidence, not an academic satisfaction status.

The fifth query service should evaluate credential gap summaries.

Credential gap summaries should remain conservative when requirements are unparsed.

Complete future-course feasibility should be reserved for a later bounded planning service.

Query responses should include machine-readable results.

Query responses should include human-readable explanations.

Query responses should include referenced source facts.

Query responses should include uncertainty flags.

Query responses should not silently discard unknown requirements.

15. Solver Strategy

The first solver can be direct Go evaluation.

Direct evaluation is enough for many unlock and progress queries.

A later solver can encode planning problems as CP-SAT or SMT.

Datalog can support reachability and dependency closure queries.

SAT-style encodings can support satisfiability over course choices.

SMT-style encodings can support numeric grade and average constraints.

Package-resolver patterns can support conflict explanation.

The architecture should not choose one solver for every query.

The backend should define solver-independent query contracts.

The solver package should receive an index view and student record.

The solver package should return explanations, not just booleans.

The solver package should expose incomplete-information behavior.

The solver package should not depend on frontend geometry.

The solver package should be tested with fixture catalogs.

The solver package should handle unparsed requirements explicitly.

The initial implementation should favor correctness over clever optimization.

Performance work should follow measurement.

Solver decisions belong in later ADRs when concrete engines are chosen.

16. Frontend Runtime

The frontend is the main user experience.

The frontend should render a global exploration view.

The frontend should render course detail panels.

The frontend should render credential detail panels.

The frontend should render planning timelines.

The frontend should render requirement explanations.

The frontend should render uncertainty plainly but calmly.

The frontend should cache graph-ready data.

The frontend should request exact answers from the backend.

The frontend may compute local highlight states.

The frontend may use WebGL for the large global view.

The frontend may use WASM if profiling justifies it.

The atlas layout and spatial-indexing worker is the deliberate exception to this caution.

ADR 0023 requires Rust/WASM for the first atlas layout implementation because atlas projection is central to the product surface.

This exception is limited to atlas layout, spatial indexing, hit-test acceleration, label placement, and visible-subgraph filtering.

It does not make WASM a scraper, backend, parser, renderer, or academic-evaluation requirement.

The first performance target should be precomputed data.

The second performance target should be viewport culling and level-of-detail.

The third performance target should be local worker threads.

The fourth performance target should be WASM for non-atlas layout or graph algorithms only after measurement.

The visual model should never redefine academic semantics.

The visual model should receive explicit graph projections from the index.

The visual model should remain replaceable.

17. Graph Projection

The index should expose graph projections for the frontend.

A course dependency graph is one projection.

A credential contribution graph is another projection.

A prerequisite unlock graph is another projection.

A conflict graph is another projection.

A subject-level grouping graph is another projection.

Each projection should have a name and documented lossiness.

Each projection should include source ids.

Each projection should include edge types.

Each projection should include requirement group ids where relevant.

Each projection should include uncertainty flags.

Projection generation should be deterministic.

Projection generation should be testable.

Projection generation should be independent of 3D layout.

The frontend may choose positions.

The backend may provide initial layout hints.

Layout hints should not be treated as facts.

Node sizing should be derived from documented metrics.

Metrics should be reproducible from index relationships.

Metrics should include unlock count and unlock diversity.

Metrics should be tunable without changing source truth.

18. Validation

Validation should run after parsing.

Validation should check expected item counts.

Validation should check missing detail records.

Validation should check duplicate course codes.

Validation should check malformed subject numbers.

Validation should check broken course links.

Validation should check broken program links.

Validation should check cross-listing symmetry.

Validation should check antirequisite references.

Validation should check requirement expressions for empty groups.

Validation should check choose-k groups for impossible thresholds.

Validation should check unit thresholds for parseable numbers.

Validation should check academic progress values.

Validation should check grade threshold values.

Validation should report unparsed requirement counts.

Validation should report parser confidence distribution.

Validation should produce a human review report.

Validation should fail the build for structural corruption.

Validation should warn for expected unknowns.

Validation policy should become stricter over time.

19. Deployment Model

The primary version 1 frontend deployment is a SvelteKit Node server beside or in front of the Go backend.

The single Go binary plus static frontend shape remains an optional fallback or local demo shape.

The backend can serve frontend assets only when that fallback mode is explicitly configured.

The backend can open a read-only SQLite index.

The backend can open a writable SQLite state database.

The backend can be hosted on a small VM or container.

The scraper can run on a maintainer machine.

The scraper can later run in CI on a schedule.

The scraper should publish artifacts through a reviewable process.

The runtime should switch indexes only through an explicit deploy.

Index deployment should be atomic.

Old indexes should remain available for rollback.

State records should store the catalog version they were created against.

When a new catalog is deployed, state migration may be advisory first.

The app should let users choose a catalog version later.

The app should default to the current undergraduate calendar.

No production path should require Kuali credentials.

No production path should scrape during a user request.

No production path should block on Waterloo availability.

20. Security and Privacy

The app should collect minimal personal data.

Anonymous state tokens should be treated as secrets.

Token hashes should use a password hashing or keyed hashing strategy.

The server should avoid logging raw tokens.

The server should avoid logging full student records by default.

Student records should be exportable.

Student records should be deletable by token holder.

The app should make third-party status clear in one information area.

The official University of Waterloo Academic Calendar remains the authoritative academic source.

The app should link back to authoritative Waterloo sources.

The app should prefer source references, stable links, hashes, and short snippets over broad republication of raw calendar text.

The app should not imply advising authority.

The scraper should respect reasonable request rates.

The scraper should store source attribution.

The scraper should tolerate endpoint changes gracefully.

The scraper should not bypass authentication-only systems.

The project should use only data visible through public calendar surfaces unless permissions change.

21. Initial Milestones

Milestone one is documentation and architecture alignment.

Milestone two is a tiny Kuali fetch spike.

The fetch spike should capture catalog metadata, schema, and a few items.

Milestone three is course enumeration.

Course enumeration should store raw snapshots and counts.

Milestone four is program enumeration.

Program enumeration should store raw snapshots and counts.

Milestone five is basic parser output.

Basic parser output should handle course listings and top-level metadata.

Milestone six is requirement HTML parsing.

Requirement parsing should support all-of, any-of, choose-k, and links.

Milestone seven is validation reporting.

Validation reporting should identify parser gaps before UI work depends on them.

Milestone eight is the first SQLite index.

Milestone nine is the first backend query endpoint.

Milestone ten is a small frontend proof of data navigation.

Global 3D exploration should wait until the index contract is stable.

22. Remaining and Resolved Questions

Remaining questions:

Whether raw snapshots live in git or in external artifact storage.
Whether parsed JSONL artifacts are committed.
Whether SQLite index artifacts are committed for development fixtures.
The first frontend framework.
Whether graph layout is server-precomputed or client-computed.
When to introduce WASM.
Whether to support graduate calendars.
How to represent historical requirement terms.
How strongly to validate against UWFlow-like secondary references.

Resolved by later docs:

Student state migration is governed by ADR 0004.
Manual patching is first-class but governed by ADR 0008.
Release gates are governed by ADR 0007.
The scraper implementation spec exists at docs/specifications/scraper-pipeline-spec/.

Secondary references should not become authoritative by accident.