ADR 0002: Index Artifact Storage and Publication Policy

Status: Accepted for architectural direction.

Date: 2026-05-11.

Context

UWScrape imports Waterloo calendar data through an offline scraper pipeline.

The scraper pipeline produces raw snapshots, parsed records, validation reports, SQLite indexes, and optional graph projection files.

The backend should load a published index artifact and should not run the scraper during user requests.

The project needs reproducibility, but it also needs repository hygiene.

Raw snapshots may be large.

Raw snapshots may contain substantial republished calendar content.

Parsed records are smaller but still derived from source calendar material.

SQLite index files are runtime artifacts.

Graph projection files are frontend boot artifacts.

Different artifact classes have different review and storage needs.

The project is early enough that local artifacts may be enough for implementation.

The architecture should still define publication concepts before backend design.

Decision

UWScrape will distinguish build artifacts by trust and publication role.

Raw snapshots are evidence artifacts.

Parsed JSONL files are intermediate artifacts.

Validation reports are review artifacts.

SQLite index files are runtime artifacts.

Graph projection files are runtime-support artifacts.

The repository may contain documentation, schemas, small fixtures, and small reviewed examples.

The repository should not automatically commit full raw snapshots.

The repository should not automatically commit generated SQLite indexes.

The repository should not automatically commit large graph projection files.

Full generated artifacts should be stored in a dedicated artifact location once that location exists.

During early local development, generated artifacts may live under data/ and remain untracked.

If a generated artifact is intentionally committed later, that should be a separate explicit decision.

A published index is a directory containing at minimum:

course-universe.sqlite
build-metadata.json
validation-summary.json
release-decision.json
build-report.md
release-decision.md when a human-readable companion is generated

A published index may also contain:

graph-projection.json
source-reference-summary.json
parser-warning-summary.json
license-and-source-notes.md

The backend should receive a path to one published index directory.

The backend should treat that directory as read-only.

The backend should refuse to serve an index whose validation status is not acceptable.

The backend should not mutate published index artifacts.

Student state must be stored outside the index artifact directory.

Artifact Classes

raw_snapshot means downloaded source evidence.

parsed_records means JSONL output derived from a raw snapshot and patch set.

validation_report means human-readable and machine-readable validation output.

runtime_index means SQLite and metadata loaded by the backend.

graph_projection means compressed visual graph data loaded by backend or frontend.

test_fixture means small source-derived or synthetic data used for tests.

manual_patch means reviewed parser correction or annotation.

Manual patches are source-controlled unless later proven too large or sensitive.

Validation policies are source-controlled.

Schemas for UWScrape’s own artifact formats are source-controlled.

Small fixtures are source-controlled.

Large generated data is not source-controlled by default.

Publication Flow

The scraper produces a raw snapshot.

The parser produces parsed records.

The validator produces validation reports.

The index builder produces runtime artifacts.

A reviewer checks build reports and validation summaries.

An approved runtime artifact directory is copied or promoted to a published index location.

The backend is configured to load a specific published index directory.

Publishing should be atomic from the backend perspective.

The previous published index should remain available for rollback.

Publication should not delete raw evidence.

Publication should not delete parsed records.

Publication should not modify student state.

Alternatives Considered

Alternative 1: Commit Every Artifact

The project could commit raw snapshots, parsed outputs, SQLite indexes, and graph projections.

This would make reproduction simple for every clone.

This would make diffs visible through git.

However, it would bloat the repository quickly.

It could create redistribution questions around large calendar content.

It would mix source code review with generated data churn.

This alternative is rejected as the default.

Alternative 2: Commit Only SQLite Indexes

The project could commit runtime SQLite indexes but not raw snapshots.

This would make backend startup easy for all developers.

However, it would weaken traceability unless raw evidence is separately archived.

It would still add binary churn to git.

It would make index diffs hard to review.

This alternative is rejected as the default.

Alternative 3: Store Nothing Generated

The project could require every developer to run the scraper locally.

This would keep git clean.

However, it would make onboarding slower.

It would increase repeated requests against public endpoints.

It would make bug reproduction harder.

This alternative is acceptable for the earliest spike but not for a mature workflow.

Consequences

The backend can be designed around a read-only published index directory.

The scraper can evolve independently from runtime deployments.

The repository remains mostly source and docs.

Build reports become important review artifacts.

The project still needs to choose a long-term artifact store.

The project still needs to write .gitignore rules when implementation begins.

The project still needs publication tooling.

The project can defer artifact hosting until after the first local pipeline works.

Follow-Up

Define .gitignore policy before generating large local data.

Define artifact store path before team usage.

Define publish command or deployment process after backend architecture.

Release gate and patch governance are defined in ADR 0007 and ADR 0008.