Skip to content

Scraper Pipeline Specification

Scraper Pipeline Specification

Status: Draft v0.1.

Project: UWScrape.

Directory: docs/specs.

Audience: scraper implementers, backend implementers, data reviewers, and future maintainers.

Last reviewed: 2026-05-11.

Primary decision record: docs/decisions/0001-offline-kuali-scraper-index-pipeline/.

Primary architecture document: docs/reference/architecture/offline-index-runtime-architecture/.

Primary mathematical model: docs/reference/academic/academic-course-universe-model/.

1. Purpose

This document specifies the first scraper pipeline for UWScrape.

The scraper pipeline imports Waterloo academic calendar data from public calendar surfaces.

The scraper pipeline stores raw evidence before parsing.

The scraper pipeline derives structured records from raw evidence.

The scraper pipeline validates the derived records.

The scraper pipeline builds runtime index artifacts.

The scraper pipeline must be reproducible.

The scraper pipeline must be inspectable.

The scraper pipeline must be conservative about uncertain parsing.

The scraper pipeline must preserve every unknown requirement fragment.

The scraper pipeline must support future parser improvement without requiring a fresh scrape.

The scraper pipeline must support future historical calendar imports.

The scraper pipeline must not run inside normal user-facing requests.

The scraper pipeline must not require Kuali or Waterloo availability at runtime.

The scraper pipeline is part of the data build system.

The scraper pipeline is not part of the interactive web application loop.

The scraper pipeline is not a timetable scraper.

The scraper pipeline is not a professor or room scraper.

The scraper pipeline is not an academic advising authority.

The scraper pipeline produces a versioned local index for the rest of the system.

2. Core Principle

The pipeline is lossless first.

The pipeline is increasingly structured second.

The scraper should store source responses exactly enough to reconstruct parser input.

The parser should derive the highest-confidence structure it can.

The parser should not guess when a phrase is ambiguous.

The parser should not delete unknown fragments.

The parser should not convert a warning into silent success.

The validator should show what was parsed.

The validator should show what was not parsed.

The validator should show why unparsed fragments remain.

The index builder should include uncertainty metadata.

The backend should expose uncertainty to query responses.

The frontend should display uncertainty when it affects a pathway.

Accuracy comes from evidence, provenance, and repeatability.

Completeness improves through parser and patch iterations.

Completeness must not be faked.

3. Naming Policy

Durable names in the scraper should be plain and operational.

The code should avoid paper-style metaphor names for important data structures.

Use CatalogVersion for a published calendar.

Use SourceRequest for a recorded HTTP request.

Use SourceResponse for a recorded HTTP response.

Use SnapshotManifest for the snapshot inventory.

Use CourseListing for a course subject-number listing.

Use CourseCredit for credit identity or equivalence grouping.

Use Credential for plans, majors, minors, options, and specializations.

Use RequirementExpression for a parsed logical requirement tree.

Use RequirementGroup for grouped requirement expressions.

Use RequirementCondition for a typed leaf-level condition.

Use ParsedRequirement for a parsed requirement with provenance.

Use UnparsedRequirement for a preserved unknown source fragment.

Use SourceReference for a pointer back to raw evidence.

Use ParserWarning for a non-fatal parser issue.

Use ValidationFinding for validator output.

Avoid implementation names that sound elegant but hide shape.

Avoid implementation names that imply certainty where confidence is partial.

4. External Source Facts

The undergraduate academic calendar is served through Waterloo’s academic calendar pages: Undergraduate studies calendar.

Waterloo’s navigation guidance says the calendar content is organized into Programs and Plans, Regulations, and Courses and presents requisites and requirements as logic-based rules with links where applicable: How to navigate the Undergraduate Studies Academic Calendar.

Waterloo’s Kuali CM resources state that Kuali Curriculum Management uses logic rules to build course requisites and program or plan requirement lists: All about building rules.

The public calendar page embeds a Kuali Catalog application.

The embedded Kuali Catalog application was observed loading data from a Kuali subdomain during the research pass.

The observed Kuali subdomain is https://uwaterloocm.kuali.co.

The observed undergraduate catalog id is 67e557ed6ed2fe2bd3a38956.

The observed catalog title is 2026-2027 Undergraduate Studies Academic Calendar.

The observed undergraduate catalog start date is 2026-04-02.

The observed undergraduate catalog end date is 2027-04-01.

The observed catalog metadata endpoint is /api/v1/catalog/public/catalogs/:catalogId.

The observed course schema endpoint is /api/v1/catalog/schema/:catalogId/courses.

The observed program schema endpoint is /api/v1/catalog/schema/:catalogId/programs.

The observed search endpoint is /api/v1/catalog/search/:catalogId.

The observed course detail endpoint is /api/v1/catalog/course/:catalogId/:pid.

The observed program detail endpoint is /api/v1/catalog/program/:catalogId/:pid.

The observed search endpoint exposes an item-count response header.

An empty course search returned paginated course summaries.

An empty course search reported item-count: 4342 during the research pass.

The item count is an observed value, not a permanent constant.

The scraper must discover counts during each run.

Kuali documentation states that the Catalog API is not public stable API.

Kuali documentation states that the Catalog API primarily drives UI behavior.

Kuali documentation states that backward compatibility is not guaranteed.

Waterloo documentation says Kuali CM feeds Kuali Catalog.

Waterloo documentation says direct calendar links use item pids.

Waterloo documentation distinguishes courses, programs or plans, and regulations.

Waterloo documentation describes cross-listed courses as sharing a Quest identification number and material content while allowing requisites and ownership to differ by offering: All about cross-listed courses.

The scraper should cite source URLs in build reports.

The scraper should treat endpoint shape as an implementation detail under watch.

The scraper should fail clearly if endpoint shape changes.

The scraper should not bypass authentication-only Kuali CM endpoints.

The scraper should use public calendar-facing data only unless project permissions change.

5. Data Scope for Version 1

Version 1 must import undergraduate course records.

Version 1 must import undergraduate program records.

Version 1 must import catalog metadata.

Version 1 must import course schema metadata.

Version 1 must import program schema metadata.

Version 1 should import public catalog settings.

Version 1 should import visible course fields configured for calendar display.

Version 1 should import visible program fields configured for calendar display.

Version 1 should parse course prerequisites.

Version 1 should parse course corequisites.

Version 1 should parse course antirequisites.

Version 1 should parse course cross-listed courses.

Version 1 should parse program course requirements.

Version 1 should parse program graduation requirement text where possible.

Version 1 should parse program declaration requirement text where possible.

Version 1 should preserve all unparsed rich text fields.

Version 1 may ignore policies except for linked references.

Version 1 may ignore experiences.

Version 1 may ignore graduate-only item types.

Version 1 may ignore timetable offerings.

Version 1 may ignore instructors.

Version 1 may ignore rooms.

Version 1 may ignore section capacity.

Version 1 may ignore live enrollment restrictions not present in calendar data.

Version 1 must mark ignored data scope explicitly in the build report.

6. Future Data Scope

Future versions may import policies.

Future versions may import specializations as separate item types if exposed.

Future versions may import experiences.

Future versions may import graduate calendars.

Future versions may import archived catalogs.

Future versions may compare catalog versions.

Future versions may compute migration notes between calendar versions.

Future versions may cross-check with secondary indexes.

Future versions may add timetable data as a separate pipeline.

Future versions may add instructor data as a separate pipeline.

Future versions may add official Open Data sources if available and suitable.

Future versions should keep runtime index boundaries clear.

Future versions should not merge timetable facts into calendar requirement facts without explicit modeling.

Future versions should not treat secondary sources as authoritative by default.

7. Command Overview

The scraper should be exposed as a Go CLI.

The preferred binary name is uwscrape.

The CLI should support discover.

The CLI should support fetch.

The CLI should support parse.

The CLI should support validate.

The CLI should support build-index.

The CLI should support report.

The CLI should support run-all.

The CLI should support inspect.

The CLI may support diff-snapshots later.

The CLI may support apply-patches as a separate command later.

Each command should accept explicit input paths.

Each command should accept explicit output paths.

Commands should avoid hidden global state.

Commands should print concise progress to stderr.

Commands should write machine-readable artifacts to output paths.

Commands should return non-zero exit codes on failed gates.

Commands should return zero exit codes when warnings are expected and allowed.

Commands should support --strict where useful.

Commands should support --json-log later if needed.

8. Command: discover

discover identifies a Kuali catalog embedded in a Waterloo calendar page.

The command should accept --calendar-url.

The command should accept --out.

The command should accept --catalog-id as an override.

The command should accept --subdomain as an override.

The command should accept --catalog-kind as optional metadata.

The command should fetch the calendar page.

The command should locate window.subdomain.

The command should locate window.catalogId.

The command should fetch catalog metadata from the observed Kuali endpoint.

The command should write catalog.json.

The command should write discovery.json.

The command should write an initial manifest.json.

The command should validate that catalog metadata title is present.

The command should validate that catalog date range is present.

The command should validate that settings are present when returned.

The command should not fetch all course or program records.

The command should fail if neither page discovery nor overrides provide catalog id.

The command should fail if neither page discovery nor overrides provide subdomain.

The command should include source URL and fetch timestamp.

9. Command: fetch

fetch downloads raw schema, search, and detail responses.

The command should accept --snapshot.

The command should accept --item-types.

The command should accept --page-size.

The command should accept --concurrency.

The command should accept --rate-limit.

The command should accept --retry.

The command should accept --resume.

The command should accept --force.

The command should fetch schemas for requested item types.

The command should enumerate requested item types through search.

The command should preserve every search page response.

The command should preserve every item detail response.

The command should store successful and failed requests in the manifest.

The command should support resuming incomplete snapshots.

The command should not overwrite completed item files unless --force is set.

The command should avoid excessive request concurrency by default.

The command should default to conservative request rates.

The command should record endpoint templates in the manifest.

The command should record item counts from response headers.

The command should compare expected count and collected summary count.

The command should fetch details by pid from search results.

The command should store detail files by item type and pid.

The command should store item id inside the detail response when present.

The command should detect duplicate pids within an item type.

The command should report duplicate pids as a failed validation precondition.

10. Command: parse

parse derives structured records from a raw snapshot.

The command should accept --snapshot.

The command should accept --out.

The command should accept --patches.

The command should accept --item-types.

The command should accept --strict.

The command should read only snapshot files.

The command should not perform network requests.

The command should produce JSONL files.

The command should produce a parser report.

The command should apply parser patches after raw parsing.

The command should record every applied patch.

The command should record every patch that did not match.

The command should record every parser warning.

The command should record parser version metadata.

The command should record input snapshot hash metadata.

The command should be deterministic for the same inputs.

The command should preserve unknown fields through source_references.jsonl or metadata.

The command should never silently drop an HTML fragment from a configured field.

The command should write all unparsed requirement fragments.

The command should write parsed requirements with source references.

11. Command: validate

validate checks raw and parsed outputs.

The command should accept --snapshot.

The command should accept --parsed.

The command should accept --policy.

The command should accept --out.

The command should accept --strict.

The command should read manifest item counts.

The command should read parsed record counts.

The command should validate course summary count.

The command should validate program summary count.

The command should validate detail file count.

The command should validate parse output shape.

The command should validate course code uniqueness policy.

The command should validate cross-listing references.

The command should validate antirequisite references.

The command should validate program course links.

The command should validate requirement group structure.

The command should validate choose-k thresholds.

The command should validate numeric grade thresholds.

The command should validate unit thresholds.

The command should validate academic progress values.

The command should validate empty requirement expressions.

The command should validate patch application results.

The command should produce validation-report.json.

The command should produce validation-report.md.

The command should fail on blocking findings.

The command should warn on known incomplete parser coverage.

The command should support a threshold for unparsed requirement percentage.

The initial threshold should be advisory rather than blocking.

12. Command: build-index

build-index creates runtime artifacts.

The command should accept --parsed.

The command should accept --validation.

The command should accept --out.

The command should accept --catalog-slug.

The command should accept --include-graph-json.

The command should accept --strict.

The command should refuse to build if validation has blocking findings.

The command should write a SQLite database.

The command should write graph projection JSON when requested.

The command should write build metadata.

The command should write a build report.

The command should include catalog version metadata.

The command should include parser version metadata.

The command should include validation summary metadata.

The command should include unresolved parser warning counts.

The command should include source hashes.

The command should not mutate parsed inputs.

The command should be deterministic for the same parsed inputs.

The command should allow replacing the output directory only with --force.

The command should produce artifacts ready for backend loading.

13. Command: report

report renders a human-readable pipeline summary.

The command should accept --snapshot.

The command should accept --parsed.

The command should accept --validation.

The command should accept --index.

The command should accept --out.

The command should summarize source catalog metadata.

The command should summarize item counts.

The command should summarize parser coverage.

The command should summarize unparsed fragments by field.

The command should summarize validation findings.

The command should summarize patch usage.

The command should summarize index artifact paths.

The command should include references to authoritative source URLs.

The command should include warnings about Kuali API stability.

The command should include a reviewer checklist.

The command may be folded into validate and build-index initially.

14. Command: run-all

run-all executes the full pipeline.

The command should accept --calendar-url.

The command should accept --slug.

The command should accept --workspace.

The command should accept --patches.

The command should accept --item-types.

The command should accept --page-size.

The command should accept --concurrency.

The command should accept --rate-limit.

The command should accept --strict.

The command should run discovery.

The command should run fetch.

The command should run parse.

The command should run validate.

The command should run build-index.

The command should run report.

The command should stop on blocking failures.

The command should leave partial artifacts inspectable.

The command should support resume for fetch.

The command should print final paths.

The command should not publish artifacts to production.

Publishing is a separate deployment decision.

15. Command: inspect

inspect helps debug one source item.

The command should accept --snapshot.

The command should accept --parsed.

The command should accept --item-type.

The command should accept --pid.

The command should accept --id.

The command should accept --course-code.

The command should print source metadata.

The command should print source fields.

The command should print parsed requirement expressions.

The command should print unparsed requirements.

The command should print applied patches.

The command should print validation findings for the item.

The command should support --json.

The command should support --field.

The command should be the default debugging tool for parser work.

The command should not perform network requests.

16. Directory Layout

The default workspace should be data.

Raw snapshots should live under data/raw.

Parsed outputs should live under data/parsed.

Validation outputs should live under data/validation.

Index outputs should live under data/index.

Patch files should live under data/patches.

Reports should live under data/reports or inside each output directory.

Temporary files should live under a command-specific temporary directory.

Temporary files should not be committed.

The default catalog slug should be explicit.

The slug should include calendar year and career when known.

Example slug: 2026-2027-undergraduate.

The slug should not depend on local time.

The slug should not include whitespace.

The slug should be stable across machines.

The slug may include catalog id if ambiguity exists.

The slug should be recorded in every manifest.

17. Raw Snapshot Layout

The raw snapshot directory should include manifest.json.

The raw snapshot directory should include discovery.json.

The raw snapshot directory should include catalog.json.

The raw snapshot directory should include schemas.

The raw snapshot directory should include search.

The raw snapshot directory should include items.

The raw snapshot directory should include errors when failures occur.

The raw snapshot directory should include http metadata when useful.

The raw snapshot directory should include README.generated.md.

The raw snapshot directory should be self-describing.

The raw snapshot directory should support archive as a tarball.

The raw snapshot directory should support hash verification.

The raw snapshot directory should be read-only after completion.

The raw snapshot directory should not contain parsed output.

The raw snapshot directory should not contain index output.

The raw snapshot directory should not contain runtime student state.

18. Raw Snapshot Example

Example raw snapshot paths:

data/raw/2026-2027-undergraduate/manifest.json

data/raw/2026-2027-undergraduate/discovery.json

data/raw/2026-2027-undergraduate/catalog.json

data/raw/2026-2027-undergraduate/schemas/courses.json

data/raw/2026-2027-undergraduate/schemas/programs.json

data/raw/2026-2027-undergraduate/search/courses/page-000000.json

data/raw/2026-2027-undergraduate/search/courses/page-000001.json

data/raw/2026-2027-undergraduate/search/programs/page-000000.json

data/raw/2026-2027-undergraduate/items/courses/S1hvTVXYn.json

data/raw/2026-2027-undergraduate/items/courses/r1-YJF47K3.json

data/raw/2026-2027-undergraduate/items/programs/HkxPJk0Cj3.json

data/raw/2026-2027-undergraduate/items/programs/Sk7D1yRAs3.json

data/raw/2026-2027-undergraduate/errors/fetch-errors.jsonl

data/raw/2026-2027-undergraduate/README.generated.md

The exact pids above are examples from observed records.

The scraper should not hard-code those pids.

The scraper should discover pids from search results.

19. Manifest Fields

manifest_version identifies the manifest schema.

scraper_version identifies the scraper implementation.

created_at records snapshot creation time.

completed_at records snapshot completion time when complete.

status records in_progress, complete, or failed.

catalog_slug records the local slug.

catalog_id records the Kuali catalog id.

catalog_title records the Kuali catalog title.

catalog_start_date records catalog start date.

catalog_end_date records catalog end date.

calendar_url records the Waterloo calendar page URL.

kuali_subdomain records the Kuali subdomain.

item_types records requested item types.

endpoints records endpoint templates.

requests records request summaries.

counts records expected and actual counts.

hashes records file hashes.

failures records failed requests.

warnings records non-fatal scraper warnings.

environment records optional runtime metadata.

The manifest should be valid JSON.

The manifest should be pretty-printed.

The manifest should be stable enough for code review.

The manifest should not contain secrets.

20. Request Metadata

Each request summary should include request id.

Each request summary should include method.

Each request summary should include URL.

Each request summary should include item type when known.

Each request summary should include pid when known.

Each request summary should include response status.

Each request summary should include response content type.

Each request summary should include started timestamp.

Each request summary should include completed timestamp.

Each request summary should include duration.

Each request summary should include retry count.

Each request summary should include output path.

Each request summary should include response body hash.

Each request summary should include selected headers.

Selected headers should include item-count when present.

Selected headers should include etag when present.

Selected headers should include last-modified when present.

Selected headers should include rate-limit headers if present.

The scraper should not store cookies unless strictly required.

The scraper should not store authentication headers.

21. Fetch Pagination

The search endpoint supports limit.

The search endpoint supports skip.

The scraper should use a configurable page size.

The default page size should be conservative.

A default page size of 100 is reasonable for initial implementation.

The scraper should read item-count from the first page.

The scraper should fetch pages until collected summaries meet expected count.

The scraper should stop if a page returns zero records before expected count.

A premature empty page should become a blocking fetch finding.

The scraper should deduplicate summaries by pid.

The scraper should record duplicate summaries.

The scraper should preserve page responses even if duplicates exist.

The scraper should not assume stable ordering across separate runs.

The scraper should build detail fetch lists from collected summaries.

The scraper should sort detail fetch lists deterministically before fetching.

The scraper should sort by item type and pid.

22. Fetch Retry Policy

The scraper should retry transient network failures.

The scraper should retry HTTP 429.

The scraper should retry HTTP 500.

The scraper should retry HTTP 502.

The scraper should retry HTTP 503.

The scraper should retry HTTP 504.

The scraper should not retry HTTP 401 as a transient public endpoint failure.

The scraper should not retry HTTP 403 as a transient public endpoint failure by default.

The scraper may retry HTTP 404 once for detail endpoints.

The scraper should use exponential backoff.

The scraper should add jitter.

The scraper should cap maximum sleep.

The scraper should record every retry.

The scraper should record final failure status.

The scraper should continue fetching other items when one item fails.

The command should fail overall if required items fail.

The default retry count should be small.

Three attempts is a reasonable starting point.

23. Rate Limiting

The scraper should be polite by default.

The scraper should default to low concurrency.

The scraper should default to a small request-per-second rate.

The scraper should allow slower rates through flags.

The scraper should not encourage aggressive endpoint crawling.

The scraper should record configured rate limit in manifest.

The scraper should record configured concurrency in manifest.

The scraper should avoid retry storms.

The scraper should honor Retry-After when present.

The scraper should support cancellation.

The scraper should write partial state safely on cancellation.

The scraper should support resume after cancellation.

24. Parsed Output Layout

Parsed output should live under data/parsed/<catalog-slug>.

The parsed directory should include parse-manifest.json.

The parsed directory should include course_listings.jsonl.

The parsed directory should include course_credits.jsonl.

The parsed directory should include credentials.jsonl.

The parsed directory should include requirement_expressions.jsonl.

The parsed directory should include requirement_conditions.jsonl.

The parsed directory should include requirement_sources.jsonl.

The parsed directory should include unparsed_requirements.jsonl.

The parsed directory should include source_references.jsonl.

The parsed directory should include parser_warnings.jsonl.

The parsed directory should include patch_applications.jsonl.

The parsed directory should include parse-report.md.

The parsed directory may include debug artifacts when requested.

The parsed directory should not contain raw source response copies.

The parsed directory should reference raw files by path and hash.

The parsed directory should be rebuildable from raw snapshot plus patches.

25. JSONL Policy

JSONL files should contain one JSON object per line.

Every object should include an id.

Every object should include catalog_version_id when applicable.

Every object should include source_reference_ids when derived from source.

Every object should include created_by_parser_version.

Every object should avoid null fields when omission is clearer.

Every object should use stable enum strings.

Every object should use snake_case keys.

Every object should use ISO dates for dates.

Every object should use RFC 3339 timestamps for timestamps.

Every object should store numeric thresholds as numbers where possible.

Every object should store display text separately from normalized text.

Every object should be deterministic in key ordering if practical.

The parser should make JSONL easy to diff.

The parser should not write nondeterministic map ordering if avoidable.

26. CourseListing Record

CourseListing should include id.

CourseListing should include catalog_version_id.

CourseListing should include source_item_type.

CourseListing should include source_item_id.

CourseListing should include source_pid.

CourseListing should include course_code.

CourseListing should include subject_code.

CourseListing should include number.

CourseListing should include title.

CourseListing should include description.

CourseListing should include units_min_x100.

CourseListing should include units_max_x100.

CourseListing should include units_display.

CourseListing should include course_level.

CourseListing should include faculty_name when present.

CourseListing should include academic_unit_name when present.

CourseListing should include date_start when present.

CourseListing should include catalog_activation_date when present.

CourseListing should include course_credit_id when known.

CourseListing should include requirement_source_ids.

The first implementation may derive course_credit_id conservatively.

The first implementation may use one credit record per listing before cross-listing grouping.

Cross-listing grouping should replace that provisional identity when known.

27. CourseCredit Record

CourseCredit should include id.

CourseCredit should include catalog_version_id.

CourseCredit should include listing_ids.

CourseCredit should include source_basis.

CourseCredit should include confidence.

source_basis may be single_listing.

source_basis may be cross_listed_courses_field.

source_basis may be manual_patch.

source_basis may be former_number_policy.

Version 1 should support single_listing.

Version 1 should support cross_listed_courses_field when source fields are clear.

Version 1 should not infer credit identity from title alone.

Version 1 should not infer credit identity from description alone.

Version 1 should not infer credit identity from antirequisites alone.

Antirequisites are conflict relations.

Cross-listing is stronger evidence for shared credit identity.

28. Credential Record

Credential should include id.

Credential should include catalog_version_id.

Credential should include source_item_type.

Credential should include source_item_id.

Credential should include source_pid.

Credential should include code.

Credential should include title.

Credential should include description.

Credential should include credential_type.

Credential should include field_of_study when present.

Credential should include faculty_name when present.

Credential should include systems_of_study when present.

Credential should include date_start when present.

Credential should include catalog_activation_date when present.

Credential should include requirement_source_ids.

Credential should include linked_specialization_ids when parsed.

Credential should not collapse all text fields into one blob.

Credential should preserve field-level source references.

29. RequirementSource Record

RequirementSource identifies a source field that may contain requirements.

RequirementSource should include id.

RequirementSource should include catalog_version_id.

RequirementSource should include owner_type.

RequirementSource should include owner_id.

RequirementSource should include source_item_type.

RequirementSource should include source_item_id.

RequirementSource should include source_pid.

RequirementSource should include field_name.

RequirementSource should include field_label when known from schema.

RequirementSource should include field_type when known from schema.

RequirementSource should include raw_value_kind.

RequirementSource should include raw_text_hash.

RequirementSource should include has_html.

RequirementSource should include parser_status.

parser_status may be not_attempted.

parser_status may be parsed.

parser_status may be partially_parsed.

parser_status may be unparsed.

parser_status may be empty.

30. RequirementExpression Record

RequirementExpression represents requirement tree structure.

RequirementExpression should include id.

RequirementExpression should include catalog_version_id.

RequirementExpression should include requirement_source_id.

RequirementExpression should include parent_expression_id when nested.

RequirementExpression should include kind.

RequirementExpression should include position.

RequirementExpression should include label.

RequirementExpression should include display_text.

RequirementExpression should include normalized_text.

RequirementExpression should include condition_id when leaf-like.

RequirementExpression should include min_required when choose-like.

RequirementExpression should include max_allowed when choose-like.

RequirementExpression should include unit_threshold when unit-like.

RequirementExpression should include confidence.

RequirementExpression should include source_reference_ids.

kind may be all_of.

kind may be any_of.

kind may be choose_count.

kind may be units_from.

kind may be average_threshold.

kind may be condition.

kind may be unparsed.

31. RequirementCondition Record

RequirementCondition represents a typed requirement leaf.

RequirementCondition should include id.

RequirementCondition should include catalog_version_id.

RequirementCondition should include kind.

RequirementCondition should include polarity.

RequirementCondition should include display_text.

RequirementCondition should include normalized_text.

RequirementCondition should include course_listing_ids.

RequirementCondition should include course_credit_ids.

RequirementCondition should include credential_ids.

RequirementCondition should include subject_codes.

RequirementCondition should include course_ranges.

RequirementCondition should include grade_threshold.

RequirementCondition should include unit_threshold.

RequirementCondition should include academic_progress_value.

RequirementCondition should include operator.

RequirementCondition should include confidence.

RequirementCondition should include source_reference_ids.

kind may be course_completed.

kind may be course_grade_at_least.

kind may be academic_progress_at_least.

kind may be academic_progress_is.

kind may be enrolled_in_credential.

kind may be subject_range_completed.

kind may be units_from_subjects.

kind may be average_at_least.

kind may be not_for_credit.

kind may be antirequisite.

kind may be consent_required.

kind may be free_text.

32. UnparsedRequirement Record

UnparsedRequirement preserves an unknown requirement fragment.

UnparsedRequirement should include id.

UnparsedRequirement should include catalog_version_id.

UnparsedRequirement should include requirement_source_id.

UnparsedRequirement should include source_reference_id.

UnparsedRequirement should include owner_type.

UnparsedRequirement should include owner_id.

UnparsedRequirement should include field_name.

UnparsedRequirement should include raw_html.

UnparsedRequirement should include display_text.

UnparsedRequirement should include normalized_text.

UnparsedRequirement should include dom_path.

UnparsedRequirement should include nearest_rule_id.

UnparsedRequirement should include reason.

UnparsedRequirement should include parser_version.

UnparsedRequirement should include suggested_patch_key.

Reasons should be stable enum values.

Reason examples include unsupported_phrase.

Reason examples include ambiguous_course_range.

Reason examples include unknown_link_target.

Reason examples include unsupported_numeric_rule.

Reason examples include malformed_html.

Reason examples include empty_rule_text.

Unparsed requirements should be visible in validation summaries.

33. SourceReference Record

SourceReference points back to raw evidence.

SourceReference should include id.

SourceReference should include catalog_version_id.

SourceReference should include raw_file_path.

SourceReference should include raw_file_hash.

SourceReference should include json_pointer.

SourceReference should include field_name.

SourceReference should include dom_path.

SourceReference should include rule_id.

SourceReference should include text_start when available.

SourceReference should include text_end when available.

SourceReference should include source_url.

SourceReference should include source_item_id.

SourceReference should include source_pid.

Source references should support click-through debugging in tooling later.

Source references should make parser output auditable.

Source references should be retained in the SQLite index.

34. Parser Stage 1: JSON Mapping

JSON mapping reads item detail JSON.

JSON mapping should use typed Go structs where source shape is stable.

JSON mapping should use raw maps for unknown fields.

JSON mapping should preserve unknown fields in source references or debug metadata.

JSON mapping should extract course pid.

JSON mapping should extract course item id.

JSON mapping should extract course code.

JSON mapping should extract subject code.

JSON mapping should extract course number.

JSON mapping should extract title.

JSON mapping should extract description.

JSON mapping should extract units.

JSON mapping should extract course level.

JSON mapping should extract faculty.

JSON mapping should extract program pid.

JSON mapping should extract program item id.

JSON mapping should extract program code.

JSON mapping should extract credential type.

JSON mapping should extract field of study.

JSON mapping should extract systems of study.

JSON mapping should extract configured requirement fields.

JSON mapping should attach schema labels when possible.

35. Parser Stage 2: HTML Fragment Parsing

HTML fragment parsing reads rich text fields.

HTML fragment parsing should use a proper HTML parser.

HTML fragment parsing should not use regular expressions as the primary parser.

HTML fragment parsing should tolerate invalid nesting.

HTML fragment parsing should tolerate extra wrapper elements.

HTML fragment parsing should preserve link href values.

HTML fragment parsing should preserve link text.

HTML fragment parsing should preserve list structure.

HTML fragment parsing should preserve group header text.

HTML fragment parsing should preserve data-test attributes.

HTML fragment parsing should convert HTML comments to empty text.

HTML fragment parsing should normalize whitespace for display text.

HTML fragment parsing should preserve raw HTML for source references.

HTML fragment parsing should record DOM paths for derived nodes.

HTML fragment parsing should detect empty fragments.

HTML fragment parsing should detect malformed fragments.

HTML fragment parsing should not fail the whole item for one malformed fragment.

Malformed fragments should create parser warnings.

36. Parser Stage 3: Display Rule Tree

The display rule tree captures Kuali-rendered list structure.

The display rule tree is not yet the final requirement expression.

The display rule tree should preserve nested ul and li groups.

The display rule tree should preserve Kuali rule ids from data-test.

The display rule tree should preserve group headers.

The display rule tree should preserve visible text.

The display rule tree should preserve course links.

The display rule tree should preserve program links.

The display rule tree should preserve policies links if present.

The display rule tree should preserve non-link text.

The display rule tree should preserve ordering.

The display rule tree should tolerate wrapper div nodes inside lists.

The display rule tree should identify parent-child relationships.

The display rule tree should identify repeated rule labels.

The display rule tree should support debugging through inspect.

The display rule tree may be emitted only in debug mode.

The semantic parser should consume the display rule tree.

37. Parser Stage 4: Requirement Expression Recognition

Requirement expression recognition converts display rule trees to operational structures.

The recognizer should identify Complete all of the following.

The recognizer should identify Complete 1 of the following.

The recognizer should identify Complete k of the following.

The recognizer should identify explicit lists of courses.

The recognizer should identify lists of credentials.

The recognizer should identify subject-code course ranges.

The recognizer should identify faculty-based pools.

The recognizer should identify unit thresholds.

The recognizer should identify grade thresholds.

The recognizer should identify academic progress rules.

The recognizer should identify enrolled-in-program rules.

The recognizer should identify antirequisite lists.

The recognizer should identify cross-listing fields.

The recognizer should identify consent fields.

The recognizer should not guess unsupported phrases.

The recognizer should emit confidence values.

The recognizer should emit parser warnings for partial recognition.

The recognizer should emit unparsed requirements for unsupported fragments.

38. Supported Course Requisite Patterns

Version 1 should support all-of requisite groups.

Version 1 should support any-of requisite groups.

Version 1 should support choose-count requisite groups.

Version 1 should support course completion leaves.

Version 1 should support course grade threshold leaves.

Version 1 should support academic progress leaves.

Version 1 should support credential enrollment leaves.

Version 1 should support course antirequisite leaves.

Version 1 should support cross-listed course references.

Version 1 should support instructor consent display fields.

Version 1 should support department consent display fields.

Version 1 should support nested group labels.

Version 1 should support mixed groups with all-of containing any-of children.

Version 1 should support any-of containing all-of children.

Version 1 should support zero-unit course records.

Version 1 should support courses with no requisites.

Version 1 should support retired or future-dated courses if returned by the catalog.

Version 1 should record status and dates for such courses.

39. Supported Credential Requirement Patterns

Version 1 should support required course lists.

Version 1 should support choose-one course alternatives.

Version 1 should support choose-k course alternatives.

Version 1 should support subject-code ranges.

Version 1 should support unit requirements from subject groups.

Version 1 should support excluded course lists.

Version 1 should support additional constraints as rich text.

Version 1 should support linked specializations lists.

Version 1 should support graduation requirement bullet lists as requirement sources.

Version 1 should support declaration requirement bullet lists as requirement sources.

Version 1 should support minimum average text as requirement sources.

Version 1 should parse average thresholds when explicit and simple.

Version 1 should preserve complex average definitions as unparsed requirements.

Version 1 should preserve natural-language exceptions as unparsed requirements.

Version 1 should preserve advisor-permission clauses as unparsed requirements.

Version 1 should preserve ambiguous exclusions as unparsed requirements.

Version 1 should identify fields that affect degree satisfaction.

Version 1 should identify fields that affect declaration only.

Course links may use #/courses/view/:id.

Course links may use #/courses/:pid.

Observed program requirement fields often link by item id.

Observed course search summaries include both item id and pid.

The parser should build lookup tables by item id.

The parser should build lookup tables by pid.

The parser should build lookup tables by course code.

A course link by item id should resolve to a CourseListing.

A course link by pid should resolve to a CourseListing.

A course text without link may resolve by exact course code.

Course code resolution should be case-insensitive.

Course code resolution should preserve canonical casing.

Ambiguous course code resolution should create a parser warning.

Unresolved course links should create validation findings.

Unresolved course links should not be silently dropped.

The source href should be preserved even when unresolved.

Program links may use #/programs/view/:id.

Program links may use #/programs/:pid.

The parser should build lookup tables by program item id.

The parser should build lookup tables by program pid.

The parser should build lookup tables by program code.

A program link by item id should resolve to a Credential.

A program link by pid should resolve to a Credential.

Program text without link may resolve by exact code when safe.

Program title resolution should be conservative.

Ambiguous program title resolution should create a parser warning.

Unresolved program links should create validation findings.

Unresolved program links should remain in source references.

Credential resolution should not depend on display order.

Credential resolution should support specializations later.

42. Subject and Range Parsing

Subject range parsing should support forms like CS340-CS398.

Subject range parsing should support comma-separated ranges.

Subject range parsing should support mixed explicit courses and ranges.

Subject range parsing should support CS440-CS489.

Subject range parsing should support course numbers with suffix letters.

Subject range parsing should preserve the original range string.

Subject range parsing should normalize numeric bounds.

Subject range parsing should record inclusive lower bound.

Subject range parsing should record inclusive upper bound.

Subject range parsing should record subject code.

Subject range parsing should reject cross-subject ranges unless explicitly supported.

Subject range parsing should preserve ambiguous ranges as unparsed requirements.

Subject range parsing should distinguish a range pool from enumerated known courses.

Subject range parsing may later expand ranges to known courses.

Range expansion should be a derived index operation.

Range expansion should preserve the original range requirement.

43. Grade Threshold Parsing

Grade threshold parsing should support percentages.

Grade threshold parsing should support at least.

Grade threshold parsing should support or higher.

Grade threshold parsing should support explicit course-grade alternatives.

Grade threshold parsing should support thresholds attached to course completions.

Grade threshold parsing should preserve course references.

Grade threshold parsing should store threshold as a numeric percentage.

Grade threshold parsing should store comparison operator.

Grade threshold parsing should distinguish passing credit from threshold credit.

Grade threshold parsing should not infer thresholds from course difficulty.

Grade threshold parsing should preserve unknown grading basis text.

Grade threshold parsing should be tested on synthetic examples.

Grade threshold parsing should be tested on real examples when found.

Ambiguous grade text should become unparsed requirements.

44. Academic Progress Parsing

Academic progress parsing should support 1A.

Academic progress parsing should support 1B.

Academic progress parsing should support 2A.

Academic progress parsing should support 2B.

Academic progress parsing should support 3A.

Academic progress parsing should support 3B.

Academic progress parsing should support 4A.

Academic progress parsing should support 4B.

Academic progress parsing should distinguish progress from standing.

Academic progress parsing should not call progress academic standing.

Academic standing examples include good standing and probation.

Academic standing should be modeled separately when source data requires it.

The parser should store progress as ordered values.

The parser should support exact progress conditions.

The parser should support at-least progress conditions.

The parser should preserve ambiguous level text.

The validator should flag unknown progress values.

45. Unit Requirement Parsing

Unit parsing should support decimal units.

Unit parsing should support 0.50.

Unit parsing should support 1.0.

Unit parsing should support 2.0.

Unit parsing should support minimum unit thresholds.

Unit parsing should support units from subject pools.

Unit parsing should support units from faculty pools when explicit.

Unit parsing should store numeric value as decimal.

Unit parsing should preserve display text.

Unit parsing should avoid binary floating point drift in Go.

Unit parsing should use integer hundredths or decimal types.

Unit parsing should validate nonnegative units.

Unit parsing should validate max units where present.

Unit parsing should preserve complex unit exceptions as unparsed requirements.

46. Average Requirement Parsing

Average parsing should support explicit minimum cumulative overall average.

Average parsing should support explicit minimum major average.

Average parsing should support explicit minimum course subset average.

Average parsing should preserve the population definition.

Average parsing should not compute average semantics unless the population is parsed.

Average parsing should store threshold percentage.

Average parsing should store average kind.

Average parsing should store included subject codes when clear.

Average parsing should store included course ranges when clear.

Average parsing should preserve exclusions.

Average parsing should mark partial confidence when threshold is parsed but population is not.

Complex average definitions should become unparsed or partially parsed requirements.

The solver should treat partially parsed averages conservatively.

47. Cross-Listing Parsing

Cross-listing parsing should read the configured cross-listed courses field.

Cross-listing parsing should resolve linked course references.

Cross-listing parsing should create listing-to-listing relationships.

Cross-listing parsing should create or update CourseCredit records.

Cross-listing parsing should validate symmetry when both sides are present.

Cross-listing parsing should warn when symmetry is missing.

Cross-listing parsing should not infer cross-listing from shared title alone.

Cross-listing parsing should not infer cross-listing from antirequisite alone.

Cross-listing parsing should preserve source text.

Cross-listing parsing should support multiple linked courses.

Cross-listing parsing should handle self-references defensively.

Cross-listing parsing should flag self-references as validation findings.

48. Antirequisite Parsing

Antirequisite parsing should read the configured antirequisites field.

Antirequisite parsing should resolve linked course references.

Antirequisite parsing should create conflict relationships.

Antirequisite parsing should distinguish conflict from credit identity.

Antirequisite parsing should support course lists.

Antirequisite parsing should support program exclusion text as unparsed when unclear.

Antirequisite parsing should preserve text about former numbers.

Antirequisite parsing may later classify former-number equivalence.

Antirequisite parsing should not assume all conflicts are symmetric.

Validation should check symmetry as an informational finding.

Validation should not require symmetry unless source policy proves it.

Unresolved antirequisite links should be blocking only above a configured threshold.

49. Manual Patch Layer

Manual patches are a first-class part of the design.

Manual patches are not hidden scraper hacks.

Manual patches should live under data/patches.

Manual patches should be YAML or JSON.

YAML is easier for humans to review.

JSON is easier for machines to validate.

The first implementation may use YAML with schema validation.

Each patch should include a stable patch id.

Each patch should include a reason.

Each patch should include a source item type.

Each patch should include a source pid or item id.

Each patch should include a field name.

Each patch should include a rule id or DOM path.

Each patch should include an action.

Each patch should include replacement data when applicable.

Each patch should include reviewer notes.

Each patch should include date added.

Each patch should include author or maintainer initials if desired.

Patches should be deterministic.

Patches should be reported during parse.

50. Patch Actions

Patch action replace_requirement replaces one parsed or unparsed requirement.

Patch action add_requirement adds a derived requirement under a source field.

Patch action ignore_fragment marks a source fragment as intentionally ignored.

Patch action set_confidence adjusts parser confidence with rationale.

Patch action link_course resolves an unresolved course reference.

Patch action link_credential resolves an unresolved credential reference.

Patch action set_course_credit_group sets cross-listing or credit identity grouping.

Patch action mark_false_positive marks a validation finding as reviewed.

Patch action split_requirement splits one source fragment into multiple expressions.

Patch action annotate_requirement adds notes without changing semantics.

Patch actions should be narrow.

Patch actions should be auditable.

Patch actions should not mutate raw snapshot files.

Patch actions should not hide the original source text.

Patch actions should be represented in parsed output.

51. Patch File Example

Example patch shape:

patches:
- id: cs-bmath-list1-range-2026
action: replace_requirement
reason: Parse explicit CS range pool from rendered free text.
item_type: programs
item_pid: HkxPJk0Cj3
field: courseRequirementsNoUnits
rule_id: ruleView-K
replacement:
kind: subject_range_completed
count: 1
subject_codes:
- CS
course_ranges:
- lower: "340"
upper: "398"
- lower: "440"
upper: "489"
notes: Source says complete one additional CS course from listed ranges.

The example is illustrative.

The parser should validate patch schema before applying patches.

The parser should fail on duplicate patch ids.

The parser should warn when a patch target does not match.

Patch misses should be visible in parse reports.

52. Validation Finding Levels

Validation finding level blocker prevents index build.

Validation finding level error should fail strict validation.

Validation finding level warning should be reviewed.

Validation finding level info documents noteworthy facts.

blocker should mean the artifact is unsafe to publish.

error should mean the artifact may be publishable only with explicit override.

warning should mean the artifact is incomplete but usable.

info should mean no immediate action is needed.

The default pipeline should fail on blockers.

The default pipeline may allow warnings.

Strict mode should fail on errors.

Strict mode may fail on configured warning classes.

Validation reports should group findings by level.

Validation reports should group findings by item type.

Validation reports should group findings by field.

Validation reports should include examples.

53. Blocking Validation Findings

Missing catalog metadata is blocking.

Missing schema for required item types is blocking.

Missing search count for required item types is blocking.

Fetched summary count below expected count is blocking.

Missing detail records above zero tolerance is blocking.

Invalid JSON in raw detail files is blocking.

Duplicate pids within the same item type are blocking.

Unparseable manifest is blocking.

Invalid parsed JSONL is blocking.

Requirement expression cycles are blocking.

Requirement expression parent references to missing ids are blocking.

Condition references to missing expression ids are blocking.

Index build with failed validation input is blocking.

Patch schema errors are blocking.

Duplicate patch ids are blocking.

SQLite schema migration failure is blocking.

Graph projection generation failure is blocking when requested.

54. Warning Validation Findings

Unresolved course links are warnings initially.

Unresolved program links are warnings initially.

Asymmetric cross-listing relationships are warnings initially.

Asymmetric antirequisite relationships are warnings initially.

Unparsed requirement fragments are warnings initially.

Partial average parsing is a warning.

Ambiguous subject range parsing is a warning.

Unknown academic progress values are warnings unless common.

Unknown requirement phrases are warnings.

Unexpected source fields are warnings.

Unexpected missing optional fields are warnings.

Patch target misses are warnings unless patch is marked required.

High unparsed rate by field is a warning.

High unparsed rate by faculty is a warning.

Warnings should include review priority.

Warnings should include representative source references.

55. Parser Confidence

Parser confidence should be explicit.

Confidence should not be overused as decoration.

Confidence should represent parser certainty about semantic interpretation.

Confidence value high means the parser recognized a known pattern.

Confidence value medium means the parser recognized core structure with some ambiguity.

Confidence value low means the parser captured partial semantics only.

Confidence value unparsed means no safe semantic interpretation was made.

Manual patches may set confidence with rationale.

Confidence should be separate from validation severity.

A high-confidence parse can still point to a missing course link.

A low-confidence parse can still be useful for display.

The solver should treat low-confidence requirements conservatively.

The UI should expose low confidence where it affects decisions.

56. SQLite Index Goals

The SQLite index should support backend startup.

The SQLite index should support course lookup by code.

The SQLite index should support course lookup by pid.

The SQLite index should support credential lookup by pid.

The SQLite index should support credential lookup by code.

The SQLite index should support requirement tree traversal.

The SQLite index should support source reference lookup.

The SQLite index should support graph projection loading.

The SQLite index should support search by title and code.

The SQLite index should support source uncertainty queries.

The SQLite index should support validation summary lookup.

The SQLite index should be read-only at runtime.

The SQLite index should not store student state.

The SQLite index should include build metadata.

The SQLite index should be reproducible.

The SQLite index should be inspectable with standard tooling.

57. SQLite Tables

The first SQLite schema should include catalog_versions.

The first SQLite schema should include course_listings.

The first SQLite schema should include course_credits.

The first SQLite schema should include course_credit_listings.

The first SQLite schema should include credentials.

The first SQLite schema should include requirement_sources.

The first SQLite schema should include requirement_expressions.

The first SQLite schema should include requirement_conditions.

The first SQLite schema should include condition_courses.

The first SQLite schema should include condition_credentials.

The first SQLite schema should include condition_subject_ranges.

The first SQLite schema should include source_references.

The first SQLite schema should include unparsed_requirements.

The first SQLite schema should include validation_findings.

The first SQLite schema should include parser_warnings.

The first SQLite schema should include patch_applications.

The first SQLite schema should include graph_nodes if graph projections are stored inside SQLite.

The first SQLite schema should include graph_edges if graph projections are stored inside SQLite.

58. Graph Projection Artifact

The graph projection artifact should be generated from parsed requirements.

The graph projection artifact should be optional in v1.

The initial graph projection artifact is uncompressed JSON so it can be inspected, hashed, and compared during early release-gate work.

The graph projection artifact should gain a compressed companion when large.

The initial default path should be graph-projection.json.

The graph projection should include nodes.

The graph projection should include edges.

The graph projection should include edge types.

The graph projection should include source requirement ids.

The graph projection should include uncertainty flags.

The graph projection should include display labels.

The graph projection should include course level.

The graph projection should include subject code.

The graph projection should include credential references.

The graph projection should include derived importance metrics later.

The graph projection should not be the canonical data model.

The graph projection should document its lossiness.

The graph projection should be replaceable by later projection formats.

59. Source Fields for Courses

The course parser should inspect description.

The course parser should inspect credits.

The course parser should inspect crossListedCourses.

The course parser should inspect feeStatement.

The course parser should inspect notes.

The course parser should inspect specialGradingBasis.

The course parser should inspect totalCompletionsAllowed.

The course parser should inspect allowMultipleEnrollInATerm.

The course parser should inspect prerequisites.

The course parser should inspect corequisites.

The course parser should inspect antirequisites.

The course parser should inspect specialConsentRequiredToAdd.

The course parser should inspect specialConsentRequiredToDrop.

The course parser should derive the configured list from catalog settings when possible.

The course parser should not assume the field list is permanent.

The course parser should report configured fields missing from records.

The course parser should report unknown configured fields.

60. Source Fields for Programs

The program parser should inspect specialNotice.

The program parser should inspect systemsOfStudy.

The program parser should inspect optionIsAvailableForStudentsInTheFollowingDegrees.

The program parser should inspect specializationIsAvailableForStudentsInTheFollowingMajorsRules.

The program parser should inspect declarationAudience.

The program parser should inspect admissionRequirements.

The program parser should inspect declarationRequirements.

The program parser should inspect minimumAverageSRequired.

The program parser should inspect degreeRequirements.

The program parser should inspect graduationRequirements.

The program parser should inspect coOperativeRequirementsUndergraduate.

The program parser should inspect requiredCoursesTermByTerm.

The program parser should inspect requirements.

The program parser should inspect courseRequirementsNoUnits.

The program parser should inspect courseListsNew.

The program parser should inspect additionalConstraints.

The program parser should inspect specializationDetails.

The program parser should inspect specializationsList.

The program parser should inspect detailsAndNotes.

The program parser should derive the configured list from catalog settings when possible.

61. Parser Test Fixtures

Parser tests should include small synthetic HTML fragments.

Parser tests should include real captured fragments from raw snapshots.

Real captured fragments should be minimized when committed.

Real captured fragments should retain source references.

Parser tests should cover all-of groups.

Parser tests should cover any-of groups.

Parser tests should cover choose-count groups.

Parser tests should cover nested groups.

Parser tests should cover course links by item id.

Parser tests should cover program links by item id.

Parser tests should cover course code text without links.

Parser tests should cover subject ranges.

Parser tests should cover academic progress rules.

Parser tests should cover grade thresholds.

Parser tests should cover antirequisites.

Parser tests should cover unparsed fallback.

Parser tests should cover malformed HTML.

Parser tests should cover patch application.

62. Fetch Test Fixtures

Fetch tests should not depend on live Kuali by default.

Fetch tests should use local HTTP test servers.

Fetch tests should cover successful discovery.

Fetch tests should cover manual override discovery.

Fetch tests should cover paginated search.

Fetch tests should cover missing item-count.

Fetch tests should cover retryable failures.

Fetch tests should cover non-retryable failures.

Fetch tests should cover resume behavior.

Fetch tests should cover duplicate pid detection.

Fetch tests should cover interrupted runs.

Fetch tests should cover manifest writing.

Fetch tests should cover hash recording.

Live integration tests should be opt-in.

Live integration tests should be rate limited.

Live integration tests should not run in normal unit test suites.

63. Validation Tests

Validation tests should cover complete snapshots.

Validation tests should cover incomplete snapshots.

Validation tests should cover invalid JSONL.

Validation tests should cover missing detail files.

Validation tests should cover duplicate records.

Validation tests should cover unresolved links.

Validation tests should cover empty requirement groups.

Validation tests should cover impossible choose-count groups.

Validation tests should cover invalid grade thresholds.

Validation tests should cover invalid progress values.

Validation tests should cover patch misses.

Validation tests should cover warning thresholds.

Validation tests should cover strict mode.

Validation tests should cover report generation.

64. Determinism Requirements

The same raw snapshot and patch set should produce the same parsed output.

The same parsed output should produce the same validation output.

The same validation output and parsed output should produce the same index.

Output ordering should be deterministic.

IDs should be deterministic where practical.

IDs may be content-addressed.

IDs may be derived from catalog id, item id, field, and local path.

Generated timestamps should be isolated to manifests and build metadata.

Parser output should not include current time on every record.

Index build output should not depend on map iteration order.

Reports should be stable enough for meaningful diffs.

The pipeline should support reproducibility checks.

The pipeline may later support uwscrape verify-reproducible.

65. ID Strategy

IDs should be stable across rebuilds from the same source.

Course listing ids may use catalog id plus source item id.

Credential ids may use catalog id plus source item id.

Requirement source ids may use owner id plus field name.

Requirement expression ids may use requirement source id plus tree path.

Requirement condition ids may use expression id plus condition kind.

Source reference ids may use raw file hash plus JSON pointer plus DOM path.

Patch application ids may use patch id plus target id.

Validation finding ids may use finding kind plus target id plus source reference.

IDs should avoid random values in parsed and index artifacts.

Random values are acceptable only for runtime student state.

ID strategy should be documented in code comments near implementation.

66. Error Handling

Errors should be explicit.

Errors should include command name.

Errors should include path or URL when relevant.

Errors should include item type when relevant.

Errors should include pid when relevant.

Errors should include field name when relevant.

Errors should include source reference when available.

Errors should distinguish fetch errors from parse errors.

Errors should distinguish parse errors from validation findings.

Errors should distinguish validation findings from build failures.

Errors should not hide failed actions behind success wording.

Errors should not include secrets.

Errors should be readable without debug mode.

Debug mode may include stack traces.

67. Logging

The CLI should log progress to stderr.

The CLI should write artifacts to files.

The CLI should support quiet mode.

The CLI should support verbose mode.

The CLI may support structured logs later.

Logs should include item type progress.

Logs should include fetch page progress.

Logs should include detail fetch progress.

Logs should include parser phase progress.

Logs should include validation summary.

Logs should include final output paths.

Logs should not print huge raw JSON by default.

Logs should not print raw HTML fragments by default.

The inspect command can print raw fragments intentionally.

68. Configuration

The CLI should work with flags alone.

The CLI may support a config file later.

Config file format may be TOML or YAML.

Config should include default calendar URL.

Config should include default item types.

Config should include page size.

Config should include concurrency.

Config should include rate limit.

Config should include retry policy.

Config should include patch paths.

Config should include validation policy.

Config should include output workspace.

Flags should override config file values.

The manifest should record final effective configuration.

69. Security Considerations

The scraper should not require user secrets.

The scraper should not store credentials.

The scraper should not bypass authenticated systems.

The scraper should not write outside configured output directories.

The scraper should sanitize slugs used in paths.

The scraper should reject path traversal in slugs.

The scraper should not execute content from source responses.

The parser should treat HTML as data.

The parser should not load remote resources from HTML fragments.

The report renderer should escape HTML when producing debug pages.

The report renderer should avoid script injection in generated reports.

The generated SQLite index should contain only public source data.

The generated artifacts should not contain runtime student state.

70. Performance Considerations

The scraper is offline, so correctness is more important than speed.

Fetch performance should still be reasonable.

The fetcher should stream or write responses without excessive memory use.

The parser should process item files incrementally when practical.

The validator should handle thousands of courses and programs comfortably.

The index builder should complete within minutes for undergraduate data.

The CLI should expose progress for long-running stages.

The pipeline should avoid quadratic scans where simple indexes suffice.

Lookup tables should be built for pids.

Lookup tables should be built for item ids.

Lookup tables should be built for course codes.

Lookup tables should be built for program codes.

Graph projection generation may be optimized later.

Performance claims should be measured before optimization decisions.

71. Build Reports

Every full run should produce a build report.

The build report should include catalog title.

The build report should include catalog id.

The build report should include catalog date range.

The build report should include source calendar URL.

The build report should include scrape timestamp.

The build report should include item counts.

The build report should include parse coverage.

The build report should include unparsed requirement counts.

The build report should include top unparsed fields.

The build report should include validation findings.

The build report should include patch summary.

The build report should include index artifact paths.

The build report should include source references.

The build report should include known limitations.

The build report should include reviewer checklist.

The build report should be Markdown.

The build report should have a matching JSON summary.

72. Reviewer Checklist

A reviewer should confirm catalog id.

A reviewer should confirm catalog title.

A reviewer should confirm catalog date range.

A reviewer should confirm expected item counts.

A reviewer should confirm failed request count is zero.

A reviewer should review parser coverage summary.

A reviewer should review unparsed requirement samples.

A reviewer should review patch applications.

A reviewer should review blocking validation findings.

A reviewer should review high-priority warnings.

A reviewer should spot-check a few course pages.

A reviewer should spot-check a few program pages.

A reviewer should spot-check cross-listings.

A reviewer should spot-check antirequisites.

A reviewer should spot-check a complex Faculty of Mathematics credential.

A reviewer should confirm index build metadata.

A reviewer should approve publication separately from build.

73. V1 Acceptance Criteria

V1 must discover the current undergraduate catalog from the Waterloo page.

V1 must support manual catalog id and subdomain override.

V1 must fetch catalog metadata.

V1 must fetch course schema.

V1 must fetch program schema.

V1 must enumerate all course summaries.

V1 must enumerate all program summaries.

V1 must fetch all course detail records.

V1 must fetch all program detail records.

V1 must store raw responses with manifest hashes.

V1 must parse course listings.

V1 must parse credential records from programs.

V1 must parse basic course prerequisite structures.

V1 must parse basic course antirequisite structures.

V1 must parse basic program course requirement structures.

V1 must preserve unparsed requirement fragments.

V1 must apply manual patches.

V1 must produce validation reports.

V1 must produce a SQLite index.

V1 must rebuild deterministically from the same raw snapshot and patches.

V1 must expose enough source references for debugging.

V1 must not require network during parse, validate, or build-index.

V1 must not integrate scraper execution into runtime user requests.

74. V1 Non-Acceptance

V1 does not need perfect parsing of every Waterloo requirement.

V1 does not need a full solver.

V1 does not need a frontend.

V1 does not need a production backend.

V1 does not need timetable offerings.

V1 does not need instructor data.

V1 does not need room data.

V1 does not need authentication.

V1 does not need archived calendars.

V1 does not need graduate calendars.

V1 does not need live automatic publishing.

V1 does not need to replace academic advising.

V1 does not need to consume UWFlow.

V1 does not need to parse every policy page.

V1 does not need to infer unstated equivalences.

75. Implementation Package Boundaries

internal/kuali should own endpoint clients.

internal/kuali should own request construction.

internal/kuali should own response decoding helpers.

internal/snapshot should own manifest writing.

internal/snapshot should own raw file path construction.

internal/snapshot should own hash calculation.

internal/parser should own JSON mapping.

internal/parser should own HTML fragment parsing.

internal/parser should own requirement expression recognition.

internal/parser should own unparsed requirement output.

internal/patches should own patch schema and application.

internal/validate should own validation checks.

internal/indexbuild should own SQLite output.

internal/report should own Markdown and JSON reports.

internal/model should own shared record types.

cmd/uwscrape should own CLI wiring.

Packages should avoid import cycles.

Packages should keep side effects narrow.

76. Go Library Choices

Use the standard net/http client for fetching.

Use context cancellation for command cancellation.

Use encoding/json for JSON.

Use a stable HTML parser such as golang.org/x/net/html.

Use a SQLite driver chosen by repository constraints.

Prefer pure Go SQLite only if it materially simplifies builds.

Use YAML only if patch readability is worth the dependency.

Use decimal or integer hundredths for unit values.

Use deterministic sorting before writing records.

Use table-driven tests for parser recognition.

Use golden files sparingly.

Use golden files when they make parser regressions obvious.

Avoid adding a large framework before the first spike proves need.

77. Live Endpoint Spike

Before full implementation, write a tiny fetch spike.

The spike should fetch the Waterloo calendar page.

The spike should extract Kuali subdomain.

The spike should extract catalog id.

The spike should fetch catalog metadata.

The spike should fetch course schema.

The spike should fetch program schema.

The spike should run one course search page.

The spike should fetch one course detail record.

The spike should run one program search page.

The spike should fetch one program detail record.

The spike should write temporary outputs.

The spike should confirm response shapes.

The spike should not become the final architecture by accident.

The spike should be deleted or folded into tested packages.

78. First Real Fixture Set

The first real fixture set should include AFM100 or a similarly simple course.

The first real fixture set should include MATH135 or a similar Math course.

The first real fixture set should include a course with nested prerequisite groups.

The first real fixture set should include a course with antirequisites.

The first real fixture set should include a course with cross-listing if available.

The first real fixture set should include CS Honours BMath.

The first real fixture set should include a minor.

The first real fixture set should include an option.

The first real fixture set should include a specialization.

The first real fixture set should include a program with subject range pools.

The first real fixture set should include a program with additional constraints.

The first fixture set should be small enough for code review.

The first fixture set should be stable enough for tests.

79. Parser Coverage Metrics

Coverage should be measured by source field.

Coverage should be measured by item type.

Coverage should be measured by faculty when available.

Coverage should be measured by requirement expression count.

Coverage should be measured by unparsed fragment count.

Coverage should be measured by source text character count.

Coverage should separate empty fields from unparsed fields.

Coverage should separate unsupported phrases from malformed source.

Coverage should separate parser warnings from validation findings.

Coverage should show top unsupported phrase patterns.

Coverage should show top fields by unparsed count.

Coverage should show top credentials by unparsed count.

Coverage should help choose the next parser improvements.

Coverage should not be used to hide important unknowns.

80. Publication Criteria

An index may be published only after successful validation.

An index may be published only after build report review.

An index may be published only after scraper source version is recorded.

An index may be published only after source catalog id is recorded.

An index may be published only after source counts are recorded.

The published index_id must identify the built runtime artifact, not only the upstream source snapshot.

The index_id must change when parser output, accepted patch applications, validation summary, build algorithm version, or other release-gated content that affects runtime academic/graph behavior changes.

The index_id must not depend on nondeterministic build timestamps.

An index may be published with warnings only when the release decision status is approved_with_warnings.

Accepted warnings should be tracked in release-decision.json and the build report.

Accepted warnings should not disappear silently.

release-decision.json is required for backend startup once release gates are enabled.

The release decision status must be approved, approved_with_warnings, or rejected.

Publication should be atomic.

Publication should preserve the previous index for rollback.

Publication should not alter student state.

Publication should not require the scraper in production.

Publication mechanics belong in a later deployment spec.

81. Risks

Kuali endpoint shape may change.

Kuali search behavior may change.

Kuali item detail fields may change.

Waterloo catalog settings may change.

Rendered HTML fragments may change.

Rule text wording may change.

Course and program pids may change across catalog versions.

Program fields may include complex policy language.

Some requirements may remain unparsed for a long time.

Manual patches may become stale.

Validation thresholds may be too lax early on.

The team may be tempted to overfit the first Faculty of Mathematics examples.

The frontend may request graph shapes before parser semantics are stable.

The architecture must keep raw evidence and uncertainty visible.

82. Mitigations

Endpoint change risk is mitigated by offline snapshots.

Endpoint change risk is mitigated by discovery tests.

Parser change risk is mitigated by raw evidence preservation.

Parser change risk is mitigated by fixture tests.

Manual patch risk is mitigated by patch reports.

Manual patch risk is mitigated by target matching.

Manual patch risk is mitigated by stale patch warnings.

Unparsed requirement risk is mitigated by explicit records.

Unparsed requirement risk is mitigated by validation summaries.

Frontend overfit risk is mitigated by backend-owned semantics.

Graph overfit risk is mitigated by projection documentation.

Publication risk is mitigated by atomic index replacement.

User trust risk is mitigated by source references and uncertainty display.

83. Open Questions

Should raw snapshots be committed to git?

Should raw snapshots be archived outside git?

Should parsed JSONL be committed?

Should SQLite index artifacts be committed?

Which SQLite driver should the Go backend use?

Should patches be YAML or JSON?

What unparsed requirement threshold is acceptable for first publication?

Which Faculty of Mathematics credentials are required for first review?

Should policy pages be imported in v1 or v1.1?

Should specializations be separate records or program records in v1?

Should graph projection JSON be produced in v1?

Should live endpoint tests run manually or in CI?

Should publication be local-only before any hosted deployment?

These questions should be answered before production indexing.

They do not block the first scraper implementation spike.

84. References

Waterloo Kuali CM documentation: https://uwaterloo.ca/academic-calendar-curriculum-management-resources/kuali-cm

Waterloo academic calendar linking guidelines: https://uwaterloo.ca/academic-calendar-curriculum-management-resources/academic-calendar

Kuali Catalog API caveat: https://kuali-ccm.zendesk.com/hc/en-us/articles/37486826047003

Kuali JSON formatting for gadgets: https://kuali-ccm.zendesk.com/hc/en-us/articles/34949201417627

Kuali displaying content from Curriculum Management: https://kuali.zendesk.com/hc/en-us/articles/42733273922203

Architecture baseline: docs/reference/architecture/offline-index-runtime-architecture/.

Scraper ADR: docs/decisions/0001-offline-kuali-scraper-index-pipeline/.

Mathematical model paper: docs/reference/academic/academic-course-universe-model/.

85. Next Step

The next project step is to approve this specification or revise it.

After approval, implementation should begin with the live endpoint spike.

After the spike, implementation should define the Go package skeleton.

After the skeleton, implementation should add snapshot manifest writing.

After manifest writing, implementation should add course and program enumeration.

After enumeration, implementation should add raw detail fetching.

After raw fetching, implementation should add the first parser fixtures.

After parser fixtures, implementation should add requirement expression parsing.

After requirement parsing, implementation should add validation reports.

After validation reports, implementation should add SQLite index output.

The scraper should earn trust before the visualization depends on it.