P10 Deploy Runbook — Fly.io single-machine
P10 Deploy Runbook — Fly.io single-machine
Locked architecture (CONTEXT D3, D7, D10):
- One Fly machine (
shared-cpu-1x, 1 GB, regionyyz). - Single Go binary, distroless image, built with
-tags embed_frontendso the SvelteKit static bundle ships inside the binary. - One persistent volume mounted at
/data(index + state DB + snapshots). - VACUUM INTO snapshots every hour, retain 168 (one week); plus Fly daily volume snapshots — no Litestream, no remote replication.
- Restart on index publish (zero-write semantics for the index).
First-time deploy
The binary fails fast if /data/index/ is empty (catalogstore opens
required files at startup), so the index must land on the volume
before the app’s image is deployed. fly sftp shell needs a live
machine, so we use a one-shot bootstrap machine running busybox sleep
to ground the volume long enough for SFTP. Order matters:
fly apps create uwscrapefly volumes create uwscrape_data --size 1 --region yyzfly volumes update <volume-id> --snapshot-retention 14 # daily snapshots for 14d
# Token verifier key (writes to /etc/uwscrape/token-key inside the machine).fly secrets set UWSCRAPE_TOKEN_KEY="$(openssl rand -hex 32)"
# Internal metrics bearer (optional but recommended).# Without this, /internal/metrics returns 404 (route not registered).fly secrets set UWSCRAPE_INTERNAL_TOKEN="$(openssl rand -hex 32)"
# Bootstrap machine: holds the volume + ssh open so SFTP works.fly machine run busybox --command "sleep 3600" \ --volume uwscrape_data:/data --region yyz
fly ssh sftp shell> mkdir /data/index> put .dev/published/math-engineering-support/* /data/index/
# Destroy the bootstrap machine before fly deploy so the app machine# can take the volume.fly machine destroy <bootstrap-machine-id> --force
fly deployHealth check at /api/v1/health. Warmup typically completes in <5 s;
Fly waits up to 30 s (grace period in fly.toml) before sending traffic.
Distribution mechanism (locked)
The index artefact is not committed to git and not baked into
the Docker image. The chosen distribution path is fly sftp upload to
the persistent /data volume, then fly deploy for a rolling restart.
Why this path (vs Git LFS, S3, GitHub Releases): the index is ~tens of MB and changes only on catalog republish, so a per-deploy SFTP upload is simpler than wiring an object store fetch into the binary, and the binary stays stateless. The trade-off is a manual operator step; a future P10.+ task may move this to a release-pipeline action.
Required index files
The server reads these files from UWSCRAPE_INDEX_DIR at startup
(see internal/catalogstore/sqlite.go requiredPublishedIndexFiles):
build-metadata.jsonvalidation-summary.jsonvalidation-report.jsonvalidation-report.mdrelease-decision.jsonrelease-decision.mdbuild-report.mdcourse-universe.sqlitegraph-projection.json
Uploading the full artefact directory (e.g. put * /data/index/) is
the recommended approach — extra files such as parse-manifest.json
and patch_applications.jsonl are harmless and useful for audit
trails. Any missing required file fails startup fail-closed.
Publishing a new index
# 1. Build the candidate and publish it locally (validates + writes# .dev/published/<slug>/).make publish-math-engineering-support-index
# 2. Upload to the Fly volume.fly sftp shell> put .dev/published/<slug>/* /data/index/
# 3. Roll the machine so the new index is loaded.fly deploy
# 4. Verify the loaded artefact matches what was uploaded.curl -s https://uwscrape.fly.dev/api/v1/index | jq '{ index_id, release_status, course_count, credential_count, graph_projection_course_count}'The index_id returned by /api/v1/index must match the
build-metadata.json index_id of the uploaded artefact. If it
matches the previous artefact, the rolling restart has not yet
swapped machines — wait for /api/v1/health to return 200 and recheck.
The rolling deploy swaps the running machine; new connections hit the
new machine once /api/v1/health returns 200 (warmup complete).
Volume recovery (catastrophic)
If the primary volume is lost, restore from a Fly daily snapshot:
fly volumes snapshots list <old-volume-id>fly volumes create uwscrape_data --snapshot-id <snapshot-id> --size 1 --region yyzfly deployLoss window: up to 24 h (the last Fly daily snapshot).
If the volume is intact but state DB is corrupted, restore from the most recent VACUUM INTO snapshot inside the volume:
fly ssh console$ cp /data/snapshots/state-<timestamp>.db /data/state.sqlite$ exitfly machine restart <machine-id>Loss window: up to 1 h (the last hourly snapshot).
Snapshot offload (operator chore)
The hourly VACUUM INTO snapshots live on the same volume. For additional disaster-recovery margin, copy them off-site periodically:
fly ssh console$ ls /data/snapshots/ # newest at the bottom$ exitfly ssh sftp shell> get /data/snapshots/state-<timestamp>.db ./backups/Document in your offsite-backup rotation cadence; this is not automated.
Concurrency / rate-limit knobs
All limits live in internal/server/config.go env defaults. Production
overrides go via fly.toml [env]:
| Setting | Default | Notes |
|---|---|---|
UWSCRAPE_RATE_LIMIT_CATALOG_PER_MINUTE | 1200 | Global catalog reads |
UWSCRAPE_RATE_LIMIT_QUERY_PER_MINUTE | 240 | Advisory/path search |
UWSCRAPE_RATE_LIMIT_STATE_PER_IP_PER_MINUTE | 30 | Per-client mutations |
UWSCRAPE_STATE_SNAPSHOT_INTERVAL | 1h | Set lower for ops testing |
UWSCRAPE_STATE_SNAPSHOT_RETENTION | 168 | One week of hourly snapshots |
UWSCRAPE_INTERNAL_TOKEN | empty | Bearer for /internal/metrics. Empty = route not registered. |
Internal metrics endpoint
When UWSCRAPE_INTERNAL_TOKEN is set, the process serves
GET /internal/metrics with a JSON snapshot of cumulative counters:
{ "timestamp_unix": 1747000000, "statestore_writes_total": 12345, "statestore_vacuum_failures_total": 0}Usage:
curl -s -H "Authorization: Bearer $UWSCRAPE_INTERNAL_TOKEN" \ https://uwscrape.fly.dev/internal/metricsADR 0027’s migration trigger (state_writes_per_second p99 > 50
sustained for 1 hour) is derived by sampling statestore_writes_total
at two points in time and dividing by elapsed seconds.