Skip to content

P10 Deploy Runbook — Fly.io single-machine

P10 Deploy Runbook — Fly.io single-machine

Locked architecture (CONTEXT D3, D7, D10):

  • One Fly machine (shared-cpu-1x, 1 GB, region yyz).
  • Single Go binary, distroless image, built with -tags embed_frontend so the SvelteKit static bundle ships inside the binary.
  • One persistent volume mounted at /data (index + state DB + snapshots).
  • VACUUM INTO snapshots every hour, retain 168 (one week); plus Fly daily volume snapshots — no Litestream, no remote replication.
  • Restart on index publish (zero-write semantics for the index).

First-time deploy

The binary fails fast if /data/index/ is empty (catalogstore opens required files at startup), so the index must land on the volume before the app’s image is deployed. fly sftp shell needs a live machine, so we use a one-shot bootstrap machine running busybox sleep to ground the volume long enough for SFTP. Order matters:

Terminal window
fly apps create uwscrape
fly volumes create uwscrape_data --size 1 --region yyz
fly volumes update <volume-id> --snapshot-retention 14 # daily snapshots for 14d
# Token verifier key (writes to /etc/uwscrape/token-key inside the machine).
fly secrets set UWSCRAPE_TOKEN_KEY="$(openssl rand -hex 32)"
# Internal metrics bearer (optional but recommended).
# Without this, /internal/metrics returns 404 (route not registered).
fly secrets set UWSCRAPE_INTERNAL_TOKEN="$(openssl rand -hex 32)"
# Bootstrap machine: holds the volume + ssh open so SFTP works.
fly machine run busybox --command "sleep 3600" \
--volume uwscrape_data:/data --region yyz
fly ssh sftp shell
> mkdir /data/index
> put .dev/published/math-engineering-support/* /data/index/
# Destroy the bootstrap machine before fly deploy so the app machine
# can take the volume.
fly machine destroy <bootstrap-machine-id> --force
fly deploy

Health check at /api/v1/health. Warmup typically completes in <5 s; Fly waits up to 30 s (grace period in fly.toml) before sending traffic.

Distribution mechanism (locked)

The index artefact is not committed to git and not baked into the Docker image. The chosen distribution path is fly sftp upload to the persistent /data volume, then fly deploy for a rolling restart.

Why this path (vs Git LFS, S3, GitHub Releases): the index is ~tens of MB and changes only on catalog republish, so a per-deploy SFTP upload is simpler than wiring an object store fetch into the binary, and the binary stays stateless. The trade-off is a manual operator step; a future P10.+ task may move this to a release-pipeline action.

Required index files

The server reads these files from UWSCRAPE_INDEX_DIR at startup (see internal/catalogstore/sqlite.go requiredPublishedIndexFiles):

  • build-metadata.json
  • validation-summary.json
  • validation-report.json
  • validation-report.md
  • release-decision.json
  • release-decision.md
  • build-report.md
  • course-universe.sqlite
  • graph-projection.json

Uploading the full artefact directory (e.g. put * /data/index/) is the recommended approach — extra files such as parse-manifest.json and patch_applications.jsonl are harmless and useful for audit trails. Any missing required file fails startup fail-closed.

Publishing a new index

Terminal window
# 1. Build the candidate and publish it locally (validates + writes
# .dev/published/<slug>/).
make publish-math-engineering-support-index
# 2. Upload to the Fly volume.
fly sftp shell
> put .dev/published/<slug>/* /data/index/
# 3. Roll the machine so the new index is loaded.
fly deploy
# 4. Verify the loaded artefact matches what was uploaded.
curl -s https://uwscrape.fly.dev/api/v1/index | jq '{
index_id,
release_status,
course_count,
credential_count,
graph_projection_course_count
}'

The index_id returned by /api/v1/index must match the build-metadata.json index_id of the uploaded artefact. If it matches the previous artefact, the rolling restart has not yet swapped machines — wait for /api/v1/health to return 200 and recheck.

The rolling deploy swaps the running machine; new connections hit the new machine once /api/v1/health returns 200 (warmup complete).

Volume recovery (catastrophic)

If the primary volume is lost, restore from a Fly daily snapshot:

Terminal window
fly volumes snapshots list <old-volume-id>
fly volumes create uwscrape_data --snapshot-id <snapshot-id> --size 1 --region yyz
fly deploy

Loss window: up to 24 h (the last Fly daily snapshot).

If the volume is intact but state DB is corrupted, restore from the most recent VACUUM INTO snapshot inside the volume:

Terminal window
fly ssh console
$ cp /data/snapshots/state-<timestamp>.db /data/state.sqlite
$ exit
fly machine restart <machine-id>

Loss window: up to 1 h (the last hourly snapshot).

Snapshot offload (operator chore)

The hourly VACUUM INTO snapshots live on the same volume. For additional disaster-recovery margin, copy them off-site periodically:

Terminal window
fly ssh console
$ ls /data/snapshots/ # newest at the bottom
$ exit
fly ssh sftp shell
> get /data/snapshots/state-<timestamp>.db ./backups/

Document in your offsite-backup rotation cadence; this is not automated.

Concurrency / rate-limit knobs

All limits live in internal/server/config.go env defaults. Production overrides go via fly.toml [env]:

SettingDefaultNotes
UWSCRAPE_RATE_LIMIT_CATALOG_PER_MINUTE1200Global catalog reads
UWSCRAPE_RATE_LIMIT_QUERY_PER_MINUTE240Advisory/path search
UWSCRAPE_RATE_LIMIT_STATE_PER_IP_PER_MINUTE30Per-client mutations
UWSCRAPE_STATE_SNAPSHOT_INTERVAL1hSet lower for ops testing
UWSCRAPE_STATE_SNAPSHOT_RETENTION168One week of hourly snapshots
UWSCRAPE_INTERNAL_TOKENemptyBearer for /internal/metrics. Empty = route not registered.

Internal metrics endpoint

When UWSCRAPE_INTERNAL_TOKEN is set, the process serves GET /internal/metrics with a JSON snapshot of cumulative counters:

{
"timestamp_unix": 1747000000,
"statestore_writes_total": 12345,
"statestore_vacuum_failures_total": 0
}

Usage:

Terminal window
curl -s -H "Authorization: Bearer $UWSCRAPE_INTERNAL_TOKEN" \
https://uwscrape.fly.dev/internal/metrics

ADR 0027’s migration trigger (state_writes_per_second p99 > 50 sustained for 1 hour) is derived by sampling statestore_writes_total at two points in time and dividing by elapsed seconds.