Stageflow is the scanning platform behind this portfolio. It ingests URL lists or uploaded ZIP builds, orchestrates per-job Podman pods, runs scanners, and publishes an aggregated report with evidence (screenshots, raw JSON, HTML reports). Job status streams live to the UI via SSE.
At a glance (from the repo)
| Metric | What it looks like |
|---|---|
| Scale | ~62K LOC (active code, excluding archive/) |
| Isolation | Per-job Podman pods + per-job workspaces |
| Messaging | NATS JetStream streams (jobs, extraction, scan) with durable consumers |
| Status | SSE (server-sent events) from Platform API → Gateway → Browser |
| Artifacts | MinIO buckets (scanner-staging, scanner-artifacts) via presigned URLs |
Measured numbers (2026-01-16)
cloc(excludingarchive,node_modules,dist,build,generated,.git): 61,681 LOC across 566 files.- Go workspace modules (
go.work): 10. - Built-in scanner manifests: 6.
- Go test files (excluding
archive/): 107. - Frontend test files (
portfolio/frontend/app/**/*.test.*): 5.
Problem
Website audits are fragmented and brittle. A “full audit” often means stitching together mismatched tools, then trying to normalize the output into something you can actually act on.
I wanted one pipeline that can scan arbitrary URLs or uploaded builds safely, survive crashes, and produce a single report format with evidence (screenshots, traces, raw JSON) I can trust.
Constraints
- Single-host deployment target (one VPS).
- Untrusted inputs (ZIP uploads and arbitrary URLs) must be isolated per job.
- Crash recovery: partial work should not require restarting a job from scratch.
- Report viewing should not require authentication (public share links).
- Support both ZIP jobs (artifact scans) and URL jobs (live scans).
Solution
Stageflow is event-driven:
- Every job transition is an explicit event on NATS JetStream.
- Consumers use explicit ack, at-least-once delivery, and durable names.
- Workers run inside per-job pods; the orchestrator treats container exit codes as a last-resort failure signal.
Status streams to the UI via SSE, and scan artifacts are stored in MinIO and served via presigned URLs so “view report” does not require authentication.
Architecture
flowchart TB
U[User / Browser] -->|HTTPS| Caddy["Caddy<br/>TLS + routing"]
Caddy --> FE["Portfolio Frontend<br/>React Router"]
Caddy --> GW["Portfolio Gateway<br/>Go/Gin"]
FE <-->|SSE| GW
GW --> API["Platform API<br/>Go"]
API --> DB[(platform_api_status.db<br/>SQLite WAL)]
API <--> NATS[(NATS JetStream)]
NATS <--> ORCH["Orchestrator<br/>Go FSM"]
ORCH --> POD["Per-job Podman Pod"]
POD --> EX[Extractor (Go)]
POD --> RUN["Scanner Runner (TS/Playwright)"]
EX --> STORE[(MinIO)]
RUN --> STOREsequenceDiagram
actor User
participant UI as Frontend
participant GW as Gateway
participant JS as JetStream
participant OR as Orchestrator
participant POD as Job Pod
participant S3 as MinIO
User->>UI: Start scan
UI->>GW: POST /api/v1/jobs/*
UI-->>GW: Subscribe SSE /jobs/:id/stream
GW->>JS: publish jobs.events.created
JS-->>OR: deliver jobs.events.created (durable)
OR->>POD: spawn pod + workers
POD-->>JS: publish extraction/scan events
POD->>S3: upload artifacts
OR-->>JS: publish jobs.events.completed (or failed)
GW-->>UI: SSE status + progressflowchart TB
User[Public Internet] -->|HTTPS| Caddy["Caddy (host)"]
subgraph VPS["Single VPS"]
Caddy --> Quad["systemd + Quadlets"]
Quad --> FE["portfolio-frontend"]
Quad --> GW["portfolio-gateway"]
Quad --> API["platform-api"]
Quad --> ORCH["orchestrator"]
Quad --> NATS[(NATS JetStream)]
Quad --> S3[(MinIO)]
API --> DB[(SQLite WAL)]
ORCH --> JOBS["ephemeral job pods"]
endEvidence (placeholders)
- Screenshot (TODO):
case-studies/stageflow/playground-create-job.png- Capture:
/playgroundafter selecting modules and entering URLs (the “create job” moment). - Alt text: “Stageflow scan setup form with selected scanner modules and URL inputs.”
- Why it matters: supports the claim that users can configure and submit multi-scanner jobs.
- Capture:
- Screenshot (TODO):
case-studies/stageflow/job-stream.png- Capture:
/scan/<job_id>while the job is transitioning states (live updates visible). - Alt text: “Scan job status view showing state transitions and live progress updates.”
- Why it matters: demonstrates SSE-driven status and the job state machine UX.
- Capture:
- Screenshot (TODO):
case-studies/stageflow/report-overview.png- Capture:
/scan/<job_id>/reportfor a completed scan (top of report, summary counts). - Alt text: “Aggregated scan report summary with counts grouped by scanner.”
- Why it matters: supports the claim that results are normalized into a single report surface.
- Capture:
- Screenshot (TODO):
case-studies/stageflow/report-artifacts.png- Capture: the report section where per-scanner artifacts (HTML/JSON/screenshots) are linked or listed.
- Alt text: “Report artifacts list showing per-scanner outputs and downloadable evidence.”
- Why it matters: proves artifacts are first-class and job-scoped.
The contract: events + state machine
Stageflow is strict about what the bus carries. The “architecture truth” is documented and implemented in docs/ARCHITECTURE.md and packages/shared-go/*:
JetStream streams and subjects
jobs:jobs.events.created,jobs.events.completed,jobs.events.failedextraction:extraction.events.ready,extraction.events.failedscan:scan.events.page.completed,scan.events.completed,scan.events.failed
Durable consumer names (real values)
- Orchestrator:
jobs.events.created→orchestrator-job-createdextraction.events.ready→orchestrator-extraction-readyextraction.events.failed→orchestrator-extraction-failedscan.events.completed→orchestrator-scan-completedscan.events.failed→orchestrator-scan-failed
- Platform API status projection:
jobs.events.created→platform-api-job-createdextraction.events.ready→platform-api-extraction-readyextraction.events.failed→platform-api-extraction-failedscan.events.page.completed→platform-api-scan-pagescan.events.completed→platform-api-scan-completedscan.events.failed→platform-api-scan-failedjobs.events.completed→platform-api-job-completedjobs.events.failed→platform-api-job-failed
Job lifecycle (actual allowed transitions)
The orchestrator FSM enforces:
PENDING → EXTRACTING | READY_TO_SCAN | FAILEDEXTRACTING → READY_TO_SCAN | FAILEDREADY_TO_SCAN → SCANNING | FAILEDSCANNING → COMPLETING | FAILEDCOMPLETING → DONE | FAILED
There are two “tracks”:
- ZIP jobs: extraction + scan.
- URL jobs: skip extraction, go straight to scanning.
Scanner modules (plugin-style)
Scanner metadata lives in packages/shared-go/scannercatalog/manifests/*/manifest.json and is embedded at build time. The platform ships six built-ins:
- Axe (
axe): WCAG accessibility via axe-core - Lighthouse (
lighthouse): performance + SEO - Security headers (
security-headers): CSP/HSTS/etc checks - SEO (
seo): meta + structured data checks - Link checker (
link-checker): broken links and redirect chains - AI navigator (
ai-navigator): goal-driven exploratory scans (optional OpenRouter)
Tech stack
| Area | Choices |
|---|---|
| Backend services | Go (API, orchestrator, gateway), shared workspace modules |
| Execution | Podman pods (rootless), per-job workspaces |
| Messaging | NATS JetStream durable streams |
| State | SQLite + WAL projections |
| Storage | MinIO artifacts + presigned URLs |
| Scanning | Playwright automation, axe-core, Lighthouse |
Deep dive: artifacts are first-class, not an afterthought
Stageflow uses two MinIO buckets:
scanner-staging: inbound ZIP uploads.scanner-artifacts: everything you need to view or debug a scan.
Artifact paths are deterministic and job-scoped:
- ZIP upload:
scanner-staging/staging/<job_id>/<filename>.zip - Provenance:
scanner-artifacts/<job_id>/provenance.json - Per-scanner results:
scanner-artifacts/<job_id>/<scanner_id>/results.jsonandreport.html - Aggregated report:
scanner-artifacts/<job_id>/report.json(contract version2.0.0)
This is why report viewing can be public: it’s presigned URLs, not privileged API reads.
Key decisions
- JetStream for durability: durable consumers + explicit ack provide at-least-once delivery and replay.
- Podman per-job pods: job isolation maps cleanly to Podman’s “pod” model and stays rootless-by-default.
- SSE for status: one-way job status matches the UI needs and simplifies reconnect semantics.
- Artifacts via MinIO presigned URLs: report viewing stays “just links”, not privileged API reads.
- SQLite WAL projections: fast read-side status without a separate DB server.
Tradeoffs
- At-least-once delivery means consumers must tolerate duplicate events.
- Per-job pods improve isolation but increase orchestration complexity and resource pressure at high concurrency.
- Public, presigned artifact links require careful bucket policy and lifecycle management.
Security and reliability
- Rootless Podman containers, per-job pods, and ephemeral workspaces reduce cross-job contamination.
- JetStream durability provides crash recovery via message redelivery.
- Orchestrator watchdogs and exit-code detection prevent jobs from hanging indefinitely.
Testing and quality
- Go end-to-end tests live under
tests/e2e/(ZIP scan flows). - Frontend tests cover report utilities and markdown rendering for case studies.
Outcomes
- Production backend for the portfolio’s scanning experience.
- Concurrency with isolation: every job is its own pod/workspace.
- Durable, replayable job state via JetStream with explicit ack and durable consumers.
- Reports are actionable: aggregated JSON + per-scanner HTML + screenshots, all stored as artifacts.