← Back to projects

Stageflow

Event-driven web scanning platform with per-job isolation and real-time reporting.

Stageflow is the scanning platform behind this portfolio. It ingests URL lists or uploaded ZIP builds, orchestrates per-job Podman pods, runs scanners, and publishes an aggregated report with evidence (screenshots, raw JSON, HTML reports). Job status streams live to the UI via SSE.

At a glance (from the repo)

Metric What it looks like
Scale ~62K LOC (active code, excluding archive/)
Isolation Per-job Podman pods + per-job workspaces
Messaging NATS JetStream streams (jobs, extraction, scan) with durable consumers
Status SSE (server-sent events) from Platform API → Gateway → Browser
Artifacts MinIO buckets (scanner-staging, scanner-artifacts) via presigned URLs

Measured numbers (2026-01-16)

  • cloc (excluding archive,node_modules,dist,build,generated,.git): 61,681 LOC across 566 files.
  • Go workspace modules (go.work): 10.
  • Built-in scanner manifests: 6.
  • Go test files (excluding archive/): 107.
  • Frontend test files (portfolio/frontend/app/**/*.test.*): 5.

Problem

Website audits are fragmented and brittle. A “full audit” often means stitching together mismatched tools, then trying to normalize the output into something you can actually act on.

I wanted one pipeline that can scan arbitrary URLs or uploaded builds safely, survive crashes, and produce a single report format with evidence (screenshots, traces, raw JSON) I can trust.

Constraints

  • Single-host deployment target (one VPS).
  • Untrusted inputs (ZIP uploads and arbitrary URLs) must be isolated per job.
  • Crash recovery: partial work should not require restarting a job from scratch.
  • Report viewing should not require authentication (public share links).
  • Support both ZIP jobs (artifact scans) and URL jobs (live scans).

Solution

Stageflow is event-driven:

  • Every job transition is an explicit event on NATS JetStream.
  • Consumers use explicit ack, at-least-once delivery, and durable names.
  • Workers run inside per-job pods; the orchestrator treats container exit codes as a last-resort failure signal.

Status streams to the UI via SSE, and scan artifacts are stored in MinIO and served via presigned URLs so “view report” does not require authentication.

Architecture

flowchart TB
  U[User / Browser] -->|HTTPS| Caddy["Caddy<br/>TLS + routing"]
  Caddy --> FE["Portfolio Frontend<br/>React Router"]
  Caddy --> GW["Portfolio Gateway<br/>Go/Gin"]
 
  FE <-->|SSE| GW
  GW --> API["Platform API<br/>Go"]
  API --> DB[(platform_api_status.db<br/>SQLite WAL)]
  API <--> NATS[(NATS JetStream)]
  NATS <--> ORCH["Orchestrator<br/>Go FSM"]
  ORCH --> POD["Per-job Podman Pod"]
 
  POD --> EX[Extractor (Go)]
  POD --> RUN["Scanner Runner (TS/Playwright)"]
  EX --> STORE[(MinIO)]
  RUN --> STORE
sequenceDiagram
  actor User
  participant UI as Frontend
  participant GW as Gateway
  participant JS as JetStream
  participant OR as Orchestrator
  participant POD as Job Pod
  participant S3 as MinIO
 
  User->>UI: Start scan
  UI->>GW: POST /api/v1/jobs/*
  UI-->>GW: Subscribe SSE /jobs/:id/stream
  GW->>JS: publish jobs.events.created
  JS-->>OR: deliver jobs.events.created (durable)
  OR->>POD: spawn pod + workers
  POD-->>JS: publish extraction/scan events
  POD->>S3: upload artifacts
  OR-->>JS: publish jobs.events.completed (or failed)
  GW-->>UI: SSE status + progress
flowchart TB
  User[Public Internet] -->|HTTPS| Caddy["Caddy (host)"]
 
  subgraph VPS["Single VPS"]
    Caddy --> Quad["systemd + Quadlets"]
    Quad --> FE["portfolio-frontend"]
    Quad --> GW["portfolio-gateway"]
    Quad --> API["platform-api"]
    Quad --> ORCH["orchestrator"]
    Quad --> NATS[(NATS JetStream)]
    Quad --> S3[(MinIO)]
    API --> DB[(SQLite WAL)]
    ORCH --> JOBS["ephemeral job pods"]
  end

Evidence (placeholders)

  • Screenshot (TODO): case-studies/stageflow/playground-create-job.png
    • Capture: /playground after selecting modules and entering URLs (the “create job” moment).
    • Alt text: “Stageflow scan setup form with selected scanner modules and URL inputs.”
    • Why it matters: supports the claim that users can configure and submit multi-scanner jobs.
  • Screenshot (TODO): case-studies/stageflow/job-stream.png
    • Capture: /scan/<job_id> while the job is transitioning states (live updates visible).
    • Alt text: “Scan job status view showing state transitions and live progress updates.”
    • Why it matters: demonstrates SSE-driven status and the job state machine UX.
  • Screenshot (TODO): case-studies/stageflow/report-overview.png
    • Capture: /scan/<job_id>/report for a completed scan (top of report, summary counts).
    • Alt text: “Aggregated scan report summary with counts grouped by scanner.”
    • Why it matters: supports the claim that results are normalized into a single report surface.
  • Screenshot (TODO): case-studies/stageflow/report-artifacts.png
    • Capture: the report section where per-scanner artifacts (HTML/JSON/screenshots) are linked or listed.
    • Alt text: “Report artifacts list showing per-scanner outputs and downloadable evidence.”
    • Why it matters: proves artifacts are first-class and job-scoped.

The contract: events + state machine

Stageflow is strict about what the bus carries. The “architecture truth” is documented and implemented in docs/ARCHITECTURE.md and packages/shared-go/*:

JetStream streams and subjects

  • jobs: jobs.events.created, jobs.events.completed, jobs.events.failed
  • extraction: extraction.events.ready, extraction.events.failed
  • scan: scan.events.page.completed, scan.events.completed, scan.events.failed

Durable consumer names (real values)

  • Orchestrator:
    • jobs.events.createdorchestrator-job-created
    • extraction.events.readyorchestrator-extraction-ready
    • extraction.events.failedorchestrator-extraction-failed
    • scan.events.completedorchestrator-scan-completed
    • scan.events.failedorchestrator-scan-failed
  • Platform API status projection:
    • jobs.events.createdplatform-api-job-created
    • extraction.events.readyplatform-api-extraction-ready
    • extraction.events.failedplatform-api-extraction-failed
    • scan.events.page.completedplatform-api-scan-page
    • scan.events.completedplatform-api-scan-completed
    • scan.events.failedplatform-api-scan-failed
    • jobs.events.completedplatform-api-job-completed
    • jobs.events.failedplatform-api-job-failed

Job lifecycle (actual allowed transitions)

The orchestrator FSM enforces:

  • PENDING → EXTRACTING | READY_TO_SCAN | FAILED
  • EXTRACTING → READY_TO_SCAN | FAILED
  • READY_TO_SCAN → SCANNING | FAILED
  • SCANNING → COMPLETING | FAILED
  • COMPLETING → DONE | FAILED

There are two “tracks”:

  • ZIP jobs: extraction + scan.
  • URL jobs: skip extraction, go straight to scanning.

Scanner modules (plugin-style)

Scanner metadata lives in packages/shared-go/scannercatalog/manifests/*/manifest.json and is embedded at build time. The platform ships six built-ins:

  • Axe (axe): WCAG accessibility via axe-core
  • Lighthouse (lighthouse): performance + SEO
  • Security headers (security-headers): CSP/HSTS/etc checks
  • SEO (seo): meta + structured data checks
  • Link checker (link-checker): broken links and redirect chains
  • AI navigator (ai-navigator): goal-driven exploratory scans (optional OpenRouter)

Tech stack

Area Choices
Backend services Go (API, orchestrator, gateway), shared workspace modules
Execution Podman pods (rootless), per-job workspaces
Messaging NATS JetStream durable streams
State SQLite + WAL projections
Storage MinIO artifacts + presigned URLs
Scanning Playwright automation, axe-core, Lighthouse

Deep dive: artifacts are first-class, not an afterthought

Stageflow uses two MinIO buckets:

  • scanner-staging: inbound ZIP uploads.
  • scanner-artifacts: everything you need to view or debug a scan.

Artifact paths are deterministic and job-scoped:

  • ZIP upload: scanner-staging/staging/<job_id>/<filename>.zip
  • Provenance: scanner-artifacts/<job_id>/provenance.json
  • Per-scanner results: scanner-artifacts/<job_id>/<scanner_id>/results.json and report.html
  • Aggregated report: scanner-artifacts/<job_id>/report.json (contract version 2.0.0)

This is why report viewing can be public: it’s presigned URLs, not privileged API reads.

Key decisions

  • JetStream for durability: durable consumers + explicit ack provide at-least-once delivery and replay.
  • Podman per-job pods: job isolation maps cleanly to Podman’s “pod” model and stays rootless-by-default.
  • SSE for status: one-way job status matches the UI needs and simplifies reconnect semantics.
  • Artifacts via MinIO presigned URLs: report viewing stays “just links”, not privileged API reads.
  • SQLite WAL projections: fast read-side status without a separate DB server.

Tradeoffs

  • At-least-once delivery means consumers must tolerate duplicate events.
  • Per-job pods improve isolation but increase orchestration complexity and resource pressure at high concurrency.
  • Public, presigned artifact links require careful bucket policy and lifecycle management.

Security and reliability

  • Rootless Podman containers, per-job pods, and ephemeral workspaces reduce cross-job contamination.
  • JetStream durability provides crash recovery via message redelivery.
  • Orchestrator watchdogs and exit-code detection prevent jobs from hanging indefinitely.

Testing and quality

  • Go end-to-end tests live under tests/e2e/ (ZIP scan flows).
  • Frontend tests cover report utilities and markdown rendering for case studies.

Outcomes

  • Production backend for the portfolio’s scanning experience.
  • Concurrency with isolation: every job is its own pod/workspace.
  • Durable, replayable job state via JetStream with explicit ack and durable consumers.
  • Reports are actionable: aggregated JSON + per-scanner HTML + screenshots, all stored as artifacts.