Skip to content

Sandbox Tests

A reproducible, evidence-based answer to the question:

Does Archon's install / update / sync / uninstall protocol actually work end-to-end on real projects, on every supported IDE and language?

Each sandbox test takes a clean fixture project (no .archon/, no binding directory), runs one Archon lifecycle command (via agent or CLI), and verifies the resulting tree against an expected outcome. Every run is recorded with date, manifest version, runner, and result so you can audit reality, not promises.

How this differs from Contract Tests

LayerAsksLives in
Contract Tests"Are the framework files internally consistent?" (file shapes, cross-refs, line caps, forbidden substrings)scripts/archon-check.py running against .archon/contracts/governance-contract.yaml
Sandbox Tests (this section)"Does the install protocol produce a valid tree on a real fresh project, on this IDE / language?"Scenario pages under /testing/sandbox/scenarios/ — each backed by a fixture in fixtures/

Both layers are required. Contract tests are static and run on every commit; sandbox tests are scenario-driven and run on every release (plus on demand when adding a new IDE / language target).

The 12-scenario matrix

The first matrix covers lifecycle stage × IDE × language with deliberate overlap on the most common stack (Cursor + Node + TS) so that update / sync / uninstall scenarios can chain on top of an install scenario.

#test-idStageIDELanguage
01install-cursor-nodeinstallCursorNode + TS
02install-claude-pythoninstallClaude CodePython
03install-codex-goinstallCodex CLIGo
04install-aider-rustinstallAiderRust
05boot-cursor-nodebootCursorNode + TS
06boot-claude-pythonbootClaude CodePython
07update-cursor-nodeupdateCursorNode + TS
08update-cli-without-cliupdate + --without=cliCursorNode + TS
09sync-cleansync (no drift)CursorNode + TS
10sync-modifiedsync (drift detected)CursorNode + TS
11uninstall-preserveuninstall (preserve ledgers)Claude CodePython
12uninstall-archiveuninstall (archive ledgers)CursorNode + TS

See the Test Matrix page for the full grid with fixture / status columns, or jump to Test Fixtures for the project skeletons each scenario installs into.

Latest run summary

The table below is the single source of truth for "is Archon release-ready". A release does not ship until every row's most-recent run is passing against the candidate manifest version.

It is rendered live from runs/index.json, which is regenerated on every invocation of scripts/sandbox-run.mjs (local + GitHub Actions). To refresh after editing a scenario, run:

bash
node scripts/sandbox-run.mjs --runnable=cli         # CLI scenarios
node scripts/sandbox-run.mjs --runnable=agent       # agent scenarios (currently → manual)

Index generated:2026-05-06 10:24:38 UTC  ·  6 passing  ·  1 failing  ·  5 manual

ScenarioStageLatest resultManifestRunnerDurationRecorded
install-cursor-nodeinstall✅ passingv0.1.0cli231 ms2026-05-06 10:24:35
install-claude-pythoninstall⏳ manualv0.1.0manualclaude2 ms2026-05-06 10:24:38
install-codex-goinstall⏳ manualv0.1.0manualcodex1 ms2026-05-06 10:24:38
install-aider-rustinstall⏳ manualv0.1.0manualaider1 ms2026-05-06 10:24:38
boot-cursor-nodeboot⏳ manualv0.1.0manualcursor224 ms2026-05-06 10:24:38
boot-claude-pythonboot⏳ manualv0.1.0manualclaude222 ms2026-05-06 10:24:38
update-cursor-nodeupdate✅ passingv0.1.0cli353 ms2026-05-06 10:24:37
update-cli-without-cliupdate❌ failingv0.1.0cli345 ms2026-05-06 10:24:37
sync-cleansync✅ passingv0.1.0cli372 ms2026-05-06 10:24:35
sync-modifiedsync✅ passingv0.1.0cli386 ms2026-05-06 10:24:36
uninstall-preserveuninstall✅ passingv0.1.0cli367 ms2026-05-06 10:24:36
uninstall-archiveuninstall✅ passingv0.1.0cli357 ms2026-05-06 10:24:36

Status legend: ✅ passing · ❌ failing · ⏳ manual (no SDK adapter yet, see KNOWN-003) · · pending (no run on record).

A failing row is not runner noise — it is either an authentic CLI regression or a scenario whose assertions need updating. Either way it blocks the release until resolved.

How to add a new scenario

  1. Pick the gap: a stage / IDE / language combination not yet covered.
  2. Pick (or add) a fixture under fixtures/ — see fixtures/README.md for conventions.
  3. Copy template.md into scenarios/<test-id>.md, fill front-matter + steps + expected outcome.
  4. Add the row to Test Matrix and to the Latest run summary table above (status pending).
  5. (When you actually execute it) record mp4 + cast, upload to docs/public/videos/<test-id>.mp4 and docs/public/asciinema/<test-id>.cast, flip status to passing in the same commit.

Why we keep pending rows visible

A scenario page that lives under "I'll write the test later" rots fast. By committing the page (with pending status, expected steps, and empty recording slots) before the run, three things happen:

  1. The matrix is honest about coverage gaps.
  2. The expected outcome is fixed before the run, removing the bias of writing the test to match whatever happened.
  3. Anyone (including future maintainers) can pick up a pending scenario and execute it without having to invent it.

Released under the Apache-2.0 License.