Sandbox Tests

A reproducible, evidence-based answer to the question:

Does Archon's install / update / sync / uninstall protocol actually work end-to-end on real projects, on every supported IDE and language?

Each sandbox test takes a clean fixture project (no .archon/, no binding directory), runs one Archon lifecycle command (via agent or CLI), and verifies the resulting tree against an expected outcome. Every run is recorded with date, manifest version, runner, and result so you can audit reality, not promises.

How this differs from Contract Tests

Layer	Asks	Lives in
Contract Tests	"Are the framework files internally consistent?" (file shapes, cross-refs, line caps, forbidden substrings)	`scripts/archon-check.py` running against `.archon/contracts/governance-contract.yaml`
Sandbox Tests (this section)	"Does the install protocol produce a valid tree on a real fresh project, on this IDE / language?"	Scenario pages under `/testing/sandbox/scenarios/` — each backed by a fixture in `fixtures/`

Both layers are required. Contract tests are static and run on every commit; sandbox tests are scenario-driven and run on every release (plus on demand when adding a new IDE / language target).

The 12-scenario matrix

The first matrix covers lifecycle stage × IDE × language with deliberate overlap on the most common stack (Cursor + Node + TS) so that update / sync / uninstall scenarios can chain on top of an install scenario.

#	test-id	Stage	IDE	Language
01	`install-cursor-node`	install	Cursor	Node + TS
02	`install-claude-python`	install	Claude Code	Python
03	`install-codex-go`	install	Codex CLI	Go
04	`install-aider-rust`	install	Aider	Rust
05	`boot-cursor-node`	boot	Cursor	Node + TS
06	`boot-claude-python`	boot	Claude Code	Python
07	`update-cursor-node`	update	Cursor	Node + TS
08	`update-cli-without-cli`	update + `--without=cli`	Cursor	Node + TS
09	`sync-clean`	sync (no drift)	Cursor	Node + TS
10	`sync-modified`	sync (drift detected)	Cursor	Node + TS
11	`uninstall-preserve`	uninstall (preserve ledgers)	Claude Code	Python
12	`uninstall-archive`	uninstall (archive ledgers)	Cursor	Node + TS

See the Test Matrix page for the full grid with fixture / status columns, or jump to Test Fixtures for the project skeletons each scenario installs into.

Latest run summary

The table below is the single source of truth for "is Archon release-ready". A release does not ship until every row's most-recent run is passing against the candidate manifest version.

It is rendered live from runs/index.json, which is regenerated on every invocation of scripts/sandbox-run.mjs (local + GitHub Actions). To refresh after editing a scenario, run:

bash

node scripts/sandbox-run.mjs --runnable=cli         # CLI scenarios
node scripts/sandbox-run.mjs --runnable=agent       # agent scenarios (currently → manual)

Index generated:2026-05-06 10:24:38 UTC · 6 passing · 1 failing · 5 manual

Scenario	Stage	Latest result	Manifest	Runner	Duration	Recorded
install-cursor-node	`install`	✅ passing	`v0.1.0`	cli	231 ms	`2026-05-06 10:24:35`
install-claude-python	`install`	⏳ manual	`v0.1.0`	manual`claude`	2 ms	`2026-05-06 10:24:38`
install-codex-go	`install`	⏳ manual	`v0.1.0`	manual`codex`	1 ms	`2026-05-06 10:24:38`
install-aider-rust	`install`	⏳ manual	`v0.1.0`	manual`aider`	1 ms	`2026-05-06 10:24:38`
boot-cursor-node	`boot`	⏳ manual	`v0.1.0`	manual`cursor`	224 ms	`2026-05-06 10:24:38`
boot-claude-python	`boot`	⏳ manual	`v0.1.0`	manual`claude`	222 ms	`2026-05-06 10:24:38`
update-cursor-node	`update`	✅ passing	`v0.1.0`	cli	353 ms	`2026-05-06 10:24:37`
update-cli-without-cli	`update`	❌ failing	`v0.1.0`	cli	345 ms	`2026-05-06 10:24:37`
sync-clean	`sync`	✅ passing	`v0.1.0`	cli	372 ms	`2026-05-06 10:24:35`
sync-modified	`sync`	✅ passing	`v0.1.0`	cli	386 ms	`2026-05-06 10:24:36`
uninstall-preserve	`uninstall`	✅ passing	`v0.1.0`	cli	367 ms	`2026-05-06 10:24:36`
uninstall-archive	`uninstall`	✅ passing	`v0.1.0`	cli	357 ms	`2026-05-06 10:24:36`

Status legend: ✅ passing · ❌ failing · ⏳ manual (no SDK adapter yet, see KNOWN-003) · · pending (no run on record).
A failing row is not runner noise — it is either an authentic CLI regression or a scenario whose assertions need updating. Either way it blocks the release until resolved.

How to add a new scenario

Pick the gap: a stage / IDE / language combination not yet covered.
Pick (or add) a fixture under fixtures/ — see fixtures/README.md for conventions.
Copy template.md into scenarios/<test-id>.md, fill front-matter + steps + expected outcome.
Add the row to Test Matrix and to the Latest run summary table above (status pending).
(When you actually execute it) record mp4 + cast, upload to docs/public/videos/<test-id>.mp4 and docs/public/asciinema/<test-id>.cast, flip status to passing in the same commit.

Why we keep `pending` rows visible

A scenario page that lives under "I'll write the test later" rots fast. By committing the page (with pending status, expected steps, and empty recording slots) before the run, three things happen:

The matrix is honest about coverage gaps.
The expected outcome is fixed before the run, removing the bias of writing the test to match whatever happened.
Anyone (including future maintainers) can pick up a pending scenario and execute it without having to invent it.

Sandbox Tests ​

How this differs from Contract Tests ​

The 12-scenario matrix ​

Latest run summary ​

How to add a new scenario ​

Why we keep pending rows visible ​