All posts
Agentic Engineering

The Executable Harness

Some agent rules belong in scripts rather than prose — filename suffixes that encode permission, and linters that keep the harness from rotting.

Some agent rules belong in scripts, not in AGENTS.md. Prose policy depends on the agent reading the right paragraph and remembering it through compaction. Script policy is enforced every time the script runs. The longer I worked alongside agents, the more I noticed that the rules I kept repeating in prose were the rules I should have been encoding in code.

ShowUp's scripts/ directory has about three dozen entries. Most are not generic developer helpers. They exist because an agent needed a safe way to act, or because the harness itself needed a way to check that it had not drifted.

Suffixes are part of the interface

When an agent reads a filename like prod-smoke-readonly.sh or production-reset-guarded.sh, the suffix tells it what kind of script this is before the file is opened. The suffixes are a small vocabulary, used consistently across the directory:

  • -readonly: touches external state but cannot mutate it, by construction.
  • -guarded: dangerous, but the dangerous path is gated by explicit confirmations and internal sanity checks. Default behavior is non-destructive.
  • -preflight: validates posture for a dangerous action. Does not perform the action.
  • -lint / -check / -audit: read repo or harness state and report findings, do not change anything.

This is more useful than it sounds. The agent does not have to read every script's preamble to learn what it does — the filename already encodes the permission. A -readonly script gets run; a -guarded script gets dry-run first; a -preflight script gets run and its output read.

-readonly means provably read-only

prod-smoke-readonly.sh makes curl GETs against production endpoints and verifies expected status codes and body markers. It has retries, timeouts, and a clear exit code. It does not POST anywhere. Running it ten times in a row produces no side effects.

play-state-readonly.py is the interesting one. It needs to query Google Play artifact state — track versions, bundle versions, the highest versionCode. The Android Publisher API forces you to create a temporary edit object to read most of that. So the script creates the edit, reads state, and then deletes the edit in a finally block, printing temporary_edit=deleted on success. The "readonly" promise is kept by the code, not just claimed in the filename.

github-deploy-prereq-readonly.sh has a subtler discipline. It checks that the required GitHub Actions secrets exist by name, that the deploy environment is configured, that branch protection is on. The script's docstring says it "prints only secret names and configuration status, never secret values." A readonly script that reads secrets must not leak them; otherwise the suffix is a lie.

The pattern is: a -readonly script is one whose worst case, if an agent runs it on its own, is wasted seconds. That is the bar.

-guarded means the danger is inside the script

production-reset-guarded.sh is the canonical example, and it is the only -guarded script in the repo — the suffix is reserved, not a habit. It can truncate runtime tables on the production Postgres database, optionally wipe Firebase Auth users, and restart services. It is also the kind of script that should make any operator nervous to invoke.

The danger lives inside, not in the operator's memory. Defaults are non-destructive: --dry-run is the default, and --execute requires --confirm-host showup-prod-1, --confirm-database showup, and --confirm-delete-postgres-runtime-data DELETE_SHOWUP_PROD_POSTGRES_RUNTIME_DATA. Each is the literal expected string, validated before anything destructive runs. The final guard is a phrase confirmation: --confirm-final-reset "RESET showup-prod-1 showup POSTGRES_ONLY". Mistype any part and the script aborts. Firebase deletion needs its own separate flags with the project name embedded in them.

The internal guards are at least as important as the flag guards. Before truncating, the script verifies the actual hostname matches showup-prod-1, the current database matches showup, the alembic_version row count is exactly one. It refuses to run if the backend user count exceeds a configurable ceiling, if any unmanaged production table contains rows, or if any unmanaged table has a foreign key pointing into a reset table — that last check exists because TRUNCATE CASCADE could otherwise quietly wipe data the operator did not plan to touch.

If the destructive step commits but post-reset verification fails, the script leaves production writers stopped on purpose. A half-completed reset should not keep accepting traffic.

The agent does not need to know any of this. The agent invokes the script with the dry-run flag, reads the inventory, and asks for owner confirmation. The script enforces the rules whether the agent remembered them or not.

-preflight means validate, then print the next command

deploy-backend-preflight.sh is what an agent runs before suggesting a production deploy. It does not deploy. It checks that the current branch is main, runs the infra template lint, runs the backend deploy validator, runs the GitHub-side prerequisite check, warns about a dirty working tree, warns if the local HEAD does not match origin/main. Then it prints the exact gh workflow run command.

To actually trigger the deploy, you have to pass --run along with --confirm-delete-target /opt/showup/backend. Without those, the script's last output is the command the operator would have to type, copied into the terminal so they can decide.

Preflight is the right shape for "this dangerous thing is about to happen, did we forget anything?" An agent that ends every deploy-adjacent session with the preflight output cannot quietly skip a check the harness already knows about.

The harness needs a linter

A harness is text and code that tells agents what to do. Like any other text and code, it rots. Conventions drift. Someone adds a new changelog entry and forgets the rollback section. Someone writes a payment-adjacent endpoint that uses FastAPI BackgroundTasks instead of durable Procrastinate. Someone pastes a service-account private key into a changelog because they did not realize the field they were copying contained one. Without enforcement, the harness slowly loses meaning.

changelog-lint.py enforces the changelog format. Every entry under changelog/ must have Status, Surface, Environment, Commit, Artifact/version/revision, Deployment mechanism, Deployed at, Operator, and Links fields, with values from a fixed allowlist for the first three. Required sections Summary, Changes, Verification, Rollback, Notes must exist. The same script scans for accidentally-pasted secrets — PEM private keys, service-account JSON bodies, ya29.* OAuth tokens, AIza* Google API keys, raw purchase tokens, exact GPS coordinate pairs, and raw env assignments containing SECRET, TOKEN, PASSWORD, DATABASE_URL. An agent could not credibly remember all of those patterns. The linter just runs.

repo-policy-lint.py enforces a small set of cross-cutting rules that do not live anywhere else. It walks every text file in the repo and flags while true loops in active source code, BackgroundTasks in payment-adjacent backend modules, positional Procrastinate defer_async calls (which once caused jobs to run with missing arguments), shell scripts sourcing the entire /etc/showup/showup.env, destructive production commands without one of the known confirmation tokens, retired image paths, durable Android version constants stored in skills or AGENTS.md, market-default timezone strings in code that should resolve timezone from the spot. Each of those is the corpse of a past mistake.

skill-validate.py checks the skill files themselves — that each .codex/skills/<name>/SKILL.md has the right frontmatter, that the folder name matches the declared name, that the description is not a placeholder, that the agent-interface YAML carries a display name and default prompt. The skills are what agents load to remember how to do a thing; the validator keeps those skills from shipping with TODO in their descriptions.

agent-check-plan.py is the one that pulls it all together. Given a set of changed files, it computes which repo skills to reopen, which source docs to reread, which commands the verification pass must run, and what the risk level is. Payment-adjacent paths force the showup-payments skill and a Play catalog summary. Backend changes pull in the ruff, mypy, and pytest command set. Migration files escalate risk to high and demand the local DB and Procrastinate bootstrap commands. AGENTS.md changes pull in skill-validate.py. It is, in effect, a function from "diff" to "the verification pass an agent should run." Prose instructions about what to verify get vague; this script produces a specific plan.

What stops being prose

Every rule that lives in a script is a rule I no longer have to repeat in AGENTS.md, no longer have to hope survives compaction, no longer have to re-explain to a fresh session.

Prose policy is the right surface for intent and judgment: the reason a rule exists, the cases where it bends, the things not worth automating. Script policy is the right surface for "this must not happen" and "this is the safe way to do that." When in doubt, push the rule into a script. The harness gets smaller and harder to forget at the same time.