All posts
Agentic Engineering

The Agentic Harness in Practice

A walkthrough of one solo-operator codebase mapped onto the four-level harness model — the artifacts, scripts, and negative-space rules that make each level real.

The agentic project harness post sketched a model — four readiness levels, a folder shape, a completion protocol. This is what one real codebase actually has at each level, and which parts turned out to matter most.

The codebase is ShowUp: an Android accountability app for location-based habits. Native Kotlin and Compose on the client, FastAPI and PostgreSQL with Alembic and Procrastinate on the backend, infrastructure across Cloudflare and one Hetzner box. One operator, heavy agent assistance, more than forty deploy entries across backend, Android, and infra in the past two weeks. Everything below comes from a working repo, not a reference template.

Orientation — what the agent reads first

The orientation layer is the agent's first ten minutes. If it lands wrong, every later decision inherits the confusion.

AGENTS.md is a wrapper, not the truth. The file runs 345 lines and deliberately doesn't try to hold all product knowledge — it sets tone, names the operator, lists production hosts and services, and then defers product, payment, privacy, verification, and launch truth to one canonical doc, docs/prd.md. The PRD is 1,234 lines. When the agent asks "is X allowed?", AGENTS.md tells it where to look and the PRD answers. The two layers can't drift because they're physically different files with different update cadences.

The PRD itself is the source of truth for anything that could legitimately be argued about — what counts as a payable market, what an admin can and can't do, what data is retained on account deletion, what the launch gate requires. The AGENTS.md guardrail is explicit: "Read docs/prd.md before product, payment, privacy, verification, or launch decisions." That one line removes a whole category of agent confusion.

Repo-local workflows live under .codex/skills/ — fifteen named skill folders, each with a SKILL.md, an agents/ reference set, and supporting references. The skills are an engineering categorization, not a product UX one: showup-backend, showup-payments, showup-ci-release, showup-launch-ops, showup-android-refactor, and so on. When the agent needs to review a payment change, it loads the payment skill instead of inventing a review pattern. The skill body carries the gotchas — idempotency keys, RTDN OIDC verification, voided-purchase handling — that an agent wouldn't reconstruct from code alone.

And scratch-pad/ is named negative space. AGENTS.md says explicitly: "owner-private. Do not read, write, move, delete, summarize, or stage anything under it unless Kuldeep explicitly names that path." Most harnesses describe what the agent should do. The do-not rules are at least as important, and writing them down is what makes them enforceable.

Local repeatability

The local layer makes development behave the same way every time. scripts/ holds named single-purpose helpers, a sampling of what gets used in a normal week:

Script What it does
bootstrap-local-db.sh One command from a clean clone to a migrated Postgres
new-changelog-entry.py Properly named, properly templated changelog entry
completion_check.sh Pre-claim check before declaring work done
deploy-backend-preflight.sh Local-side gate before triggering the deploy workflow
prod-smoke-readonly.sh Read-only checks anyone can run safely against production
release-evidence-pack.sh Bundles release artifacts and verification output

Two rules govern the folder: each script does one thing, and read-only is named read-only. The agent can run anything ending in -readonly without confirmation. Anything else triggers an authorization conversation.

completion_check.sh is the small piece that makes the completion protocol operable. It prints dirty files, untracked files, conflict markers, and TODO/debug markers — the surface area an agent has to inspect before claiming work is done.

Production discipline

This is the layer that decides whether the project can take real traffic without losing money or data.

The changelog is the deploy log and the cross-session memory. Every backend, Android, or infra deploy gets one Markdown file — UTC-named (YYYY-MM-DD-HHMMZ-<surface>-<slug>.md), surface-tagged, started from TEMPLATE.md. The template forces commit SHA or artifact version, the deployment mechanism, the verification that was actually run, and the exact rollback path. Status flows planned → deployed | failed | rolled_back | canceled and gets updated after the deploy, not before.

The 45 entries from the past two weeks aren't a vanity log. Two days into a future session, the agent picking up the work can rebuild what shipped, when, with which verification, and what the rollback was — from disk, not from a compaction summary. This is the disk-anchored record the context compaction post argues every long-running agent needs.

db-restore-drill/ proves restore, not just backup. A backup that's never been restored is a backup you're guessing about. The folder holds timestamped subfolders for actual drill runs against pgbackrest snapshots restored from R2. Small distinction, but it's what separates L2 from L1: configuring a backup is L1; running and recording the restore drill is L2.

Deploys are mechanically gated. Backend production deploys run only from main, require a confirm_delete_target input that must match /opt/showup/backend exactly (the rsync uses --delete), and refuse any other ref. The Procrastinate schema is bootstrap-only — the deploy script does not run schema --apply unless an explicit one-time env var is set. The intent is that no single mistyped flag can produce a destructive deploy. The runbook in docs/backend-deploy-rollback-runbook.md exists before it's needed, not after.

Product readiness

L3 is project-dependent, but a real-money app on a real app store has several pieces that aren't optional. docs/legal-pages-readiness.md, docs/backend-launch-readiness.md, docs/play-review/, and docs/global-regions-regional-pricing-plan.md are the launch artifacts. Each one is a single named doc listing the conditions for that surface to be considered ready, who owns each item, and what evidence proves it. AGENTS.md references each so the agent doesn't have to guess what "ready" means.

One pattern worth pulling out: the catalog is generated, not hand-maintained. The Android India product catalog is 294 Google Play SKUs. Maintaining that by hand would be a permanent source of drift between backend, client, and Play Console. Instead, backend/scripts/generate_android_in_skus.py produces the canonical list with a pricing_config_hash, and both client and server check the hash before launching billing. The broader lesson: anywhere the operator would be tempted to copy values between systems by hand, write a script and have the systems compare hashes.

What the harness buys

The payoff isn't the harness itself. It's what becomes possible because the harness exists.

The clearest example sits in docs/backend-codebase-deep-dive-2026-05-17.md. The report runs multiple validation passes across the backend's money, verification, privacy, and ops surfaces, using parallel explorer subagents — one for payments, one for account and privacy, one for check-in and rate-limiting, one for ops and admin — and a proofread pass that re-reads the report against the current worktree after migrations have moved. It ends with an explicit risk split: 30% payment and verification correctness, 25% operational edge cases and drift, 20% privacy and account/device ownership drift, 15% maintainability risk from large services, 10% latent product-contract drift.

None of that works without the harness. The parallel explorers need named skill bodies to load. The validation pass needs a stable PRD to compare against. The risk split is meaningful only because the changelog and docs let the report check its claims against what actually shipped. On an unharnessed repo, the same audit would produce a more confident, more wrong document.

The negative space

The artifacts above are the positive layer — what the agent should read, run, and produce. There's also a negative layer, and it's at least as important because it's what prevents confident wrong moves. A few rules from this repo, in roughly the order they've saved time:

  • scratch-pad/ is off-limits.
  • docs/archive/ is retained evidence, not active TODOs. An archived design doc is not an acceptance gate for new work.
  • Old Alembic revisions are not current production state. "Do not treat historical Alembic revision 20260507_0004 as current production state" is written down because it's exactly the kind of thing a confident agent gets wrong.
  • Legacy Android mockups under docs/archive/android-screens-legacy/ are not a visual compatibility target.
  • The production sender domain is required and named. Falling back to a test sender is forbidden.
  • USA is in catalog-validation only and remains payment-disabled until the owner says otherwise. The agent cannot promote a market on its own.

Each is a one- or two-line rule in AGENTS.md or the PRD. Together they are the difference between an agent that asks before acting on ambiguity and one that confidently does the wrong thing.

What it costs, what it returns

Each item costs a few hours to put in place. The changelog template is half an hour. The PRD pointer is one line. The repo-local skills are a few hours each, encoded from real work. The restore drill is half a day. The negative-space rules get written down as the agent surfaces edge cases.

Collectively they buy back days a week. The agent doesn't re-derive what the PRD says, doesn't invent a payment-review workflow, doesn't need a human to recover state after a long run, doesn't stumble into archived material. The harness is the operator's compounding investment — it gets thicker every week, and every layer that gets thicker makes the next week cheaper.