Skip to content

Building a Project from Zero: My Agent Delivery Harness

This post is part of the AI Agents series. It focuses on the delivery harness around coding agents: the issue contract, worktree isolation, test sensors, subagent role boundaries, and review gates that make agent output easier to trust.

My Agent Delivery Harness started with a few articles I read earlier this year about coding-agent workflows. The practices in them felt immediately useful, so I tried applying them to projects I already had in hand. That experiment ran straight into the usual walls: an agent enthusiastically implemented an entire issue in one sweep, then added tests after the fact; sometimes it even softened the tests until they passed. Two tasks collided in the same checkout and overwrote each other's files. Halfway through a run, I would discover the local environment had never been prepared.

Most of those failures had very little to do with how smart the model was. A model can be excellent at writing code, but without a harness around the work it will still write in the wrong directory, guess through unclear requirements, and declare victory before anything has actually been verified.

I have since used and revised this setup across several internal projects. This post is for people who already use tools like Claude Code or the GitHub Copilot coding agent and are starting to feel the gap between "it can write code" and "I can trust the delivery process." If you are looking for a beginner introduction to AI coding tools, this one is probably too workflow-heavy.

The word "harness" here is close to the sense used in LangChain's The Anatomy of an Agent Harness: Agent = Model + Harness. The model is only one part. The prompts, tools, state, execution environment, feedback loops, and constraints around it are what turn it into an agent that can get useful work done. Anthropic's Effective harnesses for long-running agents and Martin Fowler's Harness engineering for coding agent users are circling the same problem: how do you make an agent not merely act, but deliver in a way that is stable, traceable, and correctable?

The harness in this post is a set of repository-local scripts, Copilot instruction files, subagent prompts, and workflow constraints. It is not a framework that dictates your product architecture, and it is not an agent runtime that executes the model. It governs the delivery rhythm of a coding agent.

The harness does not create product features for you. It defines how an agent starts an issue, breaks down work, writes tests, verifies results, reviews changes, and opens a PR. Each task gets its own git worktree. Requirements are anchored in a GitHub issue. Local checks act as sensors. Implementation and verification are handed to different subagents. I will start with what one full issue lifecycle looks like, then walk through the repository structure, and finally show the shortest path from an empty project to a first PR.

What one issue looks like end to end

Start with the big picture. The core of the harness is a fixed issue lifecycle:

flowchart TD
    A["GitHub issue"] --> B["start-issue.sh<br/>preflight + worktree"]
    B --> P["Stage 1 planning<br/>planner writes plan + raises questions"]
    P --> H{"Human input gate<br/>all decisions resolved?"}
    H -- No --> Q["Human answers decisions"]
    Q --> H
    H -- Yes --> F["Conductor writes feature_list.json<br/>selects one passes:false feature"]
    F --> T1["tester creates/confirms RED sensor"]
    T1 --> I["implementer makes minimal product change"]
    I --> T2["tester verifies GREEN<br/>only then marks passes:true"]
    T2 --> More{"More unfinished features?"}
    More -- Yes --> F
    More -- No --> R["Stage 3 review<br/>code reviewer rereads diff fresh"]
    R --> C["Stage 4 closeout<br/>review-gate → create-pr → merge-pr → finish-issue"]

Four design choices sit underneath that diagram.

The GitHub issue is the requirements contract. Local notes and chat history do not replace it. What needs to be built and what counts as acceptance belongs in the issue, fetched with gh issue view <N> --comments. If a human decision during planning changes requirements or acceptance criteria, it should go back to the issue as a comment. Execution notes such as handoffs, sensor results, and repair logs belong in progress.md. The source of authority is the GitHub issue and its comments, then feature_list.json. The feature list is an execution checklist; it is not allowed to override the issue.

Each issue gets its own worktree. start-issue.sh creates a feature/issue-NN-<slug> branch and puts the checkout under ../<repo>-worktrees/issue-NN. Two issues can move in parallel without fighting over the same working directory.

Local scripts are sensors. There is nothing mystical about a sensor here. It is any repeatable check that reports truth through an exit code: shell tests, language tests, lint, typecheck, CI smoke checks. An agent can say the work is done, but if the sensor fails, it is not done.

Implementation and verification belong to different roles. The conductor is the coding agent in the main chat. It manages the issue, chooses the next feature, records handoffs, commits, and opens the PR. The agents that write product code and write or verify tests are separate subagents. The point is deliberately practical: do not let the same agent write the implementation, weaken the tests, and then grade its own work as successful.

The cast of roles

For Stages 1 through 4, I use a few short names. This repository is oriented around GitHub Copilot: role prompts, skills, and instructions live under .copilot/. The lifecycle itself is not Copilot-specific, though. If you use Claude Code, you can move the same role split, feature list, and sensor discipline into Claude prompts, commands, or project documents. Each persona is just a file under .copilot/agents/, so you can inspect exactly what it was told to do by opening the corresponding .agent.md.

  • conductor: The coding agent in the main conversation. It owns the issue from start to finish. It selects the issue, breaks it into features, chooses the next step between stages, commits, and opens the PR. Once the issue workflow starts, it does not directly implement a specific feature's product code or tests. It has no separate prompt file; its behavior is defined in .copilot/instructions/workflow-tiers.instructions.md and AGENTS.md.
  • planner: The research-and-planning role. It reads existing code, finds local patterns to reuse, identifies files to touch, produces a plan, and lists questions that need human decisions. It does not write product code or feature_list.json. Its prompt is .copilot/agents/planning-subagent.agent.md.
  • tester: The verification role for one feature at a time. It first creates or confirms a RED sensor: a test or check that expresses the requirement and should currently fail. After implementation, it runs the same sensor to verify GREEN. Only when the check actually passes can it mark the feature as passes:true. Its prompt is .copilot/agents/test-subagent.agent.md.
  • implementer: The product-code role for one feature at a time. It makes the smallest product change needed to turn the RED sensor green. It is not allowed to write tests or mark passes:true. Its prompt is .copilot/agents/implementation-subagent.agent.md.
  • code reviewer: The final review role. It arrives with fresh context, receives the objective, acceptance criteria, and changed files, then rereads the workspace independently to judge spec compliance, test quality, code quality, and role-boundary violations. Its prompt is .copilot/agents/code-review-subagent.agent.md.

The easiest place to misunderstand this workflow is Stage 2. It is not "the implementer writes everything, then the tester adds tests afterward." The loop is: the conductor selects one passes:false feature, the tester creates or confirms the RED sensor, the implementer makes the smallest product change, and the tester verifies GREEN. That order is what cuts off the failure mode I kept seeing earlier.

Stage 0: start-issue.sh prepares the workspace

The whole harness is issue-driven. First, open a GitHub issue with clear acceptance criteria. Then start it from the main checkout:

./scripts/start-issue.sh 1
cd ../<repo>-worktrees/issue-01

start-issue.sh runs preflight first. Only after the environment is healthy does it create the branch and worktree. Then it lays down an empty feature_list.json template and progress.md under .copilot-tracking/issues/issue-NN/. At this point, the issue has not been decomposed yet. The script only sets the stage; decomposition waits until the next planning stage is complete and the open questions have been answered by a human.

Stage 1: planning and the human input gate

The planner goes first. It researches and plans, but writes no code: it reads the current codebase, looks for existing patterns, reasons through the files and steps involved, and writes a plan under .copilot-tracking/plans/.

The planner also has to list questions that require a human decision. If a choice should not be guessed, it has to be surfaced. Once the plan and question list are ready, the conductor pauses and relays those questions to me. The workflow stops at a human input gate. Until the questions are cleared, nobody starts the feature list.

After I answer the questions and confirm the direction, the conductor translates the approved plan into feature_list.json. That list becomes the execution index for the loop that follows, but the authoritative requirements still live in the GitHub issue.

A freshly decomposed feature_list.json looks roughly like this. Every feature starts as passes: false:

{
  "issue": 1,
  "features": [
    {
      "id": "F1",
      "title": "init.sh runs gofmt checks when go.mod is present",
      "steps": [
        "Add gofmt -l to the quality gate in go.profile.sh",
        "Make init.sh exit non-zero when unformatted files are present"
      ],
      "regression_sensor": "tests/scripts/test_go_profile.sh",
      "e2e_sensor": "",
      "passes": false,
      "verification": ""
    }
  ]
}

And passes:true cannot be just a claim. The closeout script check-feature-list.sh validates the JSON structure and requires every passes:true feature to have non-empty verification. It is a small rule, but it blocks the shortcut of marking work complete without a real check.

Stage 2: the implementation and verification loop

Stage 2 is the heart of the system. It handles exactly one passes:false feature at a time, then repeats until everything is green.

Each round goes like this:

  1. The conductor selects one feature. It prepares the issue context, feature steps, and declared sensor, then hands verification to the tester.
  2. The tester creates or confirms the RED sensor. The tester writes or updates the test and confirms it fails in the current state. This is about proving the requirement is observable, not about adding documentation after the fact.
  3. The implementer makes the smallest product change. The implementer touches product code only, with the sole goal of making the sensor turn green. It does not edit tests or declare completion.
  4. The tester verifies GREEN. The tester reruns the declared sensor. Only after it passes can the tester flip passes to true and write the verification evidence.
  5. The conductor records the handoff. Who did what, what ran, and what happened all get written to the issue progress record.

Splitting the two subagents is intentional. The agent writing the implementation cannot casually weaken the test. The agent writing the test cannot quietly patch product code. A prompt is not a perfect sandbox, but it makes the responsibility boundaries explicit, and the later review and progress record check those boundaries directly.

Stage 3: whole-change review

Even after every feature is green, the workflow does not jump straight to a PR. The code reviewer arrives with fresh context. It has not seen the planning discussion or implementation loop. It only receives the issue objective, acceptance criteria, and changed files. Everything else has to come from rereading the workspace.

That fresh read is the point. The reviewer should not be carried by the same conversation that produced the diff. It checks four things at once: whether the change satisfies the specification, whether the sensors are meaningful, whether the code quality holds up, and whether any role boundaries were violated. Serious findings have to be fixed and re-reviewed before closeout.

Review-oriented skills belong here, not scattered through every role. find-brute-force catches hacks, swallowed errors, and hardcoded shortcuts. find-duplicates catches copy-pasted logic. find-over-design looks for over-abstracted solutions. dead-code-detection checks dead-code risk introduced by the diff. sync-docs catches documentation drift. public-exposure-audit checks public-repo risks such as secrets, personal data, cloud IDs, and customer material. These skills focus on issues introduced by the current diff; they are not an excuse for an unrelated full-repo cleanup.

Stage 4: closeout

The closeout stage is intentionally a little fussy, because this is where delivery workflows often leak.

First, ./scripts/review-gate.sh approve records the current HEAD as the reviewed commit. create-pr.sh checks that marker before opening a PR. If you commit again or rebase after approval, the marker no longer matches and review has to be approved again. That prevents the classic mismatch where commit A was reviewed but commit B is what gets sent.

PR creation is backed by a create-pr skill, so the title and body are structured, linked back to the issue, and include the acceptance criteria instead of opening an empty PR. After the PR is open, merge-pr.sh confirms the harness CI smoke workflow is green before merging. After merge, finish-issue.sh runs from the main checkout, removes the worktree, and verifies the feature list again.

I recommend adopting this in stages. When a repository first takes on the harness, let the agent open the PR and then stop for human review. After a few cycles, when you trust the feature list, sensors, and review gate, you can allow the agent to run merge-pr.sh when all gates and PR checks are green. The scripts support CI-green merging, but a team does not need to grant that level of autonomy on day one.

One clarification matters: the CI smoke workflow is not CI/CD delivery. It is not a deployment pipeline, and it is not auto-merge by itself. It only runs the harness's own shell sensors, checks that scripts can parse, runs shellcheck, and validates Copilot configuration front matter. It is a health check for the harness itself, and it is a hard prerequisite before merge.

There is also a situational security-audit skill. If an issue touches auth, Azure provisioning, or data movement, closeout should include a pass over credential handling, secret exposure, permissions, and data classification. That kind of audit is an inferential review sensor rather than a deterministic shell script. The workflow gets its force from what happens next: Critical or High findings have to be fixed and the relevant checks rerun, not merely listed in a report.

The repository's three-layer structure

Once the lifecycle is clear, the repository layout is easier to read. I separated stable lifecycle machinery from the parts that vary by language, so adding support for a new stack does not require rewriting the core flow.

The first layer is the Core Harness. It owns the language-agnostic lifecycle: environment preflight, worktree creation, issue progress tracking, review gates, and PR closeout. Its behavior is pinned in docs/harness-contract.yml, and tests/scripts/test_harness_contract.sh watches that contract in CI. If a core script changes without a matching contract update, the test fails.

The second layer is Language Profiles. Each language gets a profiles/<id>.profile.sh file that tells init.sh how to detect that kind of project and which checks to run. The current built-ins cover Python, Go, Node.js, Java, and Ruby. The core does not know the details of any of them; it only loads the matching profile.

The third layer is called Framework Templates in the repository docs, but I think of it more directly as Project Conventions. Project-level choices such as FastAPI in Python or Spring Boot in Java should live in the adopting repository's own documents, not in the core harness. A profile can hint at framework conventions, but it should not force them.

From zero: connecting a project to the harness

Assume you have nothing yet and want to start a new project with this harness. The shortest path looks like this.

Step 1: get the harness

The simplest route is to clone this repository as the project root and put your application code under apps/, packages/, or infra/. Leave scripts/ and profiles/ where they are.

If you already have a project, treat this as a port rather than an install. You can copy only the harness pieces you need: scripts/, profiles/, tests/, .copilot/, the smoke workflow, and the lifecycle docs under docs/. Existing repositories usually already have their own CI, scripts, docs, and branch rules, so expect to adjust paths, gates, and team conventions. Do not expect a blind copy-paste to work unchanged.

Step 2: confirm prerequisites

The hard requirements are small: macOS or Linux, an authenticated GitHub CLI, and a working Git setup. Commit signing is not required for every open-source project; today init.sh warns when signing is not enabled rather than blocking. If your company or repository policy requires verified commits, though, configure that before allowing the agent to commit, push, or merge.

gh auth login          # required, everything later depends on it

Azure CLI is only needed once you start touching Terraform or cloud deployment; then use REQUIRE_AZ=1 ./scripts/init.sh. Language toolchains, such as Python's uv, are only needed once code for that language appears. Before that, preflight warns but does not block.

Step 3: choose a language

Pick the first language your project will use. Each language is detected by a marker file and gets its own checks:

Language Detection file Checks
Python pyproject.toml ruff format, ruff lint, mypy, pytest
Go go.mod gofmt, go vet, optional golangci-lint, go test
Node.js package.json prettier, eslint, optional tsc, test script
Java pom.xml / build.gradle optional Spotless, Checkstyle-family checks, tests
Ruby Gemfile standardrb or RuboCop, RSpec or Minitest

A project can contain multiple languages. init.sh detects and checks each one. Terraform files also trigger terraform fmt and validate. If your repository only contains documentation, as this harness did at one point, the only required check is shellcheck for scripts.

Step 4: run preflight

./scripts/init.sh

It checks GitHub authentication, Azure login state, commit-signing configuration, detected language surfaces, and quality gates. Hard failures exit with repair guidance. Soft checks, such as a missing language toolchain before any matching code exists, warn but let you continue.

Preflight is roughly six gates: required tools, GitHub authentication, Azure authentication, commit signing, project-language detection, and quality gates. If the early hard checks fail, the later quality gates do not run.

Step 5: open your first issue

Open a GitHub issue with clear acceptance criteria, then start it from the main checkout:

./scripts/start-issue.sh 1
cd ../<repo>-worktrees/issue-01

Now the workspace is ready. From there, follow Stages 1 through 4: plan, resolve questions, decompose into feature_list.json, then run the RED/GREEN loop one feature at a time.

Next: evaluating the harness itself

So far, quality is guarded by script sensors, role boundaries, and the review gate. But the harness itself is an agentic system, and ordinary unit tests are not enough. Whether skills trigger correctly, whether subagents respect their boundaries, and whether handoffs drift out of order should all become things we can evaluate.

My next step is to add evaluation around the harness. Deterministic script lifecycle tests will guard L0/L1 behavior. Model-judged parts can use LLM-as-judge or mutation evals to test role boundaries, prompt injection resistance, secret handling, and signing preservation. The full design starts in docs/evaluation/, with the first executable slice under docs/evaluation/l0-l1-solution/. That part is still a plan, not a finished system.

Why this is worth the trouble

The first time through this workflow, it feels like a lot of steps. My own experience is that it removes the fuzzy space between me and the agent. Ambiguous requirements? Put them in the issue. Environment not ready? Preflight catches it. Tests getting cut down after the fact? Role separation makes that harder to hide. Need another language? Add a profile without touching the core. Every gate maps back to a failure mode that actually hurt me before.

Stepping back, the big shift over the last couple of years is that coding agents can genuinely carry work now. Since late last year especially, tools like Claude Code have become capable enough that work I once wrote line by line can often be handed to an agent, and development speed jumps. But the ability to write code is not the same thing as permission to let go. The interesting problem is how to trust the result without watching every keystroke, and how to keep reducing the amount of human intervention required.

This harness is the result of reading what others have learned, then building something I could actually use. I have already used it to kick off several new internal projects. As an FDE, I often move between new projects and have to recreate the working rules each time. Harness engineering helps because it turns those rules into something portable. The cost of switching projects gets lower.

If you use AI agents for coding and keep feeling that they are "very capable but hard to trust," try a few pieces first: force requirements into issues, isolate tasks with worktrees, and split testing from implementation. Those changes alone can save a lot of cleanup time. The full repository is agent-delivery-harness, and its docs/getting-started.md and docs/HARNESS.md go deeper than this post.