← All articles

What is an AI agent harness? From test frameworks to coding agents

The word 'harness' is everywhere in AI now. Here's what it actually means, why coding agents like Claude Code and Codex are harnesses, and what you still need to build on top of them.

12 min read

If you've spent any time around AI in the last year, you've seen the word harness used in three different ways within the same paragraph. People talk about the "Claude Code harness," the "SWE-bench harness," "writing a harness for your own agent." The word does a lot of work and rarely gets defined.

It's worth pinning down, because the shape of your harness decides what your agent can actually do. A great model inside a weak harness is a chatbot. An average model inside a strong harness can write, test, and ship a feature while you're in a meeting.

This article walks through what a harness is, where the term came from, how it applies to modern coding agents, and what the current generation of harnesses still can't do on their own.

"Harness" before AI: a word with three jobs

The word wasn't invented for LLMs. Software engineers have been using it for decades, and the older meanings are still the clearest way to understand the new one.

1. The test harness

The most common pre-AI meaning. A test harness is the code that sets up fixtures, runs your test functions, collects results, and reports pass/fail. JUnit, pytest, Jest, Go's testing package — these are all test harnesses. Your tests are the payload. The harness is the machinery that puts the payload through its paces and records what happened.

The defining feature is that the harness drives the thing under test. You don't call the harness from your tests; the harness calls your tests. It owns the loop.

2. The benchmark harness

When researchers report "model X scores 62% on SWE-bench," they're reporting a score from a benchmark harness. The harness clones a real repository at a specific commit, feeds a bug description to the model, lets the model propose a patch, applies the patch, runs the project's test suite in a container, and scores the result. Different harnesses produce different scores for the same model on the same benchmark — a fact that surfaces every time a new leaderboard appears. The harness defines how many attempts are allowed, which tools the model can use, what the cleanup between attempts looks like. Change any of those and you change the number.

3. The physical harness

The original metaphor. A harness is what you strap onto a draft horse so its muscle power can be connected to a cart. The horse has the force. The harness makes that force usable — it channels, constrains, and directs.

Every software use of the word inherits from this image. The payload has the capability. The harness is what makes the capability usable for a specific job.


What an AI agent harness is

A modern language model, by itself, produces one thing: text. You give it a prompt, it gives you tokens. It can't read a file on your disk. It can't run a command. It can't check whether its suggested code compiles. It doesn't even remember what happened five minutes ago unless you pass that history back in on the next call.

A harness is everything you build around the model to turn "produces text" into "gets work done." The essential pieces are always the same:

  • The loop. Call the model, inspect the response, do something based on it, feed the result back into the next call. Repeat until the agent says it's finished or you decide to stop it. Without the loop, you have a single model call — which is an autocomplete, not an agent.
  • The tools. A set of functions the model is allowed to call. read_file(path), edit_file(path, old, new), run_bash(command), search_web(query). The model doesn't execute these — it asks for them, and the harness runs them and returns the result.
  • The system prompt. Instructions that ride at the top of every request. This is where "you are a coding agent working on repository X; you follow these conventions; here is the file layout" lives. The system prompt is a huge part of what makes two harnesses wrapping the same model behave completely differently.
  • The context window management. Real tasks produce more text than fits in a context window. The harness decides what to keep, what to summarize, what to throw away. Bad context management is one of the most common reasons an agent "forgets" what it was doing.
  • The exit conditions. When does the loop stop? On a specific tool call like finish()? When the model stops asking for tools? After N iterations? On an error? The harness owns this decision.

That's the whole shape. Loop, tools, prompt, memory, exits. Any system calling itself an "agent" is some combination of these pieces around a model.


The framework analogy is closer than it looks

Developers coming from web or backend work recognize this pattern immediately. A web framework like Rails or Express or Spring owns the main loop — it accepts the request, routes it to your handler, passes the response back to the client. You don't call Rails; Rails calls you. The technical name for this is inversion of control, and it's the defining feature of a framework as opposed to a library.

An AI harness is inversion of control with a language model as the handler. The harness owns the loop. On each iteration it calls the model — which is your "handler" — and the model returns either a tool call ("please read file X") or a final answer. The harness routes tool calls to their implementations and feeds the results back in. The model is plugged in; the harness drives.

The analogy holds even further:

  • Frameworks have conventions. Rails has "convention over configuration" — directory layouts, naming rules, implicit behavior. Harnesses have the same thing in the form of system prompts and tool descriptions. Two agents behave differently largely because of conventions the harness imposes, not because of the model.
  • Frameworks have middleware. A request in Express passes through auth middleware, logging, body parsing. A tool call in an agent harness often passes through permission checks, logging, retries, and rate limits.
  • Frameworks have a plugin ecosystem. The MCP (Model Context Protocol) spec that Claude Code, Codex, and others support is effectively a plugin system for agent tools — a way to register new capabilities without modifying the harness itself.

If you already know how to think about frameworks, you already know most of what you need to know about harnesses. Just replace "request" with "user prompt" and "handler" with "model."


Coding agent harnesses, concretely

Now apply this framing to tools you've actually used.

Claude Code

Claude Code is a harness around Anthropic's Claude models. Open the binary and you can see the shape: a terminal REPL runs the loop, a fixed set of tools (Read, Edit, Write, Bash, Grep, Glob, WebFetch, Task, TodoWrite, and a few more) is exposed to the model, a system prompt establishes that this is a software engineering session in the current directory, and MCP support lets you plug in additional tools. The model is the muscle. Claude Code is the cart.

Two things worth noting: you can swap the underlying Claude model (Opus, Sonnet, Haiku) without changing the harness. And you can't take Claude Code's tool set and use it with a different model family — the harness and the model vendor are bundled.

OpenAI Codex CLI

Different team, different language, same shape. An agent loop around GPT-class models, a similar tool inventory (shell, file edit, apply-patch), a system prompt tuned for code tasks, and an approval flow for destructive commands. The differences from Claude Code are in the details — how approvals work, how diffs are applied, what happens when a command times out — but the architecture is the same framework-around-model pattern.

Aider, Cursor, Windsurf, Continue

Each is a harness with a distinct take. Aider is a terminal harness with a strong opinion about git — every change is committed, and the loop is tuned for small, reviewable diffs. Cursor is a harness embedded in an IDE, where the tools include "see what the user is looking at" and "apply an edit the user can accept or reject." Windsurf and Continue are variations on the IDE-embedded theme.

The model under the hood is often the same. The product differences — what each is good at — are overwhelmingly harness differences: which tools exist, how the loop terminates, how the system prompt frames the task, how context is managed when files are large.

Home-grown harnesses

Plenty of teams write their own. A support team might build a harness whose tools are "search Zendesk," "draft reply," "look up customer history." A data team might build one whose tools are "run SQL," "describe schema," "write a notebook cell." They're not "coding agents," but they're the same architectural pattern — a loop around a model with a task-shaped tool kit.

This is why the term gets used loosely. Any framework around a model for a specific job is a harness.


The limits of a single-agent harness

Here's where the story gets interesting for anyone trying to get real work out of these tools.

A harness like Claude Code or Codex runs one agent, in one terminal, on one machine, in one directory. That's the scope it's designed for, and inside that scope it's good. Once your ambitions step outside that scope, you hit a wall that the harness alone can't knock down.

One agent at a time

You can't dispatch five Claude Code instances at five features and watch them work. You can technically run five terminals with five Claude Code processes, but they'll all edit the same working tree, fight over the same node_modules, start five dev servers on port 3000, and collide on the same git state. The harness has no notion of "another agent also exists." It assumes it owns the machine for the duration of the session.

Permission prompts block autonomy

Most harnesses default to interactive approval for destructive commands — rm, git push, migrations, anything that looks risky. That's the right default when the developer is watching. It's a disaster the moment you try to leave the agent unattended. Step away for 15 minutes and you come back to an agent that has done nothing for 14 of them because it was waiting for you to click yes. (We wrote a whole piece on this: why coding agents freeze on permission prompts.)

No real isolation

The harness runs on your host. Your filesystem, your keychain, your GitHub token, your SSH keys, your .env files. Good agents don't misuse these. "Good" is not a guarantee. The one time an agent's plan involves rm -rf something is the time you learn your harness doesn't sandbox its tools, just asks you nicely before running them.

No orchestration layer

There's no dashboard, no "list of runs," no way for a teammate to see what your agent is up to. If you want that, you build it. Which is where the next layer comes in.


The orchestration harness: a harness around harnesses

When you want to run many coding agents — in parallel, unattended, safely isolated — you need a different kind of harness. Not a replacement for Claude Code or Codex, but a layer above them that handles what a single-agent harness intentionally doesn't.

This is the role Trimo plays. It's a harness whose "payload" is another harness.

Concretely, Trimo takes on the pieces that a per-agent harness leaves unsolved:

  • Isolated execution environment. Every run happens inside a fresh Docker container with the agent of your choice pre-installed — Claude Code, Codex, or your own. The container has the repo, a writable branch, any sidecar services the project needs (Postgres, Redis, a headless browser), and nothing else. The agent can't reach your keychain, your host filesystem, or your other running projects. When the run ends, the container is torn down.
  • Service orchestration per run. Declare databases, caches, queues, or browsers once in a services.json file. Each run gets its own instances on its own private network. No shared Postgres where two parallel agents corrupt each other's test data. No port-5432 collisions.
  • Parallel dispatch. Start five runs at once against five different tasks, each on its own branch, each in its own container. The ceiling is your machine's memory, not the harness's single-tenant assumption.
  • No interactive prompts in the loop. Because the container is a sandbox, the agent can run destructive commands without stopping to ask. If it makes a bad call, you throw the container away. No damage escapes.
  • A real dashboard. You see which runs are active, what each agent is doing right now, and the full transcript. When a run finishes, open a terminal in the container and verify the result. You can course-correct an agent mid-run or kill it if it's off the rails.
  • Git and review flow. Every run produces a branch and, optionally, a PR. Review in your normal review tool. Merge what's good, discard what isn't. The agent's output enters your codebase only through the same gate human work goes through.

The important distinction: Trimo isn't competing with Claude Code or Codex. It's a layer those tools slot into. Claude Code is still the coding harness. Trimo is the harness around it — the thing that lets you have five of them running at once, safely, while you're doing something else.


How to think about this going forward

The model gets the attention. It's the part that advertises leaderboard scores. But when you actually use these tools to do real work, how the harness is shaped matters at least as much as which model you chose.

Three practical takeaways:

  • When you're comparing coding agents, you're comparing harnesses. "Claude Code vs Codex" is mostly a harness comparison — tool inventory, system prompt, context management, approval flow. The underlying model is a variable both sides can change.
  • When a harness feels annoying, don't fight it — look one layer up. If Claude Code's single-session model is limiting you, the fix isn't to write a wrapper script that spawns five Claude Code processes and hopes. It's to adopt an orchestration layer whose job is to manage those runs.
  • When you're building your own agent, remember the pattern. Loop, tools, system prompt, context management, exits. Good harnesses are surprisingly small once you see the shape. Bad ones are the ones that try to put agent behavior into the model and leave the harness to be an afterthought.

The word "harness" is going to stick around because it describes something real. An LLM is a capability the way a horse is a capability — enormous, underneath you, and useless without the rig that connects it to the work.

Try Trimo for orchestrating coding agents — the harness around your harness.


Related articles