Our Monthly Throughput vs. Harness Build-Out

Every few months a more capable model arrives, and a familiar question follows: how much faster will this make us? I think it is the wrong question. The teams pulling ahead are not the ones with early access to the best model. They are the ones who built the scaffolding that turns any model into shipped work. The model matters, but it is the part you don’t control and didn’t build. The compounding happens somewhere else: in the harness you wrap around it.

What Actually Sped Up?

Start with the frontier. In a recent essay on what it calls recursive self-improvement, Anthropic describes AI systems taking over a growing share of AI development itself. The headline numbers are striking: a large majority of merged code now authored by the model, engineers shipping many times more per day than two years ago, and the duration of tasks an agent can complete reliably roughly doubling on a months-long cadence, per measurements from the research group METR.

Read quickly, that sounds like a story about models getting smarter. Read closely, it is a story about where the work goes when generation gets cheap. Anthropic is blunt about the consequence: as writing code stops being the constraint, “human code review has become a new bottleneck.” The doing got fast. The deciding did not.

Log-scale chart from METR showing the length of tasks that frontier AI agents can complete with 50% reliability doubling roughly every seven months from 2019 to 2025. — The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts.Source: METR, “Measuring AI Ability to Complete Long Tasks” (2025).

That is the pattern worth holding onto. When generation approaches free, the scarce resource is no longer production. It is judgment, review, and integration, the work of deciding what is worth shipping and who is accountable for it. And that matters because judgment does not speed up because the model got better at writing. It speeds up because you built something around the model to absorb the new volume.

The Inflection Wasn’t a Model Release

We build infrastructure for managing AI-generated assets, which means we live inside our own argument. A small team ships a lot of software, and we use AI coding agents heavily to do it. So we went looking, in our own history, for the moment things changed.

What we expected to find was a step change lined up with a model release. A better model lands, output jumps. That is the intuitive story. It is not what the record shows. Our throughput inflected, and then stayed elevated, not at a model boundary but at the point where we built reusable scaffolding around the agent: codified procedures the agent could invoke, specialized sub-agents for multi-step workflows, and automated checks that ran at the edges of every session. The capability of the underlying model kept improving in the background, steadily and usefully. But the visible change in how much we shipped tracked the tooling we built, not the version number we happened to be running.

Our Monthly Throughput vs. Harness Build-Out

1Hooks + guardrails (Jul 2025)2Skills + sub-agents (Oct 2025)3MCP integration live (Dec 2025)

Commits per month across our repository — a deliberately noisy throughput proxy. The sustained climb tracks successive harness build-out, not any single model release: automated hooks and guardrails (Jul 2025), reusable skills and sub-agents (Oct 2025), then our MCP integration coming online at year’s end (Dec 2025) — giving agents structured access to real systems — with auto-enforced work-tracking and new skills and sub-agents compounding every month since. The underlying model kept improving in the background throughout. June 2026 is omitted as an in-progress month.

I want to be careful here, because it is easy to overclaim. Throughput is a noisy measure; more commits can mean smaller commits. So we looked at a quality signal instead: how long a single focused agent session stayed productive before it needed a human to step in. That number rose as the scaffolding matured. The agent was not just doing more. It was doing more before hitting the wall where judgment was required.

The lesson generalizes past us. If your gains arrive only when a new model ships, you are renting your productivity from someone else’s release schedule. If your gains arrive when you build, you own the curve.

What a Harness Actually Is

So what is this scaffolding, concretely? It is less exotic than it sounds, and most of it is built from primitives now publicly available in tools like Claude Code. Think of it in three parts.

The first is reusable procedures. Anything you explain to an agent more than twice should be written down once, in a form the agent can pull in on demand. A procedure for scaffolding a new endpoint. A checklist for a database change. The house style for a commit message. Each one converts a recurring explanation into a callable capability, so the agent stops relearning your conventions every session.

The second is delegation. Some work is a single step; much of it is a chain, analyze then design then implement then verify. Specialized sub-agents let you hand a whole chain to a context built for it, so the main thread stays clear and each phase carries the right instructions. This is where the Model Context Protocol (MCP), the emerging standard for connecting AI tools to real systems, earns its place, by giving agents structured access to your data and services rather than a copy-pasted approximation of them.

The third is guardrails. As agents do more, the failure modes shift from “wrote bad code” to “did something reasonable in the wrong place.” Automated checks that fire at the start and end of a working session catch those drifts before they become commits. This is the least glamorous layer and arguably the most important, because it is what lets you go faster without taking on proportionally more risk.

What a Harness Is Made Of

Reusable procedures

Write down anything you explain twice; the agent invokes it on demand.

Delegation

Hand multi-step chains to sub-agents built for them.

Guardrails

Checks that fire around every session catch drift before it ships.

None of these three is a model feature. They are all things you build. And they compound: every procedure you codify, every workflow you delegate, every guardrail you add makes the next unit of work cheaper. That is the actual flywheel. Not a smarter model, a deeper harness.

One more piece of discipline holds the whole thing together: knowing where each kind of knowledge belongs. A procedure you keep repeating becomes a reusable skill, and you update it the moment the convention it encodes drifts. A multi-step chain that deserves its own focused context becomes a sub-agent. But not everything is a procedure. A decision with a rationale that has to outlive the session, why you chose one database pattern over another, belongs in a decision record, not a skill. A hands-on operational sequence, how to cut a release or recover from a failure, belongs in a runbook. A repeatable go-to-market or review play belongs in a playbook. The skill is matching each piece of knowledge to its right home and keeping it current, so the agents and the humans reach for the same source of truth instead of quietly reinventing it.

The Safety Rail You Can’t Skip

There is a precondition most enthusiasm skips over. You cannot safely let agents author most of your code unless you already have the discipline to catch their mistakes.

Tests. Continuous integration. Written records of why decisions were made. Clear rules about how work merges. None of this is new, and none of it is about AI. But it turns out to be the thing that makes high-volume AI generation trustworthy rather than terrifying. Anthropic notes that model-written code went from “somewhat worse” to “roughly at parity” with human work, a comparison that is only meaningful because there is a review and test gate measuring it.

In practice that gate is not one test but a stack of them, each catching a failure the others miss. The fastest checks run first: linting, type-checking, and deeper static analysis with SonarJS, so whole classes of mistake never reach review. Property-based tests probe the space of inputs instead of a handful of hand-picked cases. Scenario tests pin down the behavior that actually matters to a user. Database-level tests, with pgTAP, exercise the data layer on its own terms. Integration tests run the seams between components with external boundaries mocked, so they stay fast and deterministic. End-to-end tests drive the real interface in a browser the way a person would. None of these techniques is new, and that is the point. What is new is how much they are worth once an agent, not a human, writes most of what flows through them.

Here is the connection that is easy to miss: the boring engineering discipline is not overhead you tolerate alongside the AI gains. It is the rail that lets you push the throttle at all. A team without tests that lets agents merge freely is not moving fast. It is accumulating undiscovered failure. The teams that can genuinely accelerate are the ones that invested in the unglamorous substrate first.

That rail is something you can measure, and we do. We hold the harness accountable to the standard signals of software delivery, the four the DORA research settled on: how often we ship, how long a change takes to reach production, how often a change breaks something, and how quickly we recover when it does. The first two describe throughput; the last two describe stability, and we treat all four as constraints to optimize against at once rather than a single number to maximize. The distinction matters, because this is delivery performance, not generation speed, the rate at which finished, reviewed work reaches users, never the rate at which raw output is produced. Read that way, the stability pair is the governance the rest of this argument keeps pointing at: the thing that stops fast from quietly becoming broken.

If your gains arrive only when a new model ships, you’re renting your productivity from someone else’s release schedule.

Where the Humans Go

If generation is cheap and the bottleneck is judgment, the shape of a team changes. You don’t add people to write more code; the harness already multiplies that. You add them for the parts that don’t automate: deciding what is worth building, governing how the work flows, and owning the consequences when it ships.

I think of the human contributions as sitting on either side of generation. Upstream: which problem actually matters, and is this the right approach? Downstream: is this result correct, and who answers for it? The middle, the typing, is the part that compresses. Anthropic frames the frontier version of this as people moving “towards oversight, validation, and verification” of work increasingly done by machines. At a small-team scale, it looks the same.

There is a subtler cost worth naming. More agentic output doesn’t only need more review; it needs coordination. When several agents and several people are all changing things at once, the danger isn’t bad code. It is decision overhead: too many things in flight, and no one certain which ones should proceed. Speed without governance doesn’t compound. It congeals.

So the qualities we have come to value in people have shifted. Not “writes the most code,” but:

Judgment and taste, the instinct for which work is worth doing, and when an approach is a dead end.
Ownership, being accountable for an outcome, not just the completion of a task.
Governance, keeping concurrent work coherent so velocity doesn’t curdle into chaos.
Fluency with agents, directing AI tools well, and knowing where their judgment ends and yours begins.
Collaboration over heroics, visible shared work beats solo deep dives no one else can see.

And that matters because these are the capabilities that grow scarcer, not more abundant, as generation gets cheaper. We are an AI-native company: nearly every role here is AI-augmented, and we expect everyone to direct agents as part of the work. But the reason we bring on people at all is precisely the part the agents don’t do.

We have recently grown the team with exactly that in mind, adding senior people whose value is judgment and ownership rather than raw output: someone to own the “how” of how work flows and stays coherent, and a principal engineer who pairs deep technical judgment with full-stack range. The harness makes the team faster. The people make sure that faster is also right.

Key Takeaways

The compounding gains from AI aren’t in the model you use, they are in the scaffolding you build around it.
When generation gets cheap, the bottleneck moves to judgment, review, and integration. Build for that.
A harness is three things you build, not buy: reusable procedures, delegated multi-step workflows, and automated guardrails.
Conventional engineering discipline, tests, CI, and decision records, is the safety rail that makes high-volume AI generation trustworthy. Skip it and you are accumulating hidden failure, not moving fast.
As production approaches free, hire and organize for judgment, governance, and accountability, not throughput. Speed without coordination congeals rather than compounds.

Built This Way

Numonic is an AI-first digital asset management platform with provenance tracking, compliance infrastructure, and full audit history for every asset, built by a small team using the harness approach described here. If you are building your own, I would genuinely like to compare notes.

Get in Touch

AI Engineering

What Actually Sped Up?

The Inflection Wasn’t a Model Release

Our Monthly Throughput vs. Harness Build-Out

1Hooks + guardrails (Jul 2025)2Skills + sub-agents (Oct 2025)3MCP integration live (Dec 2025)

What a Harness Actually Is

So what is this scaffolding, concretely? It is less exotic than it sounds, and most of it is built from primitives now publicly available in tools like Claude Code. Think of it in three parts.

What a Harness Is Made Of

Reusable procedures

Write down anything you explain twice; the agent invokes it on demand.

Delegation

Hand multi-step chains to sub-agents built for them.

Guardrails

Checks that fire around every session catch drift before it ships.

The Safety Rail You Can’t Skip

There is a precondition most enthusiasm skips over. You cannot safely let agents author most of your code unless you already have the discipline to catch their mistakes.

If your gains arrive only when a new model ships, you’re renting your productivity from someone else’s release schedule.

Where the Humans Go

So the qualities we have come to value in people have shifted. Not “writes the most code,” but:

Judgment and taste, the instinct for which work is worth doing, and when an approach is a dead end.
Ownership, being accountable for an outcome, not just the completion of a task.
Governance, keeping concurrent work coherent so velocity doesn’t curdle into chaos.
Fluency with agents, directing AI tools well, and knowing where their judgment ends and yours begins.
Collaboration over heroics, visible shared work beats solo deep dives no one else can see.

Key Takeaways

The compounding gains from AI aren’t in the model you use, they are in the scaffolding you build around it.
When generation gets cheap, the bottleneck moves to judgment, review, and integration. Build for that.
A harness is three things you build, not buy: reusable procedures, delegated multi-step workflows, and automated guardrails.
Conventional engineering discipline, tests, CI, and decision records, is the safety rail that makes high-volume AI generation trustworthy. Skip it and you are accumulating hidden failure, not moving fast.
As production approaches free, hire and organize for judgment, governance, and accountability, not throughput. Speed without coordination congeals rather than compounds.

Built This Way

Get in Touch

AI Engineering

The AI Multiplier Isn’t the Model: It’s the Harness

What Actually Sped Up?

The Inflection Wasn’t a Model Release

Our Monthly Throughput vs. Harness Build-Out

What a Harness Actually Is

What a Harness Is Made Of

The Safety Rail You Can’t Skip

Where the Humans Go

Key Takeaways

Built This Way

AI as a Governance Partner: A Year of Building with Claude Code

Generation Is Cheap, Memory Is Expensive

We Solved the Wrong Problem in AI

The AI Multiplier Isn’t the Model: It’s the Harness

What Actually Sped Up?

The Inflection Wasn’t a Model Release

Our Monthly Throughput vs. Harness Build-Out

What a Harness Actually Is

What a Harness Is Made Of

The Safety Rail You Can’t Skip

Where the Humans Go

Key Takeaways

Built This Way

AI as a Governance Partner: A Year of Building with Claude Code

Generation Is Cheap, Memory Is Expensive

We Solved the Wrong Problem in AI