The Agent Harness Engineering Paradox: Building Stuff to Delete Is Harder Than Keeping It

Somewhere in your codebase right now, there's a planner-executor-evaluator pipeline that nobody touches. It was built in 2024 because Opus 4.0 couldn't hold a multi-step plan in its head. By 2025, the model could. The pipeline still runs. It still consumes tokens. It still has tests that pass. And nobody is going to remove it, because removing working infrastructure is the one engineering task that has no natural champion.

This is the harness engineering paradox: every team building agentic systems knows that scaffolding is temporary. They've read the Bitter Lesson. They've heard "build to delete." But the deletion never happens, and it never happens for reasons that have nothing to do with awareness and everything to do with how we've structured the work.

Everyone Quotes the Bitter Lesson. Nobody Follows It.

Rich Sutton's Bitter Lesson is the most-cited principle in harness engineering. The short version: general methods that scale with computation beat hand-engineered domain structure, every time, eventually. Applied to agent systems, it predicts that the orchestration layers we build today to compensate for model limitations will become overhead as models improve.

The evidence is already in. No-code workflow builders that chained together dozens of canvas nodes? Replaced by single long-horizon agents. Tool wrappers that translated API specs into model-digestible formats? Models now read OpenAPI specs directly and write missing helpers on the fly. Planner-executor scaffolds that decomposed tasks into discrete steps? Planning, action, and reflection merge into continuous loops inside a single trace.

The operational directive is explicit: add structure for the level of compute you have, then remove it. Context-only improvements on Terminal Bench 2.0 jumped scores from 52.8% to 66.5% with no model change. That's a 13.7-percentage-point swing from harness engineering alone. The lever is real. And it's known to be temporary.

So why does the scaffolding stay?

Addition Has a Champion. Deletion Doesn't.

The incentive asymmetry is structural, not cultural. When something breaks or a capability is needed, the trigger is clear: build a component, ship it, move on. The engineer who adds the planner-executor pipeline gets credit for unblocking the team. The work is legible, the impact is immediate, and the PR has a clear "before" and "after."

Deletion has none of this. The trigger is ambiguous: the model might handle this natively now. The impact is invisible: fewer tokens consumed, slightly less latency, one fewer thing that can break. The risk is asymmetric: if you remove a component and something degrades, you broke production. If you leave it in and the model improves around it, nothing visibly goes wrong. The dead code just sits there, a quiet tax on every request.

This is why "build to delete" is a principle everyone endorses and nobody practices. It tells you what to believe, not how to act. And the patterns we reach for when we try to act on it are, on inspection, insufficient.

Our Deletion Toolbox Solves the Wrong Problem

The standard advice for building temporary infrastructure looks reasonable. Loose coupling. Feature-flagged deprecation. Composability. Version-controlled harness config. These are good engineering practices. They're also oriented entirely toward making addition reversible, not making deletion certain.

Feature flags make rollback safe. They don't make removal inevitable. A feature flag that's been flipped to "off" for six months is dead code with a dashboard entry. Modular design makes components separable. It doesn't make them expendable. Well-isolated modules accumulate indefinitely because isolation removes the pain of having them around. Version control tracks what changed. It doesn't flag what should be removed. The git history of a harness is a chronicle of additions with scattered, reluctant deletions.

All five of the standard build-to-delete strategies are preconditions for deletion, not deletion itself. They answer "how do we make this removable?" None of them answer: when should this component die? Who decides? What triggers the removal? What happens to the 40 tests that depend on it?

The toolbox is designed for a world where the question is "can we remove this safely?" The actual question in harness engineering is "should we remove this now, and will anyone actually do it?"

The Real Asymmetry: Known Inputs vs. Ambiguous Triggers

The deeper problem is that addition and deletion operate on fundamentally different information. When you add a component, the inputs are known: a failing benchmark, a capability gap, a specific error class. When you delete a component, the trigger is a judgment call about model capability that nobody wants to be wrong about.

Consider Anthropic's shift from direct MCP tool calling to code-generation sandboxes. The result was a 98.7% token reduction, from roughly 150,000 tokens to 2,000. But this wasn't a deletion. It was a replacement: an entirely new interface architecture, validated against production traffic, rolled out with a parallel run. The old system didn't quietly disappear. It was actively displaced by something measurably better.

This points to why removal feels so different from addition. Deletion isn't "remove lines of code." It means: validate that the model handles the task natively, update the tests that assumed the component existed, update the monitoring that watched it, communicate to the team that the capability now lives in the model instead of the harness, and accept the risk that you're wrong about the model's readiness.

Harness-only changes can swing agent performance by 13.7 percentage points. That's the leverage that makes harness engineering the highest-impact work in agent development. It's also the leverage that makes touching a working harness terrifying.

What a Deprecation Ceremony Actually Looks Like

The missing practice isn't more modularity or better feature flags. It's an explicit, recurring decision point where the team confronts the question: does this component still earn its place?

In traditional software, we have deprecation protocols. APIs get sunset schedules. RFCs move through stages. Package maintainers publish end-of-life timelines. These work because the trigger for deprecation is predictable: a new version ships, usage drops below a threshold, a security flaw is discovered.

Harness engineering has none of this. The trigger for deprecation isn't a version bump or a usage metric. It's model capability growth, which is exogenous and arrives on the lab's schedule, not yours. You can't plan around it. You can only audit against it.

Here's what that audit looks like in practice. Every quarter, the team reviews each harness component against three metrics: token overhead, latency contribution, and success rate with the component disabled. That last one is the key. Running your eval suite with a component bypassed tells you whether the model has absorbed the capability. If the success rate holds, the component is a candidate for removal. If it drops, the component still earns its keep and you check again next quarter.

This isn't process theater. It's the structural counterpart to sprint planning. Sprint planning asks "what should we build?" The deprecation review asks "what should we stop building around?" Both require dedicated time, clear criteria, and someone willing to make the call.

The quarterly cadence matters because model capabilities shift on a roughly similar timescale. Checking monthly wastes effort when nothing has changed. Checking annually guarantees you're carrying a year of dead scaffolding.

The Expiration Date Is the Point

Harness engineering is the most important skill in agent development right now. Context engineering alone moved one team from 30th to 5th on Terminal Bench 2.0. Multi-agent orchestrator-worker designs outperform single-agent approaches by 90.2% on research tasks. The leverage is massive and immediate.

It's also temporary. The Bitter Lesson doesn't predict a world where harness engineering becomes irrelevant next quarter. It predicts a world where the specific components you build today will be absorbed by the model within 6 to 18 months, and new components will take their place. Harnesses don't shrink. They shift. Older scaffolding drops away while new patterns emerge for the next generation of capability gaps.

The teams that will build the best agent systems are the ones that hold both halves of this sentence at once: harness engineering is the highest-leverage work available, and every piece of it has a countdown timer. Investing heavily now because the leverage is real. Architecting every component for the day it gets pulled out. Running the quarterly audit not because it's virtuous, but because dead scaffolding is dead weight, and the model that made your 2024 planner obsolete is about to make your 2026 context router obsolete too.

The paradox doesn't resolve by choosing between addition and deletion. It resolves by treating them as the same motion. Every component you add should arrive with three things: the tests that prove it works, the metrics that measure its value, and the eval that will tell you when it's time to let go.