The AI Coding Landscape in 2026: Labs, Harnesses, and Cost

12 June 2026 | Eric

Table of content

The recent release of Fable, the latest Anthropic model, got me thinking about the cost of AI coding agents. I spent the last few weeks evaluating coding agents with the following questions in mind:

Few leaderboard ranking provide the full picture when it comes to coding agent. Three things matter more than the pure “Intelligence” score, in roughly this order: which harness wraps the model (determine token-efficiency), what task complexity you are actually facing, and only then the model itself. This post walks through each, with the numbers I found.

None of this is an argument against any particular tool. I use several of them. It is just the economics, laid out plainly, so you can pick the stack that fits your work.

The US and China model gap has mostly closed

In 2023 the performance gap between the best American and Chinese models was wide, somewhere between 17.5 and 31.6 percentage points on major benchmarks. By March 2026 that gap was about 2.7 points between the top American model and the best Chinese one (Stanford AI Index coverage).

The cost gap went the other way and widened. Chinese open-weight models run at roughly 15 to 34 times less than American equivalents for comparable capability (aimadetools), and a RAND analysis put Chinese systems at one quarter to one sixth of the cost (SemiAnalysis). DeepSeek trained V3 for around $6M against an estimated $100M for GPT-4. So the practical picture today is a few percent of capability for an order of magnitude or more in price.

The word “open-weight” is doing important work here, and it is worth being precise. Open-weight means the model parameters are published (DeepSeek under MIT, most Qwen variants under Apache 2.0 or a source-available license). The training data and pipeline are not open, so this is not open-source in the full sense. The privacy concern people raise about Chinese models is real, but it applies to the hosted API, where data is stored in China under Chinese law. A set of weights you download and run on your own GPU is a deterministic function, they do not “phone home”. Once self-hosted, they are isolated. Self-hosting can be cumbersome, but local providers can help with that.

American models cost more partly because the labs need to make money

It helps to remember that frontier model pricing is not set purely by inference cost. The American labs are funding a business that currently runs at a loss. OpenAI is projected to lose around $14B in 2026 and is not targeting profitability until the 2030s (Forbes). Anthropic’s gross margin was deeply negative in 2024 and improved through 2025, with profitability targeted around 2028 (Forbes).

The premium we pay for a top American model is partly capability and partly the cost of subsidizing a company that has promised investors a return. That is a perfectly normal thing for a business to do. It just means the sticker price is not a clean signal of how much the computation actually costs, and it explains why a Chinese open-weight model within a few points can be radically cheaper, and a massive challenge for American labs to compete with.

The harness matters as much as the model

This is the finding that reorganized how I think about the whole space. The harness is the tool that wraps the model: system prompt, tool definitions, context management, retry logic. The same model scores very differently depending on which harness it runs in.

The numbers are not subtle. In one comparison Claude Opus scored 77% inside Claude Code and 93% inside Cursor, a 16 point swing from harness alone. On CORE-Bench the same model went from 42% under a minimal scaffold to 78% under a full harness, a 36 point swing (thoughts.jock.pl).

Same modelHarness AHarness BGap
Claude Opus77% (Claude Code)93% (Cursor)+16
Claude Opus (CORE-Bench)42% (minimal)78% (full harness)+36

The takeaway is that you cannot compare two agents by asking which model they use. The harness is a multiplier on top of the model, and it is often the bigger lever. Similarly, the VILA-Lab study found that the harness is a bigger determinant of cost-efficiency than the model itself.

Only 1.6% of Claude Code’s codebase is AI decision logic. The other 98.4% is deterministic infrastructure

My own experience create RAG pipelines reflect that as well. Unfortunately, the last layer (AI summarization) often gets all the credit.

Token efficiency is a harness property, not a model property

Cost-efficiency follows directly from this. A harness that burns four times the tokens to reach the same result is a hidden cost multiplier, regardless of what you pay for the subscription. By the same source, Codex CLI used roughly three to four times fewer tokens than Claude Code for equivalent tasks. That is an architecture difference, not a model difference.

This is why I keep coming back to OpenCode. It is bring-your-own-key, so you pay the API directly and you are free to point it at whatever model is cheapest for the job. The harness and the model become independent choices, which is exactly what you want when you are optimizing cost. You can run a cheap open-weight model in an efficient harness and capture most of the value of an expensive stack.

ToolCost modelModel choice
OpenCodeBYOK, pay API directlyAny model
Cursor$20/mo, some BYOKRouted, partial freedom
Claude Code$20/mo (Pro), $100/mo (Max)Anthropic only

Anthropic is currently running the Cloud Provider playbook. They are not in a good position to compete with cheap Chinese models. They are running at a loss, so their priority is to capture market share and lock in customers by offering credits. I recommend companies to think carefully before locking in to a proprietary harness, and it doesn’t matter which one you pick even though, the cost of switching is still very low compared to switching cloud providers.

The expensive model is often not the cost-efficient one

When you look at cost per resolved problem rather than raw score, the ranking inverts. The models cluster within single digits of each other on SWE-bench while the price per problem spreads across more than an order of magnitude (AgentMarketCap).

ModelSWE-bench resolvedCost/problem
Claude Opus 4.665.3%$1.12
GPT-5.4 (medium)62.8%$0.63
Gemini 3.1 Pro62.3%$0.66
Claude Sonnet 4.660.7%$1.02
Step-3.5-Flash59.6%$0.14

Step-3.5-Flash sits about five points below Claude Opus and costs roughly an eighth as much per problem. Across the full range, the output-token price gap between Claude Opus and the budget models reaches about 62 times. For a lot of everyday work, five points of accuracy is not worth eight times the bill.

That said, the premium models earn their keep on hard work. A 100-task benchmark from SitePoint found Cursor more cost-efficient on simple, high-frequency tasks while Claude Code pulled ahead on full-feature, multi-file implementations, with the crossover around multi-file modules (SitePoint). So the honest answer is task-dependent: cheap models for the bulk of the volume, premium models for the genuinely complex pieces.

Benchmarks are one data point, not the answer

A few caveats keep me from trusting any single leaderboard. SWE-bench Verified is saturating near 90%, which means it has lost most of its discriminating power at the top. Harness effects mean the published number depends heavily on the scaffold. And some tools, like Cursor, report their own internal benchmark rather than the public one, which makes direct comparison harder.

The practical consequence is that the only benchmark that answers your question is your own codebase with your own task mix. Independent aggregators like Artificial Analysis are useful for triangulation, but they are a starting point, not a verdict. Though, special mention to Artificial Analysis for including harness efficiency in their scoring, this is getting interesting.

What I settled on

As of today, I essentially run on 2 distinct tools: OpenCode and Cursor.

OpenCode is an efficient bring-your-own-key harness that can use any cheap model like GLM 5.1, open-weight or otherwise, and pay only for what I use. For the hard multi-file/multi-repository work where the extra accuracy clearly pays off, I reach for a premium harness like Cursor, directly integrated with my IDE. Though, to be honest, it’s also because I am old-school and prefer the IDE experience. But the important point here is that the split follows the task complexity rather than the marketing.

If you take one thing from this: measure cost per finished task on your own work, not leaderboard score. The harness you choose and the task complexity you actually face will decide your bill far more than which lab trained the model. And remember that the American labs are merchants of tokens, not models. It’s in their interest to sell you as many tokens as possible, not the other way around.