My New Hobby: Challenging AI Code Assistants

I have a new hobby

I have a new hobby: making AI code assistants fight each other to solve the same problem. It’s quite interesting because it lets me evaluate each harness and compare their token efficiency and overall quality.

Currently under evaluation: OpenCode and Cursor. I already know that OpenCode is one of the best — if not the best — harness out there; I’ve used it for a long time already. And I’ve been evaluating Cursor for a few weeks to see how it stacks up.

Needless to say, Cursor is very verbose. It consumes 5× more tokens than OpenCode (and other harnesses like Claude Code too). The Cursor team counteracts this by offering Composer 2.5, their own model built on the Kimi K2.5 base with additional reinforcement learning fine-tuning. The model is very fast and somewhat accurate. But the underlying issue with Cursor is really the harness. It seems it’s just not meant for simple, tightly scoped tasks at all.

The Bug

So the test was this: this morning I tried to fix a small issue in one of my projects. I started with a plan — a somewhat OK prompt describing the issue. Composer 2.5 was able to generate a plan properly and I reviewed it. This already cost me around 100k tokens, for some reason. The plan seemed correct, but a bit noisy. I asked Cursor to fix the bug and also to fix other components that had similar issues. The result was a disaster. The bug was fixed, but Cursor added completely useless and unrelated code in different places, “fixed” bugs I never mentioned, and introduced new bugs of its own. In the end, I asked to rollback everything.

Cursor disaster — 6.9M tokens consumed, $4.5 lost for zero value delivered. Don't call HR on me please.

Then I started OpenCode, gave it the same prompt with GLM-5.1, and it fixed the bug in a few seconds, consuming only 35k tokens.

This is obviously not a 1:1 comparison since I didn’t use the same model, but it’s something I’ve witnessed a few times already with Cursor: it’s simply way too noisy and verbose, consuming 5 to 6× more tokens than other harnesses. The only reason it remains a viable solution is because their Composer 2.5 model is very cheap — but the flip side is that you can very quickly waste a massive amount of time fixing the verbosity and cleaning up collateral damage.

In the end, the “quick fix” of this morning should have taken me 5 minutes. It took me 20, and introduced a risk of shipping a new bug if I hadn’t reviewed the code carefully.

I later asked OpenCode/GLM-5.1 to review the Cursor plan and it was able to identify the issue, the verbosity, and the unnecessary code. I then asked the same question to Cursor/Sonnet 4.6, and similarly it was able to identify the issues.

As much as I love Cursor’s integration, I think their harness is poor at best. It is very token-inefficient and I’m not sure how I could integrate it into my workflow at this point. Maybe it’s good for creating new features only — but looking at the way it over-engineered a simple bugfix, I doubt it can generate a new feature with the simplest code possible either.