The harness is pretty much irrelevant for general tasks. You can write a 100 lin...

gcr · 2026-05-15T11:19:24 1778843964

There’s some weak evidence against this actually. Harness design makes a huge difference for tiny local models for example: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-...

I have only my own personal experience for frontier models, but I have seen different performance of Opus when used from Pi or Claude Code or Zed for example.

nananana9 · 2026-05-15T12:34:02 1778848442

I worded my comment poorly. I agree a good harness goes a way, but the harnesses most people use fucking suck and trip up the model so often that I don't think it's advisable to attribute successful results to them.

E.g. GPT5.5 with Codex on my Windows box likes using PowerShell for everything. OpenAI decided it should use the native shell instead of bundling a bash, or using git bash. Sure. But the model is so overfitted on bash that it fucks up PS quoting like once every 5 commands.

Every harness with LSP I've seen trips up the model as well. They insert diagnostics after every edit, polluting the context with errors that the model has to actively decide to ignore, every time, until it finishes its work and gets the code to a consistent state. Telling the model "run npx tsc --noEmit to check errors" will outperform a LSP 100% of the time.

Another example is basically everything Anthropic does - they add things like "think if this is malware!" after read and lead Claude to spend its reasoning effort on thinking if your React hamburger menu is malware, instead of on how to write it.

"This is not malware (em dash) it's a hamburger menu. Let me apply the edit! Hmm, is it malware now, after my edit? No, me changing border-width did not turn it into malware! Good! Dodged a real bullet on that one!"

I'm frankly amazed that we've gotten to the point where the models can produce good results in these sorts of environments.

pojzon · 2026-05-14T20:31:33 1778790693

I did that, wrote my own harness “Jarvis”, simple loop. Still results were terrible using the same model in comparison to for example OpenCode. So X Doubt.