What happens if we swap AI brains?
Despite my initial skepticism, I’ve been increasingly using LLM-based coding
assistants to get shit done. No vibe coding, mind you — I am too much of a
control freak for that, but letting the machine do the tedious parts of coding
has been great for me. For personal use, I particularly enjoyed using Claude
Code (enough to shell out for a Pro subscription): I don’t have to talk to it
like to a lawyer capricious genie that wants to fuck me over on the
slightest slip of instruction.
I also got to use and compare several such tools, which led me to a hypothesis:
The interface of the agent — the tool that invokes an LLM — defines its usefulness as much, if not more, as the model behind it.
More specifically, prompts, instructions and tools that are made available to the LLM can make a difference between frustrating baby-sitting and a productive coding session. However, until recently I had no good way of testing it because most frontier LLMs are coupled with their own proprietary tool and it’s hard to separate the influence of the tool from the model behind it.
TL;DR: I swapped Anthropic models for Gemini in Claude Code to test whether the tool or the model matters more. Turns out: both matter, but the model’s fine-tuning is probably the bigger factor. Claude is better at exploration and planning, while Gemini needs more hand-holding via detailed prompts. The agent’s prompts still make a noticeable difference though.
🧟 Enter Frankenstein
Claude-code-router is an open-source tool, which wraps itself around Claude Code and can intercept and send requests meant for the Anthropic servers to a variety of other LLM providers.1 The idea is pretty neat: it uses OpenAI API as a standard, and for every major LLM provider (including Anthropic itself), implements an adaptor (or “transformer” in its own lingo) that can translate back and forth, allowing it to convert requests and responses between OpenAI, Gemini, OpenRouter, and local LLMs like Ollama.
Armed with it, I could now try using Claude Code (the tool) with a different brain (the LLM) and compare the results. I also had access to a Gemini API key, so that was going to be the substitute brain. A disturbing kind of Frankenstein’s monster, but the science had to be done. ⚡
🔬 Experiment setup
With what I have access to, I was going to test 4 different setups:
- Claude Code with its native backend.
- Claude Code with ccr and the Gemini 2.5 Pro backend to “think” through context and Gemini 2.5 Flash for everything else. I think that more or less matches how Claude Code natively routes between Sonnet and Opus models, but I’m not 100% sure of it.
- Gemini CLI with the Gemini 2.5 Pro backend.
- Gemini CLI with the Gemini 2.5 Flash backend.
The latter two serve as a control group for Gemini models to demonstrate model performance in its “native” shell. This would be informative if Gemini performs poorly under ccr, which could be due to a bug in ccr itself rather than Gemini’s own fault.
As a test problem, I chose an unfixed bug in one of my pet projects. While AI can do many different tasks (and arguably bug fixing is not usually its strongest suit), it’s a good benchmark because it has a fairly unambiguous fix and a boolean pass/fail condition: the bug is either fixed, or it isn’t.
While I’m not inclined to open source the app just to make this (already not super scientific) experiment more reproducible, here’s the gist of it:
- ~6k lines of Go and ~1.5k lines of HTML.
- Client side is mostly server-rendered pages, no SPA nonsense.
- Uses Postgres as a database.
- In the middle of being ported from Revel to mostly standard library, and thus in a fair bit of disrepair.2
- It has a fairly detailed CLAUDE.md which I’ve built up over the last few weeks of Claude-assisted hacking. For Gemini CLI tests I renamed it into GEMINI.md to keep things fair-ish.
My initial prompt was as follows:
I want you to help me troubleshoot a bug in the application. When I navigate to http://localhost:9000/account/settings, select telegram notification and enable it, the setting remains off.
The prompt is somewhat deliberately under-cooked because I didn’t want to nudge the model toward any particular solution (when I conducted the test, I already knew the root cause of this particular bug), and it’s also the level of casual prompting I prefer.
However, I also made a slightly more elaborate prompt in case the agent got too lost (foreshadowing). I think it’s mostly free from hindsight bias, but it does give the agent a more specific approach to debugging — roughly what I would have done if I were debugging manually.
I want you to help me troubleshoot a bug in the application. When I navigate to http://localhost:9000/account/settings, select telegram notification and enable it, the setting remains off. You must make no assumptions3 and trace request lifecycle from the beginning to the end. You must gather evidence supporting your theory before attempting a fix.
The bug itself was simple: while the app was using Revel, it provided automatic binding of HTTP request parameters into Go types. After Revel was removed, the binding went with it, and the form data was effectively getting dropped on the floor. There was even a handy little TODO in the HTTP mux code about fixing it. But of course, that was for the AI to discover 😇
📊 Experiment results
The experiment protocol was as follows:
- Reset repository to a clean state (with the bug present).
- Start the AI agent with a clean context, paste the “basic” prompt. For Claude Code, use “planning mode” until the agent decides to end it.
- If an agent proposes a fix, demand that it proves the validity of the fix, either by gathering debugging information (with my help), or by explaining the logic behind the fix.
- The test is passed successfully if the workflow I described in the prompt works as expected.
- Count the number of interactions that were required to get there. The original prompt counts as the first interaction.
1️⃣ Claude Code
The baseline of the experiment — regular Claude Code with a Pro subscription:
- 1 interaction until the root cause is identified and a valid fix is proposed.
- Subjectively, the fix it proposed was the most elegant of the bunch.
- While fixing it, the agent invoked
go build ./...andgo test ./...to validate its fix. - Notably, it added a good regression test for the bug, and it clearly peeked at how other tests are written in the project to imitate the style.
- It scanned more source files than any other setup before offering a solution. On one hand it’s good because it found the cause easily. On the other hand it does increase the token cost.
2️⃣ Claude Code on Gemini (via claude-code-router)
The setup I was personally interested in as a fallback for when I run out of the Pro subscription usage limit.
- 6 interactions until it found the correct root cause and proposed a fix.
- Made some serious assumptions about the app’s architecture. Specifically, it assumed that a framework automatically binds request parameters to the Go types, and got stuck looking for bugs in the business logic. Of course, it didn’t check if the imagined framework was real until I prompted it to.
- Codebase exploration was a lot more limited. As soon as it found some related parts of the code, it latched onto them and did not look for broader context.
- Did not attempt to run the compiler or tests, asking the user to do so instead.
- Used the TODO tool from Claude Code to plan its steps, but clearly not as effectively as Claude itself.
- With an “improved” prompt, it got to the right solution in only 3 interactions.
Overall it was a bit disappointing: while there were definitely rough edges due to Gemini not being particularly attuned to the tools Claude Code provided, I had hoped it would be more effective than this. The much better performance with the “improved” prompt hints that maybe Claude has been specifically fine-tuned for these kinds of tasks, whereas Gemini is offered as a more general-purpose model… Still, I had high hopes.
3️⃣ Gemini CLI with Gemini 2.5 Pro
Given the “tool attuning” hypothesis above, this experiment was supposed to test if Gemini can use Gemini CLI tools better than others. However, Gemini CLI doesn’t allow mixing different model versions for different tasks (at least not in any obvious way), so this test has somewhat of an advantage by using the bigger, more expensive model all the time.
- 4 interactions until the root cause was found and the first working fix was proposed.
- However, the fix was really ugly and it took 4 more interactions to get it into a tolerable state.
- Once again, most of the 4 initial interactions were because the model assumed automatic request binding and went looking for bugs elsewhere.
- The only case where the model both needed some debug logs and came up with a decent set of log lines to get everything it needed in one shot.
- However, interesting things happened when I gave it the “improved” prompt:
- Only 1 interaction to the root cause and a correct fix! Is this a redemption arc?!
- It even added a unit test to validate its fix, and ran
go testto execute it. However, it made an off by one error in the test setup, and when the test failed it fixed the assertion, rather than the real mistake. - And after the test passed, it decided to go ahead and delete the test. When questioned about it, the agent claimed that the test had served its purpose in proving the validity of the fix and was no longer needed 🙄 Oh well…
4️⃣ Gemini CLI with Gemini 2.5 Flash
I went into this one without too many hopes, so at least I wasn’t disappointed.
- It took a whole 10 interactions to get to the root cause and a valid fix.
- The agent got oddly fixated on an incorrect hypothesis and kept coming back to it, even after it found the real cause. Even telling it to not do it didn’t help much.
- Of course, it did not attempt to compile and test the code itself, but by this point I kind of expected as much.
- And then… Plot twist! 🎭 With the “improved” prompt, it found the bug and implemented a correct fix in 1 interaction! It wasn’t very elegant, and it took two more interactions to fix compiler errors, but it was a valid one.
So apparently, the Flash model can perform on coding tasks about as well as the Pro model, despite being a lot cheaper, but it really needs a good prompt. Without it, it dives into the first rabbit hole and refuses to leave it.
🤔 Did we learn anything?
This experiment started with the hypothesis:
The interface of the agent — the tool that invokes an LLM — defines its usefulness as much, if not more, as the model behind it.
And at the end of it, I think there is some truth to it, but the fine-tuning of the model is probably the biggest factor. While I have no idea how Claude or Gemini are fine-tuned, it’s clear that Claude is much better at exploration and planning its actions, which leads to better problem solving. Gemini’s prodigious 1M token context window remains underutilized because the agent doesn’t manage to fill it with the right information.
The tool — the agent — still definitely matters, most likely the prompt part of it. The Claude Code on Gemini experiment showed passable performance despite using the Pro model for only half of its queries, and the TODOs it made kept it a bit more on track compared to the Gemini CLI cases. I think it must be the prompt because of how much of a difference my “improved” prompt made for Gemini.
Of course, all of this comes with a huge caveat that we’re trying to judge an incredibly complex system based on very few data points and a lot of assumptions, so take it with a grain of salt.
As I type this, I realize that it would be interesting to try using Anthropic models via claude-code-router and OpenRouter to see if Claude Code may be using some special, more specialized version of the model. I may come back and update the post later if I find time to try this.
-
Fun fact: I found out about claude-code-router while looking for a way to keep using Claude Code past my Pro subscription usage limit. I don’t run out of it very often, but it’s very irritating whenever I do. ↩︎
-
This is the kind of cleanup I would have never bothered doing if I didn’t have access to some sort of AI assistant: too much boring code shuffling, doing it in my spare time is not fun. But if an LLM can do the bulk of the work for me, I get a much cleaner, nicer code base I wouldn’t mind adding new features to in the future. ↩︎
-
It’s no surprise that agents love
hallucinatingmaking “common sense” assumptions about the broader design of an application based on a few source files in their context. Maybe the fact that such assumptions are often wrong says something about my coding style?.. 🤔 ↩︎