Wednesday, December 3, 2025

AI Coding Agents Comparison 2025

AI is increasingly being used in coding these days. Many people discuss which model is better and what tasks different IDE/agent are better suited for, but few talk about which agent actually writes better code. One example is this experiment. It shows that even with same model some agents can successfully solve a problem while others cannot.

This is important because code quality can vary greatly between agents, even when using the same model.

Here are some tests to explore this difference.

For testing, the task will be to write a fairly simple and small application.

It will be a flutter desktop application that shows an EUR/USD candlestick chart.

A standard empty application will be created to provide a single starting point: flutter create testai_chart_quotes.

The First prompt will be most difficult and will ask the agent to implement the main part of the application. Subsequent prompts are logically smaller and will add features. Here are the prompts.

Prompt 1: Base implementation

Here is an empty Flutter Windows desktop application.
Transform it into a currency quotes app that:
1) Has main window that has a text input for days (default 50) and a refresh button. They located at the top of the window.
2) Has chart with candlesticks to show EUR/USD daily rates in the main window. Chart takes all window space except space for other controls.
3) Starts with empty data and only loads data when refresh button is clicked.
4) When refresh button clicked it loads quotes from Alpha Vantage API and shows them in the chart. It loads number of quotes specified in days control.


Prompt 2: Error handling

Add error handling and logging into local file.


Prompt 3: Indication

Add a loading indicator that shows during API requests and disables the refresh button.


Prompt 4: Input validation

Add input validation for the days field.


Prompt 5: Timeframes

Add buttons at the window top named: M1, M5, M30, H1, H4, D. They should change how quotes are shown. It should change quotes period respectively to: 1 minute, 5 minutes, 30 minutes, 1 hour, 4 hours, 1 day.
Button with current timeframe should be pressed, others are not. "D" is by default.
When button is pressed and it's not current timeframe then respective quotes should be loaded and shown.
Refresh button loads and shows qoutes with current timeframe.


Prompt 6: Indicator

Add checkbox "Show Simple Moving Average" (default is checked) at the window top, and add textbox after it labled "Period" (default 7).
If checkbox is checked then on the chart it showd be shown simple moving average with specified period calculated from Close price.

It's important to note that the Alpha Vantage API was chosen on AI's recommendation, but later it turned out that its demo access was very limited and provided very little functionality. Consequently, the validation of prompt 5 was limited to a compilation check only.

Software used:

  • Cursor 2.1.39. Models listed: Opus 4.5, Sonnet 4.5, GPT 5.1 Codex, GPT 5.1, Gemini 3 Pro, GPT 5.1 Codex Mini, Grok Code, GPT 4.1
  • VS Code 1.106.3
  • GitHub Copilot 1.388.0, GitHub Copilot Chat 0.33.3 (VS Code extension)
  • Roo Code 3.34.8 (VS Code extension)
  • Continue 1.2.11 (VS Code extension)
  • Kilo Code 4.125.1 (VS Code extension)

Below are the results. Some iterations are not listed or counted due to issues with the Alpha Vantage demo API and manual URL fixes required to make it work.

The source repository with all commits and a table with almost raw results is available here: https://github.com/liiws/testai-chart-quotes.

Below are the condensed results tables.

The following table highlights the best-performing models to showcase the current state of the art as a practical reference.

Agent Prompt # Amendment Iterations Total Time (mm:ss) Comment
Cursor, auto 1 0 3:34  
  2 1 4:19  
  3 0 0:39  
  4 0 0:38  
  5 0 1:49  
  6 0 2:31 Success
Copilot, Openrouter Grok 4.1 Fast 1 0 1:39  
  2 0 1:37  
  3 0 0:52  
  4 2 3:47 Success (Compilation error was fixed)

The exact model used by Cursor in "auto" mode is unknown, but it was likely more capable than Grok 4.1 Fast. The results show that while a simpler model is usually much faster, but if it fails then fixing the error can take significantly more time.

The following table uses the same GPT-4.1 model across different agents. This allows for a direct comparison of the agents.

Agent Prompt # Amendment Iterations Total Time (mm:ss) Cost ($) Comment
Cursor, GPT 4.1 1 3 1:21   Asked to edit pubspec.yaml and then run flutter commands manually
  2 0 0:17    
  3 0 0:10    
  4 0 0:10    
  5 0 0:35    
  6 0 0:24   Success
Cursor, GPT 4.1, Try 2 1 2 1:02   Chart still looks wrong. Stopped to try
Copilot, GPT 4.1 1 7 1:34   Runtime error. Stopped to try (same error)
Copilot, GPT 4.1, after Cursor GPT 4.1 prompt 1 fixed 2 0 0:32    
  3 0 0:11    
  4 0 0:08    
  5 1 0:33    
  6 3 0:51   Success
Copilot, Openrouter GPT 4.1 1 6 3:28 0.46 Chart looks wrong. Stopped to try (same wrong result)
Roo Code, Openrouter GPT 4.1 1 2 1:54 0.59 Chart looks wrong. Stopped to try (same error)
Continue, Openrouter GPT 4.1 1 0 0:11 0.01 Compilation error, many. Stopped to try (it could not edit files himself, everything manually)
Kilo Code, Openrouter GPT 4.1 1 4 3:14 0.97  
  2 0 0:28 0.19  
  3 0 0:17 0.16  
  4 0 0:20 0.06  
  5 0 0:35 0.24  
  6 1 1:52 0.54 Success (minor error fixed was no checkbox)

Note that the difference lies not only in whether the code compiles and runs, but also in the domain knowledge. For instance, Cursor in "auto" mode was able to find the proper URL for the demo API (using the "demo" key and omitting the format specifier, which fails in demo mode).

Even from this simple test it's clear that agents' code quality can be very different. Using the same GPT-4.1 model, Cursor successfully completed prompt 1 (it worked on the second try as well, although the chart looked slightly wrong). In contrast, Copilot and Roo Code failed to complete prompt 1, repeating the same error. Among the open-source tools, only Kilo Code managed to complete prompt 1 independently, although it asked the developer to add debug information manually.

The other aspect is cost. Pricing models differ greatly: Cursor uses a subscription with limited requests, while using your own key via Openrouter (or similar) means you only pay for the tokens you use, with no monthly fee. This allows you to choose between cheaper (or even free) models and more capable but expensive ones, depending on your current needs. Whether it's worth using a worse and cheaper model is a different question, which is beyond the scope of this article.

Conclusion

Here is a breakdown of each tool's performance.

Cursor proved to be a highly capable tool, delivering excellent results in this test.

Copilot produced worse code quality than Cursor, but it remains a good tool. Its major advantage is potential cost savings when used with a service like Openrouter. The initial run of Prompt 1 cost $0.16 (though the resulting app did not work).

Roo Code (and likely Cline) offer excellent automation, but the final code quality was poor. It also tends to use many more tokens than Copilot, making it less efficient. The first iteration of Prompt 1 cost $0.40 (for a non-functional application).

Continue appears to have failed at basic functionality during this test, as it could not edit files or run command, requiring all actions to be performed manually.

Kilo Code stands out as the only tool besides Cursor that successfully fixed Prompt 1. The first run cost $0.22 (for a non-working app), but it demonstrated the ability to guide the debugging process and resolve all subsequent errors.

The two most effective tools in this evaluation are Cursor and Kilo Code. They achieved similar high code quality on the primary task, but they operate on fundamentally different pricing models.

A special mention goes to Copilot. Despite its lower output quality, it offers a free tier and Openrouter compatibility. While less capable, it is also more token-efficient, making it a noteworthy budget-conscious option.

No comments:

Post a Comment