Lessons learned from building agent that can code like Composer
How I've built a coding agent that can code itself.
Since the release of GPT, I have always kept the chat window open on the side to shape ideas, prototype, review, or improve any code. Always present, LLM felt like an engineer’s ultimate growth hack.
Coding agents take this to the next level. Agents now iterate, learn about the codebase, read the docs, and write more complex code. I’ve shipped features to multiple projects 3 - 4x faster.
Such impact is impressive for a technology that has been mainstream for barely half a year. Seeing such a massive impact on my work, I wanted to take this to the next level.
I wanted to go deep to develop intuition about designing a coding agent from scratch. SYNQ will soon write code to fix issues, and I wanted to know every little detail about what it takes to build an AI that writes code.
Learning to code with Cursor and Composer
I’ve been coding with Github Co-pilot for a while and with Cursor since September. I migrated mainly because of their next action prediction model. It went far beyond traditional auto-complete, suggesting more significant edits, edits in other parts of the file, or keeping context across multiple files to help me finish work.
Magic.
When I started digging into agents, Cursor shipped Composer, its first agentic workflow, in November. Its first iteration worked only with Claude, and after weeks of experiments with OpenAI models, I realized how much better Claude was than OpenAI, which made it click for me how far modern LLM models can go.
With Composer + Claude, Cursor’s “TAB, TAB” auto-complete, I started to see significant improvements in my speed. Still, at the same time, I began noticing limitations on our rather large codebases, which contain hundreds or thousands of files.
I also found the developer experience of a coding agent in the side panel of the coding editor limiting. I do a lot of work in the terminal, working with git, k8s, logs, running type checks, tests, and more, with my hands mostly on the keyboard. The VSCode side panel chat interface feels clunky to me.
At that point, it became apparent to me that agents would be a significant part of the future of software engineering. I had to build one to understand how they work.
I dug deeper into what made Composer different from a normal LLM chat.
AI developer toolkit
One of the key differences between humans and other primates is our ability to build and use tools. It's the tools we use to amplify what we can do, evolving into who we are today.
Agents are the same.
The tools allow agents to do complex things beyond simple back-and-forth chat interactions. They plug LLM into broader workflows.
When you think about the agent that writes code, the minimal set of tools needed is relatively simple:
List files to explore the structure of your project.
Read the file to understand existing code anywhere in your codebase.
Write a file to generate new code.
That’s it, at least for the initial version.
In some of my first iterations, I’ve tried even less. I started with a tool that reads files and asked LLM to directly generate code edits in response format as unified diffs within markdown code fences. I specifically used diffs because we often work with very long files (1000+ lines), which would be expensive and slow to generate fully with foundational models priced at $3/MTok.
It didn’t work.
Generating code is a complex job, and if you ask LLM to generate code AND put additional restrictions on how to formulate the output, it will make many mistakes. Sometimes, it skips a comment, skips a new line, adds one, forgets to prefix the line with + or - to indicate the change, or messes up the location of code edit because LLMs are notoriously bad at counting (lines). I’ve spent some time trying to post-process such output and resolve the errors, but I’ve quickly realized this is a dead end. I fixed ~90% of the issues, but the code was getting complex, and the occasional errors still broke the UX (or DX).
This is why the write file tool is essential. Under the hood, it uses a relatively simple approach, which I’ve learned from Cursor. It uses a so-called fast-apply model, a much smaller model specializing in rewriting code, while the core agent model outlines the edits. I’ve tested several smaller models to generate code with ~25B parameter models being precise enough to reliably apply code edits and regenerate the content of files, doing so much faster and much cheaper. The combination of Claude as a planner and a smaller, speedier coding model did the trick.
It was an important lesson:
Tool design for agents is very influential. The mental model I work with is a balance between deterministic and probabilistic systems.
A tool is a unit of work. Its inputs and outputs define its semantics for the agent. The tool, therefore, encapsulates logic, which can often be deterministic, exposing it through a clean interface to agents with probabilistic workflow at its core.
Striking the right balance is key.
Exposing too many broad tools to the agent will frequently go wrong. Make workflows and tools static, and you will lose the power of LLM and its ability to solve problems that are hard to solve with deterministic rules. Strike the balance, and you will have an agent that solves what was not possible before. Mastering the tool design is the key.
By solving this challenge, I’ve built the foundational set of tools for a coding agent, which has started to produce code edits on par with Composer in many scenarios.
Strengthening the feedback loop
As an engineer, coding rarely happens in “one shot.”
A typical coding session is a sequence of coding edits followed by the execution of tools such as type check, build, or test run.
To support this, I've introduced a new tool: execute command. This tool allows the agents to perform commands such as running tests, type checking, cargo build, go build, or similar. Each of these tools runs fast and gives agents feedback on their edits. A test has been fixed or started failing, or a syntax error was introduced somewhere.
I spent some time thinking about different approaches to code this multi-step workflow.
A popular way to do that is with tools like LangChain, which allows chaining multiple LLM interactions into workflows, or LangGraph, which treats workflow execution as a traversal of graphs. Nodes are work, and edges are possible ways to progress the workflow.
After the first few experiments, I found this approach problematic. First, it felt like overengineering. I somehow don’t see the value of modeling workflows as graphs. I get it; it looks nice, but it feels unnecessary. Second, it felt rigid for my use case. There will be actions when running tests is not necessary. I want the agent to decide whether to run the tests, build, or invoke another feedback loop.
Lastly, this multi-step process is slow. We need to pass a lot of contexts (file content) to LLM, and waiting for this multistage process, where we make at least a few calls to LLM to get one edit done, takes time and costs more money. I’ve quickly abandoned this approach and focused on one-shot planning with calls to code edit tools, as outlined above.
I simplified the agent workflow into a for loop with tool calls. The tool can also contain another agent or interaction with LLM. Such design makes tools composable, similar to software modules and functions, allowing us to combine deterministic and probabilistic code into one workflow.
The workflow is inside the tools; If an agent decides to write a file, it, under the hood, invokes a different LLM to execute the code change before it gets back to control. The agent then loops over tool calls. The heart of the agent goes something like this:
Take user instructions + context
Decide which tool you want to use or return control to the user
Use the tool
Review the result of tool usage and go to step 2.
The agent has instructions to execute the task and use tools as necessary. When it doesn’t need more tools, it will write a summary of what it has done (as instructed in the system prompt) and return it to the user.
This way, the agent can do zero tool calls and ask the user for clarification or execute one or many tool calls until it executes the task. It could take a few to dozens of iterations in complex code edits. The important part is that the agent knows what it has done and iterates. Sometimes, it does an edit and later realizes there is a better one. I explicitly call this out in the system prompt as I want an agent that revises its thoughts and iterates. That’s how engineering typically works. Current LLM models execute such reasoning well if you instruct them.
The addition of testing was crucial. Cursor does this, too, with its introduction of linter, which you can see running now and then in Composer sessions. Without this, agents sometimes produce invalid code and finish, which is sub-optimal. With the linter and other tools, the agent can review and fix its errors before handing control back to the user. The DX is much better, as code edits are almost always valid.
Another benefit of having a feedback loop implemented via a command execution tool is that there is no strict workflow. The agent is encouraged in its system prompt to seek feedback but decides independently. It also chooses the actual command to run. It knows that when it comes to a Golang project, it should run a build to check for syntax errors or run tests to verify that everything works.
After several iterations on system prompts for the agent and tweaking instructions and descriptions for the list, read, edit, and execute tools, I started to see results on par with Composer.
I’ve also added code & file search and ergonomic CLI UX that serves me well daily.
As a result, it writes ~90% of my code and accelerates development 3-4x. Suppose you wonder why not 10x; I don’t implicitly trust the LLM. I review every line of code before a commit, sometimes instructing agents to iterate. This also means that the ratio of the work changes; I spend most of my time thinking about the next step and reviewing the code, not writing it.
Lessons learned
I will explore more techniques in the following posts, but I hope you’ve already learned something new or unexpected about what it takes to build agents that can code.
To finish this first post on the topic, let me summarise a few of my learnings, elevating them beyond coding agents to building agents in general:
Hardcoded workflow (LangChain) or complex graph representation of agents (LangGraph) are not always the best abstractions. You can model complex workflows via tool calls, where individual tools interact with LLM. The root agent decides what to do next, making the workflow more flexible. The latest iterations of LLM are very good at tool calling, and I expect this trend of improvements to continue.
Pay a lot of attention to the description and structure of the input and output of tools. This differentiates between an agent confused and constantly calling wrong functions or putting in wrong parameters and an agent that precisely executes complex workflows. Especially at SYNQ, where we code agents that can debug data issues, we’ve spent a lot of energy on hiding some internal concepts of SYNQ from agents inside the tools, creating a somewhat more straightforward representation of the data observability world for the agent. It makes many fewer mistakes.
Think about the feedback loops. How does your agent know that it got the job done? How does it know it made a mistake and needs more work?
Focus on the UX around your agents. LLMs are commodities. Agents are not that hard to code. The magic will be the UX. How do we embed agents into actual workflows, wider platforms, and the software we build? We should think beyond basic chat.
Think very hard about what “pre-prompt” you inject into your task. I don’t mean a system prompt that describes what the agent should do. I mean a context for the specific task that you’re about to do. In my case, it's a file edit, or in our work at SYNQ, it's information about the newly detected issues, relevant code changes, or historical time series data. The way this context gets passed to the agent, and consequently, the LLM makes or breaks the workflow.
Six months into the agent-building journey, I’ve learned a lot and am incredibly excited to work with a coding agent who can code, sometimes code itself.
PS: I will spend the next few posts sharing practical experience with building AI systems and bring it back to data and analytics as we build these capabilities into SYNQ. I can’t wait to share more about what we’re building.
Image by Roman Synkevych on Unsplash
Great read Peto :) I've been doing the same lately https://github.com/bosun-ai/kwaak/issues/411#issuecomment-2909351523 inspired by kwaak and gpt-engineer rather than by Cursor and Composer and overall I came to conclusion that doing some recursive agentic coding on the commercial models is tough because the token flow with unexpectedly huge context can get out of control leading to less useful halucinations or huge bills, so I started experimenting with ollama and llama.cpp on my geforce 3090 limited by 24GB which is not enough for the really deep reasoning models and I knew that I needed at least NVIDIA Quadro RTX 8000 48GB so I've been waiting for it to be shipped for 2 months until the order got canceled :) However I still believe it is the way as you won't be restricted so much on the tokens flying .
I believe that coding agents are quite optimized for a particular LLM and they have lower chance of succeeding with another one which is also a reason to go with open models instead of those commercial ones. It reminds me of a relationship that cannot be easily applied to someone else :)
My latest bet was on autogen from microsoft, made a minimal playground with Ollama here https://github.com/pragmaxim/autogen-playground ... It offers really nice framework for building some recursive "build-until-compiles-and-tests-pass" agentic tool however it is quite rigid and enforces boundaries I don't like much.
Best of luck too you, hope to see you some day again