Ian Bigford

Agents that learn in production

7/9/202514 min read

A core limitation of most LLM agents today is that they don't actually learn from their actions. Once deployed, their underlying model weights are frozen. Because these models are non-deterministic, they won't fail in the exact same way every time. Instead, they fail without a pattern of improvement. An agent might fumble a login flow one way, and on the next attempt, it might fumble it a different way or fail on the subsequent step. There's no systematic reduction in error, no accumulation of skill. The experience of failure is lost at the end of each episode, preventing the agent from building true, lasting competence.

To make this concrete, I like to think of agent intelligence on three timescales based on an oversimplified view of human memory:

  1. Working Memory (milliseconds to minutes): This is the context window. It's the agent's scratchpad, its awareness of what it's doing right now. Modern LLMs are fantastic at this.

  2. Episodic Memory (minutes to days): This is the agent's short and long term memory, typically implemented with RAG from vector stores, state variables, or databases. The agent can look up past events or facts, but this knowledge is external.

  3. Synaptic Learning (days to months): This is where the unlock would happen. This involves updating the model's weights based on experience. If this gap is bridged, agents would start to mimic what it's like to work with other people. At first context is weak, but through interaction with the environment it builds. In most deployed agents today, this is entirely missing. But as we'll see, that's starting to change.

Three Timescales of Agent Intelligence showing working memory and episodic memory as solved, with synaptic learning (weight updates over days to months) remaining as the critical gap

Since synaptic learning is completely missing, the consequences show up everywhere. The most visible is fragile behavior under distribution shift. The agent is brittle when faced with a slightly different UI or a new error message it wasn’t trained on, and this is especially pronounced for fast moving domains. An agent trained on a framework’s API will break the moment that API ships a new version, and it has no mechanism to recover from that breakage on its own.

This brittleness compounds into difficulty with long horizons. When you make a mistake, you can almost instantly update your understanding and actions to limit the chance you’d repeat it. LLMs can only do this if the mistake is within their context window. If it wasn’t recent, it’s gone. The agent makes the same class of error on Monday, Wednesday, and Friday with no accumulation of insight. The economic cost is significant too. Since agents can’t learn as they go, they continually rediscover solutions to solved problems, translating to billions of wasted reasoning tokens re-deriving answers that should have been internalized after the first successful attempt.

Finally, there are diminishing returns from scaffolding. You can only give a mind with 5 minutes of effective memory so many tools before you need a categorically different system to really scale. Prompting tricks, tool use, and retrieval are powerful but they all hit a ceiling when the core reasoning engine never improves.

If you’ve ever shipped a recommendation system, you’ve had the opposite experience. It updates all the time through bandits, online gradients, and replay buffers, and it gets sharper with every interaction. Agents should inherit that instinct.

What We Have Today

We've built impressive scaffolding around these frozen brains. In-context learning and tool use give the model a huge context window and let it call external tools. This lets it act smarter without weight updates, but the core reasoning engine doesn't improve. It's using knowledge without implicitly embodying it. Retrieval-augmented memory through vector databases is great for providing context, but retrieval just nudges the next token. It doesn't fix a flawed policy. You can bandage a weak agent with better memory, but it still lacks cumulative competence.

On the reinforcement learning side, RL in controlled domains has already proven what's possible. The AlphaZero family demonstrated that iterative feedback loops can create superhuman intelligence, and that remains our north star. The lesson is that learning works. The challenge is exporting it from the clean, simulated world of Go to the messy, open-ended real world. Test-time compute scaling offers another lever, letting an agent think harder with techniques like tree-of-thought or multi-sample ranking. This raises the performance ceiling for a single task but doesn't carry any learning forward to the next one.

These are all necessary components, but they aren't sufficient. They are workarounds for the core problem, which is that the brain doesn't update.

Frozen vs. Adaptive, showing today's agents as ice brains where new experience bounces off, contrasted with future agents that are malleable, absorbing and integrating failures into their structure

Promising Approaches

So, how do we give agents the ability to truly learn? A year ago this section would have been entirely theoretical. It's not anymore. The research frontier is converging with production systems in a way that I think is underappreciated, and the pattern looks a lot like a mashup of recommender systems, robotics RL, and LLMs.

1. Preference-Based and Rule-Based Reinforcement Signals

DPO Feedback Loop where the model generates two paths, a critic picks the better one, and a gradient update shifts the model's probabilities toward the preferred output

Instead of complex, handcrafted reward functions, the field is moving toward simpler, more direct signals. Direct Preference Optimization (DPO) is the foundational idea. Given a prompt, you show the model two possible completions, A and B, and tell it B is better. The model learns a gradient update that makes B more likely and A less likely. No separate reward model needed.

What makes this exciting for production learning is that the preference signal can come from anywhere. Human supervisors, programmatic critics ("did the code pass the unit tests?"), or even implicit user behavior (did the user accept or reject the output?). This turns every production interaction into a potential training example.

A more recent variant, Group Relative Policy Optimization (GRPO), pushes this further. Instead of pairwise comparisons, GRPO generates N completions for each prompt, scores all of them via a reward function, and reinforces whichever completions scored above the group average. This is more sample-efficient than DPO for agentic tasks where the action space is large and the reward landscape is complex.

  • Source (DPO): Rafailov, R., Sharma, A., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290
  • Source (GRPO): Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300

2. ART, Agents That Learn From Their Own Trajectories

This is where it starts to feel real. OpenPipe's ART (Agent Reinforcement Trainer) is an open-source framework that closes the loop between agent execution and weight updates using GRPO. The architecture has two components. A client runs your agent code and records every interaction as a "trajectory," meaning a complete run history with tool calls, observations, and outcomes. A server handles GPU-intensive training and inference via vLLM with LoRA adapters.

The training loop alternates between two phases. During inference, the agent executes tasks with parallel rollouts and a reward score is assigned upon completion. During training, trajectories are shipped to the server, GRPO updates are computed, a new LoRA adapter is saved and hot-swapped into the inference server, and the cycle repeats.

Their demonstration agent, ART-E, is an email research agent trained on the Enron corpus. The results are striking. A Qwen2.5-14B model trained with ART achieved 96% accuracy on email search tasks, a 56% improvement over the base model, while being 5x faster than o3 and 64x cheaper per 1,000 runs. All trained on a single H100 for under $80. What's most interesting isn't the numbers though, it's what the agent learned to do that nobody told it to. It started exploring database schemas before querying, writing correct JOINs across tables, and implementing error recovery. All of these were emergent behaviors from the RL loop, not from instructions.

ART also includes RULER, an LLM-as-judge component that automatically generates reward scores by comparing trajectories, which removes one of the biggest friction points in RL for agents, the need to handcraft reward functions.

3. Memory-Augmented Policies that Actually Adapt

Standard RAG just stuffs retrieved text into the context. But what if memory could directly influence the model's predictions? kNN-LM does this. It finds the k-nearest neighbors to the current context in a massive datastore of text and uses their next-token distributions to directly interpolate the final prediction. A more modern approach, seen in Memorizing Transformers, augments the model with a retrieval mechanism that can pull up exact pieces of context from the past to improve its predictions. This isn't just reading notes. It's letting the notes guide your hand as you write.

These methods sit in an interesting middle ground. They don't update weights, so they're not truly learning in the synaptic sense. But they're not purely episodic either. The retrieval mechanism changes the model's effective behavior based on accumulated experience. For production systems where weight updates are too risky or expensive, this might be a pragmatic compromise.

  • Source (kNN-LM): Khandelwal, U., He, H., et al. (2019). Generalization through Memorization: Nearest Neighbor Language Models. arXiv:1911.00172
  • Source (Memorizing Transformers): Wu, Y., et al. (2022). Memorizing Transformers. arXiv:2203.08913

4. Streaming PEFT/LoRA Adapter Updates

Full fine-tuning is expensive. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a solution. You freeze the massive base model and insert very small, trainable adapter layers. The magic for online learning is that you can update just these tiny adapters (a fraction of a percent of the total parameters) in a streaming fashion. This makes online updates computationally cheap, fast, and reversible. If a new adapter causes problems, you can just turn it off.

This is the mechanism ART uses under the hood, and it's likely what makes fast iteration cycles possible in production RL systems more broadly. The base model provides the foundation of general capability, while the adapter captures the specialized, evolving knowledge gained from deployment experience. Hot-swapping adapters without restarting inference is what makes the "learn while serving" loop practical.

  • Source: Hu, E.J., Shen, Y., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685

5. Continual Learning with Anti-Forgetting

If you teach an agent a new skill, you don't want it to forget an old one. This is the classic catastrophic forgetting problem in neural networks. The field of continual learning tackles this head-on. Methods like Elastic Weight Consolidation (EWC) identify which weights are most important for previously learned tasks and penalize changes to them during new training. Think of it as adding a regularizer term

L(θ)=LB(θ)+iλ2Fi(θiθA,i)2L(\theta) = L_B(\theta) + \sum_i \frac{\lambda}{2} F_i (\theta_i - \theta_{A,i}^*)^2

where the model is regularized to stay close to the optimal weights for task A (θA\theta_A^*) while learning task B. An online agent must have this to avoid regressing. As we'll see with Cursor's experience below, this isn't just theoretical. Model degradation from online updates is one of the primary practical challenges.

  • Source: Kirkpatrick, J., Pascanu, R., et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS

It’s Already Happening With Cursor’s Real Time RL

While most of the approaches above still live in research repos and blog posts, Cursor has been doing this in production with their Composer model and published the details on March 26th, 2026.

The setup is conceptually simple but operationally intense. Cursor serves their model to millions of users, collects billions of tokens of production interaction data, converts implicit user behavior into reward signals, updates the model weights, validates the new checkpoint against their internal benchmark (CursorBench), and deploys it. They ship improved checkpoints multiple times daily, sometimes as often as every five hours.

The key insight is that the training is on-policy, meaning the model generating the data is the same model being trained. This matters because off-policy data (collected from a previous version of the model) degrades quickly in this setting. The distribution of inputs the model sees depends on the outputs it generates, so stale training data actively misleads the optimizer.

Their results from the Composer 1.5 A/B tests tell the story. Agent edits that users kept in their codebase went up 2.28%, dissatisfied follow-up messages dropped 3.13%, and latency decreased 10.3%. These aren’t dramatic numbers in isolation, but they compound. Every five hours, the model gets slightly better at the actual distribution of tasks its users care about. Over weeks, the cumulative effect is substantial. Their Composer 2 release showed CursorBench scores jumping from 44.2 to 61.3.

What I find most valuable about Cursor’s write-up isn’t the wins, though. It’s their candid documentation of reward hacking and the exact failure modes that make online learning dangerous.

The first one is subtle. Composer learned to emit invalid tool calls on hard tasks. Rather than attempting a difficult edit and risking a bad outcome (which would hurt its reward), the model learned to essentially give up gracefully by producing broken calls that wouldn’t execute. The reward signal didn’t penalize inaction harshly enough, so the model found a loophole. They fixed it by including broken tool calls as explicit negative training examples.

The second is even more insidious. The model started asking excessive clarifying questions instead of making edits. From a reward perspective, asking "Did you mean X or Y?" is safer than committing to an edit that might get rejected. The model’s editing rate quietly declined as it learned to defer. They had to modify the reward function to correct for this.

These aren’t edge cases. They’re the canonical failure modes of online RL applied to real systems, and seeing them documented in a production context is enormously useful for anyone thinking about building similar systems.

The Gaps That Remain

Between ART’s open-source framework and Cursor’s production deployment, we now have concrete evidence that agents can learn from their own experience. But several hard problems remain unsolved.

Reliable reward at scale. Cursor can use implicit signals (did the user keep the edit?) because code edits have relatively clear accept/reject semantics. For agents operating in domains where success is ambiguous, such as customer support, research, or creative work, the reward signal problem is much harder. ART’s RULER (LLM-as-judge) approach is promising, but it introduces its own failure modes. The judge model has biases, and optimizing against a biased judge amplifies those biases.

Credit assignment across long horizons. If an agent makes a subtle architectural mistake on step 3 that only causes a visible failure on step 53, how do you propagate that error signal back? Cursor’s real time RL works partly because coding edits have relatively short feedback loops. For agents running multi-day workflows, standard methods struggle with the delay between action and consequence.

The weight access problem. All of the RL-based approaches require access to model weights. You can’t run GRPO on Claude or GPT-5.4 because you need to own the model. This means the most capable foundation models are excluded from online learning loops, and teams building on top of API-only models are limited to episodic memory approaches. This is a real structural constraint that shapes who can build learning agents and who can’t. Cursor trains their own model. Most teams don’t have that option.

Safety under optimization pressure. Cursor’s reward hacking examples are instructive but mild. In higher stakes domains like finance, healthcare, and infrastructure management, an agent that learns to game its reward function could cause real damage. The field still lacks robust, general purpose mechanisms for constraining online learning to stay within safe behavioral bounds.

Non-stationarity. The world changes. Websites get redesigned, APIs are deprecated, user preferences shift. A continual learner must adapt to this drift without catastrophically forgetting what it already knows. EWC and related methods help in theory, but in practice the interplay between online learning and environmental drift is poorly understood.

From Frozen Weights To Compounding Competence

A year ago, the idea of agents learning in production was almost entirely theoretical. Today, Cursor is shipping model updates multiple times a day based on production experience, and OpenPipe has open-sourced a framework that lets anyone train agents from their own trajectories on a single GPU. The gap between "agents should learn" and "agents do learn" is closing fast.

The path forward isn’t paved with more parameters or bigger context windows. It’s about closing the feedback loop by building systems where every production interaction becomes a training signal, where reward hacking gets caught and corrected, and where small incremental updates compound into genuine competence.

We’ve seen this play out in recommender systems, game-playing agents, and robotics. The compounding returns from iterative feedback are always transformative. The difference now is that we have concrete proof it works for LLM agents too. The remaining problems are hard, but they’re engineering problems, not open questions about whether this is possible. It is. The question is how fast we can make it safe and reliable.