Ian Bigford

Agents that learn in production

7/9/20259 min read

A core limitation of most LLM agents today is that they don't actually learn from their actions. Once deployed, their underlying model weights are frozen. Because these models are non-deterministic, they won't fail in the exact same way every time. Instead, they fail without a pattern of improvement. An agent might fumble a login flow one way, and on the next attempt, it might fumble it a different way or fail on the subsequent step. There's no systematic reduction in error, no accumulation of skill. The experience of failure is lost at the end of each episode, preventing the agent from building true, lasting competence.

To make this concrete, I like to think of agent intelligence on three timescales based on an oversimplified view of human memory:

  1. Working Memory (milliseconds to minutes): This is the context window. It's the agent's scratchpad, its awareness of what it's doing right now. Modern LLMs are fantastic at this.

  2. Episodic Memory (minutes to days): This is the agent's short and long-term memory, typically implemented with RAG from vector stores, state variables, or databases. The agent can "look up" past events or facts, but this knowledge is external.

  3. Synaptic Learning (days to months): This is where the unlock would happen. This involves the process of updating the model's weights based on experience. If this gap is bridged, agents would start to mimick what its like to work with other people - at first context is weak, but through interaction with the environment it builds. In today's deployed agents, this is entirely missing.

Since synaptic learning is completely missing, we see:

  • Fragile behaviour under distribution shift. The agent is brittle when faced with a slightly different UI or a new error message it wasn't trained on. This is especially pronounced for popular topics that evolve. For example, if you ask GPT-5 about LLM architecture it still references heuristics from BERT unless explicitly told to do research to update its context.

  • Difficulty with long horizons. When you make a mistake, you can almost instantly update your understanding and actions to limit the chance you'd repeat it. For LLMs, they can only do this if the mistake within in their context window / working memory, meaning that if the mistake wasn't recent, the LLM will forget all about it.

  • High recurring costs. Since agents and LLMs can't learn as they go, they have continually rediscover solutions to solved problems until they are updated or explicitedly programmed to figure it out. This utlimately translates to billions of wasted reasoning tokens to solve the same problem repeatedly.

  • Diminishing returns. You can only give a mind with 5 minutes of effective memory so many tools to improve its performance before you need a categorically different system to really scale meaningfully.

If you’ve ever shipped a recommendation system, you've had the opposite experince: it updates all the time (bandits, online gradients, replay buffers), and it gets sharper with feedback. Agents should inherit that instinct.

What We Have Today

We've built impressive scaffolding around these frozen brains:

  • In-context learning & tool use: We give the model a huge context window and let it call external tools. This lets it "act" smarter without weight updates, but the core reasoning engine doesn't improve. It's using knowledge, but doesn't implicitly embody it.

  • Retrieval-augmented memory: Vector databases are great for providing context. But retrieval just nudges the next token; it doesn't fix a flawed policy. You can bandage a weak agent with better memory, but it still lacks cumulative competence.

  • RL in controlled domains: The AlphaZero family proved that iterative feedback loops can create superhuman intelligence. This is our north star. The lesson is clear: learning works. The challenge is exporting this from the clean, simulated world of Go to the messy, open-ended real world.

  • Test-time compute scaling: We can make an agent "think harder" with techniques like tree-of-thought or multi-sample ranking. This raises the performance ceiling for a single task but doesn't carry any learning forward to the next one.

These are all necessary components, but they aren't sufficient. They are workarounds for the core problem: the brain doesn't update.

Promising Approaches

So, how do we give agents the ability to truly learn? The research frontier is exciting and looks a lot like a mashup of recommender systems, robotics RL, and LLMs. Here are some of the most promising directions.

1. Preference-Based and Rule-Based Reinforcement Signals

Instead of complex, hand-crafted reward functions, the field is moving toward simpler, more direct signals. The standout here is Direct Preference Optimization (DPO). The idea is wonderfully simple: given a prompt, you show the model two possible completions, A and B, and simply tell it "B is better." The model then learns a gradient update that makes B more likely and A less likely. No separate reward model needed. You can imagine an online system where human supervisors or even programmatic critics (e.g., "did the code pass the unit tests?") generate these pairwise preferences, allowing the model to continuously refine its policy in small, directed steps.

  • Source: Rafailov, R., Sharma, A., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290

2. Memory-Augmented Policies that Actually Adapt

Standard RAG just stuffs retrieved text into the context. But what if memory could directly influence the model's predictions? kNN-LM does this. It finds the k-nearest neighbors to the current context in a massive datastore of text and uses their next-token distributions to directly interpolate the final prediction. A more modern approach, seen in Memorizing Transformers, augments the model with a retrieval mechanism that can pull up exact pieces of context from the past to improve its predictions. This isn't just "reading notes"; it's letting the notes guide your hand as you write.

  • Source (kNN-LM): Khandelwal, U., He, H., et al. (2019). Generalization through Memorization: Nearest Neighbor Language Models. arXiv:1911.00172

  • Source (Memorizing Transformers): Wu, Y., et al. (2022). Memorizing Transformers. arXiv:2203.08913

3. Streaming PEFT/LoRA Adapter Updates

Full fine-tuning is expensive. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a solution. You freeze the massive base model and insert very small, trainable "adapter" layers. The magic for online learning is that you can update just these tiny adapters (a fraction of a percent of the total parameters) in a streaming fashion. This makes online updates computationally cheap, fast, and—critically—reversible. If a new adapter causes problems, you can just turn it off. This opens the door to safe, personalized, on-the-fly learning.

  • Source: Hu, E.J., Shen, Y., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685

4. Continual Learning with Anti-Forgetting

If you teach an agent a new skill, you don't want it to forget an old one. This is the classic "catastrophic forgetting" problem in neural networks. The field of continual learning tackles this head-on. Methods like Elastic Weight Consolidation (EWC) identify which weights are most important for previously learned tasks and penalize changes to them during new training. Think of it as adding a regularizer term $L(\theta) = L_B(\theta) + \sum_i \frac{\lambda}{2} F_i (\theta_i - \theta_{A,i}^*)^2$ where the model is regularized to stay close to the optimal weights for task A ($\theta_A^*$) while learning task B. An online agent must have this to avoid regressing.

  • Source: Kirkpatrick, J., Pascanu, R., et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS

The Gaps We Still Need to Close

In general, the direction that seems to be favoured today is RL techniques like DPO or others. But it's worth pointing out that leveraging RL in an agentic system itself requires having access to the model weights, meaning we can't use the top models like Claude, GPT, Gemini etc for this purpose and frameworks are continually being built with the view that LLMs should provide us with everything we need to built powerful agents. To allow agents to learn online will require that we are adjusting model weights while the model is in production likely through a programmed feedback cycle that involves capturing online samples, calculating a reward signal and running an update which will likely require significant GPUs usage to properly inculcate the updates without causing degradation. Beyond this a few additional questions remain:

  1. Reliable Reward: Where does the learning signal come from? Human feedback is accurate but slow and expensive. Implicit signals (e.g., clicks, task completion) are abundant but noisy and easily hacked. Programmatic signals (e.g., unit tests) are great but brittle. The real challenge is creating reward models that are calibrated for online use—models that know when they are uncertain and can choose to abstain rather than give a bad signal.

  2. Safety and Robustness: A system optimized with online gradients is a professional reward hacker. It will find the shortest, laziest, or most exploitative path to maximize its reward, whether that aligns with user intent or not. Without robust filters, canary analysis, and strong gradient-gating mechanisms, an online learning system will inevitably amplify biases and find dangerous edge cases.

  3. Credit Assignment Across Long Horizons: The "long fuse" problem is one of the oldest and hardest in reinforcement learning. If an agent makes a subtle mistake on step 3 that only causes a visible failure on step 53, how do you propagate that error signal back in time? Standard methods struggle with such long delays, especially in the vast, sparse-reward environments agents operate in.

  4. Non-Stationarity: The world changes. Websites get redesigned, APIs are deprecated, and user preferences shift. An agent's environment is not a static game board. A continual learner must be able to adapt to this drift without catastrophically forgetting what it already knows or becoming unstable.

  5. Model Degradation: If you start adjusting the weights of complex LLMs it is highly likely that you are going to do more damage than good, particularly if the learning invovles fundamental changes in understanding. Getting this just right at scale in an automated fashion will be challenging to figure out.

Closing Thoughts

The path to truly capable agents isn't paved with more parameters or bigger context windows alone. It's about closing the loop. It's about building a feedback engine, instrumenting it for safety, and allowing small, incremental updates to accumulate into genuine competence.

We've seen this movie before in other fields. In gaming, in recommender systems, in robotics—the story is always the same. Compounding returns from iterative feedback are transformative. Agents will get there, too.

The agents that win won’t just think harder. They’ll learn.

The teams that win won’t just fine-tune more. They’ll operationalize feedback.

And the products that win won’t just talk. They’ll adapt.

That’s what “truly performant” means:

It remembers.

It improves.

It earns trust.