Some (Messy) Notes On Continual Learning
update: added a few readings and direction, fixed typos and sentence structures
I became enormously interested in this aspect of LLMs recently due to the simple question on how models might behave when they are adapted to new data, task, and knowledge. This question came up after watching my colleagues talk about Test Time Adaptation on LLMs, and I immediately wondered how it affects alignment measures that are in place in the base model. However, after further review, I realized that TTA makes little to no sense (at least in the context of what we might imagine an agent that continuously adapts would do) and seems to be lazily motivated (meant to just extend TTA into the LLM domain).
I then decided to look at the fundamental problem first rather than looking at the safety aspect. How can we adequately phrase the problem and how can we implement a system that can continuously learn. It’s quite hard to even make a toy model of continual learning. I was in part inspired by discussion held with Aether and from reading Jessy Lin’s wonderful blogpost and paper (of which I probably regurgitate a lot of her arguments). These are meant as a very early notes with a very vague idea on what kind of problem to tackle. Hopefully this will be updated every now and again!
Problem in Hand
- Main: LLM/agents that continuously learn are the missing part of AGI (or at the very least for agentic system that’s going to be useful), but they don’t really learn, reflect, or adapt (both state-wise and memory-wise) and it’s hard to make it right → catastrophic forgetting, inefficient incorporation of knowledge, etc. This definition of continual learning may be in contradiction with works done circa 2020, but I think it makes sense to have a system that remembers and act upon previous experience just like what we imagine a human will do!
- Recent Works that try to build frameworks that tries to continuously teach model new things fails in achieving what I had in mind (I don’t quite like the direction they’re going towards):
- SEAL : Tries to self adapt by doing “RL” using data it self generate and evaluate → Have horrible forgetting issue + very expensive to run. In fairness, the idea is indeed interesting but still suffers from the fact that current LLM architecture and optimization framework doesn’t bode well with continuous training. It doesn’t make sense for an LLM to update all of it’s weights during continual training.
- TTL for LLM : Proposes to adapt LLMs to match the style of the text inputted by minimizing the perplexity to the input (weird motivation but ok). I have several problems: perplexity is a weird thing to optimize against. Does this raise scores w.r.t. common benchmarks (reasoning, math, coding, Knowledge QA, etc). Furthermore their argument against forgetting is just they use LoRA?
- Current accepted ideas include methods that change the architecture, using memory, etc:
- 4 main things :
- Build a really long context (hard to do for anyone not in a frontier lab)
- Use a state space model to keep memory in a constant vector
- Train on saved context periodically (e.g. every night just like a human would)
- RAG combined with summarization and reflection on the background
- 4 main things :
Main Readings
- Reflexion This paper guides LLMs by doing “RL”, that involves trial and error on some task execution. The responses are evaluated through external agents and using the generated sparse signal (pass/fail, etc) feedback given, the model self-reflects on it’s previous action (what went wrong, etc). The self reflection are saved as episodic memory fed into the next trial as context.
- Pink, et al Position Paper : The paper argues that “memory” is not a vague concept but must be operationalized to enable true long-term agents. Current LLMs often possess semantic memory (general facts) or working memory (context window), but lack true “episodic memory”. Inspired by biological systems to differentiate into Encoding (short term), Retrieval (pull past episode into context), and Consolidation (Integrate memory into weights)
- LifeLongAgentBench : Probably the closest thing to what I’m thinking of, where agents must accumulate and transfer skills across sequential, interdependent tasks (in Database, OS, and Knowledge Graph environments). The authors introduce a group self-consistency mechanism: instead of stuffing all retrieved history into one prompt, past experiences are partitioned into smaller groups to generate multiple reasoning paths. Benchmark part is great, but work remains in needing to make it work really well for a single model though
- Provable Benefit of In-Tool Learning : A bit of motivation on how learning > fine tuning
- CRITIC
- LaMP-QA
- BeSpoke
- Self Improving Agents at Test Time
Thoughts on what the ideal systems architecture should be
- Weight updates are necessary, especially for super specialized model. Efficiency of weight updates compared to internalizing things from context is reason enough
- Weight updates should be targeted. Acquired knowledge will realistically either occupy a new portion or change already known thing in the knowledge space. This will prevent forgetting while still efficiently incorporate the knowledge itself.
- This necessitates the need for a new paradigm in the way post-post-training is done (where current and future memory is saved architecturally, what objective, if any, should the model be optimized against, which set of parameters to update)
- Self-reflection is necessary. We want a model that have an internalized representation of it’s past efforts and incorporate them into weights.
- Small updates (LoRA-esque), we can imagine a plug and play aspect of this as well as the ability to aggregate memory from multiple learners.
- Adding context itself is equivalent to parameter updates !
- etc..
Some Possible Directions
- Looking at how to integrate memory as personalization tokens –> serve memory as context that can change overtime
- Focus towards efficient continual learning that can work in a deployed mobile setting
- My personal favourite: Look into how agents learn to use new tools, as learning this is provably more difficult that changing
- We can use ToolAlpaca to model an continuous workflow (fake API that’s not in the current training data)