Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree what people probably actually want is continual training, I disagree continual training is the only way to get persistent state. The GP is (explicitly) talking about long term memory alone and in the examples. If you have an e.g. 10 trillion token context then you have long term memory, which can give the ability and enable long term goals and affect actions over tasks as listed, even without continual training.

Continual training would replace the need to have that to have context provide the persistent state as well as provide additional capabilities than enormous context/other methods of persistent state alone would give, but that doesn't mean it's the only way to get persistent state as described.



A giant, even infinite, context cannot overcome the fundamental limitations a model has - the limitations in processing come from the "shape" of the weights in latent space, not from the contextual navigation through latent space through inference using the context.

The easiest way to understand the problem is like this: If a model has a mode collapse, like only displaying watch and clock faces with the hands displaying 10:10, you can sometimes use prompt engineering to get an occasional output that shows some other specified time, but 99% of the time, it's going to be accompanied by weird artifacts, distortions, and abject failures to align with whatever the appropriate output might be.

All of a model's knowledge is encoded in the weights. All of the weights are interconnected, with links between concepts and hierarchies and sequences and processes embedded within - there are concepts related to clocks and watches that are accurate, yet when a prompt causes the navigation through the distorted, "mode collapsed" region of latent space, it fundamentally distorts and corrupts the following output. In an RL context, you quickly get a doom cycle, with the output getting worse, faster and faster.

Let's say you use CFG or a painstakingly handcrafted LORA and you precisely modify the weights that deal with a known mode collapse - your model now can display all times, 10:10 , 3:15, 5:00, etc - the secondary networks that depended on the corrupted / collapsed values now "corrected" by your modification are now skewed, with chaotic and complex downstream consequences.

You absolutely, 100% need realtime learning to update the engrams in harmony with the percepts, at the scale of the entire model - the more sparse and hierarchical and symbol-like the internal representation, the less difficulty it will be to maintain updates, but with these massive multibillion parameter models, even simple updates are going to be spread between tens or hundreds of millions of parameters across dozens of layers.

Long contexts are great and you can make up for some of the shortcomings caused by the lack of realtime, online learning, but static engrams have consequences beyond simply managing something like an episodic memory. Fundamental knowledge representation has to be dynamic, contextual, allow for counterfactuals, and meet these requirements without being brittle or subject to mode collapse.

There is only one way to get that sort of persisted memory, and that's through continuous learning. There's a lot of progress in that realm over the last 2 years, but nobody has it cracked yet.

That might be the underlying function of consciousness, by the way - a meta-model that processes all the things that the model is "experiencing" and that it "knows" through each step, that comes about through a need for stabilizing the continuous learning function. Changes at that level propagate out through the entirety of the network, Subjective experience might be an epiphenomenological consequence of that meta-model.

It might not be necessary, which would be nice if we could verify - purely functional, non-subjective AI vs suffering AI would be a good thing to get right.

At any rate, static model weights create problems that cannot be solved with long, or even infinite, contexts, even with recursion in the context stream, complex registers, or any manipulation of that level of inputs. The actual weights have to be dynamic and adaptive in an intelligent way.


You explain the limitations in learning and long term memory (for lack of a better word) regarding current models in a much more knowledgeable and insightful way than I ever could. I am going to save these comments for later in case I need to better the current limitations we face to others in the future.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: