A new paper argues that current LLMs are fundamentally broken because they’re completely static. They call it “anterograde amnesia”, which is honestly spot on. A model gets pre-trained, and from that moment on, its weights are frozen. It can’t actually learn anything new. Sure, it has a context window, but that’s just short-term memory. The model can’t take new information from its context and permanently update its own parameters. The knowledge in its MLP layers is stuck in the past, and the attention mechanism is the only part that’s live, but it forgets everything instantly.

The paper introduces what they term Nested Learning to fix this. The whole idea is to stop thinking of a model as one big, deep stack of layers that all update at the same time. Instead, they take inspiration from the brain, which has all kinds of different update cycles running at different speeds in form of brain waves. They represent the model as a set of nested optimization problems , where each level has its own update frequency. Instead of just deep layers, you have levels defined by how often they learn.

The idea of levels was then used to extend the standard Transformer which has a fast attention level that updates every token and the slow MLP layers that update only during pre-training. There’s no in-between.

The paper presents a Hierarchical Optimizers and Parallel Extensible model with additional levels. You might have a mid-frequency level that updates its own weights every, say, 1,000 tokens it processes, and a slower-frequency level that updates every 100,000 tokens, and so on. The result is a model that can actually consolidate new information it sees after pre-training. It can learn new facts from a long document and bake them into that mid-level memory, all while the deep, core knowledge in the slowest level stays stable. It creates a proper gradient of memory from short-term to long-term, allowing the model to finally learn on the fly without just forgetting everything or suffering catastrophic forgetting.

      • Maeve@kbin.earth
        link
        fedilink
        arrow-up
        3
        ·
        1 month ago

        I was wondering if this is a first step to actual “learning” vs. mimicry.

        • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          1 month ago

          It’s an important component, and another key aspect is establishing a feedback loop that provides both positive and negative reinforcement. I expect embodied intelligence will be the necessary path toward creating genuine AI. An organism’s brain maintains homeostasis by constantly balancing internal body signals with those from the external environment, making decisions to regulate its internal state. It’s a continuous feedback loop that allows the brain to evaluate the usefulness of its actions, which facilitates reinforcement learning. An embodied AI could use this same mechanism to learn about and interact with the world effectively. Furthermore, equipping it with an internal world model would enable meaningful communication with us. We would move beyond merely stringing text tokens together. Here, words would map to underlying representations that are fundamentally similar to our own.

          • Maeve@kbin.earth
            link
            fedilink
            arrow-up
            2
            ·
            1 month ago

            I can see it in my mind’s eye! If we can teach it the best of heuristics without the worse, somehow, that feedback loop can be so much more mutually beneficial! But it must be available to all people, everywhere. We might even convince ourselves to work across borders for the survival of more species and begin undoing the damage of insatiable greed.

  • PM_ME_VINTAGE_30S [he/him]@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 month ago

    Sounds like a promising framework, but does the table at the bottom suggest that the HOPE architecture they came up with to demonstrate the framework is only incrementally better than the others?

      • PM_ME_VINTAGE_30S [he/him]@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        1 month ago

        Yeah, definitely exciting news even if the empirical result was incrementally better, because it demonstrates that the new framework recovers SOTA performance. Sounds like this framework might be helpful for analyzing and controlling dynamical systems, e.g. wrapping a system with a NL network that continuously improves as the system evolves. But I’m a dynamical systems guy so I’m gonna be a bit biased 😆

        • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 month ago

          Yeah that would be a great application, but I think this just a better approach in general. In any context where you’re doing extrapolations about likely future states based on the current state, having a system that can automatically tune itself is incredibly valuable. You can basically throw it at a data stream and have it learn to analyze it over time.