LLMs totally choke on long context because of that O(n2) scaling nightmare. It’s the core scaling problem for almost all modern LLMs because of their self-attention mechanism.

In simple terms, for every single token in the input, the attention mechanism has to look at and calculate a score against every other single token in that same input.

So, if you have a sequence with n tokens, the first token compares itself to all n tokens. The second token also compares itself to all n tokens… and so on. This means you end up doing n*n, or n^2, calculations.

This is a nightmare because the cost doesn’t grow nicely. If you double your context length, you’re not doing 2x the work; you’re doing 2^2=4x the work. If you 10x the context, you’re doing 10^2=100x the work. This explodes the amount of computation and, more importantly, the GPU memory needed to store all those scores. This is the fundamental bottleneck that stops you from just feeding a whole book into a model.

Well, DeepSeek came up with a novel solution to just stop feeding the model text tokens. Instead, you render the text as an image and feed the model the picture. It sounds wild, but the whole point is that a huge wall of text can be “optically compressed” into way, way fewer vision tokens.

To do this, they built a new thing called DeepEncoder. It’s a clever stack that uses a SAM-base for local perception, then a 16x convolutional compressor to just crush the token count, and then a CLIP model to get the global meaning. This whole pipeline means it can handle high-res images without the GPU just melting from memory activation.

And the results are pretty insane. At a 10x compression ratio, the model can look at the image and “decompress” the original text with about 97% precision. It still gets 60% accuracy even at a crazy 20x compression. As a bonus, this thing is now a SOTA OCR model. It beats other models like MinerU2.0 while using fewer than 800 tokens when the other guy needs almost 7,000. It can also parse charts into HTML, read chemical formulas, and understands like 100 languages.

The real kicker is what this means for the future. The authors are basically proposing this as an LLM forgetting mechanism. You could have a super long chat where the recent messages are crystal clear, but older messages get rendered into blurrier, lower-token images. It’s a path to unlimited context by letting the model’s memory fade, just like a human’s.

  • 7bicycles [he/him]@hexbear.net
    link
    fedilink
    English
    arrow-up
    37
    ·
    3 days ago

    Well, DeepSeek came up with a novel solution to just stop feeding the model text tokens. Instead, you render the text as an image and feed the model the picture. It sounds wild, but the whole point is that a huge wall of text can be “optically compressed” into way, way fewer vision tokens.

    Thís is bullshit, man. This is computer alchemy. I detest this, it should not work.

    • 7bicycles [he/him]@hexbear.net
      link
      fedilink
      English
      arrow-up
      21
      ·
      3 days ago

      I don’t see DeepSeek really having much sway in on the western AI bubble in the short term. The initial hit was like “oh shit the backwater hellhole china can do this?” and that shakes investors but then every government scrambled to just ban it’s useage because the chinese are going to steal all your data and that’s that.

      See also: Chinese EVs (including, but not only, cars)

  • hello_hello [comrade/them]@hexbear.net
    link
    fedilink
    English
    arrow-up
    31
    ·
    edit-2
    3 days ago

    Well, DeepSeek came up with a novel solution to just stop feeding the model text tokens. Instead, you render the text as an image and feed the model the picture. It sounds wild, but the whole point is that a huge wall of text can be “optically compressed” into way, way fewer vision tokens.

    I am impressed that this actually does work, was this ever done with western models or is Deepseek the first to really pioneer it?

    Also this means that the deepseek service would become even cheaper then, wouldn’t this be a death kneel to the western AI business model?

    • Not cheaper necessarily when we’ve seen these things just balloon to meet demand. Besides, from my own experience deepseek’s free models are sorta middle-of-the-road nowadays but i don’t really use LLMs more than is ABSOLUTELY necessary to navigate the slop left behind by other LLMs

    • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
      link
      fedilink
      English
      arrow-up
      23
      ·
      3 days ago

      As far as I know this is a completely novel approach, and yeah this should make DeepSeek cheaper and able to work on large documents, or code projects which is currently a problem for most models. I do expect that western companies will start implementing this idea as well to keep up.

      • hello_hello [comrade/them]@hexbear.net
        link
        fedilink
        English
        arrow-up
        5
        ·
        3 days ago

        If you can answer, I wonder how far I can go with just $20? Is that like months worth of constant use? I want to put the price in perspective because it’s hard for me to wrap my mind around it.

        • BountifulEggnog [she/her]@hexbear.net
          link
          fedilink
          English
          arrow-up
          5
          ·
          edit-2
          3 days ago

          Chat.deepseek.com is free. No paid tier at all, running their best model. Api pricing, eh it depends on use. That’s what the 40x refers to.

          I don’t think it’s worth bothering with the 1650 super. 4gb of vram is very little, you could run 4b models but they are not good for standard use.

          edit: api pricing is 1m tokens in for 28 cents and 1m out for 42 cents. Assuming requests are 10k tokens and you get 5k out (a lot imo) you’d get 200 per dollar.

        • LangleyDominos [none/use name]@hexbear.net
          link
          fedilink
          English
          arrow-up
          5
          ·
          3 days ago

          Do you have an nvidia GPU? If so it’s fairly easy to run these things locally and you won’t have to pay at all (except electricity). huggingface has deepseek OCR

          https://huggingface.co/deepseek-ai/DeepSeek-OCR

          Ollama lets you run the model while using a browser as an interface

          https://ollama.com/

          Download and install ollama. Then you have to download the Deepseek OCR tensors and place them in the correct folder (see ollama documentation). You might have to download CUDA for your nvidia card. There are tons of videos and written instructions out there.

          • hello_hello [comrade/them]@hexbear.net
            link
            fedilink
            English
            arrow-up
            3
            ·
            edit-2
            3 days ago

            Is a 1650 super enough? The prebuilt I bought like 5 years ago also only has 8GB of ram. Also relying on CUDA isn’t appealing for me either.

            In any case, I do want to run the higher parameter models (I’m able to run the 8B models on Ollama on the macbook air m1 just through software).

            Ollama is getting a vulkan frontend which I hope might help. But I don’t mind delegating it all to deepseek as a service.

  • JoeByeThen [he/him, they/them]@hexbear.net
    link
    fedilink
    English
    arrow-up
    24
    ·
    3 days ago

    Oh shit, I thought about trying something like that with RNNs years ago when I learned that there were folks doing audio and brainwave processing networks with CNNs. My life blew up and I never got to try it. Nifty!

  • Moidialectica [he/him, comrade/them]@hexbear.net
    link
    fedilink
    English
    arrow-up
    12
    ·
    3 days ago

    I wonder if it can be used with RAG to capture those connected the closest with more clarity, and those with lower scores with less clarity It wouldnt matter since a good dataset will make it so the RAG retrieval is almost always accurate, but with worse models it could allow it to pick out those that are certain, and still keep those that are just ‘maybe’

    • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
      link
      fedilink
      English
      arrow-up
      8
      ·
      3 days ago

      Yeah, I imagine it would be relatively easy to track the original text, and then you could use the image encoding to zero in on the concrete part of the context you want to recall. Even if it’s fuzzy, it would cut down the amount of search you have to do on retrieval.

      • Moidialectica [he/him, comrade/them]@hexbear.net
        link
        fedilink
        English
        arrow-up
        9
        ·
        3 days ago

        For me, it’s really good, especially the compression, even Gemini models struggle with 200 thousand tokens, but with deepseek OCR it should be possible to input 500k tokens and have it function like it’s 50k, this is gonna be helpful once it’s properly ready

        • ☆ Yσɠƚԋσʂ ☆@lemmygrad.mlOP
          link
          fedilink
          English
          arrow-up
          7
          ·
          3 days ago

          Indeed, I think it’ll be really handy for coding tasks as well. It’ll be able to load large projects into context, and find things in them much easier now.

  • NuraShiny [any]@hexbear.net
    link
    fedilink
    English
    arrow-up
    12
    arrow-down
    6
    ·
    3 days ago

    I really need to just block this sub, because the stupid hype for LLMs is so disgusting it makes my skin crawl.