- cross-posted to:
- technology@lemmy.ml
- cross-posted to:
- technology@lemmy.ml
LLMs totally choke on long context because of that O(n2) scaling nightmare. It’s the core scaling problem for almost all modern LLMs because of their self-attention mechanism.
In simple terms, for every single token in the input, the attention mechanism has to look at and calculate a score against every other single token in that same input.
So, if you have a sequence with n tokens, the first token compares itself to all n tokens. The second token also compares itself to all n tokens… and so on. This means you end up doing n*n, or n^2, calculations.
This is a nightmare because the cost doesn’t grow nicely. If you double your context length, you’re not doing 2x the work; you’re doing 2^2=4x the work. If you 10x the context, you’re doing 10^2=100x the work. This explodes the amount of computation and, more importantly, the GPU memory needed to store all those scores. This is the fundamental bottleneck that stops you from just feeding a whole book into a model.
Well, DeepSeek came up with a novel solution to just stop feeding the model text tokens. Instead, you render the text as an image and feed the model the picture. It sounds wild, but the whole point is that a huge wall of text can be “optically compressed” into way, way fewer vision tokens.
To do this, they built a new thing called DeepEncoder. It’s a clever stack that uses a SAM-base for local perception, then a 16x convolutional compressor to just crush the token count, and then a CLIP model to get the global meaning. This whole pipeline means it can handle high-res images without the GPU just melting from memory activation.
And the results are pretty insane. At a 10x compression ratio, the model can look at the image and “decompress” the original text with about 97% precision. It still gets 60% accuracy even at a crazy 20x compression. As a bonus, this thing is now a SOTA OCR model. It beats other models like MinerU2.0 while using fewer than 800 tokens when the other guy needs almost 7,000. It can also parse charts into HTML, read chemical formulas, and understands like 100 languages.
The real kicker is what this means for the future. The authors are basically proposing this as an LLM forgetting mechanism. You could have a super long chat where the recent messages are crystal clear, but older messages get rendered into blurrier, lower-token images. It’s a path to unlimited context by letting the model’s memory fade, just like a human’s.
Well, DeepSeek came up with a novel solution to just stop feeding the model text tokens. Instead, you render the text as an image and feed the model the picture. It sounds wild, but the whole point is that a huge wall of text can be “optically compressed” into way, way fewer vision tokens.
Thís is bullshit, man. This is computer alchemy. I detest this, it should not work.
🤣
Information is information, I guess
Welp, time to watch the AI stocks crash once again
I don’t see DeepSeek really having much sway in on the western AI bubble in the short term. The initial hit was like “oh shit the backwater hellhole china can do this?” and that shakes investors but then every government scrambled to just ban it’s useage because the chinese are going to steal all your data and that’s that.
See also: Chinese EVs (including, but not only, cars)
tariffs are pretty effective against everything China does right now in the US, but once the rest of the world is lapping the US with cheaper more effective tools and products, it is no longer sustainable
define “rest of the world” here
the countries not in green, basically
Well, DeepSeek came up with a novel solution to just stop feeding the model text tokens. Instead, you render the text as an image and feed the model the picture. It sounds wild, but the whole point is that a huge wall of text can be “optically compressed” into way, way fewer vision tokens.
I am impressed that this actually does work, was this ever done with western models or is Deepseek the first to really pioneer it?
Also this means that the deepseek service would become even cheaper then, wouldn’t this be a death kneel to the western AI business model?
Not cheaper necessarily when we’ve seen these things just balloon to meet demand. Besides, from my own experience deepseek’s free models are sorta middle-of-the-road nowadays but i don’t really use LLMs more than is ABSOLUTELY necessary to navigate the slop left behind by other LLMs
As far as I know this is a completely novel approach, and yeah this should make DeepSeek cheaper and able to work on large documents, or code projects which is currently a problem for most models. I do expect that western companies will start implementing this idea as well to keep up.
Deepseek is already 40x cheaper than Claude right now. I don’t think there is a tipping point here.
If you can answer, I wonder how far I can go with just $20? Is that like months worth of constant use? I want to put the price in perspective because it’s hard for me to wrap my mind around it.
I bought $2 of tokens in July and use it fairly heavily for my obsidian setup, nvim, and opencode. Still haven’t ran out yet.
Chat.deepseek.com is free. No paid tier at all, running their best model. Api pricing, eh it depends on use. That’s what the 40x refers to.
I don’t think it’s worth bothering with the 1650 super. 4gb of vram is very little, you could run 4b models but they are not good for standard use.
edit: api pricing is 1m tokens in for 28 cents and 1m out for 42 cents. Assuming requests are 10k tokens and you get 5k out (a lot imo) you’d get 200 per dollar.
Do you have an nvidia GPU? If so it’s fairly easy to run these things locally and you won’t have to pay at all (except electricity). huggingface has deepseek OCR
https://huggingface.co/deepseek-ai/DeepSeek-OCR
Ollama lets you run the model while using a browser as an interface
Download and install ollama. Then you have to download the Deepseek OCR tensors and place them in the correct folder (see ollama documentation). You might have to download CUDA for your nvidia card. There are tons of videos and written instructions out there.
Is a 1650 super enough? The prebuilt I bought like 5 years ago also only has 8GB of ram. Also relying on CUDA isn’t appealing for me either.
In any case, I do want to run the higher parameter models (I’m able to run the 8B models on Ollama on the macbook air m1 just through software).
Ollama is getting a vulkan frontend which I hope might help. But I don’t mind delegating it all to deepseek as a service.
I can’t find a solid answer but it probably does want more than 8gb. Maybe a quantized model will come out soon.
So this works because a picture is worth a thousand words?
turns out that’s no longer just a metaphor :)
Oh shit, I thought about trying something like that with RNNs years ago when I learned that there were folks doing audio and brainwave processing networks with CNNs. My life blew up and I never got to try it. Nifty!
yeah, It’s a really clever trick, always neat when you think of something and then get validated :)
Lol in my head right now:
I’M WICKED SMAHT!!!
AI shut down the Amazon servers earlier today I knew jt
they’re trying to save us from ourselves. please let the whole internet officially die next time
I wonder if it can be used with RAG to capture those connected the closest with more clarity, and those with lower scores with less clarity It wouldnt matter since a good dataset will make it so the RAG retrieval is almost always accurate, but with worse models it could allow it to pick out those that are certain, and still keep those that are just ‘maybe’
Yeah, I imagine it would be relatively easy to track the original text, and then you could use the image encoding to zero in on the concrete part of the context you want to recall. Even if it’s fuzzy, it would cut down the amount of search you have to do on retrieval.
For me, it’s really good, especially the compression, even Gemini models struggle with 200 thousand tokens, but with deepseek OCR it should be possible to input 500k tokens and have it function like it’s 50k, this is gonna be helpful once it’s properly ready
Indeed, I think it’ll be really handy for coding tasks as well. It’ll be able to load large projects into context, and find things in them much easier now.
I really need to just block this sub, because the stupid hype for LLMs is so disgusting it makes my skin crawl.
Why do people feel the need to announce how they’re going to block a sub because they don’t like what other people are interested in. Just do what you need to do, let the rest of us enjoy things.
The constant moaning and whining from people not liking things other people like never gets old.
truly, it’s like people have a protagonist complex
Hey now this is about more than LLM hype.
This is about DeepSeek crashing the nvidia stocks and causing the AI bubble to pop.
this is not an airport, no need to announce your departure
deleted by creator