
DeepSeek has once again set the AI world buzzing with its new model, DeepSeek-OCR.
At first glance, DeepSeek-OCR looks like just another vision language model (VLM). But as their research paper shows, it is much more than that. DeepSeek-OCR introduces a completely new way to think about how AI models store, process, and compress information.
Rather than simply improving OCR, it challenges a fundamental assumption: LLMs must process information as long sequences of text tokens. DeepSeek-OCR shows that this doesn’t have to be the case. A model can “see” information instead of just reading it, achieving the same understanding with a fraction of the computation.
In this blog post, we’ll look at:
LLMs today face a growing computational bottleneck when handling long pieces of text. Every token they process consume resources: floating-point operations per second (FLOPs), memory, time, and energy. A 10,000-token article means 10,000 discrete processing steps. That’s like forcing a model to read every single word in sequence, even when much of the content is repetitive or predictable.
DeepSeek-OCR rethinks what a token can be with Contexts Optical Compression. Instead of treating long text sequences as endless strings of small, low-information text tokens, it uses the visual modality as a more efficient compression channel for textual information.
In this framework, it compresses the same content into a smaller set of dense visual tokens. Each visual token carries much richer information, such as typography, layout, and spatial relationships between words. This allows the model to encode and understand entire chunks of text at once.
The result: The model achieved the same semantic understanding with an order-of-magnitude fewer computation steps. What once required 1,000 text tokens might now be represented by just 100 visual ones. This can reduce processing time and cost dramatically while preserving context.
DeepSeek-OCR is thus more than just an open-source OCR model. It is a proof of concept for a new paradigm in AI efficiency: let models see information instead of merely reading it.
Even Andrej Karpathy noted that DeepSeek-OCR raises a deeper question: are pixels better inputs to LLMs than text? He suggests that text tokens might be inherently wasteful and that “historical baggage” could eventually be replaced by visual inputs for efficiency.

DeepSeek-OCR features a unified end-to-end VLM architecture built around two brains that work together: a visual encoder and a language decoder.

The DeepEncode is where the compression happens. It handles high-resolution inputs more efficiently in terms of memory and token counts.
The DeepSeek team built it from the ground up because no existing open-source encoder met their requirements. They needed a model that could:
To meet these conditions, the team designed a 380M parameter encoder that achieves high compression ratios and can output a manageable number of vision tokens. It combines three main components:
Once the encoder compresses the input into visual tokens, the DeepSeek-3B-MoE Decoder turns them back into text.
This decoder uses a MoE design. During inference, it activates only 6 of 64 experts plus 2 shared ones, totaling about 570M activated parameters. This gives it the power of a 3B model but the inference cost of one under 600M, an ideal balance between performance and efficiency.
The decoder receives the vision tokens and prompts, and generates the final output, including chemical structures and planar geometric figures.
DeepSeek-OCR demonstrates strong efficiency and accuracy across benchmarks.
At a 10Ă— compression ratio, it retains around 97% accuracy. Even at higher compression levels (up to 20Ă—), it can still produce usable results at roughly 60% accuracy.

On OmniDocBench, a leading benchmark for document understanding, DeepSeek-OCR outperforms established baselines such as GOT-OCR 2.0 and MinerU 2.0, achieving higher accuracy with far fewer tokens.

In deployment, it also proves highly practical. With a single A100 40GB GPU, DeepSeek-OCR can process more than 200K pages per day. This makes it a viable solution for large-scale document processing and training data generation for LLMs and VLMs.
By turning text into visual representations, DeepSeek-OCR shows that LLMs don’t always have to process information through text tokens. A single image can carry far more meaning with far fewer tokens. This idea opens up a promising direction for building long-context and more efficient LLMs.
What this means:
For the last point, the paper introduces a fascinating concept inspired by human memory. As human beings, we gradually forget details while keeping what’s important. DeepSeek-OCR proposes something similar; it uses optical compression to shrink and blur older conversation history over time:

This “visual forgetting” mechanism could enable models to manage ultra-long conversations more efficiently. It preserves what matters most while letting less relevant details fade naturally. It’s an early but promising step toward more powerful multimodal AI and theoretically unlimited context architectures.
DeepSeek-OCR is more than an OCR breakthrough. It’s a glimpse into a new way of thinking about efficiency and memory in AI systems. By replacing text tokens with compact visual representations, it redefines how models can scale, reason, and remember.
Learn more: