Chinese artificial intelligence startup DeepSeek has introduced DeepSeek-OCR, an open-source model accompanied by a research paper that pioneers a novel "optical compression" method aimed at reducing token processing requirements for large language models (LLMs). The technique allows LLMs to manage lengthy texts with up to 20 times less memory consumption, addressing a significant bottleneck in AI performance.
Tackling memory constraints
The challenge of limited memory capacity in LLMs often restricts their ability to effectively process ultra-long documents, as token limits lead to older context being forgotten. DeepSeek's innovation converts text into images that undergo optical compression, enabling the preservation of semantic content while drastically lowering computational demands. By compressing earlier context into compact "image memories" that require fewer tokens, this approach improves resource efficiency compared to directly processing raw text.
DeepSeek's solution consists of two interconnected components: DeepEncoder, a 380-million-parameter model that converts image data into condensed features, and a 570-million-parameter mixture-of-experts model named DeepSeek3B-MoE-A570M, which decodes the compressed features to reconstruct text. Tests indicate that the model retains 97% accuracy at compression ratios below 10 times, with roughly 60% of information preserved even when compression reaches 20 times. This advancement suggests potential for AI systems to efficiently handle documents spanning millions of words without a proportional increase in computational costs.
DeepSeek-OCR can also generate over 200,000 training pages per day using just a single Nvidia A100-40G GPU, offering significant reductions in data production expenses. The model and related resources are currently available on platforms such as Hugging Face and GitHub.
Building on OpenAI and Meta foundations
DeepSeek-OCR extends traditional Optical Character Recognition (OCR) technology by integrating innovations from leading open-source AI frameworks developed by OpenAI and Meta Platforms Inc. It incorporates a multi-stage compression pipeline utilizing Meta's Segment Anything Model (SAM) to segment images into zones such as text, charts, and tables, facilitating document structure understanding. OpenAI's CLIP model is used to create semantic links between text and images, enriching the model's comprehension capabilities.
Through this pipeline, a standard 1,024 by 1,024-pixel image can be compressed from 4,096 tokens down to just 256 after 16-fold compression, substantially reducing computational overhead. Beyond conventional image tagging and object recognition, DeepSeek-OCR demonstrates proficiency in parsing complex visual data like financial reports, chemical formulas, and geometric figures. This adaptability positions the technology as a promising tool for sectors requiring high-density information processing, including finance, scientific research, and education, where classic OCR solutions often fall short.
Article edited by Jerry Chen