⌘K
Change language Switch ThemeSign In
Narrow Mode
Open-source OCR Model Chandra OCR 2: Structured Conversion for Complex Documents
Open-source OCR Model Chandra OCR 2: Structured Conversion for Complex Documents
 ### GitHubDaily@GitHub_Daily
处理 PDF 和图片里的文字,尤其碰到表格、公式、手写体混在一起的文档,传统 OCR 识别出来的结果经常乱成一团,排版全丢了。
今天又刷到一个新开源 OCR 模型:Chandra OCR 2,能把图片和PDF直接转成结构化的Markdown、HTML或JSON,还能保留原始排版。
支持 90 多种语言,在多语言 OCR 基准测试中跑到了综合第一,对数学公式、复杂表格和手写内容的识别效果都很强。
GitHub:github.com/datalab-to/cha…
比较实用的是它能准确还原表单结构,包括复选框这种细节,还能提取文档中的图表并自动生成描述。一条命令就能处理单个文件或整个目录。
如果你经常要把扫描件、论文、多语言文档转成可编辑的格式,这个工具值得试试,中文识别准确率接近 89%。Show More
Mar 28, 2026, 1:30 PM View on X
0 Replies
19 Retweets
62 Likes
3,394 Views  GitHubDaily @GitHub_Daily
One Sentence Summary
Introducing Chandra OCR 2, an open-source OCR model that converts PDFs and images into structured Markdown, HTML, or JSON while preserving original formatting.
Summary
This tweet recommends the open-source project Chandra OCR 2, an OCR model designed for complex documents containing tables, formulas, and handwriting. Supporting over 90 languages and performing exceptionally in multilingual OCR benchmarks, it can convert documents into structured Markdown, HTML, or JSON formats while maintaining original layouts and form structures. It is a highly practical tool for developers who frequently handle scanned files, research papers, and other complex documents.
AI Score
80
Influence Score 27
Published At Today
Language
Chinese
Tags
OCR
Open Source
Chandra OCR
Document Processing
GitHub HomeArticlesPodcastsVideosTweets