HN Reader

Show HN: OCR Arena – A playground for OCR models

208

I built OCR Arena as a free playground for the community to compare leading foundation VLMs and open-source OCR models side-by-side.

Upload any doc, measure accuracy, and (optionally) vote for the models on a public leaderboard.

It currently has Gemini 3, dots.ocr, DeepSeek, GPT5, olmOCR 2, Qwen, and a few others. If there's any others you'd like included, let me know!

What is needed to evaluate OCR for most business applications (above everything else) is accuracy.

Some results look plausible but are just plain wrong. That is worse than useless.

Example: the "Table" sample document contains chemical substances and their properties. How many numbers did the LLM output and associate correctly? That is all that matters. There is no "preference" aspect that is relevant until the data is correct. Nicely formatted incorrect data is still incorrect.

I reviewed the output from Qwen3-VL-8B on this document. It mixes up the rows, resulting in many values associated with the wrong substance. I presume using its output for any real purpose would be incredibly dangerous. This model should not be used for such a purpose. There is no winning aspect to it. Does another model produce worse results? Then both models should be avoided at all costs.

Are there models available that are accurate enough for this purpose? I don't know. It is very time consuming to evaluate. This particular table seems pretty legible. A real production grade OCR solution should probably need a 100% score on this example before it can be adopted. The output of such a table is not something humans are good at reviewing. It is difficult to spot errors. It either needs to be entirely correct, or the OCR has failed completely.

I am confident we'll reach a point where a mix of traditional OCR and LLM models can produce correct and usable output. I would welcome a benchmark where (objective) correctness is rated separately from of the (subjective) output structure.

Edit: Just checked a few other models for errors on this example.

* GPT 5.1 is confused by the column labelled "C4" and mismatches the last 4 columns entirely. And almost all of the numbers in the last column are wrong.

* olmOCR 2 omits the single value in column "C4" from the table.

* Gemini 3 produces "1.001E-04" instead of "1.001E-11" as viscosity at T_max for Argon. Off by 7 orders of magnitude! There is zero ambiguity in the original table. On the second try it got it right. Which is interesting! I want to see this in a benchmark!

There might be more errors! I don't know, I'd like to see them!

17 hours agoby molf

I've been really impressed with this model specifically because of how insanely cheap it is: https://replicate.com/ibm-granite/granite-vision-3.3-2b

I didn't expect IBM to be making relevant AI models but this thing is priced at $1 per 4,000,000 output tokens... I'm using it to transcribe handwritten input text and it works very well and super fast.

1 day agoby ArcaneMoose

I suggest you make explicit the assumption that this website is specifically about English text. Otherwise the leaderboard is pretty meaningless, with extreme differences in performance across other scripts - and potentially even languages such as Vietnamese or Czech which use Latin but have lots of accents.

23 hours agoby deaux

I'm very impressed by the models, to the point I was wondering if they were really converting the pdf or just reading the content. I tried on documents in french, english and spanish, very heaving on graphics and with complex layouts (boardgame, flyer, book about rust), and I wasn't expecting anything great. Especially some models were showing symbols and smileys quite close from the original.

I noticed that some models were resisting better to faking data than other, especially I saw that in a sentence cut from the document, GPT5 was inventing the end of the sentence and opus was properly showing it cut.

I didn't try with my writing but in the playground there is one example and some models read it better than me.

I wish the output would show the confidence of the model on each part. I think it would help immensely.

Note that sometimes a model get stuck in a loop, preventing to vote and to see which model is which

16 hours agoby poulpy123

There have been such a large number of OCR tools pop up over the past ~year; sorely in need for some benchmarks to compare them. Would love to see support for normal OCR tools like tesseract, EasyOCR, Microsoft Azure, etc. I'm using these for some projects, and my experiments with VLMs for OCR have resulted in too much hallucination for me to switch. Benchmarks comparing across this aisle would be incredibly useful.

1 day agoby cdrini

A limitation of this leaderboard approach that I want to point out is that while the large general-purpose LLMs can make greater leaps of inference (on handwriting and poor quality scans), and almost always produce better layouts and more coherent output, they can also sometimes be less correct. My experience is that they're more prone to skipping or transposing sections of text, or even hallucinating completely incorrect output, than the purpose-trained models. (A similar comparison can be made in turn to the character- or word-based OCR approaches like Tesseract, which are even less "intelligent" but also even less prone to those malbehaviors.)

Also, some of the models are prone to infinite loops and I suspect this is not being punished appropriately; the frontend seems to get into a bad state after around 50k characters, which prevents the user from selecting a winner. Probably would be beneficial to make sure every model has an output length limit.

Still, a really cool resource - I'm looking forward to more models being added.

1 day agoby daemonologist

Offtopic, but what's the best OCR that can run offline on browsers with js/wasm with reasonable CPU/memory cost?

Working on a hobby project that interacts with user handwriting on <canvas>. Tried some CNN models for digits but had trouble with characters.

21 hours agoby est

Love this! Would have liked to see something like textract for a pre-LLM benchmark (but of course that's expensive), and also a distinction between handwritten text and printed one.

But still, this is incredibly useful!

1 day agoby zzleeper

> If there's any others you'd like included, let me know!

Just this morning I came across HunyuanOCR which sounded very promising. https://huggingface.co/tencent/HunyuanOCR

14 hours agoby tethys

Interesting that the 8B of the Qwen3-VL family 9th place, above a few proprietary models. This thing can run locally with llama.cpp on modest hardware.

18 hours agoby tarruda

i don't think i'm you're target audience but i found it interesting to see the side-by-side comparisons from images with text in. it's pretty cool to see how different models interpret photos, too. cool tool, must've been fun to make.

8 hours agoby mpercy123

Two suggestions:

UX on mobile isn’t great. It wasn’t obvious to me where the second model output was and I was thrown off even more so because the option to vote for model 1 output was presented without ever even seeing model two output.

Second suggestion would be to install a MathJax plugin so one can properly rate mathematical equations and formulas. Raw LATeX is easy to mistake and it makes comparing between LATeX and Unicode outputs hard.

23 hours agoby ComputerGuru

Would be great to compare these against Apple’s LiveText. This project now supports it: https://github.com/mkyt/OCRmyPDF-AppleOCR

I’ve had great results locally. Albeit you need macOS >=13 for this.

1 day agoby hakunin

cool UI and lets anyone upload a doc. but lacks https://github.com/opendatalab/mineru

9 hours agoby vdm

Really hope there is a layout mode or ocr with bbox mode, I want to see the model restore the whole page.

1 day agoby wener

Any plans to add Document Pre-trained transformer-2 (DPT-2) from https://landing.ai/?

19 hours agoby densekernel

plz add https://huggingface.co/spaces/lixin4ever/VideoLLaMA3-Image

11 hours agoby gfody