What is needed to evaluate OCR for most business applications (above everything else) is accuracy.
Some results look plausible but are just plain wrong. That is worse than useless.
Example: the "Table" sample document contains chemical substances and their properties. How many numbers did the LLM output and associate correctly? That is all that matters. There is no "preference" aspect that is relevant until the data is correct. Nicely formatted incorrect data is still incorrect.
I reviewed the output from Qwen3-VL-8B on this document. It mixes up the rows, resulting in many values associated with the wrong substance. I presume using its output for any real purpose would be incredibly dangerous. This model should not be used for such a purpose. There is no winning aspect to it. Does another model produce worse results? Then both models should be avoided at all costs.
Are there models available that are accurate enough for this purpose? I don't know. It is very time consuming to evaluate. This particular table seems pretty legible. A real production grade OCR solution should probably need a 100% score on this example before it can be adopted. The output of such a table is not something humans are good at reviewing. It is difficult to spot errors. It either needs to be entirely correct, or the OCR has failed completely.
I am confident we'll reach a point where a mix of traditional OCR and LLM models can produce correct and usable output. I would welcome a benchmark where (objective) correctness is rated separately from of the (subjective) output structure.
Edit: Just checked a few other models for errors on this example.
* GPT 5.1 is confused by the column labelled "C4" and mismatches the last 4 columns entirely. And almost all of the numbers in the last column are wrong.
* olmOCR 2 omits the single value in column "C4" from the table.
* Gemini 3 produces "1.001E-04" instead of "1.001E-11" as viscosity at T_max for Argon. Off by 7 orders of magnitude! There is zero ambiguity in the original table. On the second try it got it right. Which is interesting! I want to see this in a benchmark!
There might be more errors! I don't know, I'd like to see them!