News
5 minute read

IBM Granite Vision tops the chart for small models in document understanding

The latest IBM Granite 3.3 2B vision model recently debuted at number two on the OCRBench leaderboard, and is the most performant multimodal model under 7B parameters.

Last week was sound, and now it’s vision: Each of Granite’s senses is starting to prove its mettle.  

So much of the data we interact with at work is inherently visual. Employees around the world perform countless time-consuming tasks each day that could be automated away by large AI models that can interpret the world to help give answers. From understanding the information in charts and tables, to parsing the contents of images in presentations and websites, or understanding notes a colleague has written out by hand.

But this requires a multimodal AI model that can understand both text and how documents are laid out, so that it can interpret the complex forms, charts, tables, and invoices employees see every day in a way a text-only model cannot. 

IBM recently popped up near the top of the OCRBench leaderboard with its open-source Granite Vision 3.3 2B model. The multimodal model is currently sitting in second in the table, and is head-and-shoulders above any other small model under 7B parameters.  

RankNameLanguage ModelOpen SourceFinal Score
1Miniocpm-V 2.6Qwen2-7BYes852
2granite-visionm-3.3-2b-instructgranite-3.1-2b-instructYes824
3minimonkeyminternlm2-chat-1.8bYes806
4H20VL-Mississippi-28H20-Danube21.8BYes782
5InternVLM2-1BQwen2.0.5B-InstructYes779
6InternVLM2-4BPhi3-mini128k-instructYes776
7InternVLM2-2Bminternlm2-chat1_8bYes768
8H20VL-Mississippi30BH20-Danube30_5BYes751
9Qwen-VL-Max-Yes723

OCRBench is a wide-ranging benchmark used by the AI industry to assess how effective vision and multimodal models are at tasks that require the fundamental ability to read text in challenging scenarios. While the concept of machines reading printed or handwritten text is not new — it’s a field in which IBM itself has a long history of innovation — building AI systems that can take that vision capability, discern what is being displayed, and generate something useful with that information is a field of cutting-edge exploration right now.  

Within OCRBench, there are five components to the benchmark that each model is judged against. They include how well the model can recognize text, extract key information, understand handwritten mathematical expressions, and the ability to answer questions on what they’ve seen in specific scenes and documents. The test includes 1,000 question-and-answer pairs, with each answer including at least four symbols to lower the potential for false positives in the results.  

When tested across the five parts of the benchmark, Granite Vision 3.3 2B received the second-highest score overall, with particular aptitude in recognition math expressions and answering questions on specific scenes. The model also beat out some other heavy hitters in the industry, including Google’s Gemini, OpenAI’s GPT-4V, and several models based on Meta’s Llama.  

“Our high score is due to our training data,” said Eli Schwartz, a researcher from the IBM Granite Vision team. “We deliberately trained the model on a dataset of low-quality documents, which made it exceptionally resilient and accurate on the kind of real-world images found in the benchmark.” 

For this latest version of Granite Vision, the team behind the model made several tweaks compared to the 3.2 version — these likely helped account for its success on the leaderboard. The goal was to create a compact model that would work effectively and dependably, while making cutting-edge AI more accessible and cost-effective. 

IBM Researcher Rogerio Feris, who works on these vision models, said that among other enhancements, the team dropped in a new encoder for this version, and added in more layers of document training than they had with previous models. The team focused on creating high-quality training data on the tasks that IBM’s own use-cases would benefit most from, Schwartz added.  

For image recognition, the big shift in recent years has been moving from AI systems that could recognize one type of image (such as the scanners in bank apps that recognize checks) to foundation models that can be used for many different tasks out of the box, and be easily adapted for new ones. But that doesn’t mean we’ve completely cracked computer vision.  

The next big leap, according to the team, will come when these models can act and reason without explicit instructions. With more high-quality data, more reinforcement learning, and some time, they expect that models in the future will be able to power agentic workflows that can execute complex business tasks on their own. 

And the team sees this model as a stepping stone to even greater advances in the future. “The biggest surprise was just how much powerful performance we could extract from such a small model,” Schwartz said. “We were also impressed that it continued to improve as we added more data, showing that we haven't yet hit the ceiling for what these efficient models can do.” 

Related posts