Member-only story

Document Parsing with OmniParser and GPT4o Vision

Ferry Djaja
8 min readNov 18, 2024

--

In this blog, we’ll explore how to leverage Microsoft’s OmniParser as input for GPT-4’s vision capabilities, optimizing the parsing of documents for optimal results.

OmniParser is a general screen parsing tool to extract information from UI screenshot into structured bounding box and labels which enhances GPT-4V’s performance in action prediction in a variety of user tasks.

We will apply OmniParser with GPT4o Vision to extract the information from the document.

Sample document to extract
Extraction flow

In my previous post, I tried to extract line items from a document using GPT4 Vision, which required a lengthy and detailed prompt. Let’s explore this approach with OmniParser. Let’s see if this approach gives accurate result or not.

In this document, I am unable to get the correct result when parsing the line items (Item number 2 and 1)

I am hoping with OmniParser I will be able to extract the line items correctly.

Let’s get started!

Clone the OmniParser Git repo.

git clone https://github.com/microsoft/OmniParser

Install the environment.

conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

Download the model checkpoint files from https://huggingface.co/microsoft/OmniParser, and place them under weights folder.

Convert the safetensor to .pt file.

python weights/convert_safetensor_to_pt.py

I modified the get_som_labeled_img function in utils.py to include bounding box…

--

--

No responses yet