Member-only story

Extract Information from Non-English PDFs Using GPT-4o and LangGraph

Ferry Djaja
6 min readOct 20, 2024

--

In this blog post, I want to show you how to get information from PDF files that have content in languages other than English. While it’s usually easy to extract data from English documents, this article goes further by showing how to extract data from non-English documents, focusing on those in Burmese and other languages. P.S. I am not familiar with Burmese languages.

Myanmar Utility Bill Invoice (Burmese)
Indonesia Utility Bill Invoice (Bahasa)

To get information from PDF files, we’ll use GPT-4o from OpenAI. First, we’ll convert the PDF pages into images like I did in my previous blogs here.

Then ask GPT-4o to translate it into English and present it in a JSON format without any additional explanation. Let’s dive into the coding. We’ll be using LangGraph for this task.

We’ll start by importing the necessary Python libraries.

import pypdfium2 as pdfium
import backoff
import asyncio
import os
import base64
from io import BytesIO
from IPython.display import display, Image, HTML

from typing import List
from typing_extensions import TypedDict

from openai import OpenAIError
from openai import AsyncOpenAI

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate

from typing_extensions import…

--

--

No responses yet