Extract Information from Non-English PDFs Using GPT-4o and LangGraph

Ferry Djaja
6 min readOct 20, 2024

In this blog post, I want to show you how to get information from PDF files that have content in languages other than English. While it’s usually easy to extract data from English documents, this article goes further by showing how to extract data from non-English documents, focusing on those in Burmese and other languages. P.S. I am not familiar with Burmese languages.

Myanmar Utility Bill Invoice (Burmese)
Indonesia Utility Bill Invoice (Bahasa)

To get information from PDF files, we’ll use GPT-4o from OpenAI. First, we’ll convert the PDF pages into images like I did in my previous blogs here.

--

--

No responses yet