Member-only story

Extract Information from Non-English PDFs Using GPT-4o and LangGraph

6 min readOct 20, 2024

--

In this blog post, I want to show you how to get information from PDF files that have content in languages other than English. While it’s usually easy to extract data from English documents, this article goes further by showing how to extract data from non-English documents, focusing on those in Burmese and other languages. P.S. I am not familiar with Burmese languages.

Myanmar Utility Bill Invoice (Burmese)

Indonesia Utility Bill Invoice (Bahasa)

To get information from PDF files, we’ll use GPT-4o from OpenAI. First, we’ll convert the PDF pages into images like I did in my previous blogs here.

RAG with Complex PDF Structure

In this blog, I’ll outline how I developed a Retrieval Augmented Generation to analyze complex PDFs and answer…

djajafer.medium.com

Written by Ferry Djaja

https://www.linkedin.com/in/ferrydjaja/

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams