Member-only story
Extract Information from Non-English PDFs Using GPT-4o and LangGraph
In this blog post, I want to show you how to get information from PDF files that have content in languages other than English. While it’s usually easy to extract data from English documents, this article goes further by showing how to extract data from non-English documents, focusing on those in Burmese and other languages. P.S. I am not familiar with Burmese languages.
To get information from PDF files, we’ll use GPT-4o from OpenAI. First, we’ll convert the PDF pages into images like I did in my previous blogs here.
Then ask GPT-4o to translate it into English and present it in a JSON format without any additional explanation. Let’s dive into the coding. We’ll be using LangGraph for this task.
We’ll start by importing the necessary Python libraries.
import pypdfium2 as pdfium
import backoff
import asyncio
import os
import base64
from io import BytesIO
from IPython.display import display, Image, HTML
from typing import List
from typing_extensions import TypedDict
from openai import OpenAIError
from openai import AsyncOpenAI
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from typing_extensions import…