RAG with Complex PDF Structure
In this blog, I’ll outline how I developed a Retrieval Augmented Generation to analyze complex PDFs and answer questions. The process involves converting each PDF page into an image, then using the GPT-4 model to extract content into a JSON format.
Why we need to convert to an image first? Converting PDF pages to images is necessary because PDFs often have complex structures, featuring tables, images, layouts, and texts in various arrangements. Simply extracting text from PDFs may not yield accurate results, as this method fails to account for the document’s visual layout and formatting.
Let’s utilize the SAP HANA Business Use Cases sample PDF, as a test case. Its medium complexity, featuring a mix of images, charts and texts, makes it an ideal candidate for our analysis.
Here’s a simplified diagram illustrating my approach, which leverages FAISS vector database, OpenAI GPT-4o and LangChain for this use case.
Let’s use the following table in page 49 to test with this question:
What is the estimated reduction in days of inventory under the supply chain category,
considered a likely recurring benefit?