PDF Table Extraction with Keras-RetinaNet

Build a parser to extract the table in PDF document with RetinaNet

Ferry Djaja
5 min readJul 14, 2020

I was inspired to build another PDF table extraction with deep learning method after reading this great blog PDFs’ parsing using YOLOv3. Since I have done an object detection and localization with RetinaNet, why don’t give a try to apply the same method with Keras-RetinaNet.

There are two parts we will build. In the first part, I will go through how to train Keras-RetinaNet to detect tables in PDF file and save the weight. In the last part, I’ll go through how to load the weight, and run the prediction to detect the table on the given PDF file and extract it into Panda data frame.

Let’s get started with Part 1.

Part 1 — Train RetinaNet to Detect Table

We can separate the PDF files into two classes:

  • Text-based files: containing text that can be copied and pasted
  • Image-based files: contained images such as scanned documents

In this tutorial, I will focus on the first class Text-based files. You can also perform the same method for the second class. The only difference is to apply the appropriate libraries to handle the Image-based files.

All codes in this tutorial are executed in Google Colab with GPU enabled.

Install Keras-RetinaNet

Start by cloning the Keras-RetinaNet from the repository by running the following command on Colab.

git clone https://github.com/fizyr/keras-retinanet

Install the library:

pip install .python setup.py build_ext — inplace

Table Annotation

Create a training database with these steps:

  • Get the sample of PDF files for training and convert the PDF files to JPG files with this Colab PDF2Img. The more you have, the better for the prediction result.
Figure 1
  • Save the JPG files in images folder.
  • Open the JPG files and annotate the table with labelImg. You need to install it locally. Alternatively, you can use an online tool makesense.ai.
  • Save the annotation format as PASCALVOC.
Figure 2
  • Save the XML annotate file in annotations folder.
  • Create the list of PDF files for training and testing and insert the list in train.txt and test.txt.
Figure 3
python build_logos.py
Figure 4
  • After you run the above command, you will get retinanet_classes.csv, retinanet_test.csv and retinanet_train.csv.
  • In retinanet_classses.csv, we will only get one class which is class 0 (table) as we only identify the table within the PDF document. You can try to add more classes to identify headers and footers.
tabel,0
  • The list of complete files and folder structure can be shown at Figure 5 below.
Figure 5

Training

  • Upload retinanet_classes.csv, retinanet_test.csv, retinanet_train.csv, train.txt and test.txt to root folder of keras-retinanet as shown in the figure 6.
  • Upload JPG files to images folder.
Figure 6
  • Run Colab TrainOCR. I have run it for 10 epochs. You can adjust this parameter depending on how many JPG files you have for training.
  • Once training is complete, you will get the weight file output.h5 as showin in Figure 7. Download and save this file to your localhost. We will be using the file for run the prediction later.
  • A small note here that in the Colab, I have uploaded the files to Git and do a Git clone. You can do your best way how to upload the files. Also, follow the instruction to complete the training like I mentioned in the Colab.
Figure 7

Part 2 — Run Prediction

Now we continue to build the prediction part using the same library Keras-Retina. We also need other libraries, mainly for PDF processing.

2.1 Install Libraries

2.1.1. PyPDF2

PyPDF2 is a python tool library that enable us to extract document information, cropping page, etc. We will be using this library to read the PDF page and crop it.

Install this library with this command:

pip install PyPDF2

2.1.2. Camelot

Camelot is is a python library specialized in parsing tables of PDFs pages.

Install this library with this command:

pip install camelot-py[cv]

I also need to install python-tk and ghostscript as I am getting an error in Colab while installing Camelot.

apt install python-tk ghostscript

2.1.3. PDF2IMG

PDF2IMG is a python library convert PDF to a PIL Image object.

Install this library with this command:

pip install pdf2image

You also need to install the poppler-utils.

apt-get install -y poppler-utils

Before we run the prediction, we need to load the model with the weight file output.h5 that we got from training and define the label from retinanet_classes.csv which is class 0 (table).

model_path = 'output.h5'
model = models.load_model(model_path, backbone_name='resnet50')
labels = 'retinanet_classes.csv'
LABELS = open(labels).read().strip().split('\n')
LABELS = {int(L.split(',')[1]): L.split(',')[0] for L in LABELS}
print(LABELS)
{0: 'tabel'}

Now we are ready to run the prediction.

If the table is detected, we can see the result in the bounding box with score as shown below.

Figure 8

From the bounding box coordinate (x1, y1, x2, y2), we feed into Camelot read_pdf function with table_area is the bounding box that has been normalized.

And convert the output to Panda data-frame for further processing.

Figure 9

References

--

--