Python Libraries for Extracting Tables from PDFs

When dealing with PDF text extraction, you’ll eventually need to pull table data from the PDFs. These five Python libraries simplify the task. Each offers unique features, making them suitable for different use cases.

1. Camelot

Camelot is designed specifically for extracting tables from PDFs. It is lightweight and focused solely on this task.

Key Features:

Table Extraction Modes:

  • Stream Mode: Works well with tables that lack explicit borders but have consistent spacing.
  • Lattice Mode: Ideal for tables with visible cell boundaries, like grid lines.

Output Formats:

  • CSV
  • JSON
  • Pandas DataFrame
  • Excel

Accurate Parsing:

  • Detects multiple tables on a single page.
  • Handles complex layouts efficiently.

Visualization:

  • Visualizes extraction processes and table boundaries.

Basic Usage Example:

import camelot

tables = camelot.read_pdf("example.pdf", pages="1", flavor="stream")
df = tables[0].df
print(df)
tables[0].to_csv("table.csv")

2. PDFPlumber

PDFPlumber provides a versatile approach to PDF data extraction. It supports tables, text, and images.

Key Features:

Comprehensive Extraction:

  • Extracts tables, text, and images.
  • Retrieves metadata.

Customizable Table Parsing:

  • Offers control over how tables are detected and processed.

Output Formats:

  • Provides lists of rows and columns.
  • Outputs Pandas-compatible structures.

Basic Usage Example:

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    table = pdf.pages[0].extract_table()
for row in table:
    print(row)

3. Tabula-py

Tabula-py is a Python wrapper for the Tabula Java library. It is reliable for extracting tables, especially when using the Tabula desktop tool.

Key Features:

Automated Table Detection:

  • Identifies and extracts multiple tables in one operation.

Output Formats:

  • CSV
  • JSON
  • Pandas DataFrame

Customization:

  • Users can specify areas of interest on a page.

Basic Usage Example:

from tabula import read_pdf

df = read_pdf("example.pdf", pages=1)
print(df)
df.to_csv("table.csv", index=False)

4. PyPDF2

PyPDF2 is a general-purpose PDF library. While it doesn’t specialize in tables, it extracts text effectively.

Key Features:

Text Extraction:

  • Extracts raw text for post-processing.

PDF Manipulation:

  • Splits, merges, and encrypts PDFs.

Output Formats:

  • Plain text.

Basic Usage Example:

from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")
text = reader.pages[0].extract_text()
print(text)

5. pdf2image

pdf2image converts PDF pages to images. While not directly for tables, it is useful for preprocessing image-based PDFs.

Key Features:

Image Conversion:

  • Converts PDF pages to high-quality images.

Compatibility:

  • Works with OCR libraries like Tesseract for table extraction.

Output Formats:

  • PNG
  • JPEG

Basic Usage Example:

from pdf2image import convert_from_path

images = convert_from_path("example.pdf", first_page=1, last_page=1)
images[0].save("page1.png")

Choosing the Right Library

For structured tables, Camelot or Tabula-py is ideal. Use PDFPlumber for flexible extraction or text-focused tasks. PyPDF2 works for basic text extraction, and pdf2image handles image-based preprocessing. Each library has its strengths — select the one that suits your task.

When dealing with image-based PDFs or scanned documents, OCR (Optical Character Recognition) becomes essential for extracting table data. OCR tools identify and convert text embedded in images into editable and machine-readable formats, enabling table extraction from non-searchable documents.

Here are some popular OCR tools and libraries for pulling table data from scanned PDFs:

1. Tesseract OCR

Tesseract is an open-source OCR engine maintained by Google. It is highly customizable and works well for text extraction from images, including tables.

Key Features:

  • Converts image-based PDFs to text.
  • Multilingual support with various pre-trained models.
  • Supports layout analysis for better table extraction.

Basic Usage Example:

from pytesseract import image_to_string
from pdf2image import convert_from_path

# Convert PDF page to an image
images = convert_from_path("scanned_table.pdf", first_page=1, last_page=1)
image = images[0]

# Perform OCR on the image
text = image_to_string(image, lang='eng')
print(text)

Enhancement: Use image_to_data for extracting structured text with bounding boxes, useful for identifying table structures.

2. Textract

Textract is a Python library for extracting text from various document formats, including PDFs, images, and scanned documents. It wraps OCR engines like Tesseract.

Key Features:

  • Works with image-based PDFs and other formats.
  • Simplifies text extraction workflows.

Basic Usage Example:

import textract

# Extract text from a scanned PDF
text = textract.process("scanned_table.pdf")
print(text.decode('utf-8'))

3. EasyOCR

EasyOCR is a lightweight OCR library that supports text recognition from images. It is particularly useful for its simplicity and accuracy.

Key Features:

  • Multilingual support.
  • Works well with tables in images.

Basic Usage Example:

import easyocr

# Initialize the EasyOCR reader
reader = easyocr.Reader(['en'])

# Perform OCR on an image
results = reader.readtext('scanned_table_image.png')

# Print detected text
for text in results:
    print(text)

4. OCRmyPDF

OCRmyPDF adds an OCR layer to PDFs, making them searchable while preserving the original document.

Key Features:

  • Integrates OCR directly into PDFs.
  • Outputs searchable PDFs with text overlays.

Basic Usage Example:

ocrmypdf input.pdf output.pdf

Once the PDF is searchable, you can use libraries like Camelot, PDFPlumber, or Tabula-py to extract table data.

5. Google Cloud Vision API

Google Cloud Vision API is a robust OCR solution for large-scale, high-accuracy text extraction.

Key Features:

  • Advanced text detection in images, including tables.
  • Detects rows and columns in tables.
  • Scalable for enterprise use.

Basic Usage Example:

from google.cloud import vision
from google.cloud.vision import types

client = vision.ImageAnnotatorClient()

# Load image file
with open("table_image.png", "rb") as image_file:
    content = image_file.read()

image = types.Image(content=content)

# Perform text detection
response = client.text_detection(image=image)
text = response.full_text_annotation.text
print(text)

Enhancing Table Extraction

When using OCR for tables, raw text output may need additional processing to reconstruct rows and columns. Libraries like Pandas and OpenCV can help:

  • Pandas: Organize extracted text into DataFrames.
  • OpenCV: Detect lines and contours to identify table boundaries.

Example with OpenCV:

import cv2
import numpy as np

# Load the image
image = cv2.imread('table_image.png', 0)

# Threshold the image
_, binary = cv2.threshold(image, 128, 255, cv2.THRESH_BINARY_INV)

# Detect horizontal and vertical lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (25, 1))
horizontal_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, horizontal_kernel)

vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 25))
vertical_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, vertical_kernel)

# Combine detected lines
table_lines = cv2.add(horizontal_lines, vertical_lines)

# Display the table boundaries
cv2.imshow('Table', table_lines)
cv2.waitKey(0)

Choosing an OCR Solution

  • Use Tesseract for lightweight and customizable OCR tasks.
  • Use EasyOCR for simplicity and speed.
  • Use Google Cloud Vision API for high-accuracy enterprise solutions.
  • Use OCRmyPDF if you need searchable PDFs for post-OCR extraction with other libraries.

Thank you for reading this article. I hope you found it helpful and informative. If you have any questions, or if you would like to suggest new Python code examples or topics for future tutorials, please feel free to reach out. Your feedback and suggestions are always welcome!

Happy coding!
Py-Core.com Python Programming

You can also find this article at Medium.com

Leave a Reply