When dealing with PDF text extraction, you’ll eventually need to pull table data from the PDFs. These five Python libraries simplify the task. Each offers unique features, making them suitable for different use cases.
1. Camelot
Camelot is designed specifically for extracting tables from PDFs. It is lightweight and focused solely on this task.
Key Features:
Table Extraction Modes:
- Stream Mode: Works well with tables that lack explicit borders but have consistent spacing.
- Lattice Mode: Ideal for tables with visible cell boundaries, like grid lines.
Output Formats:
- CSV
- JSON
- Pandas DataFrame
- Excel
Accurate Parsing:
- Detects multiple tables on a single page.
- Handles complex layouts efficiently.
Visualization:
- Visualizes extraction processes and table boundaries.
Basic Usage Example:
import camelot
tables = camelot.read_pdf("example.pdf", pages="1", flavor="stream")
df = tables[0].df
print(df)
tables[0].to_csv("table.csv")
2. PDFPlumber
PDFPlumber provides a versatile approach to PDF data extraction. It supports tables, text, and images.
Key Features:
Comprehensive Extraction:
- Extracts tables, text, and images.
- Retrieves metadata.
Customizable Table Parsing:
- Offers control over how tables are detected and processed.
Output Formats:
- Provides lists of rows and columns.
- Outputs Pandas-compatible structures.
Basic Usage Example:
import pdfplumber
with pdfplumber.open("example.pdf") as pdf:
table = pdf.pages[0].extract_table()
for row in table:
print(row)
3. Tabula-py
Tabula-py is a Python wrapper for the Tabula Java library. It is reliable for extracting tables, especially when using the Tabula desktop tool.
Key Features:
Automated Table Detection:
- Identifies and extracts multiple tables in one operation.
Output Formats:
- CSV
- JSON
- Pandas DataFrame
Customization:
- Users can specify areas of interest on a page.
Basic Usage Example:
from tabula import read_pdf
df = read_pdf("example.pdf", pages=1)
print(df)
df.to_csv("table.csv", index=False)
4. PyPDF2
PyPDF2 is a general-purpose PDF library. While it doesn’t specialize in tables, it extracts text effectively.
Key Features:
Text Extraction:
- Extracts raw text for post-processing.
PDF Manipulation:
- Splits, merges, and encrypts PDFs.
Output Formats:
- Plain text.
Basic Usage Example:
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
text = reader.pages[0].extract_text()
print(text)
5. pdf2image
pdf2image converts PDF pages to images. While not directly for tables, it is useful for preprocessing image-based PDFs.
Key Features:
Image Conversion:
- Converts PDF pages to high-quality images.
Compatibility:
- Works with OCR libraries like Tesseract for table extraction.
Output Formats:
- PNG
- JPEG
Basic Usage Example:
from pdf2image import convert_from_path
images = convert_from_path("example.pdf", first_page=1, last_page=1)
images[0].save("page1.png")
Choosing the Right Library
For structured tables, Camelot or Tabula-py is ideal. Use PDFPlumber for flexible extraction or text-focused tasks. PyPDF2 works for basic text extraction, and pdf2image handles image-based preprocessing. Each library has its strengths — select the one that suits your task.
When dealing with image-based PDFs or scanned documents, OCR (Optical Character Recognition) becomes essential for extracting table data. OCR tools identify and convert text embedded in images into editable and machine-readable formats, enabling table extraction from non-searchable documents.
Here are some popular OCR tools and libraries for pulling table data from scanned PDFs:
1. Tesseract OCR
Tesseract is an open-source OCR engine maintained by Google. It is highly customizable and works well for text extraction from images, including tables.
Key Features:
- Converts image-based PDFs to text.
- Multilingual support with various pre-trained models.
- Supports layout analysis for better table extraction.
Basic Usage Example:
from pytesseract import image_to_string
from pdf2image import convert_from_path
# Convert PDF page to an image
images = convert_from_path("scanned_table.pdf", first_page=1, last_page=1)
image = images[0]
# Perform OCR on the image
text = image_to_string(image, lang='eng')
print(text)
Enhancement: Use image_to_data
for extracting structured text with bounding boxes, useful for identifying table structures.
2. Textract
Textract is a Python library for extracting text from various document formats, including PDFs, images, and scanned documents. It wraps OCR engines like Tesseract.
Key Features:
- Works with image-based PDFs and other formats.
- Simplifies text extraction workflows.
Basic Usage Example:
import textract
# Extract text from a scanned PDF
text = textract.process("scanned_table.pdf")
print(text.decode('utf-8'))
3. EasyOCR
EasyOCR is a lightweight OCR library that supports text recognition from images. It is particularly useful for its simplicity and accuracy.
Key Features:
- Multilingual support.
- Works well with tables in images.
Basic Usage Example:
import easyocr
# Initialize the EasyOCR reader
reader = easyocr.Reader(['en'])
# Perform OCR on an image
results = reader.readtext('scanned_table_image.png')
# Print detected text
for text in results:
print(text)
4. OCRmyPDF
OCRmyPDF adds an OCR layer to PDFs, making them searchable while preserving the original document.
Key Features:
- Integrates OCR directly into PDFs.
- Outputs searchable PDFs with text overlays.
Basic Usage Example:
ocrmypdf input.pdf output.pdf
Once the PDF is searchable, you can use libraries like Camelot, PDFPlumber, or Tabula-py to extract table data.
5. Google Cloud Vision API
Google Cloud Vision API is a robust OCR solution for large-scale, high-accuracy text extraction.
Key Features:
- Advanced text detection in images, including tables.
- Detects rows and columns in tables.
- Scalable for enterprise use.
Basic Usage Example:
from google.cloud import vision
from google.cloud.vision import types
client = vision.ImageAnnotatorClient()
# Load image file
with open("table_image.png", "rb") as image_file:
content = image_file.read()
image = types.Image(content=content)
# Perform text detection
response = client.text_detection(image=image)
text = response.full_text_annotation.text
print(text)
Enhancing Table Extraction
When using OCR for tables, raw text output may need additional processing to reconstruct rows and columns. Libraries like Pandas and OpenCV can help:
- Pandas: Organize extracted text into DataFrames.
- OpenCV: Detect lines and contours to identify table boundaries.
Example with OpenCV:
import cv2
import numpy as np
# Load the image
image = cv2.imread('table_image.png', 0)
# Threshold the image
_, binary = cv2.threshold(image, 128, 255, cv2.THRESH_BINARY_INV)
# Detect horizontal and vertical lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (25, 1))
horizontal_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, horizontal_kernel)
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 25))
vertical_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, vertical_kernel)
# Combine detected lines
table_lines = cv2.add(horizontal_lines, vertical_lines)
# Display the table boundaries
cv2.imshow('Table', table_lines)
cv2.waitKey(0)
Choosing an OCR Solution
- Use Tesseract for lightweight and customizable OCR tasks.
- Use EasyOCR for simplicity and speed.
- Use Google Cloud Vision API for high-accuracy enterprise solutions.
- Use OCRmyPDF if you need searchable PDFs for post-OCR extraction with other libraries.
Thank you for reading this article. I hope you found it helpful and informative. If you have any questions, or if you would like to suggest new Python code examples or topics for future tutorials, please feel free to reach out. Your feedback and suggestions are always welcome!
Happy coding!
Py-Core.com Python Programming
You can also find this article at Medium.com