PDFs often contain images, logos, charts, and scanned pages. If you have a need to extract images from PDFs for digital archiving, content analysis, or preprocessing OCR then there are several Python libraries available. Each offers unique features suited for different use cases.
1. PyMuPDF (fitz)
PyMuPDF, also known as Fitz, is a fast and efficient library for working with PDFs. It provides straightforward methods for extracting images from documents.
Install: pip install pymupdf
Key Features
- High-Speed Processing: Extracts images efficiently, even from large PDFs.
- Preserves Image Quality: Extracts images without losing resolution.
- Supports Various Image Formats: Exports images in PNG, JPEG, and other formats.
- Metadata Extraction: Retrieves DPI, color space, and image type.
Basic Usage Example
import fitz
def extract_images_from_pdf(pdf_path, output_folder):
doc = fitz.open(pdf_path)
for page_index in range(len(doc)):
for img_index, img in enumerate(doc[page_index].get_images(full=True)):
xref = img[0]
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
image_ext = base_image["ext"]
with open(f"{output_folder}/image_{page_index}_{img_index}.{image_ext}", "wb") as f:
f.write(image_bytes)
pdf_path = "sample.pdf"
output_folder = "extracted_images"
extract_images_from_pdf(pdf_path, output_folder)
Why Use PyMuPDF?
- Works well with high-resolution images.
- Maintains the original image format.
- Efficient for batch processing large PDFs.
2. pdf2image
pdf2image is designed to convert PDF pages into images. It’s useful when dealing with scanned documents or PDFs where images are embedded inside a rendered page.
Install: pip install pdf2image
Additionally, pdf2image requires Poppler to be installed for PDF rendering.
- Windows: Download Poppler from this link and add its
bin
directory to your systemPATH
.
macOS: Install via Homebrew
brew install poppler
Linux (Debian/Ubuntu)
sudo apt install poppler-utils
Key Features
- Renders Entire Pages: Converts PDF pages into high-resolution images.
- Multi-Page Support: Handles PDFs with multiple pages easily.
- Format Flexibility: Outputs images as PNG, JPEG, or TIFF.
- Compatible with OCR: Prepares PDFs for Optical Character Recognition (OCR).
Basic Usage Example
from pdf2image import convert_from_path
def convert_pdf_to_images(pdf_path, output_folder):
images = convert_from_path(pdf_path, dpi=300)
for i, image in enumerate(images):
image.save(f"{output_folder}/page_{i}.png", "PNG")
pdf_path = "scanned_document.pdf"
output_folder = "page_images"
convert_pdf_to_images(pdf_path, output_folder)
Why Use pdf2image?
- Ideal for scanned PDFs and OCR preprocessing.
- Handles entire page extraction.
- Works well with high-DPI image conversion.
3. pdfminer.six
pdfminer.six is widely known for text extraction, but it also supports embedded image extraction.
Install: pip install pdfminer.six
Key Features
- Extracts Embedded Images: Pulls out images inside PDF objects.
- Detailed PDF Analysis: Parses PDF structure for image objects.
- Flexible Image Export: Saves extracted images in multiple formats.
Basic Usage Example
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTImage
def extract_images_from_pdf(pdf_path, output_folder):
with open(pdf_path, "rb") as fp:
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
rsrcmgr = PDFResourceManager()
device = PDFPageAggregator(rsrcmgr, laparams=LAParams())
for page in PDFPage.create_pages(doc):
interpreter = PDFDevice(rsrcmgr, device)
interpreter.process_page(page)
layout = device.get_result()
for obj in layout:
if isinstance(obj, LTImage):
with open(f"{output_folder}/image_{obj.name}.jpg", "wb") as f:
f.write(obj.stream.get_rawdata())
pdf_path = "sample.pdf"
output_folder = "extracted_images"
extract_images_from_pdf(pdf_path, output_folder)
Why Use pdfminer.six?
- Good for extracting vector-based PDF images.
- Works when images are embedded inside text-based PDFs.
- Provides in-depth PDF analysis capabilities.
4. PyPDF2
PyPDF2 is a versatile library for manipulating PDFs. While it primarily handles merging, splitting, and text extraction, it also supports image extraction.
Install: pip install pypdf2
Key Features
- Simple Image Extraction: Finds and saves embedded images.
- PDF Manipulation: Merges and splits PDFs easily.
- Compatible with Other Libraries: Works well with pdfminer.six and pdf2image.
Basic Usage Example
import PyPDF2
def extract_images_from_pypdf(pdf_path, output_folder):
with open(pdf_path, "rb") as f:
reader = PyPDF2.PdfReader(f)
for i, page in enumerate(reader.pages):
if "/XObject" in page["/Resources"]:
xObject = page["/Resources"]["/XObject"].get_object()
for obj in xObject:
if xObject[obj]["/Subtype"] == "/Image":
img_data = xObject[obj]._data
with open(f"{output_folder}/image_{i}.jpg", "wb") as img_file:
img_file.write(img_data)
pdf_path = "sample.pdf"
output_folder = "extracted_images"
extract_images_from_pypdf(pdf_path, output_folder)
Why Use PyPDF2?
- Lightweight and easy to use.
- Useful when working with multi-purpose PDF tasks.
- Extracts images while preserving structure.
5. pdfplumber
pdfplumber is great for extracting text, tables, and images from PDFs. It provides fine-grained control over the extraction process.
Install: pip install pdfplumber
Key Features
- Granular Image Extraction: Pulls out specific images from each page.
- Works with Tables and Text: Extracts multiple data types.
- Supports Page Cropping: Isolates image regions efficiently.
Basic Usage Example
import pdfplumber
def extract_images_from_pdfplumber(pdf_path, output_folder):
with pdfplumber.open(pdf_path) as pdf:
for page_number, page in enumerate(pdf.pages):
for img_number, img in enumerate(page.images):
with open(f"{output_folder}/image_{page_number}_{img_number}.png", "wb") as f:
f.write(img["stream"].get_data())
pdf_path = "sample.pdf"
output_folder = "extracted_images"
extract_images_from_pdfplumber(pdf_path, output_folder)
Why Use pdfplumber?
- Good for structured document analysis.
- Works well when combined with text and table extraction.
- Provides detailed control over image regions.
Thank you for reading this article. I hope you found it helpful and informative. If you have any questions, or if you would like to suggest new Python code examples or topics for future tutorials, please feel free to reach out. Your feedback and suggestions are always welcome!
Happy coding!
Py-Core.com Python Programming
You can also find this article at Medium.com