Home Python Libraries for Extracting Images from PDFs

Python Libraries for Extracting Images from PDFs

PDFs often contain images, logos, charts, and scanned pages. If you have a need to extract images from PDFs for digital archiving, content analysis, or preprocessing OCR then there are several Python libraries available. Each offers unique features suited for different use cases.

1. PyMuPDF (fitz)

PyMuPDF, also known as Fitz, is a fast and efficient library for working with PDFs. It provides straightforward methods for extracting images from documents.

Website

Install: pip install pymupdf

Key Features

High-Speed Processing: Extracts images efficiently, even from large PDFs.
Preserves Image Quality: Extracts images without losing resolution.
Supports Various Image Formats: Exports images in PNG, JPEG, and other formats.
Metadata Extraction: Retrieves DPI, color space, and image type.

Basic Usage Example

import fitz

def extract_images_from_pdf(pdf_path, output_folder):
    doc = fitz.open(pdf_path)
    for page_index in range(len(doc)):
        for img_index, img in enumerate(doc[page_index].get_images(full=True)):
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            with open(f"{output_folder}/image_{page_index}_{img_index}.{image_ext}", "wb") as f:
                f.write(image_bytes)

pdf_path = "sample.pdf"
output_folder = "extracted_images"
extract_images_from_pdf(pdf_path, output_folder)

Why Use PyMuPDF?

Works well with high-resolution images.
Maintains the original image format.
Efficient for batch processing large PDFs.

2. pdf2image

pdf2image is designed to convert PDF pages into images. It’s useful when dealing with scanned documents or PDFs where images are embedded inside a rendered page.

Website

Install: pip install pdf2image

Additionally, pdf2image requires Poppler to be installed for PDF rendering.

Windows: Download Poppler from this link and add its bin directory to your system PATH.

macOS: Install via Homebrew

brew install poppler

Linux (Debian/Ubuntu)

sudo apt install poppler-utils

Key Features

Renders Entire Pages: Converts PDF pages into high-resolution images.
Multi-Page Support: Handles PDFs with multiple pages easily.
Format Flexibility: Outputs images as PNG, JPEG, or TIFF.
Compatible with OCR: Prepares PDFs for Optical Character Recognition (OCR).

Basic Usage Example

from pdf2image import convert_from_path

def convert_pdf_to_images(pdf_path, output_folder):
    images = convert_from_path(pdf_path, dpi=300)
    for i, image in enumerate(images):
        image.save(f"{output_folder}/page_{i}.png", "PNG")

pdf_path = "scanned_document.pdf"
output_folder = "page_images"
convert_pdf_to_images(pdf_path, output_folder)

Why Use pdf2image?

Ideal for scanned PDFs and OCR preprocessing.
Handles entire page extraction.
Works well with high-DPI image conversion.

3. pdfminer.six

pdfminer.six is widely known for text extraction, but it also supports embedded image extraction.

Website

Install: pip install pdfminer.six

Key Features

Extracts Embedded Images: Pulls out images inside PDF objects.
Detailed PDF Analysis: Parses PDF structure for image objects.
Flexible Image Export: Saves extracted images in multiple formats.

Basic Usage Example

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTImage

def extract_images_from_pdf(pdf_path, output_folder):
    with open(pdf_path, "rb") as fp:
        parser = PDFParser(fp)
        doc = PDFDocument(parser)
        parser.set_document(doc)

        rsrcmgr = PDFResourceManager()
        device = PDFPageAggregator(rsrcmgr, laparams=LAParams())
        for page in PDFPage.create_pages(doc):
            interpreter = PDFDevice(rsrcmgr, device)
            interpreter.process_page(page)
            layout = device.get_result()
            for obj in layout:
                if isinstance(obj, LTImage):
                    with open(f"{output_folder}/image_{obj.name}.jpg", "wb") as f:
                        f.write(obj.stream.get_rawdata())

pdf_path = "sample.pdf"
output_folder = "extracted_images"
extract_images_from_pdf(pdf_path, output_folder)

Why Use pdfminer.six?

Good for extracting vector-based PDF images.
Works when images are embedded inside text-based PDFs.
Provides in-depth PDF analysis capabilities.

4. PyPDF2

PyPDF2 is a versatile library for manipulating PDFs. While it primarily handles merging, splitting, and text extraction, it also supports image extraction.

Website

Install: pip install pypdf2

Key Features

Simple Image Extraction: Finds and saves embedded images.
PDF Manipulation: Merges and splits PDFs easily.
Compatible with Other Libraries: Works well with pdfminer.six and pdf2image.

Basic Usage Example

import PyPDF2

def extract_images_from_pypdf(pdf_path, output_folder):
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        for i, page in enumerate(reader.pages):
            if "/XObject" in page["/Resources"]:
                xObject = page["/Resources"]["/XObject"].get_object()
                for obj in xObject:
                    if xObject[obj]["/Subtype"] == "/Image":
                        img_data = xObject[obj]._data
                        with open(f"{output_folder}/image_{i}.jpg", "wb") as img_file:
                            img_file.write(img_data)

pdf_path = "sample.pdf"
output_folder = "extracted_images"
extract_images_from_pypdf(pdf_path, output_folder)

Why Use PyPDF2?

Lightweight and easy to use.
Useful when working with multi-purpose PDF tasks.
Extracts images while preserving structure.

5. pdfplumber

pdfplumber is great for extracting text, tables, and images from PDFs. It provides fine-grained control over the extraction process.

Website

Install: pip install pdfplumber

Key Features

Granular Image Extraction: Pulls out specific images from each page.
Works with Tables and Text: Extracts multiple data types.
Supports Page Cropping: Isolates image regions efficiently.

Basic Usage Example

import pdfplumber

def extract_images_from_pdfplumber(pdf_path, output_folder):
    with pdfplumber.open(pdf_path) as pdf:
        for page_number, page in enumerate(pdf.pages):
            for img_number, img in enumerate(page.images):
                with open(f"{output_folder}/image_{page_number}_{img_number}.png", "wb") as f:
                    f.write(img["stream"].get_data())

pdf_path = "sample.pdf"
output_folder = "extracted_images"
extract_images_from_pdfplumber(pdf_path, output_folder)

Why Use pdfplumber?

Good for structured document analysis.
Works well when combined with text and table extraction.
Provides detailed control over image regions.

Thank you for reading this article. I hope you found it helpful and informative. If you have any questions, or if you would like to suggest new Python code examples or topics for future tutorials, please feel free to reach out. Your feedback and suggestions are always welcome!

Happy coding!
Py-Core.com Python Programming

You can also find this article at Medium.com

Byadmin

Updated May 31, 2025