Python Multiprocessing with a Real-World Example

Python’s multiprocessing library enables developers to speed up applications by distributing work across cores. Multiprocessing lets a computer handle multiple tasks simultaneously by giving each process its own memory space and resources. This setup allows each process to run independently, which makes it great for handling heavy, computation-intensive work. If you’re diving into training large language models (LLMs) or working with AI-heavy Python code, you’ll definitely come across terms like multiprocessing and multithreading.

Here’s a quick breakdown of the two: with multiprocessing, each process operates separately and independently, with the operating system allocating its own memory space to each one. This setup is ideal when tasks need to be completely independent and when you want to fully use multiple CPU cores.

In contrast, multithreading runs multiple threads within the same process. Threads share memory, which makes them faster to start but less isolated than processes, making multithreading more suitable for tasks that involve a lot of waiting, like reading files or making network requests.

Let’s dive into when and why to use multiprocessing, the key concepts, and how to handle processes. By the end, you’ll be equipped with a sample Python script illustrating these ideas in action.

What is Multiprocessing?

Multiprocessing allows tasks to be split across multiple CPU cores. Python, like many other languages, traditionally runs code in a single thread, which means the program will process tasks one by one. However, for tasks that involve heavy computation, this can slow down performance. Multiprocessing creates separate processes for tasks, allowing each to run on different CPU cores simultaneously. This is different from multithreading because threads share the same memory space, whereas processes do not, which can prevent some common issues like memory leaks and deadlocks.

Why and When Should You Use Multiprocessing?

You should consider multiprocessing when your program requires heavy computations or you’re working with independent tasks that can run parallelly. CPU-bound tasks, such as number crunching, data processing, or working with large files, can benefit from multiprocessing. For I/O-bound tasks, such as reading files or making network requests, multithreading may be more effective because these tasks spend most of their time waiting for external input/output rather than using the CPU. Multiprocessing, however, is ideal when tasks involve significant computation and can run independently.

Key Concepts in Multiprocessing

Processes

Each process runs independently and has its own memory space. In Python’s multiprocessing library, creating a new process is straightforward. For instance, to start a process, we can define a function and then assign it to a process using the Process class.

Starting, Stopping, and Exiting Processes

Starting a process is easy — just call the start() method. To stop or terminate a process, use the terminate() method, which kills the process immediately. If you want a process to stop after it completes its task, use the join() method, which allows the main program to wait until the process finishes. This way, the process naturally exits without being killed forcibly.

Getting Values from Processes

When you need values from multiple processes, you have options like Queue and Pipe. These provide ways to communicate between processes and collect outputs.

Waiting for a Process

To ensure a process completes before moving forward, join() is helpful. It pauses the main thread until the specified process has finished executing. This helps maintain order, especially when one task depends on the output of another.

Queue

Queue allows multiple processes to communicate. Think of it as a data pipe where one process can place data, and another can retrieve it. This helps when you need multiple processes to contribute to a shared dataset without sharing memory directly.

Pipe

Pipe is another form of inter-process communication (IPC), but it only connects two processes. It’s useful when two specific processes need to exchange data back and forth. Pipes are unidirectional, so they’re best for limited, two-way communications.

Pool

Pool allows you to manage multiple worker processes, which is helpful for parallelizing a function across many inputs. You create a pool and assign it tasks, then let Python distribute these tasks across worker processes.

Example Code: Setting Up a Multiprocessing System

In this example, we’ll create a Python script that uses multiprocessing to calculate the square of numbers in a list. The script will use a virtual environment and work on both Windows and Linux.

File Structure:

  • Project directory: multiprocessing_example/
  • Main script: multiprocessing_example/multiprocessing_script.py
  • Virtual environment (create in the same directory): multiprocessing_example/venv/

Creating the Virtual Environment

Run the following command based on your OS:

Windows:

python -m venv venv
.\venv\Scripts\activate

Linux:

python3 -m venv venv
source venv/bin/activate

The Code: multiprocessing_script.py

Below is the complete script. It calculates squares and uses Queue to get results from child processes.

import multiprocessing
import time

def square(number, queue):
    """Calculate the square of a number and store it in a queue."""
    result = number * number
    print(f"Square of {number}: {result}")
    queue.put(result)

if __name__ == "__main__":
    # List of numbers to square
    numbers = [1, 2, 3, 4, 5]

    # Queue to store results
    queue = multiprocessing.Queue()

    # Process pool
    processes = []

    # Start each process
    for number in numbers:
        process = multiprocessing.Process(target=square, args=(number, queue))
        processes.append(process)
        process.start()

    # Wait for all processes to complete
    for process in processes:
        process.join()

    # Retrieve results from the queue
    results = [queue.get() for _ in numbers]
    print("Squares:", results)

Explanation of the Code

  • Function square: This function takes a number, calculates its square, and puts the result in a Queue.
  • Process Pool Setup: Each number in the list numbers is assigned to a separate process. These processes are stored in the processes list.
  • Starting Processes: We use a loop to start each process.
  • Waiting for Processes: The join() method ensures each process completes before moving to the next section.
  • Retrieving Results: After all processes are complete, we retrieve results from the queue using queue.get().

Running the Code

Navigate to the directory:

cd multiprocessing_example

Activate the virtual environment:

source venv/bin/activate  # Linux
.\venv\Scripts\activate  # Windows

Run the script:

python multiprocessing_script.py

What’s Happening in the Code?

The code demonstrates a few concepts:

  • Creating processes: Each process handles a unique computation.
  • Queue: We use Queue for storing results, allowing each process to put data into a shared resource.
  • Process Synchronization: With join(), the script waits for each process to complete.

Thank you for following along with this tutorial. We hope you found it helpful and informative. If you have any questions, or if you would like to suggest new Python code examples or topics for future tutorials/articles, please feel free to join and comment. Your feedback and suggestions are always welcome!

You can find the same tutorial on Medium.com.

Leave a Reply