Python Multiprocessing

Introduction

When your Python program needs to perform multiple tasks simultaneously or leverage multiple CPU cores for better performance, the multiprocessing module comes to the rescue. Unlike threading, which is limited by Python's Global Interpreter Lock (GIL), multiprocessing allows you to bypass this limitation by creating separate Python processes that can run truly in parallel.

In this tutorial, we'll explore how to use Python's multiprocessing module to distribute workloads across multiple CPU cores, significantly improving performance for CPU-bound tasks.

Understanding Multiprocessing Basics

Multiprocessing is a technique that allows your program to create multiple processes, each running independently with its own Python interpreter and memory space. This approach enables true parallel execution across multiple CPU cores.

When to Use Multiprocessing

Multiprocessing is particularly effective for:

CPU-bound tasks: Calculations, data processing, and other operations that heavily use the CPU
Tasks that need to utilize multiple CPU cores
Operations that can be split into independent parts

Multiprocessing vs. Threading

Before diving deeper, it's important to understand the key difference:

Threading: Multiple threads share the same memory space but are limited by the GIL, making them ideal for I/O-bound tasks
Multiprocessing: Multiple processes with separate memory spaces that can run truly in parallel, ideal for CPU-bound tasks

Getting Started with Multiprocessing

Let's start with a basic example of how to create and run a process:

import multiprocessing
import time

def process_function():
    print(f"Process ID: {multiprocessing.current_process().pid}")
    print("Process running...")
    time.sleep(2)
    print("Process completed.")

if __name__ == "__main__":
    # Create a process
    process = multiprocessing.Process(target=process_function)
    
    # Start the process
    print("Starting the process...")
    process.start()
    
    # Wait for the process to complete
    process.join()
    
    print("Main program continues...")

Output:

Starting the process...
Process ID: 14582
Process running...
Process completed.
Main program continues...

In this example:

We import the multiprocessing module
Define a function process_function that will run in a separate process
Create a Process object with our function as the target
Start the process using start()
Wait for it to finish using join()

The `if name == "main"` Guard

You might notice the if __name__ == "__main__": statement in our code. This is crucial when using multiprocessing in Python. It prevents recursive process creation when the script is imported or when the child processes import the module.

Processing Multiple Items with Pool

For most practical scenarios, you'll want to process multiple items in parallel. The Pool class provides a convenient way to do this:

import multiprocessing
import time

def process_item(item):
    print(f"Processing {item} in process {multiprocessing.current_process().pid}")
    time.sleep(1)  # Simulate processing time
    return item * 2

if __name__ == "__main__":
    items = [1, 2, 3, 4, 5, 6, 7, 8]
    
    # Time the sequential execution
    start_time = time.time()
    sequential_result = [process_item(item) for item in items]
    sequential_time = time.time() - start_time
    
    # Time the parallel execution
    start_time = time.time()
    with multiprocessing.Pool() as pool:
        parallel_result = pool.map(process_item, items)
    parallel_time = time.time() - start_time
    
    print(f"Sequential result: {sequential_result}")
    print(f"Parallel result: {parallel_result}")
    print(f"Sequential execution time: {sequential_time:.2f} seconds")
    print(f"Parallel execution time: {parallel_time:.2f} seconds")
    print(f"Speedup: {sequential_time / parallel_time:.2f}x")

Output (will vary based on CPU):

Processing 1 in process 14583
Processing 2 in process 14584
Processing 3 in process 14585
Processing 4 in process 14586
Processing 5 in process 14587
Processing 6 in process 14588
Processing 7 in process 14589
Processing 8 in process 14590
Sequential result: [2, 4, 6, 8, 10, 12, 14, 16]
Parallel result: [2, 4, 6, 8, 10, 12, 14, 16]
Sequential execution time: 8.01 seconds
Parallel execution time: 2.05 seconds
Speedup: 3.91x

This example demonstrates:

How to use Pool to distribute workload across multiple processes
The performance advantage of parallel execution (speedup depends on your CPU cores)

Unlike threads, processes don't share memory by default. Python's multiprocessing module provides several ways to share data:

Using Queue

import multiprocessing

def producer(queue):
    for i in range(5):
        queue.put(i)
        print(f"Produced: {i}")
    
    # Signal the end of data
    queue.put(None)

def consumer(queue):
    while True:
        data = queue.get()
        if data is None:  # Check for the end signal
            break
        print(f"Consumed: {data}")

if __name__ == "__main__":
    # Create a shared queue
    queue = multiprocessing.Queue()
    
    # Create processes
    p1 = multiprocessing.Process(target=producer, args=(queue,))
    p2 = multiprocessing.Process(target=consumer, args=(queue,))
    
    # Start processes
    p1.start()
    p2.start()
    
    # Wait for processes to finish
    p1.join()
    p2.join()

Output:

Produced: 0
Produced: 1
Produced: 2
Produced: 3
Produced: 4
Consumed: 0
Consumed: 1
Consumed: 2
Consumed: 3
Consumed: 4

Using Value and Array for Shared Memory

import multiprocessing
import time

def increment_counter(counter):
    for _ in range(100):
        counter.value += 1
        time.sleep(0.01)

if __name__ == "__main__":
    # Create a shared counter
    counter = multiprocessing.Value('i', 0)  # 'i' indicates integer type
    
    # Create processes
    processes = [
        multiprocessing.Process(target=increment_counter, args=(counter,))
        for _ in range(4)
    ]
    
    # Start processes
    for p in processes:
        p.start()
    
    # Wait for processes to finish
    for p in processes:
        p.join()
    
    print(f"Final counter value: {counter.value}")

Output:

Final counter value: 400

In this example:

We create a shared Value object that can be accessed by multiple processes
Each process increments this counter 100 times
The final value shows that all processes successfully updated the shared value

Process Synchronization

Just like with threading, process synchronization is important to prevent race conditions:

import multiprocessing
import time

def deposit(balance, lock, amount):
    for _ in range(100):
        with lock:
            balance.value += amount
        time.sleep(0.001)

def withdraw(balance, lock, amount):
    for _ in range(100):
        with lock:
            balance.value -= amount
        time.sleep(0.001)

if __name__ == "__main__":
    balance = multiprocessing.Value('d', 1000.0)  # 'd' indicates double/float
    lock = multiprocessing.Lock()
    
    # Create processes
    p1 = multiprocessing.Process(target=deposit, args=(balance, lock, 10))
    p2 = multiprocessing.Process(target=withdraw, args=(balance, lock, 10))
    
    # Start processes
    p1.start()
    p2.start()
    
    # Wait for processes to finish
    p1.join()
    p2.join()
    
    print(f"Final balance: ${balance.value:.2f}")

Output:

Final balance: $1000.00

This example shows how to use a Lock to ensure that only one process can modify the shared balance at a time.

Practical Example: Image Processing

Let's see a real-world application of multiprocessing - image processing. In this example, we'll apply a simple blur effect to multiple images in parallel.

from PIL import Image, ImageFilter
import multiprocessing
import time
import os

def blur_image(image_path):
    try:
        img_name = os.path.basename(image_path)
        print(f"Processing {img_name}...")
        
        # Open image
        img = Image.open(image_path)
        
        # Apply blur
        blurred = img.filter(ImageFilter.GaussianBlur(radius=10))
        
        # Create output filename
        output_path = f"blurred_{img_name}"
        
        # Save blurred image
        blurred.save(output_path)
        
        return f"{img_name} processed successfully"
    except Exception as e:
        return f"Error processing {image_path}: {str(e)}"

if __name__ == "__main__":
    # List of image paths (replace with your actual image paths)
    image_paths = [
        "image1.jpg",
        "image2.jpg",
        "image3.jpg",
        "image4.jpg"
    ]
    
    # Process sequentially
    start_time = time.time()
    for path in image_paths:
        blur_image(path)
    sequential_time = time.time() - start_time
    
    # Process in parallel
    start_time = time.time()
    with multiprocessing.Pool() as pool:
        results = pool.map(blur_image, image_paths)
    parallel_time = time.time() - start_time
    
    print(f"Sequential processing time: {sequential_time:.2f} seconds")
    print(f"Parallel processing time: {parallel_time:.2f} seconds")
    print(f"Speedup: {sequential_time / parallel_time:.2f}x")

This example would need actual image files to run, but it demonstrates how to use multiprocessing to speed up image processing tasks.

Advanced Features

Process Pools with different methods

The Pool class offers several methods for different parallel processing needs:

import multiprocessing
import time

def process_item(item):
    time.sleep(1)  # Simulate processing time
    return item * 2

if __name__ == "__main__":
    with multiprocessing.Pool(processes=4) as pool:
        # map - process items in order
        results = pool.map(process_item, [1, 2, 3, 4])
        print(f"map results: {results}")
        
        # apply_async - process a single item asynchronously
        result = pool.apply_async(process_item, (10,))
        print(f"apply_async result: {result.get()}")
        
        # map_async - process items asynchronously
        result = pool.map_async(process_item, [5, 6, 7, 8])
        print(f"map_async results: {result.get()}")
        
        # imap - returns an iterator
        for result in pool.imap(process_item, [9, 10, 11]):
            print(f"imap result: {result}")

Output:

map results: [2, 4, 6, 8]
apply_async result: 20
map_async results: [10, 12, 14, 16]
imap result: 18
imap result: 20
imap result: 22

Process Communication with Pipes

import multiprocessing

def sender(conn):
    conn.send("Hello from the sender process!")
    conn.close()

def receiver(conn):
    message = conn.recv()
    print(f"Received: {message}")
    conn.close()

if __name__ == "__main__":
    # Create a pipe
    parent_conn, child_conn = multiprocessing.Pipe()
    
    # Create processes
    p1 = multiprocessing.Process(target=sender, args=(child_conn,))
    p2 = multiprocessing.Process(target=receiver, args=(parent_conn,))
    
    # Start processes
    p1.start()
    p2.start()
    
    # Wait for processes to finish
    p1.join()
    p2.join()

Output:

Received: Hello from the sender process!

Best Practices and Considerations

Memory Usage: Each process has its own memory space, which can increase the overall memory usage.
Process Creation Overhead: Creating processes is more expensive than creating threads.
Data Serialization: Data passed between processes needs to be serializable.
Number of Processes: A common practice is to match the number of processes with the number of CPU cores:

import multiprocessing

# Get the number of available CPU cores
num_cores = multiprocessing.cpu_count()
print(f"Number of CPU cores: {num_cores}")

# Create a pool with optimal number of processes
with multiprocessing.Pool(processes=num_cores) as pool:
    # Your parallel code here
    pass

Process Termination: Always make sure to properly terminate processes to avoid zombie processes.

Summary

Python's multiprocessing module is a powerful tool for parallel processing that allows you to:

Create and manage multiple processes
Distribute workloads across CPU cores using Pool
Share data between processes using Queue, Value, and Array
Synchronize processes using Lock and other primitives
Significantly improve performance for CPU-bound tasks

By bypassing the Global Interpreter Lock (GIL), multiprocessing enables true parallel execution, making it the ideal choice when you need to leverage multiple CPU cores to speed up processing-intensive tasks.

Exercises

Create a multiprocessing program that calculates the square of numbers from 1 to 1,000,000 and compare its performance with a sequential approach.
Build a parallel web scraper that uses multiple processes to download and process multiple web pages simultaneously.
Implement a producer-consumer pattern using multiprocessing to process items from a queue.
Create an image processing application that applies different filters to images in parallel.
Implement a parallel sorting algorithm using multiprocessing and compare its performance with Python's built-in sort.

Additional Resources

Happy parallel processing!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Multiprocessing Basics​

When to Use Multiprocessing​

Multiprocessing vs. Threading​

Getting Started with Multiprocessing​

The if __name__ == "__main__" Guard​

Processing Multiple Items with Pool​

Sharing Data Between Processes​

Using Queue​

Using Value and Array for Shared Memory​

Process Synchronization​

Practical Example: Image Processing​

Advanced Features​

Process Pools with different methods​

Process Communication with Pipes​

Best Practices and Considerations​

Summary​

Exercises​

Additional Resources​

Introduction

Understanding Multiprocessing Basics

When to Use Multiprocessing

Multiprocessing vs. Threading

Getting Started with Multiprocessing

The `if name == "main"` Guard

Processing Multiple Items with Pool

Sharing Data Between Processes

Using Queue

Using Value and Array for Shared Memory

Process Synchronization

Practical Example: Image Processing

Advanced Features

Process Pools with different methods

Process Communication with Pipes

Best Practices and Considerations

Summary

Exercises

Additional Resources