Python Multiprocessing
Introduction
When your Python program needs to perform multiple tasks simultaneously or leverage multiple CPU cores for better performance, the multiprocessing
module comes to the rescue. Unlike threading, which is limited by Python's Global Interpreter Lock (GIL), multiprocessing allows you to bypass this limitation by creating separate Python processes that can run truly in parallel.
In this tutorial, we'll explore how to use Python's multiprocessing
module to distribute workloads across multiple CPU cores, significantly improving performance for CPU-bound tasks.
Understanding Multiprocessing Basics
Multiprocessing is a technique that allows your program to create multiple processes, each running independently with its own Python interpreter and memory space. This approach enables true parallel execution across multiple CPU cores.
When to Use Multiprocessing
Multiprocessing is particularly effective for:
- CPU-bound tasks: Calculations, data processing, and other operations that heavily use the CPU
- Tasks that need to utilize multiple CPU cores
- Operations that can be split into independent parts
Multiprocessing vs. Threading
Before diving deeper, it's important to understand the key difference:
- Threading: Multiple threads share the same memory space but are limited by the GIL, making them ideal for I/O-bound tasks
- Multiprocessing: Multiple processes with separate memory spaces that can run truly in parallel, ideal for CPU-bound tasks
Getting Started with Multiprocessing
Let's start with a basic example of how to create and run a process:
import multiprocessing
import time
def process_function():
print(f"Process ID: {multiprocessing.current_process().pid}")
print("Process running...")
time.sleep(2)
print("Process completed.")
if __name__ == "__main__":
# Create a process
process = multiprocessing.Process(target=process_function)
# Start the process
print("Starting the process...")
process.start()
# Wait for the process to complete
process.join()
print("Main program continues...")
Output:
Starting the process...
Process ID: 14582
Process running...
Process completed.
Main program continues...
In this example:
- We import the
multiprocessing
module - Define a function
process_function
that will run in a separate process - Create a
Process
object with our function as the target - Start the process using
start()
- Wait for it to finish using
join()
The if __name__ == "__main__"
Guard
You might notice the if __name__ == "__main__":
statement in our code. This is crucial when using multiprocessing in Python. It prevents recursive process creation when the script is imported or when the child processes import the module.
Processing Multiple Items with Pool
For most practical scenarios, you'll want to process multiple items in parallel. The Pool
class provides a convenient way to do this:
import multiprocessing
import time
def process_item(item):
print(f"Processing {item} in process {multiprocessing.current_process().pid}")
time.sleep(1) # Simulate processing time
return item * 2
if __name__ == "__main__":
items = [1, 2, 3, 4, 5, 6, 7, 8]
# Time the sequential execution
start_time = time.time()
sequential_result = [process_item(item) for item in items]
sequential_time = time.time() - start_time
# Time the parallel execution
start_time = time.time()
with multiprocessing.Pool() as pool:
parallel_result = pool.map(process_item, items)
parallel_time = time.time() - start_time
print(f"Sequential result: {sequential_result}")
print(f"Parallel result: {parallel_result}")
print(f"Sequential execution time: {sequential_time:.2f} seconds")
print(f"Parallel execution time: {parallel_time:.2f} seconds")
print(f"Speedup: {sequential_time / parallel_time:.2f}x")
Output (will vary based on CPU):
Processing 1 in process 14583
Processing 2 in process 14584
Processing 3 in process 14585
Processing 4 in process 14586
Processing 5 in process 14587
Processing 6 in process 14588
Processing 7 in process 14589
Processing 8 in process 14590
Sequential result: [2, 4, 6, 8, 10, 12, 14, 16]
Parallel result: [2, 4, 6, 8, 10, 12, 14, 16]
Sequential execution time: 8.01 seconds
Parallel execution time: 2.05 seconds
Speedup: 3.91x
This example demonstrates:
- How to use
Pool
to distribute workload across multiple processes - The performance advantage of parallel execution (speedup depends on your CPU cores)
Sharing Data Between Processes
Unlike threads, processes don't share memory by default. Python's multiprocessing
module provides several ways to share data:
Using Queue
import multiprocessing
def producer(queue):
for i in range(5):
queue.put(i)
print(f"Produced: {i}")
# Signal the end of data
queue.put(None)
def consumer(queue):
while True:
data = queue.get()
if data is None: # Check for the end signal
break
print(f"Consumed: {data}")
if __name__ == "__main__":
# Create a shared queue
queue = multiprocessing.Queue()
# Create processes
p1 = multiprocessing.Process(target=producer, args=(queue,))
p2 = multiprocessing.Process(target=consumer, args=(queue,))
# Start processes
p1.start()
p2.start()
# Wait for processes to finish
p1.join()
p2.join()
Output:
Produced: 0
Produced: 1
Produced: 2
Produced: 3
Produced: 4
Consumed: 0
Consumed: 1
Consumed: 2
Consumed: 3
Consumed: 4
Using Value and Array for Shared Memory
import multiprocessing
import time
def increment_counter(counter):
for _ in range(100):
counter.value += 1
time.sleep(0.01)
if __name__ == "__main__":
# Create a shared counter
counter = multiprocessing.Value('i', 0) # 'i' indicates integer type
# Create processes
processes = [
multiprocessing.Process(target=increment_counter, args=(counter,))
for _ in range(4)
]
# Start processes
for p in processes:
p.start()
# Wait for processes to finish
for p in processes:
p.join()
print(f"Final counter value: {counter.value}")
Output:
Final counter value: 400
In this example:
- We create a shared
Value
object that can be accessed by multiple processes - Each process increments this counter 100 times
- The final value shows that all processes successfully updated the shared value
Process Synchronization
Just like with threading, process synchronization is important to prevent race conditions:
import multiprocessing
import time
def deposit(balance, lock, amount):
for _ in range(100):
with lock:
balance.value += amount
time.sleep(0.001)
def withdraw(balance, lock, amount):
for _ in range(100):
with lock:
balance.value -= amount
time.sleep(0.001)
if __name__ == "__main__":
balance = multiprocessing.Value('d', 1000.0) # 'd' indicates double/float
lock = multiprocessing.Lock()
# Create processes
p1 = multiprocessing.Process(target=deposit, args=(balance, lock, 10))
p2 = multiprocessing.Process(target=withdraw, args=(balance, lock, 10))
# Start processes
p1.start()
p2.start()
# Wait for processes to finish
p1.join()
p2.join()
print(f"Final balance: ${balance.value:.2f}")
Output:
Final balance: $1000.00
This example shows how to use a Lock
to ensure that only one process can modify the shared balance at a time.
Practical Example: Image Processing
Let's see a real-world application of multiprocessing - image processing. In this example, we'll apply a simple blur effect to multiple images in parallel.
from PIL import Image, ImageFilter
import multiprocessing
import time
import os
def blur_image(image_path):
try:
img_name = os.path.basename(image_path)
print(f"Processing {img_name}...")
# Open image
img = Image.open(image_path)
# Apply blur
blurred = img.filter(ImageFilter.GaussianBlur(radius=10))
# Create output filename
output_path = f"blurred_{img_name}"
# Save blurred image
blurred.save(output_path)
return f"{img_name} processed successfully"
except Exception as e:
return f"Error processing {image_path}: {str(e)}"
if __name__ == "__main__":
# List of image paths (replace with your actual image paths)
image_paths = [
"image1.jpg",
"image2.jpg",
"image3.jpg",
"image4.jpg"
]
# Process sequentially
start_time = time.time()
for path in image_paths:
blur_image(path)
sequential_time = time.time() - start_time
# Process in parallel
start_time = time.time()
with multiprocessing.Pool() as pool:
results = pool.map(blur_image, image_paths)
parallel_time = time.time() - start_time
print(f"Sequential processing time: {sequential_time:.2f} seconds")
print(f"Parallel processing time: {parallel_time:.2f} seconds")
print(f"Speedup: {sequential_time / parallel_time:.2f}x")
This example would need actual image files to run, but it demonstrates how to use multiprocessing to speed up image processing tasks.
Advanced Features
Process Pools with different methods
The Pool
class offers several methods for different parallel processing needs:
import multiprocessing
import time
def process_item(item):
time.sleep(1) # Simulate processing time
return item * 2
if __name__ == "__main__":
with multiprocessing.Pool(processes=4) as pool:
# map - process items in order
results = pool.map(process_item, [1, 2, 3, 4])
print(f"map results: {results}")
# apply_async - process a single item asynchronously
result = pool.apply_async(process_item, (10,))
print(f"apply_async result: {result.get()}")
# map_async - process items asynchronously
result = pool.map_async(process_item, [5, 6, 7, 8])
print(f"map_async results: {result.get()}")
# imap - returns an iterator
for result in pool.imap(process_item, [9, 10, 11]):
print(f"imap result: {result}")
Output:
map results: [2, 4, 6, 8]
apply_async result: 20
map_async results: [10, 12, 14, 16]
imap result: 18
imap result: 20
imap result: 22
Process Communication with Pipes
import multiprocessing
def sender(conn):
conn.send("Hello from the sender process!")
conn.close()
def receiver(conn):
message = conn.recv()
print(f"Received: {message}")
conn.close()
if __name__ == "__main__":
# Create a pipe
parent_conn, child_conn = multiprocessing.Pipe()
# Create processes
p1 = multiprocessing.Process(target=sender, args=(child_conn,))
p2 = multiprocessing.Process(target=receiver, args=(parent_conn,))
# Start processes
p1.start()
p2.start()
# Wait for processes to finish
p1.join()
p2.join()
Output:
Received: Hello from the sender process!
Best Practices and Considerations
-
Memory Usage: Each process has its own memory space, which can increase the overall memory usage.
-
Process Creation Overhead: Creating processes is more expensive than creating threads.
-
Data Serialization: Data passed between processes needs to be serializable.
-
Number of Processes: A common practice is to match the number of processes with the number of CPU cores:
import multiprocessing
# Get the number of available CPU cores
num_cores = multiprocessing.cpu_count()
print(f"Number of CPU cores: {num_cores}")
# Create a pool with optimal number of processes
with multiprocessing.Pool(processes=num_cores) as pool:
# Your parallel code here
pass
- Process Termination: Always make sure to properly terminate processes to avoid zombie processes.
Summary
Python's multiprocessing
module is a powerful tool for parallel processing that allows you to:
- Create and manage multiple processes
- Distribute workloads across CPU cores using
Pool
- Share data between processes using
Queue
,Value
, andArray
- Synchronize processes using
Lock
and other primitives - Significantly improve performance for CPU-bound tasks
By bypassing the Global Interpreter Lock (GIL), multiprocessing enables true parallel execution, making it the ideal choice when you need to leverage multiple CPU cores to speed up processing-intensive tasks.
Exercises
-
Create a multiprocessing program that calculates the square of numbers from 1 to 1,000,000 and compare its performance with a sequential approach.
-
Build a parallel web scraper that uses multiple processes to download and process multiple web pages simultaneously.
-
Implement a producer-consumer pattern using multiprocessing to process items from a queue.
-
Create an image processing application that applies different filters to images in parallel.
-
Implement a parallel sorting algorithm using multiprocessing and compare its performance with Python's built-in sort.
Additional Resources
- Python Multiprocessing Official Documentation
- Python Concurrency: Getting Started With Concurrent Programming
- Speed Up Your Python Program With Concurrency
- Multiprocessing vs Threading in Python: What You Need to Know
Happy parallel processing!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)