Python Yield Statement
Introduction
When working with large datasets or sequences in Python, memory efficiency becomes crucial. The yield
statement offers an elegant solution by enabling you to create generators - special iterators that generate values on-the-fly instead of storing them all in memory at once.
In this tutorial, you'll learn how the yield
statement works, how it differs from regular return
statements, and how to leverage generators to write more memory-efficient and cleaner code.
What is the Yield Statement?
The yield
statement is used within a function to turn it into a generator function. Unlike a regular function that returns a value and terminates, a generator function:
- Returns a generator object
- Pauses execution when it reaches a
yield
statement - Saves its state (local variables, position in code)
- Resumes from where it left off when called again
This "pause and resume" behavior makes generators perfect for working with large sequences or infinite streams of data.
Basic Syntax of Yield
Here's the basic syntax of a generator function using yield
:
def generator_function():
# Some code
yield value1
# More code
yield value2
# And so on
Yield vs Return: Understanding the Difference
To understand what makes yield
special, let's compare it with the regular return
statement:
# Function with return
def return_numbers():
numbers = []
for i in range(1, 6):
numbers.append(i)
return numbers
# Function with yield
def yield_numbers():
for i in range(1, 6):
yield i
# Using return function
print("Using return:")
result = return_numbers()
print(result) # All numbers at once
# Using yield function
print("\nUsing yield:")
gen = yield_numbers()
print(next(gen)) # One number at a time
print(next(gen))
print(next(gen))
Output:
Using return:
[1, 2, 3, 4, 5]
Using yield:
1
2
3
Key differences:
- The
return
function builds the entire list in memory before returning - The
yield
function generates each value on demand - The generator object maintains its state between calls to
next()
How Generators Work Behind the Scenes
When you call a generator function, it doesn't execute the function body immediately. Instead, it returns a generator object that implements the iterator protocol:
def simple_generator():
print("First yield")
yield 1
print("Second yield")
yield 2
print("Third yield")
yield 3
# Create generator object
gen = simple_generator()
print(type(gen))
# Nothing is printed until we start iterating
print("\nStarting iteration:")
print(next(gen)) # Executes until first yield
print(next(gen)) # Continues from previous position
print(next(gen)) # Continues again
Output:
<class 'generator'>
Starting iteration:
First yield
1
Second yield
2
Third yield
3
If you call next(gen)
one more time, you'll get a StopIteration
exception, which is how Python signals the end of an iterator.
Common Use Cases for Yield
1. Processing Large Files
Generators are perfect for reading large files line by line without loading the entire file into memory:
def read_large_file(file_path):
with open(file_path, 'r') as file:
for line in file:
yield line.strip()
# Usage example
def count_lines(file_path):
count = 0
for line in read_large_file(file_path):
count += 1
return count
# This processes a large file efficiently without loading it all into memory
2. Infinite Sequences
You can create infinite sequences that would be impossible to store in memory:
def fibonacci():
a, b = 0, 1
while True:
yield a
a, b = b, a + b
# Get first 10 fibonacci numbers
fib = fibonacci()
for _ in range(10):
print(next(fib), end=" ")
Output:
0 1 1 2 3 5 8 13 21 34
3. Pipelining Data Processing
Generators can be used to create data processing pipelines:
def read_data(file_path):
with open(file_path, 'r') as f:
for line in f:
yield line.strip()
def parse_data(lines):
for line in lines:
# Assume comma-separated values
yield line.split(',')
def filter_data(records):
for record in records:
if len(record) >= 2 and record[1].isdigit() and int(record[1]) > 30:
yield record
# Usage example (assuming you have a data.csv file)
# pipeline = filter_data(parse_data(read_data('data.csv')))
# for record in pipeline:
# print(record)
Generator Expressions
Similar to list comprehensions, Python provides a concise way to create generators called generator expressions:
# List comprehension (creates entire list in memory)
numbers_list = [x*x for x in range(1000000)]
# Generator expression (creates generator object)
numbers_gen = (x*x for x in range(1000000))
# Compare memory usage
import sys
print(f"List size: {sys.getsizeof(numbers_list)} bytes")
print(f"Generator size: {sys.getsizeof(numbers_gen)} bytes")
# We can iterate over the generator
for i, num in enumerate(numbers_gen):
if i < 5:
print(num, end=" ")
else:
break
Output (sizes may vary):
List size: 8448728 bytes
Generator size: 112 bytes
0 1 4 9 16
The memory difference is significant!
Advanced Generator Features
Sending Values to Generators with .send()
Generators can receive values from outside using the .send()
method:
def echo_generator():
while True:
received = yield
print(f"Received: {received}")
g = echo_generator()
next(g) # Prime the generator
g.send("Hello")
g.send("World")
Output:
Received: Hello
Received: World
Two-way Communication
You can both receive and send values with generators:
def compute_average():
count = 0
total = 0
average = 0
while True:
# Yield current average, then receive next value
value = yield average
if value is not None:
count += 1
total += value
average = total / count
# Use the generator
avg_gen = compute_average()
next(avg_gen) # Start the generator (returns 0)
print(avg_gen.send(10)) # Send 10, get average
print(avg_gen.send(20)) # Send 20, get new average
print(avg_gen.send(30)) # Send 30, get new average
Output:
10.0
15.0
20.0
Real-World Example: Batch Processing
Here's a practical example of using generators for batch processing:
def data_source(items):
"""Simulate fetching data from a source"""
for item in items:
yield item
def process_batches(data, batch_size=3):
"""Process data in batches"""
batch = []
for item in data:
batch.append(item)
if len(batch) >= batch_size:
yield batch
batch = []
# Don't forget the last incomplete batch
if batch:
yield batch
# Sample data
all_data = range(1, 11) # Numbers 1-10
# Create processing pipeline
source = data_source(all_data)
batches = process_batches(source, batch_size=3)
# Process each batch
for i, batch in enumerate(batches, 1):
print(f"Processing batch {i}: {batch}")
# Do something with the batch
Output:
Processing batch 1: [1, 2, 3]
Processing batch 2: [4, 5, 6]
Processing batch 3: [7, 8, 9]
Processing batch 4: [10]
This approach is memory-efficient even with very large datasets, as it only keeps one batch in memory at a time.
Best Practices for Using Yield
- Use generators for large sequences: When dealing with large amounts of data, generators help reduce memory usage
- Use generators for calculated sequences: When each item requires calculation, generators compute values on demand
- Use generators for infinite sequences: For potentially infinite streams of data (like monitoring systems)
- Keep generator functions focused: Each generator should have a single responsibility
- Consider using generator expressions for simple cases where a full function isn't needed
Common Pitfalls to Avoid
- Trying to reuse generators: Once a generator is exhausted, you need to recreate it to use it again
- Accessing generator values by index: Generators don't support indexing - you have to iterate through them
- Forgetting that generators are single-use: You can't reset or rewind a generator to the beginning
Summary
The yield
statement is a powerful tool in Python that enables you to:
- Create generator functions that produce values on-demand
- Process large datasets efficiently with minimal memory usage
- Build data pipelines that process information incrementally
- Create infinite sequences that would be impossible with regular collections
Generators represent a fundamental shift in how data is processed: from "compute everything at once" to "compute only what you need, when you need it."
Exercises
- Write a generator function that produces the first n prime numbers
- Create a generator that reads a CSV file and yields each row as a dictionary
- Build a data processing pipeline using multiple generators that:
- Reads numbers from a file
- Filters out non-numeric values
- Converts them to integers
- Yields only the even numbers
- Implement a windowing generator that, given a list and window size, yields overlapping sublists of the specified window size
Additional Resources
- Python Documentation on Generators
- PEP 255 -- Simple Generators
- Book: "Fluent Python" by Luciano Ramalho (has excellent chapters on generators)
- Python Generator Tricks Documentation
Understanding generators and the yield
statement opens up new possibilities for efficient data processing and can dramatically improve the performance and readability of your code when working with sequences and streams.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)