Python Collections Module
Introduction
The Python Collections module is a built-in module that implements specialized container data types providing alternatives to Python's general-purpose built-in containers such as lists, dictionaries, and tuples. If you've been working with Python's standard data structures, the Collections module offers more specialized, efficient tools that can make your code more readable and performant.
In this tutorial, we'll explore the most commonly used data structures from the Collections module:
Counter
: counts occurrences of elementsdefaultdict
: dictionary with default valuesnamedtuple
: tuple with named fieldsdeque
: double-ended queueOrderedDict
: dictionary that remembers insertion order
Let's dive in and see how these specialized data structures can simplify your code and make it more efficient.
Getting Started with Collections
To use any of the data structures from the Collections module, you first need to import them:
from collections import Counter, defaultdict, namedtuple, deque, OrderedDict
You can also import the entire module and access the classes using dot notation:
import collections
my_counter = collections.Counter()
Counter
What is Counter?
Counter
is a dict subclass designed for counting hashable objects. It's essentially a dictionary where elements are stored as keys and their counts as values.
Basic Usage
Let's see how to create and use a Counter:
from collections import Counter
# Create a Counter from a list
fruits = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
fruit_counter = Counter(fruits)
print(fruit_counter)
Output:
Counter({'apple': 3, 'banana': 2, 'orange': 1})
Common Methods
Counter objects have several useful methods:
# Most common elements
print(fruit_counter.most_common(2)) # Top 2 most common
# Total of all counts
print(sum(fruit_counter.values()))
# Convert to a list of elements
print(list(fruit_counter.elements()))
# Update with new elements
fruit_counter.update(['apple', 'mango', 'mango'])
print(fruit_counter)
# Subtract occurrences
fruit_counter.subtract(['apple', 'apple'])
print(fruit_counter)
Output:
[('apple', 3), ('banana', 2)]
6
['apple', 'apple', 'apple', 'banana', 'banana', 'orange']
Counter({'apple': 4, 'banana': 2, 'orange': 1, 'mango': 2})
Counter({'apple': 2, 'banana': 2, 'orange': 1, 'mango': 2})
Practical Example: Word Frequency Analysis
Let's count the most frequent words in a text:
from collections import Counter
import re
def count_word_frequency(text):
# Convert to lowercase and split into words
words = re.findall(r'\w+', text.lower())
# Count the words
word_counts = Counter(words)
# Return the 5 most common words
return word_counts.most_common(5)
sample_text = """
Python is an interpreted high-level general-purpose programming language.
Python's design philosophy emphasizes code readability with its notable use of significant indentation.
Python is dynamically-typed and garbage-collected.
"""
print(count_word_frequency(sample_text))
Output:
[('python', 3), ('is', 3), ('high', 1), ('level', 1), ('general', 1)]
defaultdict
What is defaultdict?
defaultdict
is a dictionary subclass that calls a factory function to supply missing values. This is extremely useful when working with data that might not have all possible keys defined.
Basic Usage
from collections import defaultdict
# Create a defaultdict with int as the default factory
fruit_count = defaultdict(int)
# Add items to the dictionary
fruits = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
for fruit in fruits:
fruit_count[fruit] += 1
print(dict(fruit_count)) # Convert to regular dict for display
Output:
{'apple': 3, 'banana': 2, 'orange': 1}
Notice how we didn't need to check if the key existed before incrementing its value. The defaultdict
took care of that for us!
Default Factory Functions
You can use any callable (function that returns a value) as a default factory:
# defaultdict with list as default factory
grouped_items = defaultdict(list)
# Group items by their first letter
items = ["apple", "banana", "cherry", "avocado", "blueberry"]
for item in items:
grouped_items[item[0]].append(item)
print(dict(grouped_items))
Output:
{'a': ['apple', 'avocado'], 'b': ['banana', 'blueberry'], 'c': ['cherry']}
Practical Example: Grouping Data
Let's see how defaultdict
can simplify grouping data by a category:
from collections import defaultdict
def group_students_by_grade(student_data):
# Create a defaultdict that will hold lists
graded_students = defaultdict(list)
for name, grade in student_data:
graded_students[grade].append(name)
return dict(graded_students) # Convert to regular dict for return
students = [
("Alice", "A"),
("Bob", "B"),
("Charlie", "A"),
("David", "C"),
("Eva", "B")
]
print(group_students_by_grade(students))
Output:
{'A': ['Alice', 'Charlie'], 'B': ['Bob', 'Eva'], 'C': ['David']}
namedtuple
What is namedtuple?
namedtuple
creates tuple subclasses with named fields. They're immutable like regular tuples, but you can access fields by name instead of just by index.
Basic Usage
from collections import namedtuple
# Create a namedtuple type called 'Point'
Point = namedtuple('Point', ['x', 'y'])
# Create a Point instance
p = Point(10, 20)
print(p.x, p.y) # Access by name
print(p[0], p[1]) # Access by index
print(p) # Print the tuple
Output:
10 20
10 20
Point(x=10, y=20)
Creating namedtuples
There are multiple ways to define the fields:
# Different ways to specify field names
Person1 = namedtuple('Person', ['name', 'age', 'city'])
Person2 = namedtuple('Person', 'name age city') # Space-separated string
Person3 = namedtuple('Person', 'name,age,city') # Comma-separated string
# Creating instances
john = Person1('John', 30, 'New York')
alice = Person2('Alice', 25, 'Boston')
print(john)
print(alice)
Output:
Person(name='John', age=30, city='New York')
Person(name='Alice', age=25, city='Boston')
Practical Example: Representing Data Records
namedtuple
is perfect for representing fixed data records:
from collections import namedtuple
from datetime import date
# Define a Book type
Book = namedtuple('Book', 'title author isbn publication_date price')
# Create book instances
books = [
Book('Python Crash Course', 'Eric Matthes', '1593279280',
date(2019, 5, 3), 39.99),
Book('Fluent Python', 'Luciano Ramalho', '1491946008',
date(2015, 8, 20), 49.99),
Book('Learning Python', 'Mark Lutz', '1449355730',
date(2013, 7, 6), 59.99)
]
# Using the named fields
for book in books:
print(f"{book.title} by {book.author}, published on {book.publication_date}")
# Find the most expensive book
most_expensive = max(books, key=lambda book: book.price)
print(f"Most expensive book: {most_expensive.title} (${most_expensive.price})")
Output:
Python Crash Course by Eric Matthes, published on 2019-05-03
Fluent Python by Luciano Ramalho, published on 2015-08-20
Learning Python by Mark Lutz, published on 2013-07-06
Most expensive book: Learning Python ($59.99)
deque
What is deque?
deque
(pronounced "deck") stands for "double-ended queue" and is designed to have fast appends and pops from both ends. Unlike lists, which are optimized for fast operations at the end, deques are optimized for operations at both the beginning and end.
Basic Usage
from collections import deque
# Create a deque
queue = deque(['a', 'b', 'c'])
print(queue)
# Add to right side
queue.append('d')
print(queue)
# Add to left side
queue.appendleft('e')
print(queue)
# Remove from right
right_item = queue.pop()
print(f"Popped from right: {right_item}")
print(queue)
# Remove from left
left_item = queue.popleft()
print(f"Popped from left: {left_item}")
print(queue)
Output:
deque(['a', 'b', 'c'])
deque(['a', 'b', 'c', 'd'])
deque(['e', 'a', 'b', 'c', 'd'])
Popped from right: d
deque(['e', 'a', 'b', 'c'])
Popped from left: e
deque(['a', 'b', 'c'])
Additional Methods
deque
provides many other useful methods:
# Create a new deque
d = deque([1, 2, 3, 4, 5])
# Rotate the deque (positive is right, negative is left)
d.rotate(1) # Rotate right by 1
print(f"After rotating right: {d}")
d.rotate(-2) # Rotate left by 2
print(f"After rotating left: {d}")
# Add multiple elements at once
d.extend([6, 7, 8]) # Add to right
print(f"After extend: {d}")
d.extendleft([0, -1, -2]) # Add to left (note the order gets reversed)
print(f"After extendleft: {d}")
# Limit the size of the deque
limited_deque = deque(maxlen=3)
for i in range(5):
limited_deque.append(i)
print(f"Added {i}: {limited_deque}")
Output:
After rotating right: deque([5, 1, 2, 3, 4])
After rotating left: deque([2, 3, 4, 5, 1])
After extend: deque([2, 3, 4, 5, 1, 6, 7, 8])
After extendleft: deque([-2, -1, 0, 2, 3, 4, 5, 1, 6, 7, 8])
Added 0: deque([0])
Added 1: deque([0, 1])
Added 2: deque([0, 1, 2])
Added 3: deque([1, 2, 3])
Added 4: deque([2, 3, 4])
Practical Example: Recent Items History
A common use case for deque
is maintaining a fixed-size list of "recent" items:
from collections import deque
class BrowsingHistory:
def __init__(self, max_history=5):
self.history = deque(maxlen=max_history)
def visit_page(self, page):
self.history.append(page)
def get_history(self):
return list(self.history)
def go_back(self):
if len(self.history) > 1:
# Remove the current page
self.history.pop()
# Return the previous page (now the current one)
return self.history[-1]
else:
return None
# Create a browsing history with max 3 items
browser = BrowsingHistory(max_history=3)
# Visit some pages
pages = ["Home", "Products", "About", "Contact"]
for page in pages:
browser.visit_page(page)
print(f"Visited: {page}, History: {browser.get_history()}")
# Go back in history
previous_page = browser.go_back()
print(f"Went back to: {previous_page}, History: {browser.get_history()}")
Output:
Visited: Home, History: ['Home']
Visited: Products, History: ['Home', 'Products']
Visited: About, History: ['Home', 'Products', 'About']
Visited: Contact, History: ['Products', 'About', 'Contact']
Went back to: About, History: ['Products', 'About']
OrderedDict
What is OrderedDict?
OrderedDict
is a dictionary subclass that remembers the order in which keys were inserted. While regular dictionaries in Python 3.7+ also maintain insertion order, OrderedDict
provides additional behaviors like the move_to_end
method.
Basic Usage
from collections import OrderedDict
# Create an OrderedDict
colors = OrderedDict()
colors['red'] = '#FF0000'
colors['green'] = '#00FF00'
colors['blue'] = '#0000FF'
print(colors)
# Regular dictionary also maintains order in Python 3.7+
regular_dict = {}
regular_dict['red'] = '#FF0000'
regular_dict['green'] = '#00FF00'
regular_dict['blue'] = '#0000FF'
print(regular_dict)
Output:
OrderedDict([('red', '#FF0000'), ('green', '#00FF00'), ('blue', '#0000FF')])
{'red': '#FF0000', 'green': '#00FF00', 'blue': '#0000FF'}
Special Methods
OrderedDict
provides some methods not available in regular dictionaries:
# Create an OrderedDict
colors = OrderedDict([
('red', '#FF0000'),
('green', '#00FF00'),
('blue', '#0000FF')
])
print("Original order:", list(colors.items()))
# Move 'red' to the end
colors.move_to_end('red')
print("After move_to_end('red'):", list(colors.items()))
# Move 'blue' to the beginning (last=False)
colors.move_to_end('blue', last=False)
print("After move_to_end('blue', last=False):", list(colors.items()))
# popitem (default is last=True which removes the last item)
last_item = colors.popitem()
print(f"Removed last item: {last_item}")
print("After popitem():", list(colors.items()))
# popitem with last=False (removes the first item)
first_item = colors.popitem(last=False)
print(f"Removed first item: {first_item}")
print("After popitem(last=False):", list(colors.items()))
Output:
Original order: [('red', '#FF0000'), ('green', '#00FF00'), ('blue', '#0000FF')]
After move_to_end('red'): [('green', '#00FF00'), ('blue', '#0000FF'), ('red', '#FF0000')]
After move_to_end('blue', last=False): [('blue', '#0000FF'), ('green', '#00FF00'), ('red', '#FF0000')]
Removed last item: ('red', '#FF0000')
After popitem(): [('blue', '#0000FF'), ('green', '#00FF00')]
Removed first item: ('blue', '#0000FF')
After popitem(last=False): [('green', '#00FF00')]
Practical Example: LRU Cache
One common use of OrderedDict
is implementing a Least Recently Used (LRU) cache:
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.capacity = capacity
self.cache = OrderedDict()
def get(self, key):
if key not in self.cache:
return -1
# Move the accessed key to the end (most recently used)
self.cache.move_to_end(key)
return self.cache[key]
def put(self, key, value):
# If key exists, update it and move to the end
if key in self.cache:
self.cache[key] = value
self.cache.move_to_end(key)
return
# If cache is full, remove the least recently used item
if len(self.cache) >= self.capacity:
self.cache.popitem(last=False)
# Add the new key-value pair
self.cache[key] = value
def display(self):
return dict(self.cache)
# Create an LRU cache with capacity 3
cache = LRUCache(3)
# Add some values
cache.put("A", 1)
cache.put("B", 2)
cache.put("C", 3)
print(f"Cache after initial adds: {cache.display()}")
# Access a value (moves it to "most recently used")
print(f"Value for key 'A': {cache.get('A')}")
print(f"Cache after accessing 'A': {cache.display()}")
# Add a new value that will evict the least recently used (now 'B')
cache.put("D", 4)
print(f"Cache after adding 'D': {cache.display()}")
# Check that 'B' was evicted
print(f"Value for key 'B': {cache.get('B')}")
Output:
Cache after initial adds: {'A': 1, 'B': 2, 'C': 3}
Value for key 'A': 1
Cache after accessing 'A': {'B': 2, 'C': 3, 'A': 1}
Cache after adding 'D': {'C': 3, 'A': 1, 'D': 4}
Value for key 'B': -1
Summary
The Python Collections module provides specialized data structures that extend the functionality of Python's built-in containers:
- Counter: A dict subclass for counting hashable objects
- defaultdict: A dict subclass that calls a factory function to supply missing values
- namedtuple: Factory function for creating tuple subclasses with named fields
- deque: List-like container with fast appends and pops on both ends
- OrderedDict: Dict subclass that remembers the order entries were added
These specialized data structures can significantly simplify your code and improve its efficiency in specific use cases. By choosing the right container for your task, you can write cleaner, more intuitive, and more efficient Python code.
Additional Resources
To deepen your understanding of the Collections module:
Practice Exercises
- Use
Counter
to find the most common words in a text file. - Implement a multi-level dictionary using
defaultdict
. - Create a
namedtuple
to represent a student record with fields for name, ID, and grades. - Use a
deque
to implement a simple task scheduler that rotates through tasks. - Build a simple "most recently used" tracker using
OrderedDict
.
Happy coding with Python's Collections module!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)