Pandas Query Method
Introduction
When working with data in Pandas, selecting and filtering data is one of the most common operations. While there are multiple ways to filter data in Pandas (such as using boolean indexing or the .loc
accessor), the .query()
method provides a more intuitive and readable way to filter DataFrames using string expressions.
The query()
method allows you to write filtering conditions as strings, making your code more concise and often easier to understand, especially for complex filtering operations. This method was introduced in pandas version 0.13.0 and has become a popular tool for data selection.
Basic Syntax
The basic syntax of the query()
method is:
df.query(expr, inplace=False, **kwargs)
Where:
expr
: A string expression representing the filtering conditioninplace
: If True, the filtering operation is performed on the original DataFrame; otherwise, a new filtered DataFrame is returned**kwargs
: Additional parameters passed to the underlying evaluation engine
Simple Examples
Let's start with some basic examples to understand how .query()
works. First, we'll create a sample DataFrame:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'age': [25, 30, 35, 40, 45],
'salary': [50000, 60000, 75000, 90000, 110000],
'department': ['HR', 'IT', 'Finance', 'IT', 'Marketing']
}
df = pd.DataFrame(data)
print(df)
Output:
name age salary department
0 Alice 25 50000 HR
1 Bob 30 60000 IT
2 Charlie 35 75000 Finance
3 David 40 90000 IT
4 Eva 45 110000 Marketing
Simple Filtering
Now let's filter the DataFrame to show only people older than 30:
# Using query method
result = df.query('age > 30')
print(result)
Output:
name age salary department
2 Charlie 35 75000 Finance
3 David 40 90000 IT
4 Eva 45 110000 Marketing
Multiple Conditions
You can combine multiple conditions using logical operators like &
(and) and |
(or):
# Employees in IT department with salary greater than 70000
result = df.query('department == "IT" & salary > 70000')
print(result)
Output:
name age salary department
3 David 40 90000 IT
Comparing with Traditional Methods
Let's compare the query()
method with traditional filtering methods:
# Using query method
query_result = df.query('age > 30 & salary < 100000')
# Equivalent using boolean indexing
boolean_result = df[(df['age'] > 30) & (df['salary'] < 100000)]
print("Using query():")
print(query_result)
print("\nUsing boolean indexing:")
print(boolean_result)
Output:
Using query():
name age salary department
2 Charlie 35 75000 Finance
3 David 40 90000 IT
Using boolean indexing:
name age salary department
2 Charlie 35 75000 Finance
3 David 40 90000 IT
Both methods produce the same result, but the query()
method is generally more readable, especially for complex filtering conditions.
Advanced Features of query()
Using Local Variables in Queries
You can use variables from your local Python environment within your queries by prefixing them with @
:
min_age = 30
max_salary = 90000
# Filter using local variables
result = df.query('age > @min_age & salary <= @max_salary')
print(result)
Output:
name age salary department
2 Charlie 35 75000 Finance
3 David 40 90000 IT
Working with Column Names Containing Spaces
If your column names contain spaces or special characters, you can use backticks (`
) to reference them:
# Create a DataFrame with a column containing spaces
data2 = {
'employee name': ['Alice', 'Bob', 'Charlie'],
'yearly salary': [50000, 60000, 75000]
}
df2 = pd.DataFrame(data2)
# Query with column containing spaces
result = df2.query('`employee name` == "Bob"')
print(result)
Output:
employee name yearly salary
1 Bob 60000
Using in and not in Operators
You can use in
and not in
operators to check if values are in a set:
# Filter for specific departments
departments = ['IT', 'HR']
result = df.query('department in @departments')
print(result)
Output:
name age salary department
0 Alice 25 50000 HR
1 Bob 30 60000 IT
3 David 40 90000 IT
String Methods in Queries
You can use string methods like startswith
, endswith
, and contains
in your queries:
# Names starting with specific letters
result = df.query('name.str.startswith("A") | name.str.startswith("D")')
print(result)
Output:
name age salary department
0 Alice 25 50000 HR
3 David 40 90000 IT
Real-world Example: Data Analysis Workflow
Let's demonstrate how query()
can be used in a real-world data analysis workflow. We'll use a larger dataset and perform multiple filtering operations:
# Create a more realistic dataset
import numpy as np
np.random.seed(0)
# Generate sample employee data
n = 1000
departments = ['Sales', 'Marketing', 'IT', 'HR', 'Finance', 'Operations']
locations = ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle']
employees = pd.DataFrame({
'employee_id': range(1, n+1),
'age': np.random.randint(22, 65, size=n),
'salary': np.random.randint(30000, 150000, size=n),
'years_of_experience': np.random.randint(0, 30, size=n),
'department': np.random.choice(departments, size=n),
'location': np.random.choice(locations, size=n),
'performance_score': np.random.randint(1, 6, size=n)
})
# Add a derived column
employees['salary_per_year_experience'] = employees['salary'] / (employees['years_of_experience'] + 1)
# Display sample
print(employees.head())
Now let's perform some analysis using query()
:
# 1. Find high-performing IT employees in specific locations
high_performers = employees.query('department == "IT" & performance_score >= 4 & location in ["New York", "San Francisco"]')
print(f"Number of high-performing IT employees in NY or SF: {len(high_performers)}")
# 2. Find employees with high salary relative to their experience
experienced_well_paid = employees.query('salary > 100000 & years_of_experience < 10')
print(f"Number of well-paid employees with less than 10 years experience: {len(experienced_well_paid)}")
# 3. Find young employees with exceptional salary per year of experience
young_exceptional = employees.query('age < 30 & salary_per_year_experience > 20000')
print(f"Number of young employees with exceptional salary/experience ratio: {len(young_exceptional)}")
# 4. Complex filtering for talent management
talent_pool = employees.query('(performance_score >= 4 & years_of_experience >= 5) | '
'(salary < 50000 & performance_score >= 4 & age < 30)')
print(f"Size of identified talent pool: {len(talent_pool)}")
# Display sample of talent pool
print("\nSample of the talent pool:")
print(talent_pool.sample(5))
This demonstrates how query()
can be used to perform complex filtering operations in a data analysis workflow, making your code more readable and maintainable.
Performance Considerations
While query()
offers improved readability, it's worth noting that it may not always be the most performant option for very large datasets. For simple filtering operations, traditional boolean indexing might be slightly faster.
However, the query method becomes increasingly valuable when:
- Your filtering conditions are complex
- You need to use variables from the local namespace
- Code readability is a priority
Summary
The Pandas query()
method provides a powerful and intuitive way to filter DataFrames using string expressions. Its key advantages include:
- Readability: Conditions expressed as strings are often easier to read and understand
- Conciseness: Complex filtering operations can be written with less code
- Integration with local variables: Using the
@
prefix allows you to incorporate Python variables into your queries - Support for column names with spaces: Using backticks allows querying on columns with special characters
By mastering the query()
method, you'll be able to write more elegant and maintainable code for data selection tasks in Pandas.
Additional Resources
Exercises
-
Create a DataFrame with sales data (product, category, price, quantity) and use the
query()
method to find products with high sales volume (quantity > 100) and price less than $50. -
Use the
query()
method with local variables to filter a DataFrame based on user inputs (e.g., minimum and maximum values for numerical columns). -
Create a complex query that combines string operations (like
startswith
orcontains
) with numerical comparisons, and compare its readability to the equivalent boolean indexing approach. -
Use the
query()
method to filter a time-series DataFrame to show only entries that fall between two dates provided as variables.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)