Skip to main content

Pandas Query Method

Introduction

When working with data in Pandas, selecting and filtering data is one of the most common operations. While there are multiple ways to filter data in Pandas (such as using boolean indexing or the .loc accessor), the .query() method provides a more intuitive and readable way to filter DataFrames using string expressions.

The query() method allows you to write filtering conditions as strings, making your code more concise and often easier to understand, especially for complex filtering operations. This method was introduced in pandas version 0.13.0 and has become a popular tool for data selection.

Basic Syntax

The basic syntax of the query() method is:

python
df.query(expr, inplace=False, **kwargs)

Where:

  • expr: A string expression representing the filtering condition
  • inplace: If True, the filtering operation is performed on the original DataFrame; otherwise, a new filtered DataFrame is returned
  • **kwargs: Additional parameters passed to the underlying evaluation engine

Simple Examples

Let's start with some basic examples to understand how .query() works. First, we'll create a sample DataFrame:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'age': [25, 30, 35, 40, 45],
'salary': [50000, 60000, 75000, 90000, 110000],
'department': ['HR', 'IT', 'Finance', 'IT', 'Marketing']
}

df = pd.DataFrame(data)
print(df)

Output:

      name  age  salary department
0 Alice 25 50000 HR
1 Bob 30 60000 IT
2 Charlie 35 75000 Finance
3 David 40 90000 IT
4 Eva 45 110000 Marketing

Simple Filtering

Now let's filter the DataFrame to show only people older than 30:

python
# Using query method
result = df.query('age > 30')
print(result)

Output:

      name  age  salary department
2 Charlie 35 75000 Finance
3 David 40 90000 IT
4 Eva 45 110000 Marketing

Multiple Conditions

You can combine multiple conditions using logical operators like & (and) and | (or):

python
# Employees in IT department with salary greater than 70000
result = df.query('department == "IT" & salary > 70000')
print(result)

Output:

    name  age  salary department
3 David 40 90000 IT

Comparing with Traditional Methods

Let's compare the query() method with traditional filtering methods:

python
# Using query method
query_result = df.query('age > 30 & salary < 100000')

# Equivalent using boolean indexing
boolean_result = df[(df['age'] > 30) & (df['salary'] < 100000)]

print("Using query():")
print(query_result)
print("\nUsing boolean indexing:")
print(boolean_result)

Output:

Using query():
name age salary department
2 Charlie 35 75000 Finance
3 David 40 90000 IT

Using boolean indexing:
name age salary department
2 Charlie 35 75000 Finance
3 David 40 90000 IT

Both methods produce the same result, but the query() method is generally more readable, especially for complex filtering conditions.

Advanced Features of query()

Using Local Variables in Queries

You can use variables from your local Python environment within your queries by prefixing them with @:

python
min_age = 30
max_salary = 90000

# Filter using local variables
result = df.query('age > @min_age & salary <= @max_salary')
print(result)

Output:

      name  age  salary department
2 Charlie 35 75000 Finance
3 David 40 90000 IT

Working with Column Names Containing Spaces

If your column names contain spaces or special characters, you can use backticks (`) to reference them:

python
# Create a DataFrame with a column containing spaces
data2 = {
'employee name': ['Alice', 'Bob', 'Charlie'],
'yearly salary': [50000, 60000, 75000]
}
df2 = pd.DataFrame(data2)

# Query with column containing spaces
result = df2.query('`employee name` == "Bob"')
print(result)

Output:

  employee name  yearly salary
1 Bob 60000

Using in and not in Operators

You can use in and not in operators to check if values are in a set:

python
# Filter for specific departments
departments = ['IT', 'HR']
result = df.query('department in @departments')
print(result)

Output:

    name  age  salary department
0 Alice 25 50000 HR
1 Bob 30 60000 IT
3 David 40 90000 IT

String Methods in Queries

You can use string methods like startswith, endswith, and contains in your queries:

python
# Names starting with specific letters
result = df.query('name.str.startswith("A") | name.str.startswith("D")')
print(result)

Output:

    name  age  salary department
0 Alice 25 50000 HR
3 David 40 90000 IT

Real-world Example: Data Analysis Workflow

Let's demonstrate how query() can be used in a real-world data analysis workflow. We'll use a larger dataset and perform multiple filtering operations:

python
# Create a more realistic dataset
import numpy as np
np.random.seed(0)

# Generate sample employee data
n = 1000
departments = ['Sales', 'Marketing', 'IT', 'HR', 'Finance', 'Operations']
locations = ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle']

employees = pd.DataFrame({
'employee_id': range(1, n+1),
'age': np.random.randint(22, 65, size=n),
'salary': np.random.randint(30000, 150000, size=n),
'years_of_experience': np.random.randint(0, 30, size=n),
'department': np.random.choice(departments, size=n),
'location': np.random.choice(locations, size=n),
'performance_score': np.random.randint(1, 6, size=n)
})

# Add a derived column
employees['salary_per_year_experience'] = employees['salary'] / (employees['years_of_experience'] + 1)

# Display sample
print(employees.head())

Now let's perform some analysis using query():

python
# 1. Find high-performing IT employees in specific locations
high_performers = employees.query('department == "IT" & performance_score >= 4 & location in ["New York", "San Francisco"]')
print(f"Number of high-performing IT employees in NY or SF: {len(high_performers)}")

# 2. Find employees with high salary relative to their experience
experienced_well_paid = employees.query('salary > 100000 & years_of_experience < 10')
print(f"Number of well-paid employees with less than 10 years experience: {len(experienced_well_paid)}")

# 3. Find young employees with exceptional salary per year of experience
young_exceptional = employees.query('age < 30 & salary_per_year_experience > 20000')
print(f"Number of young employees with exceptional salary/experience ratio: {len(young_exceptional)}")

# 4. Complex filtering for talent management
talent_pool = employees.query('(performance_score >= 4 & years_of_experience >= 5) | '
'(salary < 50000 & performance_score >= 4 & age < 30)')
print(f"Size of identified talent pool: {len(talent_pool)}")

# Display sample of talent pool
print("\nSample of the talent pool:")
print(talent_pool.sample(5))

This demonstrates how query() can be used to perform complex filtering operations in a data analysis workflow, making your code more readable and maintainable.

Performance Considerations

While query() offers improved readability, it's worth noting that it may not always be the most performant option for very large datasets. For simple filtering operations, traditional boolean indexing might be slightly faster.

However, the query method becomes increasingly valuable when:

  1. Your filtering conditions are complex
  2. You need to use variables from the local namespace
  3. Code readability is a priority

Summary

The Pandas query() method provides a powerful and intuitive way to filter DataFrames using string expressions. Its key advantages include:

  • Readability: Conditions expressed as strings are often easier to read and understand
  • Conciseness: Complex filtering operations can be written with less code
  • Integration with local variables: Using the @ prefix allows you to incorporate Python variables into your queries
  • Support for column names with spaces: Using backticks allows querying on columns with special characters

By mastering the query() method, you'll be able to write more elegant and maintainable code for data selection tasks in Pandas.

Additional Resources

Exercises

  1. Create a DataFrame with sales data (product, category, price, quantity) and use the query() method to find products with high sales volume (quantity > 100) and price less than $50.

  2. Use the query() method with local variables to filter a DataFrame based on user inputs (e.g., minimum and maximum values for numerical columns).

  3. Create a complex query that combines string operations (like startswith or contains) with numerical comparisons, and compare its readability to the equivalent boolean indexing approach.

  4. Use the query() method to filter a time-series DataFrame to show only entries that fall between two dates provided as variables.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)