Pandas Label Selection
One of the most powerful features of pandas is the ability to select data by labels (names) rather than just numeric positions. This allows for more intuitive and readable code, especially when working with meaningful column and index names. In this tutorial, we'll explore how to select data using labels in pandas.
Introduction to Label Selection
Label-based selection in pandas uses the .loc
accessor, which allows you to select data by referencing the index and column names directly. This is different from position-based indexing (using .iloc
), which we'll cover in another tutorial.
Let's start by creating a simple DataFrame to work with:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Name': ['John', 'Emma', 'Michael', 'Sophia', 'William'],
'Age': [28, 24, 32, 27, 36],
'Profession': ['Engineer', 'Designer', 'Doctor', 'Analyst', 'Teacher'],
'Salary': [75000, 65000, 120000, 72000, 68000],
'Experience': [3, 2, 7, 4, 10]
}
df = pd.DataFrame(data)
df.set_index('Name', inplace=True)
print(df)
This will output:
Age Profession Salary Experience
Name
John 28 Engineer 75000 3
Emma 24 Designer 65000 2
Michael 32 Doctor 120000 7
Sophia 27 Analyst 72000 4
William 36 Teacher 68000 10
Basic Label Selection with .loc
The .loc
accessor is used for selecting data by label. Let's look at some examples:
Selecting a Single Row by Label
# Select the row with label 'Emma'
emma_data = df.loc['Emma']
print(emma_data)
Output:
Age 24
Profession Designer
Salary 65000
Experience 2
Name: Emma, dtype: object
Notice that the result is a pandas Series object when selecting a single row.
Selecting Multiple Rows by Label
# Select rows with labels 'John' and 'Sophia'
selected_rows = df.loc[['John', 'Sophia']]
print(selected_rows)
Output:
Age Profession Salary Experience
Name
John 28 Engineer 75000 3
Sophia 27 Analyst 72000 4
Selecting Specific Columns for a Row
# Select the 'Age' and 'Profession' columns for 'Michael'
michael_info = df.loc['Michael', ['Age', 'Profession']]
print(michael_info)
Output:
Age 32
Profession Doctor
Name: Michael, dtype: object
Slicing with Labels
.loc
also supports slicing by labels, which is very useful for selecting ranges of data:
# Select rows from 'Emma' to 'Sophia' (inclusive)
slice_data = df.loc['Emma':'Sophia']
print(slice_data)
Output:
Age Profession Salary Experience
Name
Emma 24 Designer 65000 2
Michael 32 Doctor 120000 7
Sophia 27 Analyst 72000 4
Note that unlike normal Python slicing, when using .loc
, the end label ('Sophia' in this case) is included in the results.
Combining Row and Column Selection
You can select both rows and columns simultaneously with .loc
:
# Select 'Age' and 'Salary' columns for 'John' and 'William'
subset = df.loc[['John', 'William'], ['Age', 'Salary']]
print(subset)
Output:
Age Salary
Name
John 28 75000
William 36 68000
Boolean Indexing with Labels
.loc
can also be used with boolean arrays, allowing you to filter data based on conditions:
# Select rows where Age > 30
experienced = df.loc[df['Age'] > 30]
print(experienced)
Output:
Age Profession Salary Experience
Name
Michael 32 Doctor 120000 7
William 36 Teacher 68000 10
Modifying Data Using Label Selection
.loc
can also be used to modify values in a DataFrame:
# Give everyone a 10% salary increase
df.loc[:, 'Salary'] = df['Salary'] * 1.1
print(df)
Output:
Age Profession Salary Experience
Name
John 28 Engineer 82500.0 3
Emma 24 Designer 71500.0 2
Michael 32 Doctor 132000.0 7
Sophia 27 Analyst 79200.0 4
William 36 Teacher 74800.0 10
Practical Example: Employee Data Analysis
Let's use what we've learned to analyze some employee data:
# Create a more complex DataFrame
dates = pd.date_range('20220101', periods=5)
employee_data = {
'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name': ['John Smith', 'Emma Davis', 'Michael Brown', 'Sophia Wilson', 'William Taylor'],
'Department': ['Engineering', 'Design', 'Medical', 'Analytics', 'Education'],
'HireDate': dates,
'Salary': [75000, 65000, 120000, 72000, 68000],
'Performance': ['Good', 'Excellent', 'Good', 'Average', 'Excellent']
}
employees = pd.DataFrame(employee_data)
employees.set_index('EmployeeID', inplace=True)
print(employees)
Output:
Name Department HireDate Salary Performance
EmployeeID
E001 John Smith Engineering 2022-01-01 75000 Good
E002 Emma Davis Design 2022-01-02 65000 Excellent
E003 Michael Brown Medical 2022-01-03 120000 Good
E004 Sophia Wilson Analytics 2022-01-04 72000 Average
E005 William Taylor Education 2022-01-05 68000 Excellent
Now, let's perform some data analysis tasks:
1. Find employees with excellent performance
excellent_performers = employees.loc[employees['Performance'] == 'Excellent']
print("Excellent Performers:")
print(excellent_performers)
Output:
Excellent Performers:
Name Department HireDate Salary Performance
EmployeeID
E002 Emma Davis Design 2022-01-02 65000 Excellent
E005 William Taylor Education 2022-01-05 68000 Excellent
2. Calculate bonuses based on performance
# Create a bonus mapping
bonus_mapping = {'Excellent': 0.15, 'Good': 0.10, 'Average': 0.05}
# Calculate bonuses
employees['Bonus'] = employees['Performance'].map(bonus_mapping) * employees['Salary']
# Select specific columns for reporting
bonus_report = employees.loc[:, ['Name', 'Performance', 'Salary', 'Bonus']]
print("\nBonus Report:")
print(bonus_report)
Output:
Bonus Report:
Name Performance Salary Bonus
EmployeeID
E001 John Smith Good 75000 7500.0
E002 Emma Davis Excellent 65000 9750.0
E003 Michael Brown Good 120000 12000.0
E004 Sophia Wilson Average 72000 3600.0
E005 William Taylor Excellent 68000 10200.0
3. Get details of specific employees
# Get information about specific employees
specific_employees = employees.loc[['E001', 'E003', 'E005'], ['Department', 'HireDate', 'Performance']]
print("\nSelected Employee Details:")
print(specific_employees)
Output:
Selected Employee Details:
Department HireDate Performance
EmployeeID
E001 Engineering 2022-01-01 Good
E003 Medical 2022-01-03 Good
E005 Education 2022-01-05 Excellent
Label Selection with MultiIndex
Pandas also supports MultiIndex (hierarchical indexing), where .loc
becomes even more powerful:
# Create a DataFrame with MultiIndex
multi_idx = pd.MultiIndex.from_tuples([
('Engineering', 'Senior'),
('Engineering', 'Junior'),
('Design', 'Senior'),
('Design', 'Junior'),
('Medical', 'Senior')
], names=['Department', 'Level'])
multi_df = pd.DataFrame({
'Employee': ['John', 'Emma', 'Sophia', 'William', 'Michael'],
'Salary': [85000, 65000, 78000, 62000, 120000]
}, index=multi_idx)
print("DataFrame with MultiIndex:")
print(multi_df)
Output:
DataFrame with MultiIndex:
Employee Salary
Department Level
Engineering Senior John 85000
Junior Emma 65000
Design Senior Sophia 78000
Junior William 62000
Medical Senior Michael 120000
You can select data at different levels:
# Select all employees in the Engineering department
engineering_staff = multi_df.loc['Engineering']
print("\nEngineering Staff:")
print(engineering_staff)
# Select all senior level employees across departments
senior_staff = multi_df.loc[(slice(None), 'Senior')]
print("\nSenior Level Staff:")
print(senior_staff)
Output:
Engineering Staff:
Employee Salary
Level
Senior John 85000
Junior Emma 65000
Senior Level Staff:
Employee Salary
Department Level
Engineering Senior John 85000
Design Senior Sophia 78000
Medical Senior Michael 120000
Summary
In this tutorial, we've explored pandas label-based selection using .loc
. Here's what we've covered:
- Basic row and column selection by labels
- Selecting multiple rows and columns simultaneously
- Label-based slicing
- Combining row and column selection
- Boolean indexing with labels
- Practical examples with employee data
- Working with MultiIndex
Label-based selection with .loc
provides a more intuitive and readable way to work with your data, especially when your DataFrames have meaningful index and column names.
Exercises
To practice your label selection skills, try these exercises:
-
Create a DataFrame of students with columns for Name, Grade, and scores in different subjects. Use student names as the index. Select:
- All information for a specific student
- All students with grades above a certain threshold
- The Math scores for a subset of students
-
Create a MultiIndex DataFrame representing sales data across different regions and product categories. Practice selecting:
- All sales from a specific region
- All sales of a specific product category across all regions
- The top-performing product in each region
-
Take the employees DataFrame from our example and find:
- The average salary by department
- The employees hired in the first quarter of 2022
- The department with the highest average performance rating
Additional Resources
- Pandas Documentation: Indexing and Selecting Data
- 10 Minutes to Pandas: Selection
- Pandas Cheat Sheet: Data Selection
Happy data analysis!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)