Skip to main content

Pandas Label Selection

One of the most powerful features of pandas is the ability to select data by labels (names) rather than just numeric positions. This allows for more intuitive and readable code, especially when working with meaningful column and index names. In this tutorial, we'll explore how to select data using labels in pandas.

Introduction to Label Selection

Label-based selection in pandas uses the .loc accessor, which allows you to select data by referencing the index and column names directly. This is different from position-based indexing (using .iloc), which we'll cover in another tutorial.

Let's start by creating a simple DataFrame to work with:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
'Name': ['John', 'Emma', 'Michael', 'Sophia', 'William'],
'Age': [28, 24, 32, 27, 36],
'Profession': ['Engineer', 'Designer', 'Doctor', 'Analyst', 'Teacher'],
'Salary': [75000, 65000, 120000, 72000, 68000],
'Experience': [3, 2, 7, 4, 10]
}

df = pd.DataFrame(data)
df.set_index('Name', inplace=True)

print(df)

This will output:

         Age Profession  Salary  Experience
Name
John 28 Engineer 75000 3
Emma 24 Designer 65000 2
Michael 32 Doctor 120000 7
Sophia 27 Analyst 72000 4
William 36 Teacher 68000 10

Basic Label Selection with .loc

The .loc accessor is used for selecting data by label. Let's look at some examples:

Selecting a Single Row by Label

python
# Select the row with label 'Emma'
emma_data = df.loc['Emma']
print(emma_data)

Output:

Age                24
Profession Designer
Salary 65000
Experience 2
Name: Emma, dtype: object

Notice that the result is a pandas Series object when selecting a single row.

Selecting Multiple Rows by Label

python
# Select rows with labels 'John' and 'Sophia'
selected_rows = df.loc[['John', 'Sophia']]
print(selected_rows)

Output:

       Age Profession  Salary  Experience
Name
John 28 Engineer 75000 3
Sophia 27 Analyst 72000 4

Selecting Specific Columns for a Row

python
# Select the 'Age' and 'Profession' columns for 'Michael'
michael_info = df.loc['Michael', ['Age', 'Profession']]
print(michael_info)

Output:

Age                32
Profession Doctor
Name: Michael, dtype: object

Slicing with Labels

.loc also supports slicing by labels, which is very useful for selecting ranges of data:

python
# Select rows from 'Emma' to 'Sophia' (inclusive)
slice_data = df.loc['Emma':'Sophia']
print(slice_data)

Output:

         Age Profession  Salary  Experience
Name
Emma 24 Designer 65000 2
Michael 32 Doctor 120000 7
Sophia 27 Analyst 72000 4

Note that unlike normal Python slicing, when using .loc, the end label ('Sophia' in this case) is included in the results.

Combining Row and Column Selection

You can select both rows and columns simultaneously with .loc:

python
# Select 'Age' and 'Salary' columns for 'John' and 'William'
subset = df.loc[['John', 'William'], ['Age', 'Salary']]
print(subset)

Output:

         Age  Salary
Name
John 28 75000
William 36 68000

Boolean Indexing with Labels

.loc can also be used with boolean arrays, allowing you to filter data based on conditions:

python
# Select rows where Age > 30
experienced = df.loc[df['Age'] > 30]
print(experienced)

Output:

         Age Profession  Salary  Experience
Name
Michael 32 Doctor 120000 7
William 36 Teacher 68000 10

Modifying Data Using Label Selection

.loc can also be used to modify values in a DataFrame:

python
# Give everyone a 10% salary increase
df.loc[:, 'Salary'] = df['Salary'] * 1.1
print(df)

Output:

         Age Profession   Salary  Experience
Name
John 28 Engineer 82500.0 3
Emma 24 Designer 71500.0 2
Michael 32 Doctor 132000.0 7
Sophia 27 Analyst 79200.0 4
William 36 Teacher 74800.0 10

Practical Example: Employee Data Analysis

Let's use what we've learned to analyze some employee data:

python
# Create a more complex DataFrame
dates = pd.date_range('20220101', periods=5)
employee_data = {
'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name': ['John Smith', 'Emma Davis', 'Michael Brown', 'Sophia Wilson', 'William Taylor'],
'Department': ['Engineering', 'Design', 'Medical', 'Analytics', 'Education'],
'HireDate': dates,
'Salary': [75000, 65000, 120000, 72000, 68000],
'Performance': ['Good', 'Excellent', 'Good', 'Average', 'Excellent']
}

employees = pd.DataFrame(employee_data)
employees.set_index('EmployeeID', inplace=True)

print(employees)

Output:

             Name   Department   HireDate  Salary Performance
EmployeeID
E001 John Smith Engineering 2022-01-01 75000 Good
E002 Emma Davis Design 2022-01-02 65000 Excellent
E003 Michael Brown Medical 2022-01-03 120000 Good
E004 Sophia Wilson Analytics 2022-01-04 72000 Average
E005 William Taylor Education 2022-01-05 68000 Excellent

Now, let's perform some data analysis tasks:

1. Find employees with excellent performance

python
excellent_performers = employees.loc[employees['Performance'] == 'Excellent']
print("Excellent Performers:")
print(excellent_performers)

Output:

Excellent Performers:
Name Department HireDate Salary Performance
EmployeeID
E002 Emma Davis Design 2022-01-02 65000 Excellent
E005 William Taylor Education 2022-01-05 68000 Excellent

2. Calculate bonuses based on performance

python
# Create a bonus mapping
bonus_mapping = {'Excellent': 0.15, 'Good': 0.10, 'Average': 0.05}

# Calculate bonuses
employees['Bonus'] = employees['Performance'].map(bonus_mapping) * employees['Salary']

# Select specific columns for reporting
bonus_report = employees.loc[:, ['Name', 'Performance', 'Salary', 'Bonus']]
print("\nBonus Report:")
print(bonus_report)

Output:

Bonus Report:
Name Performance Salary Bonus
EmployeeID
E001 John Smith Good 75000 7500.0
E002 Emma Davis Excellent 65000 9750.0
E003 Michael Brown Good 120000 12000.0
E004 Sophia Wilson Average 72000 3600.0
E005 William Taylor Excellent 68000 10200.0

3. Get details of specific employees

python
# Get information about specific employees
specific_employees = employees.loc[['E001', 'E003', 'E005'], ['Department', 'HireDate', 'Performance']]
print("\nSelected Employee Details:")
print(specific_employees)

Output:

Selected Employee Details:
Department HireDate Performance
EmployeeID
E001 Engineering 2022-01-01 Good
E003 Medical 2022-01-03 Good
E005 Education 2022-01-05 Excellent

Label Selection with MultiIndex

Pandas also supports MultiIndex (hierarchical indexing), where .loc becomes even more powerful:

python
# Create a DataFrame with MultiIndex
multi_idx = pd.MultiIndex.from_tuples([
('Engineering', 'Senior'),
('Engineering', 'Junior'),
('Design', 'Senior'),
('Design', 'Junior'),
('Medical', 'Senior')
], names=['Department', 'Level'])

multi_df = pd.DataFrame({
'Employee': ['John', 'Emma', 'Sophia', 'William', 'Michael'],
'Salary': [85000, 65000, 78000, 62000, 120000]
}, index=multi_idx)

print("DataFrame with MultiIndex:")
print(multi_df)

Output:

DataFrame with MultiIndex:
Employee Salary
Department Level
Engineering Senior John 85000
Junior Emma 65000
Design Senior Sophia 78000
Junior William 62000
Medical Senior Michael 120000

You can select data at different levels:

python
# Select all employees in the Engineering department
engineering_staff = multi_df.loc['Engineering']
print("\nEngineering Staff:")
print(engineering_staff)

# Select all senior level employees across departments
senior_staff = multi_df.loc[(slice(None), 'Senior')]
print("\nSenior Level Staff:")
print(senior_staff)

Output:

Engineering Staff:
Employee Salary
Level
Senior John 85000
Junior Emma 65000

Senior Level Staff:
Employee Salary
Department Level
Engineering Senior John 85000
Design Senior Sophia 78000
Medical Senior Michael 120000

Summary

In this tutorial, we've explored pandas label-based selection using .loc. Here's what we've covered:

  • Basic row and column selection by labels
  • Selecting multiple rows and columns simultaneously
  • Label-based slicing
  • Combining row and column selection
  • Boolean indexing with labels
  • Practical examples with employee data
  • Working with MultiIndex

Label-based selection with .loc provides a more intuitive and readable way to work with your data, especially when your DataFrames have meaningful index and column names.

Exercises

To practice your label selection skills, try these exercises:

  1. Create a DataFrame of students with columns for Name, Grade, and scores in different subjects. Use student names as the index. Select:

    • All information for a specific student
    • All students with grades above a certain threshold
    • The Math scores for a subset of students
  2. Create a MultiIndex DataFrame representing sales data across different regions and product categories. Practice selecting:

    • All sales from a specific region
    • All sales of a specific product category across all regions
    • The top-performing product in each region
  3. Take the employees DataFrame from our example and find:

    • The average salary by department
    • The employees hired in the first quarter of 2022
    • The department with the highest average performance rating

Additional Resources

Happy data analysis!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)