Pandas Label Selection

One of the most powerful features of pandas is the ability to select data by labels (names) rather than just numeric positions. This allows for more intuitive and readable code, especially when working with meaningful column and index names. In this tutorial, we'll explore how to select data using labels in pandas.

Introduction to Label Selection

Label-based selection in pandas uses the .loc accessor, which allows you to select data by referencing the index and column names directly. This is different from position-based indexing (using .iloc), which we'll cover in another tutorial.

Let's start by creating a simple DataFrame to work with:

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Michael', 'Sophia', 'William'],
    'Age': [28, 24, 32, 27, 36],
    'Profession': ['Engineer', 'Designer', 'Doctor', 'Analyst', 'Teacher'],
    'Salary': [75000, 65000, 120000, 72000, 68000],
    'Experience': [3, 2, 7, 4, 10]
}

df = pd.DataFrame(data)
df.set_index('Name', inplace=True)

print(df)

This will output:

         Age Profession  Salary  Experience
Name                                       
John      28   Engineer   75000          3
Emma      24   Designer   65000          2
Michael   32     Doctor  120000          7
Sophia    27    Analyst   72000          4
William   36    Teacher   68000         10

Basic Label Selection with `.loc`

The .loc accessor is used for selecting data by label. Let's look at some examples:

Selecting a Single Row by Label

# Select the row with label 'Emma'
emma_data = df.loc['Emma']
print(emma_data)

Output:

Age                24
Profession    Designer
Salary          65000
Experience          2
Name: Emma, dtype: object

Notice that the result is a pandas Series object when selecting a single row.

Selecting Multiple Rows by Label

# Select rows with labels 'John' and 'Sophia'
selected_rows = df.loc[['John', 'Sophia']]
print(selected_rows)

Output:

       Age Profession  Salary  Experience
Name                                     
John    28   Engineer   75000          3
Sophia  27    Analyst   72000          4

Selecting Specific Columns for a Row

# Select the 'Age' and 'Profession' columns for 'Michael'
michael_info = df.loc['Michael', ['Age', 'Profession']]
print(michael_info)

Output:

Age                32
Profession     Doctor
Name: Michael, dtype: object

Slicing with Labels

.loc also supports slicing by labels, which is very useful for selecting ranges of data:

# Select rows from 'Emma' to 'Sophia' (inclusive)
slice_data = df.loc['Emma':'Sophia']
print(slice_data)

Output:

         Age Profession  Salary  Experience
Name                                       
Emma      24   Designer   65000          2
Michael   32     Doctor  120000          7
Sophia    27    Analyst   72000          4

Note that unlike normal Python slicing, when using .loc, the end label ('Sophia' in this case) is included in the results.

Combining Row and Column Selection

You can select both rows and columns simultaneously with .loc:

# Select 'Age' and 'Salary' columns for 'John' and 'William'
subset = df.loc[['John', 'William'], ['Age', 'Salary']]
print(subset)

Output:

         Age  Salary
Name               
John     28   75000
William  36   68000

Boolean Indexing with Labels

.loc can also be used with boolean arrays, allowing you to filter data based on conditions:

# Select rows where Age > 30
experienced = df.loc[df['Age'] > 30]
print(experienced)

Output:

         Age Profession  Salary  Experience
Name                                       
Michael   32     Doctor  120000          7
William   36    Teacher   68000         10

Modifying Data Using Label Selection

.loc can also be used to modify values in a DataFrame:

# Give everyone a 10% salary increase
df.loc[:, 'Salary'] = df['Salary'] * 1.1
print(df)

Output:

         Age Profession   Salary  Experience
Name                                        
John      28   Engineer  82500.0          3
Emma      24   Designer  71500.0          2
Michael   32     Doctor 132000.0          7
Sophia    27    Analyst  79200.0          4
William   36    Teacher  74800.0         10

Practical Example: Employee Data Analysis

Let's use what we've learned to analyze some employee data:

# Create a more complex DataFrame
dates = pd.date_range('20220101', periods=5)
employee_data = {
    'EmployeeID': ['E001', 'E002', 'E003', 'E004', 'E005'],
    'Name': ['John Smith', 'Emma Davis', 'Michael Brown', 'Sophia Wilson', 'William Taylor'],
    'Department': ['Engineering', 'Design', 'Medical', 'Analytics', 'Education'],
    'HireDate': dates,
    'Salary': [75000, 65000, 120000, 72000, 68000],
    'Performance': ['Good', 'Excellent', 'Good', 'Average', 'Excellent']
}

employees = pd.DataFrame(employee_data)
employees.set_index('EmployeeID', inplace=True)

print(employees)

Output:

             Name   Department   HireDate  Salary Performance
EmployeeID                                                   
E001     John Smith  Engineering 2022-01-01  75000        Good
E002     Emma Davis       Design 2022-01-02  65000   Excellent
E003  Michael Brown      Medical 2022-01-03 120000        Good
E004  Sophia Wilson   Analytics 2022-01-04  72000     Average
E005 William Taylor   Education 2022-01-05  68000   Excellent

Now, let's perform some data analysis tasks:

1. Find employees with excellent performance

excellent_performers = employees.loc[employees['Performance'] == 'Excellent']
print("Excellent Performers:")
print(excellent_performers)

Output:

Excellent Performers:
             Name Department   HireDate  Salary Performance
EmployeeID                                                
E002     Emma Davis     Design 2022-01-02  65000   Excellent
E005 William Taylor  Education 2022-01-05  68000   Excellent

2. Calculate bonuses based on performance

# Create a bonus mapping
bonus_mapping = {'Excellent': 0.15, 'Good': 0.10, 'Average': 0.05}

# Calculate bonuses
employees['Bonus'] = employees['Performance'].map(bonus_mapping) * employees['Salary']

# Select specific columns for reporting
bonus_report = employees.loc[:, ['Name', 'Performance', 'Salary', 'Bonus']]
print("\nBonus Report:")
print(bonus_report)

Output:

Bonus Report:
             Name Performance   Salary    Bonus
EmployeeID                                    
E001     John Smith        Good  75000   7500.0
E002     Emma Davis   Excellent  65000   9750.0
E003  Michael Brown        Good 120000  12000.0
E004  Sophia Wilson     Average  72000   3600.0
E005 William Taylor   Excellent  68000  10200.0

3. Get details of specific employees

# Get information about specific employees
specific_employees = employees.loc[['E001', 'E003', 'E005'], ['Department', 'HireDate', 'Performance']]
print("\nSelected Employee Details:")
print(specific_employees)

Output:

Selected Employee Details:
           Department   HireDate Performance
EmployeeID                                  
E001      Engineering 2022-01-01        Good
E003          Medical 2022-01-03        Good
E005        Education 2022-01-05   Excellent

Label Selection with MultiIndex

Pandas also supports MultiIndex (hierarchical indexing), where .loc becomes even more powerful:

# Create a DataFrame with MultiIndex
multi_idx = pd.MultiIndex.from_tuples([
    ('Engineering', 'Senior'), 
    ('Engineering', 'Junior'),
    ('Design', 'Senior'),
    ('Design', 'Junior'),
    ('Medical', 'Senior')
], names=['Department', 'Level'])

multi_df = pd.DataFrame({
    'Employee': ['John', 'Emma', 'Sophia', 'William', 'Michael'],
    'Salary': [85000, 65000, 78000, 62000, 120000]
}, index=multi_idx)

print("DataFrame with MultiIndex:")
print(multi_df)

Output:

DataFrame with MultiIndex:
                   Employee  Salary
Department  Level                  
Engineering Senior     John   85000
           Junior     Emma   65000
Design      Senior   Sophia   78000
           Junior  William   62000
Medical     Senior  Michael  120000

You can select data at different levels:

# Select all employees in the Engineering department
engineering_staff = multi_df.loc['Engineering']
print("\nEngineering Staff:")
print(engineering_staff)

# Select all senior level employees across departments
senior_staff = multi_df.loc[(slice(None), 'Senior')]
print("\nSenior Level Staff:")
print(senior_staff)

Output:

Engineering Staff:
       Employee  Salary
Level                  
Senior     John   85000
Junior     Emma   65000

Senior Level Staff:
                   Employee  Salary
Department  Level                  
Engineering Senior     John   85000
Design      Senior   Sophia   78000
Medical     Senior  Michael  120000

Summary

In this tutorial, we've explored pandas label-based selection using .loc. Here's what we've covered:

Basic row and column selection by labels
Selecting multiple rows and columns simultaneously
Label-based slicing
Combining row and column selection
Boolean indexing with labels
Practical examples with employee data
Working with MultiIndex

Label-based selection with .loc provides a more intuitive and readable way to work with your data, especially when your DataFrames have meaningful index and column names.

Exercises

To practice your label selection skills, try these exercises:

Create a DataFrame of students with columns for Name, Grade, and scores in different subjects. Use student names as the index. Select:
- All information for a specific student
- All students with grades above a certain threshold
- The Math scores for a subset of students
Create a MultiIndex DataFrame representing sales data across different regions and product categories. Practice selecting:
- All sales from a specific region
- All sales of a specific product category across all regions
- The top-performing product in each region
Take the employees DataFrame from our example and find:
- The average salary by department
- The employees hired in the first quarter of 2022
- The department with the highest average performance rating

Additional Resources

Happy data analysis!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Label Selection​

Basic Label Selection with .loc​

Selecting a Single Row by Label​

Selecting Multiple Rows by Label​

Selecting Specific Columns for a Row​

Slicing with Labels​

Combining Row and Column Selection​

Boolean Indexing with Labels​

Modifying Data Using Label Selection​

Practical Example: Employee Data Analysis​

1. Find employees with excellent performance​

2. Calculate bonuses based on performance​

3. Get details of specific employees​

Label Selection with MultiIndex​

Summary​

Exercises​

Additional Resources​

Introduction to Label Selection

Basic Label Selection with `.loc`

Selecting a Single Row by Label

Selecting Multiple Rows by Label

Selecting Specific Columns for a Row

Slicing with Labels

Combining Row and Column Selection

Boolean Indexing with Labels

Modifying Data Using Label Selection

Practical Example: Employee Data Analysis

1. Find employees with excellent performance

2. Calculate bonuses based on performance

3. Get details of specific employees

Label Selection with MultiIndex

Summary

Exercises

Additional Resources