Pandas Indexing Basics
Indexing is one of the most fundamental skills you need when working with Pandas. It allows you to access, filter, and manipulate specific parts of your data. In this tutorial, we'll explore the various ways to index and select data from Pandas DataFrames and Series.
Introduction
When working with data in Pandas, you'll often need to extract specific portions of your data for analysis, visualization, or transformation. Pandas offers multiple ways to select data:
- Selecting by label (
.loc
) - Selecting by position (
.iloc
) - Boolean indexing (filtering with conditions)
- Basic indexing using
[]
operator
Let's explore each of these methods with clear examples.
Setting Up Our Example Data
First, let's create a sample DataFrame that we'll use throughout this tutorial:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'Age': [24, 27, 22, 32, 29],
'City': ['New York', 'Boston', 'Chicago', 'Denver', 'San Francisco'],
'Salary': [72000, 65000, 53000, 98000, 84000],
'Department': ['Marketing', 'IT', 'Sales', 'Engineering', 'Product']
}
df = pd.DataFrame(data)
print(df)
This will output:
Name Age City Salary Department
0 Alice 24 New York 72000 Marketing
1 Bob 27 Boston 65000 IT
2 Charlie 22 Chicago 53000 Sales
3 David 32 Denver 98000 Engineering
4 Emily 29 San Francisco 84000 Product
Basic Indexing with []
The simplest way to select data is using the []
operator:
Selecting Columns
To select a single column:
# Select the 'Name' column
names = df['Name']
print(names)
Output:
0 Alice
1 Bob
2 Charlie
3 David
4 Emily
Name: Name, dtype: object
To select multiple columns:
# Select 'Name' and 'Age' columns
subset = df[['Name', 'Age']]
print(subset)
Output:
Name Age
0 Alice 24
1 Bob 27
2 Charlie 22
3 David 32
4 Emily 29
Selecting Rows by Position
You can also use slicing to select rows by their position:
# Select first 3 rows
first_three = df[0:3]
print(first_three)
Output:
Name Age City Salary Department
0 Alice 24 New York 72000 Marketing
1 Bob 27 Boston 65000 IT
2 Charlie 22 Chicago 53000 Sales
Selecting with .loc
(Label-based)
The .loc
accessor is used for label-based indexing. It allows you to select data based on row and column labels.
Select by Row Label
By default, row labels are numeric indices (0, 1, 2, etc.):
# Select row with label 2
row_2 = df.loc[2]
print(row_2)
Output:
Name Charlie
Age 22
City Chicago
Salary 53000
Department Sales
Name: 2, dtype: object
Select by Row and Column Labels
You can select specific cells by specifying both row and column labels:
# Select the Age of Charlie (row 2, column 'Age')
charlie_age = df.loc[2, 'Age']
print(f"Charlie's age: {charlie_age}")
Output:
Charlie's age: 22
Select Range of Rows and Columns
You can also select ranges using slices:
# Select rows 1-3 and columns 'Name' and 'City'
subset = df.loc[1:3, ['Name', 'City']]
print(subset)
Output:
Name City
1 Bob Boston
2 Charlie Chicago
3 David Denver
Selecting with .iloc
(Position-based)
The .iloc
accessor is used for position-based (integer-based) indexing. It works with integer positions only.
Select by Row Position
# Select the third row (index 2)
third_row = df.iloc[2]
print(third_row)
Output:
Name Charlie
Age 22
City Chicago
Salary 53000
Department Sales
Name: 2, dtype: object
Select by Row and Column Position
# Select the element at row 1, column 2 (Bob's city)
element = df.iloc[1, 2]
print(f"Element at row 1, column 2: {element}")
Output:
Element at row 1, column 2: Boston
Select Range of Rows and Columns by Position
# Select rows 1-3 and columns 1-3
subset = df.iloc[1:4, 1:4]
print(subset)
Output:
Age City Salary
1 27 Boston 65000
2 22 Chicago 53000
3 32 Denver 98000
Boolean Indexing (Filtering)
One of the most powerful ways to select data is by using boolean conditions:
Filter by a Single Condition
# Select people who are older than 25
older_than_25 = df[df['Age'] > 25]
print(older_than_25)
Output:
Name Age City Salary Department
1 Bob 27 Boston 65000 IT
3 David 32 Denver 98000 Engineering
4 Emily 29 San Francisco 84000 Product
Filter by Multiple Conditions
# Select people who are older than 25 AND earn more than 70000
filtered = df[(df['Age'] > 25) & (df['Salary'] > 70000)]
print(filtered)
Output:
Name Age City Salary Department
3 David 32 Denver 98000 Engineering
4 Emily 29 San Francisco 84000 Product
Setting a Custom Index
So far we've used the default integer index, but you can set any column as the index:
# Set 'Name' as the index
df_indexed = df.set_index('Name')
print(df_indexed)
Output:
Age City Salary Department
Name
Alice 24 New York 72000 Marketing
Bob 27 Boston 65000 IT
Charlie 22 Chicago 53000 Sales
David 32 Denver 98000 Engineering
Emily 29 San Francisco 84000 Product
Now you can select rows directly using names:
# Get data for Bob
bob_data = df_indexed.loc['Bob']
print(bob_data)
Output:
Age 27
City Boston
Salary 65000
Department IT
Name: Bob, dtype: object
Practical Application: Data Analysis
Let's see how these indexing techniques can be used in a real data analysis scenario:
# 1. Calculate average salary by age group
def age_group(age):
if age < 25:
return 'Young'
elif age < 30:
return 'Mid'
else:
return 'Senior'
# Add age group column
df['AgeGroup'] = df['Age'].apply(age_group)
# Group by age group and calculate average salary
avg_salary = df.groupby('AgeGroup')['Salary'].mean()
print("Average salary by age group:")
print(avg_salary)
# 2. Find highest paid employee in each department
highest_paid = df.loc[df.groupby('Department')['Salary'].idxmax()]
print("\nHighest paid employee in each department:")
print(highest_paid[['Department', 'Name', 'Salary']])
Output:
Average salary by age group:
AgeGroup
Mid 74500.0
Senior 98000.0
Young 62500.0
Name: Salary, dtype: float64
Highest paid employee in each department:
Department Name Salary
3 Engineering David 98000
1 IT Bob 65000
0 Marketing Alice 72000
4 Product Emily 84000
2 Sales Charlie 53000
Summary
In this tutorial, we've covered the basics of Pandas indexing:
- Basic indexing using
[]
for column selection and row slicing - Label-based indexing with
.loc
for selecting by row and column labels - Position-based indexing with
.iloc
for selecting by integer position - Boolean indexing for filtering data based on conditions
- Setting custom indices for easier data access
These indexing techniques form the foundation of data manipulation in Pandas and are essential skills for any data analysis work.
Exercises
To reinforce your learning, try these exercises:
- Create a DataFrame with your own data and practice all indexing methods.
- Select all employees in the Marketing or IT departments.
- Find all employees whose name starts with a letter after 'C' in the alphabet.
- Create a new column called "Bonus" that is 10% of salary for people in Engineering and 5% for everyone else.
- Reset the index of
df_indexed
back to the default integer index.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)