Skip to main content

Pandas Indexing Basics

Indexing is one of the most fundamental skills you need when working with Pandas. It allows you to access, filter, and manipulate specific parts of your data. In this tutorial, we'll explore the various ways to index and select data from Pandas DataFrames and Series.

Introduction

When working with data in Pandas, you'll often need to extract specific portions of your data for analysis, visualization, or transformation. Pandas offers multiple ways to select data:

  • Selecting by label (.loc)
  • Selecting by position (.iloc)
  • Boolean indexing (filtering with conditions)
  • Basic indexing using [] operator

Let's explore each of these methods with clear examples.

Setting Up Our Example Data

First, let's create a sample DataFrame that we'll use throughout this tutorial:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'Age': [24, 27, 22, 32, 29],
'City': ['New York', 'Boston', 'Chicago', 'Denver', 'San Francisco'],
'Salary': [72000, 65000, 53000, 98000, 84000],
'Department': ['Marketing', 'IT', 'Sales', 'Engineering', 'Product']
}

df = pd.DataFrame(data)
print(df)

This will output:

      Name  Age           City  Salary  Department
0 Alice 24 New York 72000 Marketing
1 Bob 27 Boston 65000 IT
2 Charlie 22 Chicago 53000 Sales
3 David 32 Denver 98000 Engineering
4 Emily 29 San Francisco 84000 Product

Basic Indexing with []

The simplest way to select data is using the [] operator:

Selecting Columns

To select a single column:

python
# Select the 'Name' column
names = df['Name']
print(names)

Output:

0      Alice
1 Bob
2 Charlie
3 David
4 Emily
Name: Name, dtype: object

To select multiple columns:

python
# Select 'Name' and 'Age' columns
subset = df[['Name', 'Age']]
print(subset)

Output:

      Name  Age
0 Alice 24
1 Bob 27
2 Charlie 22
3 David 32
4 Emily 29

Selecting Rows by Position

You can also use slicing to select rows by their position:

python
# Select first 3 rows
first_three = df[0:3]
print(first_three)

Output:

      Name  Age     City  Salary Department
0 Alice 24 New York 72000 Marketing
1 Bob 27 Boston 65000 IT
2 Charlie 22 Chicago 53000 Sales

Selecting with .loc (Label-based)

The .loc accessor is used for label-based indexing. It allows you to select data based on row and column labels.

Select by Row Label

By default, row labels are numeric indices (0, 1, 2, etc.):

python
# Select row with label 2
row_2 = df.loc[2]
print(row_2)

Output:

Name           Charlie
Age 22
City Chicago
Salary 53000
Department Sales
Name: 2, dtype: object

Select by Row and Column Labels

You can select specific cells by specifying both row and column labels:

python
# Select the Age of Charlie (row 2, column 'Age')
charlie_age = df.loc[2, 'Age']
print(f"Charlie's age: {charlie_age}")

Output:

Charlie's age: 22

Select Range of Rows and Columns

You can also select ranges using slices:

python
# Select rows 1-3 and columns 'Name' and 'City'
subset = df.loc[1:3, ['Name', 'City']]
print(subset)

Output:

      Name     City
1 Bob Boston
2 Charlie Chicago
3 David Denver

Selecting with .iloc (Position-based)

The .iloc accessor is used for position-based (integer-based) indexing. It works with integer positions only.

Select by Row Position

python
# Select the third row (index 2)
third_row = df.iloc[2]
print(third_row)

Output:

Name           Charlie
Age 22
City Chicago
Salary 53000
Department Sales
Name: 2, dtype: object

Select by Row and Column Position

python
# Select the element at row 1, column 2 (Bob's city)
element = df.iloc[1, 2]
print(f"Element at row 1, column 2: {element}")

Output:

Element at row 1, column 2: Boston

Select Range of Rows and Columns by Position

python
# Select rows 1-3 and columns 1-3
subset = df.iloc[1:4, 1:4]
print(subset)

Output:

   Age     City  Salary
1 27 Boston 65000
2 22 Chicago 53000
3 32 Denver 98000

Boolean Indexing (Filtering)

One of the most powerful ways to select data is by using boolean conditions:

Filter by a Single Condition

python
# Select people who are older than 25
older_than_25 = df[df['Age'] > 25]
print(older_than_25)

Output:

    Name  Age           City  Salary  Department
1 Bob 27 Boston 65000 IT
3 David 32 Denver 98000 Engineering
4 Emily 29 San Francisco 84000 Product

Filter by Multiple Conditions

python
# Select people who are older than 25 AND earn more than 70000
filtered = df[(df['Age'] > 25) & (df['Salary'] > 70000)]
print(filtered)

Output:

    Name  Age           City  Salary  Department
3 David 32 Denver 98000 Engineering
4 Emily 29 San Francisco 84000 Product

Setting a Custom Index

So far we've used the default integer index, but you can set any column as the index:

python
# Set 'Name' as the index
df_indexed = df.set_index('Name')
print(df_indexed)

Output:

         Age           City  Salary  Department
Name
Alice 24 New York 72000 Marketing
Bob 27 Boston 65000 IT
Charlie 22 Chicago 53000 Sales
David 32 Denver 98000 Engineering
Emily 29 San Francisco 84000 Product

Now you can select rows directly using names:

python
# Get data for Bob
bob_data = df_indexed.loc['Bob']
print(bob_data)

Output:

Age                27
City Boston
Salary 65000
Department IT
Name: Bob, dtype: object

Practical Application: Data Analysis

Let's see how these indexing techniques can be used in a real data analysis scenario:

python
# 1. Calculate average salary by age group
def age_group(age):
if age < 25:
return 'Young'
elif age < 30:
return 'Mid'
else:
return 'Senior'

# Add age group column
df['AgeGroup'] = df['Age'].apply(age_group)

# Group by age group and calculate average salary
avg_salary = df.groupby('AgeGroup')['Salary'].mean()
print("Average salary by age group:")
print(avg_salary)

# 2. Find highest paid employee in each department
highest_paid = df.loc[df.groupby('Department')['Salary'].idxmax()]
print("\nHighest paid employee in each department:")
print(highest_paid[['Department', 'Name', 'Salary']])

Output:

Average salary by age group:
AgeGroup
Mid 74500.0
Senior 98000.0
Young 62500.0
Name: Salary, dtype: float64

Highest paid employee in each department:
Department Name Salary
3 Engineering David 98000
1 IT Bob 65000
0 Marketing Alice 72000
4 Product Emily 84000
2 Sales Charlie 53000

Summary

In this tutorial, we've covered the basics of Pandas indexing:

  • Basic indexing using [] for column selection and row slicing
  • Label-based indexing with .loc for selecting by row and column labels
  • Position-based indexing with .iloc for selecting by integer position
  • Boolean indexing for filtering data based on conditions
  • Setting custom indices for easier data access

These indexing techniques form the foundation of data manipulation in Pandas and are essential skills for any data analysis work.

Exercises

To reinforce your learning, try these exercises:

  1. Create a DataFrame with your own data and practice all indexing methods.
  2. Select all employees in the Marketing or IT departments.
  3. Find all employees whose name starts with a letter after 'C' in the alphabet.
  4. Create a new column called "Bonus" that is 10% of salary for people in Engineering and 5% for everyone else.
  5. Reset the index of df_indexed back to the default integer index.

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)