Pandas Indexing Basics

Indexing is one of the most fundamental skills you need when working with Pandas. It allows you to access, filter, and manipulate specific parts of your data. In this tutorial, we'll explore the various ways to index and select data from Pandas DataFrames and Series.

Introduction

When working with data in Pandas, you'll often need to extract specific portions of your data for analysis, visualization, or transformation. Pandas offers multiple ways to select data:

Selecting by label (.loc)
Selecting by position (.iloc)
Boolean indexing (filtering with conditions)
Basic indexing using [] operator

Let's explore each of these methods with clear examples.

Setting Up Our Example Data

First, let's create a sample DataFrame that we'll use throughout this tutorial:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'Age': [24, 27, 22, 32, 29],
    'City': ['New York', 'Boston', 'Chicago', 'Denver', 'San Francisco'],
    'Salary': [72000, 65000, 53000, 98000, 84000],
    'Department': ['Marketing', 'IT', 'Sales', 'Engineering', 'Product']
}

df = pd.DataFrame(data)
print(df)

This will output:

      Name  Age           City  Salary  Department
  Alice   24       New York   72000   Marketing
    Bob   27         Boston   65000          IT
Charlie   22        Chicago   53000       Sales
  David   32         Denver   98000  Engineering
  Emily   29  San Francisco   84000     Product

Basic Indexing with `[]`

The simplest way to select data is using the [] operator:

Selecting Columns

To select a single column:

python
# Select the 'Name' column
names = df['Name']
print(names)

Output:

    Alice
      Bob
  Charlie
    David
    Emily
Name: Name, dtype: object

To select multiple columns:

python
# Select 'Name' and 'Age' columns
subset = df[['Name', 'Age']]
print(subset)

Output:

      Name  Age
  Alice   24
    Bob   27
Charlie   22
  David   32
  Emily   29

Selecting Rows by Position

You can also use slicing to select rows by their position:

python
# Select first 3 rows
first_three = df[0:3]
print(first_three)

Output:

      Name  Age     City  Salary Department
  Alice   24  New York   72000  Marketing
    Bob   27    Boston   65000         IT
Charlie   22   Chicago   53000      Sales

Selecting with `.loc` (Label-based)

The .loc accessor is used for label-based indexing. It allows you to select data based on row and column labels.

Select by Row Label

By default, row labels are numeric indices (0, 1, 2, etc.):

python
# Select row with label 2
row_2 = df.loc[2]
print(row_2)

Output:

Name           Charlie
Age                 22
City           Chicago
Salary           53000
Department       Sales
Name: 2, dtype: object

Select by Row and Column Labels

You can select specific cells by specifying both row and column labels:

python
# Select the Age of Charlie (row 2, column 'Age')
charlie_age = df.loc[2, 'Age']
print(f"Charlie's age: {charlie_age}")

Output:

Charlie's age: 22

Select Range of Rows and Columns

You can also select ranges using slices:

python
# Select rows 1-3 and columns 'Name' and 'City'
subset = df.loc[1:3, ['Name', 'City']]
print(subset)

Output:

      Name     City
    Bob   Boston
Charlie  Chicago
  David   Denver

Selecting with `.iloc` (Position-based)

The .iloc accessor is used for position-based (integer-based) indexing. It works with integer positions only.

Select by Row Position

python
# Select the third row (index 2)
third_row = df.iloc[2]
print(third_row)

Output:

Name           Charlie
Age                 22
City           Chicago
Salary           53000
Department       Sales
Name: 2, dtype: object

Select by Row and Column Position

python
# Select the element at row 1, column 2 (Bob's city)
element = df.iloc[1, 2]
print(f"Element at row 1, column 2: {element}")

Output:

Element at row 1, column 2: Boston

Select Range of Rows and Columns by Position

python
# Select rows 1-3 and columns 1-3
subset = df.iloc[1:4, 1:4]
print(subset)

Output:

   Age     City  Salary
 27   Boston   65000
 22  Chicago   53000
 32   Denver   98000

Boolean Indexing (Filtering)

One of the most powerful ways to select data is by using boolean conditions:

Filter by a Single Condition

python
# Select people who are older than 25
older_than_25 = df[df['Age'] > 25]
print(older_than_25)

Output:

    Name  Age           City  Salary  Department
  Bob   27         Boston   65000          IT
David   32         Denver   98000  Engineering
Emily   29  San Francisco   84000     Product

Filter by Multiple Conditions

python
# Select people who are older than 25 AND earn more than 70000
filtered = df[(df['Age'] > 25) & (df['Salary'] > 70000)]
print(filtered)

Output:

    Name  Age           City  Salary  Department
3  David   32         Denver   98000  Engineering
4  Emily   29  San Francisco   84000     Product

Setting a Custom Index

So far we've used the default integer index, but you can set any column as the index:

python
# Set 'Name' as the index
df_indexed = df.set_index('Name')
print(df_indexed)

Output:

         Age           City  Salary  Department
Name                                          
Alice     24       New York   72000   Marketing
Bob       27         Boston   65000          IT
Charlie   22        Chicago   53000       Sales
David     32         Denver   98000  Engineering
Emily     29  San Francisco   84000     Product

Now you can select rows directly using names:

python
# Get data for Bob
bob_data = df_indexed.loc['Bob']
print(bob_data)

Output:

Age                27
City           Boston
Salary          65000
Department         IT
Name: Bob, dtype: object

Practical Application: Data Analysis

Let's see how these indexing techniques can be used in a real data analysis scenario:

python
# 1. Calculate average salary by age group
def age_group(age):
    if age < 25:
        return 'Young'
    elif age < 30:
        return 'Mid'
    else:
        return 'Senior'

# Add age group column
df['AgeGroup'] = df['Age'].apply(age_group)

# Group by age group and calculate average salary
avg_salary = df.groupby('AgeGroup')['Salary'].mean()
print("Average salary by age group:")
print(avg_salary)

# 2. Find highest paid employee in each department
highest_paid = df.loc[df.groupby('Department')['Salary'].idxmax()]
print("\nHighest paid employee in each department:")
print(highest_paid[['Department', 'Name', 'Salary']])

Output:

Average salary by age group:
AgeGroup
Mid        74500.0
Senior     98000.0
Young      62500.0
Name: Salary, dtype: float64

Highest paid employee in each department:
    Department   Name  Salary
3  Engineering  David   98000
1           IT    Bob   65000
0    Marketing  Alice   72000
4      Product  Emily   84000
2        Sales  Charlie   53000

Summary

In this tutorial, we've covered the basics of Pandas indexing:

Basic indexing using [] for column selection and row slicing
Label-based indexing with .loc for selecting by row and column labels
Position-based indexing with .iloc for selecting by integer position
Boolean indexing for filtering data based on conditions
Setting custom indices for easier data access

These indexing techniques form the foundation of data manipulation in Pandas and are essential skills for any data analysis work.

Exercises

To reinforce your learning, try these exercises:

Create a DataFrame with your own data and practice all indexing methods.
Select all employees in the Marketing or IT departments.
Find all employees whose name starts with a letter after 'C' in the alphabet.
Create a new column called "Bonus" that is 10% of salary for people in Engineering and 5% for everyone else.
Reset the index of df_indexed back to the default integer index.

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Setting Up Our Example Data​

Basic Indexing with []​

Selecting Columns​

Selecting Rows by Position​

Selecting with .loc (Label-based)​

Select by Row Label​

Select by Row and Column Labels​

Select Range of Rows and Columns​

Selecting with .iloc (Position-based)​

Select by Row Position​

Select by Row and Column Position​

Select Range of Rows and Columns by Position​

Boolean Indexing (Filtering)​

Filter by a Single Condition​

Filter by Multiple Conditions​

Setting a Custom Index​

Practical Application: Data Analysis​

Summary​

Exercises​

Additional Resources​

Introduction

Setting Up Our Example Data

Basic Indexing with `[]`

Selecting Columns

Selecting Rows by Position

Selecting with `.loc` (Label-based)

Select by Row Label

Select by Row and Column Labels

Select Range of Rows and Columns

Selecting with `.iloc` (Position-based)

Select by Row Position

Select by Row and Column Position

Select Range of Rows and Columns by Position

Boolean Indexing (Filtering)

Filter by a Single Condition

Filter by Multiple Conditions

Setting a Custom Index

Practical Application: Data Analysis

Summary

Exercises

Additional Resources