Pandas Column Selection

In data analysis with pandas, one of the most common operations is selecting and working with specific columns in a DataFrame. Knowing how to properly select columns will help you efficiently analyze and transform your data.

Introduction

A pandas DataFrame is a two-dimensional data structure with rows and columns, similar to a table in a database or a spreadsheet. Column selection is the process of accessing specific columns of data within your DataFrame, allowing you to focus on particular variables or features.

In this tutorial, we'll explore various methods to select columns in pandas DataFrames, from basic to more advanced techniques.

Basic Column Selection

Selecting a Single Column

The simplest way to select a single column is by using the bracket notation with the column name:

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['John', 'Emma', 'Sam', 'Lisa', 'Tom'],
    'Age': [28, 32, 24, 35, 29],
    'City': ['New York', 'Boston', 'Chicago', 'Denver', 'Seattle'],
    'Salary': [65000, 72000, 54000, 80000, 69000]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Select the 'Name' column
names = df['Name']
print("\nSelected 'Name' column:")
print(names)

Output:

Original DataFrame:
   Name  Age     City  Salary
John   28  New York   65000
Emma   32    Boston   72000
 Sam   24   Chicago   54000
Lisa   35    Denver   80000
 Tom   29   Seattle   69000

Selected 'Name' column:
  John
  Emma
   Sam
  Lisa
   Tom
Name: Name, dtype: object

When you select a single column, pandas returns a Series object (a one-dimensional array with axis labels).

Using Dot Notation

For columns with names that don't contain spaces or special characters, you can also use the dot notation:

# Using dot notation
ages = df.Age
print("\nSelected 'Age' column using dot notation:")
print(ages)

Output:

Selected 'Age' column using dot notation:
  28
  32
  24
  35
  29
Name: Age, dtype: int64

caution

While the dot notation is more concise, it's recommended to use bracket notation [] because:

It works with column names containing spaces or special characters
It avoids conflicts with DataFrame method names
It's less prone to errors when column names match Python keywords

Selecting Multiple Columns

Using a List of Column Names

To select multiple columns, pass a list of column names inside the brackets:

# Select multiple columns
selected_columns = df[['Name', 'Salary']]
print("\nSelected 'Name' and 'Salary' columns:")
print(selected_columns)

Output:

Selected 'Name' and 'Salary' columns:
   Name  Salary
John   65000
Emma   72000
 Sam   54000
Lisa   80000
 Tom   69000

Notice that when selecting multiple columns, pandas returns a DataFrame (not a Series).

Reordering Columns

You can also use column selection to reorder columns in your DataFrame:

# Reorder columns
reordered_df = df[['Salary', 'Name', 'Age', 'City']]
print("\nDataFrame with reordered columns:")
print(reordered_df)

Output:

DataFrame with reordered columns:
   Salary  Name  Age     City
 65000  John   28  New York
 72000  Emma   32    Boston
 54000   Sam   24   Chicago
 80000  Lisa   35    Denver
 69000   Tom   29   Seattle

Advanced Column Selection Techniques

Column Selection Using loc and iloc

Pandas provides two powerful indexers: loc (label-based) and iloc (integer position-based). While they're often used for row selection, they can also select columns:

# Select columns using loc
cols_loc = df.loc[:, ['Name', 'City']]
print("\nColumns selected using loc:")
print(cols_loc)

# Select columns using iloc (by position)
cols_iloc = df.iloc[:, [0, 2]]  # First and third columns
print("\nColumns selected using iloc:")
print(cols_iloc)

Output:

Columns selected using loc:
   Name     City
John  New York
Emma    Boston
 Sam   Chicago
Lisa    Denver
 Tom   Seattle

Columns selected using iloc:
   Name     City
John  New York
Emma    Boston
 Sam   Chicago
Lisa    Denver
 Tom   Seattle

Selecting Columns by Data Type

Sometimes you might want to select columns based on their data type:

# Select all numeric columns
numeric_cols = df.select_dtypes(include=['number'])
print("\nNumeric columns:")
print(numeric_cols)

# Select all string (object) columns
string_cols = df.select_dtypes(include=['object'])
print("\nString columns:")
print(string_cols)

Output:

Numeric columns:
   Age  Salary
 28   65000
 32   72000
 24   54000
 35   80000
 29   69000

String columns:
   Name     City
John  New York
Emma    Boston
 Sam   Chicago
Lisa    Denver
 Tom   Seattle

Filtering Columns with Conditional Logic

You can use conditional logic to filter columns based on their names:

# Get DataFrame's column names
col_names = df.columns.tolist()
print("\nColumn names:", col_names)

# Select columns that start with 'S'
s_cols = [col for col in df.columns if col.startswith('S')]
selected_s_cols = df[s_cols]
print("\nColumns starting with 'S':")
print(selected_s_cols)

Output:

Column names: ['Name', 'Age', 'City', 'Salary']

Columns starting with 'S':
   Salary
0   65000
1   72000
2   54000
3   80000
4   69000

Practical Examples

Example 1: Data Analysis on Specific Features

When analyzing data, often you'll want to focus on specific features:

# Create a more complex dataset
data = {
    'Name': ['John', 'Emma', 'Sam', 'Lisa', 'Tom'],
    'Age': [28, 32, 24, 35, 29],
    'Experience': [5, 8, 3, 10, 7],
    'Department': ['Sales', 'Engineering', 'Marketing', 'Engineering', 'Sales'],
    'Salary': [65000, 72000, 54000, 80000, 69000],
    'Bonus': [5000, 7000, 3000, 8000, 5500]
}

employee_df = pd.DataFrame(data)

# Analyze compensation columns only
compensation_df = employee_df[['Salary', 'Bonus']]
total_comp = compensation_df.sum(axis=1)
employee_df['Total_Comp'] = total_comp

print("Employee compensation analysis:")
print(employee_df[['Name', 'Salary', 'Bonus', 'Total_Comp']])

Output:

Employee compensation analysis:
   Name  Salary  Bonus  Total_Comp
John   65000   5000       70000
Emma   72000   7000       79000
 Sam   54000   3000       57000
Lisa   80000   8000       88000
 Tom   69000   5500       74500

Example 2: Feature Engineering

Column selection is crucial for feature engineering in data science projects:

# Creating derived features from selected columns
engineering_features = employee_df[['Age', 'Experience']]
employee_df['Experience_Ratio'] = engineering_features['Experience'] / engineering_features['Age']

print("\nFeature engineering example:")
print(employee_df[['Name', 'Age', 'Experience', 'Experience_Ratio']])

Output:

Feature engineering example:
   Name  Age  Experience  Experience_Ratio
John   28           5           0.17857
Emma   32           8           0.25000
 Sam   24           3           0.12500
Lisa   35          10           0.28571
 Tom   29           7           0.24138

Summary

In this tutorial, we've covered various methods to select columns in pandas:

Basic selection with brackets df['column_name'] and dot notation df.column_name
Selecting multiple columns with lists df[['col1', 'col2']]
Advanced selection with loc and iloc
Selecting columns by data type using select_dtypes()
Filtering columns using conditional logic
Practical applications in data analysis and feature engineering

Mastering column selection techniques allows you to efficiently work with your data and focus on the variables that matter for your analysis.

Exercises

Create a DataFrame with at least 5 columns of different data types, then select only the numeric columns.
Write code to select columns whose names are longer than 5 characters.
Create a DataFrame and practice reordering its columns in reverse order.
Select columns based on their position (first and last columns only) using iloc.
Take a real dataset (like the Titanic dataset from seaborn) and practice selecting different columns for analysis.

Additional Resources

Happy data analyzing with pandas!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Basic Column Selection​

Selecting a Single Column​

Using Dot Notation​

Selecting Multiple Columns​

Using a List of Column Names​

Reordering Columns​

Advanced Column Selection Techniques​

Column Selection Using loc and iloc​

Selecting Columns by Data Type​

Filtering Columns with Conditional Logic​

Practical Examples​

Example 1: Data Analysis on Specific Features​

Example 2: Feature Engineering​

Summary​

Exercises​

Additional Resources​

Introduction

Basic Column Selection

Selecting a Single Column

Using Dot Notation

Selecting Multiple Columns

Using a List of Column Names

Reordering Columns

Advanced Column Selection Techniques

Column Selection Using loc and iloc

Selecting Columns by Data Type

Filtering Columns with Conditional Logic

Practical Examples

Example 1: Data Analysis on Specific Features

Example 2: Feature Engineering

Summary

Exercises

Additional Resources