Pandas Column Selection
In data analysis with pandas, one of the most common operations is selecting and working with specific columns in a DataFrame. Knowing how to properly select columns will help you efficiently analyze and transform your data.
Introduction
A pandas DataFrame is a two-dimensional data structure with rows and columns, similar to a table in a database or a spreadsheet. Column selection is the process of accessing specific columns of data within your DataFrame, allowing you to focus on particular variables or features.
In this tutorial, we'll explore various methods to select columns in pandas DataFrames, from basic to more advanced techniques.
Basic Column Selection
Selecting a Single Column
The simplest way to select a single column is by using the bracket notation with the column name:
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['John', 'Emma', 'Sam', 'Lisa', 'Tom'],
'Age': [28, 32, 24, 35, 29],
'City': ['New York', 'Boston', 'Chicago', 'Denver', 'Seattle'],
'Salary': [65000, 72000, 54000, 80000, 69000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Select the 'Name' column
names = df['Name']
print("\nSelected 'Name' column:")
print(names)
Output:
Original DataFrame:
Name Age City Salary
0 John 28 New York 65000
1 Emma 32 Boston 72000
2 Sam 24 Chicago 54000
3 Lisa 35 Denver 80000
4 Tom 29 Seattle 69000
Selected 'Name' column:
0 John
1 Emma
2 Sam
3 Lisa
4 Tom
Name: Name, dtype: object
When you select a single column, pandas returns a Series object (a one-dimensional array with axis labels).
Using Dot Notation
For columns with names that don't contain spaces or special characters, you can also use the dot notation:
# Using dot notation
ages = df.Age
print("\nSelected 'Age' column using dot notation:")
print(ages)
Output:
Selected 'Age' column using dot notation:
0 28
1 32
2 24
3 35
4 29
Name: Age, dtype: int64
While the dot notation is more concise, it's recommended to use bracket notation []
because:
- It works with column names containing spaces or special characters
- It avoids conflicts with DataFrame method names
- It's less prone to errors when column names match Python keywords
Selecting Multiple Columns
Using a List of Column Names
To select multiple columns, pass a list of column names inside the brackets:
# Select multiple columns
selected_columns = df[['Name', 'Salary']]
print("\nSelected 'Name' and 'Salary' columns:")
print(selected_columns)
Output:
Selected 'Name' and 'Salary' columns:
Name Salary
0 John 65000
1 Emma 72000
2 Sam 54000
3 Lisa 80000
4 Tom 69000
Notice that when selecting multiple columns, pandas returns a DataFrame (not a Series).
Reordering Columns
You can also use column selection to reorder columns in your DataFrame:
# Reorder columns
reordered_df = df[['Salary', 'Name', 'Age', 'City']]
print("\nDataFrame with reordered columns:")
print(reordered_df)
Output:
DataFrame with reordered columns:
Salary Name Age City
0 65000 John 28 New York
1 72000 Emma 32 Boston
2 54000 Sam 24 Chicago
3 80000 Lisa 35 Denver
4 69000 Tom 29 Seattle
Advanced Column Selection Techniques
Column Selection Using loc and iloc
Pandas provides two powerful indexers: loc
(label-based) and iloc
(integer position-based). While they're often used for row selection, they can also select columns:
# Select columns using loc
cols_loc = df.loc[:, ['Name', 'City']]
print("\nColumns selected using loc:")
print(cols_loc)
# Select columns using iloc (by position)
cols_iloc = df.iloc[:, [0, 2]] # First and third columns
print("\nColumns selected using iloc:")
print(cols_iloc)
Output:
Columns selected using loc:
Name City
0 John New York
1 Emma Boston
2 Sam Chicago
3 Lisa Denver
4 Tom Seattle
Columns selected using iloc:
Name City
0 John New York
1 Emma Boston
2 Sam Chicago
3 Lisa Denver
4 Tom Seattle
Selecting Columns by Data Type
Sometimes you might want to select columns based on their data type:
# Select all numeric columns
numeric_cols = df.select_dtypes(include=['number'])
print("\nNumeric columns:")
print(numeric_cols)
# Select all string (object) columns
string_cols = df.select_dtypes(include=['object'])
print("\nString columns:")
print(string_cols)
Output:
Numeric columns:
Age Salary
0 28 65000
1 32 72000
2 24 54000
3 35 80000
4 29 69000
String columns:
Name City
0 John New York
1 Emma Boston
2 Sam Chicago
3 Lisa Denver
4 Tom Seattle
Filtering Columns with Conditional Logic
You can use conditional logic to filter columns based on their names:
# Get DataFrame's column names
col_names = df.columns.tolist()
print("\nColumn names:", col_names)
# Select columns that start with 'S'
s_cols = [col for col in df.columns if col.startswith('S')]
selected_s_cols = df[s_cols]
print("\nColumns starting with 'S':")
print(selected_s_cols)
Output:
Column names: ['Name', 'Age', 'City', 'Salary']
Columns starting with 'S':
Salary
0 65000
1 72000
2 54000
3 80000
4 69000
Practical Examples
Example 1: Data Analysis on Specific Features
When analyzing data, often you'll want to focus on specific features:
# Create a more complex dataset
data = {
'Name': ['John', 'Emma', 'Sam', 'Lisa', 'Tom'],
'Age': [28, 32, 24, 35, 29],
'Experience': [5, 8, 3, 10, 7],
'Department': ['Sales', 'Engineering', 'Marketing', 'Engineering', 'Sales'],
'Salary': [65000, 72000, 54000, 80000, 69000],
'Bonus': [5000, 7000, 3000, 8000, 5500]
}
employee_df = pd.DataFrame(data)
# Analyze compensation columns only
compensation_df = employee_df[['Salary', 'Bonus']]
total_comp = compensation_df.sum(axis=1)
employee_df['Total_Comp'] = total_comp
print("Employee compensation analysis:")
print(employee_df[['Name', 'Salary', 'Bonus', 'Total_Comp']])
Output:
Employee compensation analysis:
Name Salary Bonus Total_Comp
0 John 65000 5000 70000
1 Emma 72000 7000 79000
2 Sam 54000 3000 57000
3 Lisa 80000 8000 88000
4 Tom 69000 5500 74500
Example 2: Feature Engineering
Column selection is crucial for feature engineering in data science projects:
# Creating derived features from selected columns
engineering_features = employee_df[['Age', 'Experience']]
employee_df['Experience_Ratio'] = engineering_features['Experience'] / engineering_features['Age']
print("\nFeature engineering example:")
print(employee_df[['Name', 'Age', 'Experience', 'Experience_Ratio']])
Output:
Feature engineering example:
Name Age Experience Experience_Ratio
0 John 28 5 0.17857
1 Emma 32 8 0.25000
2 Sam 24 3 0.12500
3 Lisa 35 10 0.28571
4 Tom 29 7 0.24138
Summary
In this tutorial, we've covered various methods to select columns in pandas:
- Basic selection with brackets
df['column_name']
and dot notationdf.column_name
- Selecting multiple columns with lists
df[['col1', 'col2']]
- Advanced selection with
loc
andiloc
- Selecting columns by data type using
select_dtypes()
- Filtering columns using conditional logic
- Practical applications in data analysis and feature engineering
Mastering column selection techniques allows you to efficiently work with your data and focus on the variables that matter for your analysis.
Exercises
- Create a DataFrame with at least 5 columns of different data types, then select only the numeric columns.
- Write code to select columns whose names are longer than 5 characters.
- Create a DataFrame and practice reordering its columns in reverse order.
- Select columns based on their position (first and last columns only) using
iloc
. - Take a real dataset (like the Titanic dataset from seaborn) and practice selecting different columns for analysis.
Additional Resources
- Pandas Documentation: Indexing and Selecting Data
- 10 Minutes to pandas: Selection
- Pandas Cheat Sheet on Column Selection
Happy data analyzing with pandas!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)