Skip to main content

Pandas Column Selection

In data analysis with pandas, one of the most common operations is selecting and working with specific columns in a DataFrame. Knowing how to properly select columns will help you efficiently analyze and transform your data.

Introduction

A pandas DataFrame is a two-dimensional data structure with rows and columns, similar to a table in a database or a spreadsheet. Column selection is the process of accessing specific columns of data within your DataFrame, allowing you to focus on particular variables or features.

In this tutorial, we'll explore various methods to select columns in pandas DataFrames, from basic to more advanced techniques.

Basic Column Selection

Selecting a Single Column

The simplest way to select a single column is by using the bracket notation with the column name:

python
import pandas as pd

# Create a sample DataFrame
data = {
'Name': ['John', 'Emma', 'Sam', 'Lisa', 'Tom'],
'Age': [28, 32, 24, 35, 29],
'City': ['New York', 'Boston', 'Chicago', 'Denver', 'Seattle'],
'Salary': [65000, 72000, 54000, 80000, 69000]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Select the 'Name' column
names = df['Name']
print("\nSelected 'Name' column:")
print(names)

Output:

Original DataFrame:
Name Age City Salary
0 John 28 New York 65000
1 Emma 32 Boston 72000
2 Sam 24 Chicago 54000
3 Lisa 35 Denver 80000
4 Tom 29 Seattle 69000

Selected 'Name' column:
0 John
1 Emma
2 Sam
3 Lisa
4 Tom
Name: Name, dtype: object

When you select a single column, pandas returns a Series object (a one-dimensional array with axis labels).

Using Dot Notation

For columns with names that don't contain spaces or special characters, you can also use the dot notation:

python
# Using dot notation
ages = df.Age
print("\nSelected 'Age' column using dot notation:")
print(ages)

Output:

Selected 'Age' column using dot notation:
0 28
1 32
2 24
3 35
4 29
Name: Age, dtype: int64
caution

While the dot notation is more concise, it's recommended to use bracket notation [] because:

  1. It works with column names containing spaces or special characters
  2. It avoids conflicts with DataFrame method names
  3. It's less prone to errors when column names match Python keywords

Selecting Multiple Columns

Using a List of Column Names

To select multiple columns, pass a list of column names inside the brackets:

python
# Select multiple columns
selected_columns = df[['Name', 'Salary']]
print("\nSelected 'Name' and 'Salary' columns:")
print(selected_columns)

Output:

Selected 'Name' and 'Salary' columns:
Name Salary
0 John 65000
1 Emma 72000
2 Sam 54000
3 Lisa 80000
4 Tom 69000

Notice that when selecting multiple columns, pandas returns a DataFrame (not a Series).

Reordering Columns

You can also use column selection to reorder columns in your DataFrame:

python
# Reorder columns
reordered_df = df[['Salary', 'Name', 'Age', 'City']]
print("\nDataFrame with reordered columns:")
print(reordered_df)

Output:

DataFrame with reordered columns:
Salary Name Age City
0 65000 John 28 New York
1 72000 Emma 32 Boston
2 54000 Sam 24 Chicago
3 80000 Lisa 35 Denver
4 69000 Tom 29 Seattle

Advanced Column Selection Techniques

Column Selection Using loc and iloc

Pandas provides two powerful indexers: loc (label-based) and iloc (integer position-based). While they're often used for row selection, they can also select columns:

python
# Select columns using loc
cols_loc = df.loc[:, ['Name', 'City']]
print("\nColumns selected using loc:")
print(cols_loc)

# Select columns using iloc (by position)
cols_iloc = df.iloc[:, [0, 2]] # First and third columns
print("\nColumns selected using iloc:")
print(cols_iloc)

Output:

Columns selected using loc:
Name City
0 John New York
1 Emma Boston
2 Sam Chicago
3 Lisa Denver
4 Tom Seattle

Columns selected using iloc:
Name City
0 John New York
1 Emma Boston
2 Sam Chicago
3 Lisa Denver
4 Tom Seattle

Selecting Columns by Data Type

Sometimes you might want to select columns based on their data type:

python
# Select all numeric columns
numeric_cols = df.select_dtypes(include=['number'])
print("\nNumeric columns:")
print(numeric_cols)

# Select all string (object) columns
string_cols = df.select_dtypes(include=['object'])
print("\nString columns:")
print(string_cols)

Output:

Numeric columns:
Age Salary
0 28 65000
1 32 72000
2 24 54000
3 35 80000
4 29 69000

String columns:
Name City
0 John New York
1 Emma Boston
2 Sam Chicago
3 Lisa Denver
4 Tom Seattle

Filtering Columns with Conditional Logic

You can use conditional logic to filter columns based on their names:

python
# Get DataFrame's column names
col_names = df.columns.tolist()
print("\nColumn names:", col_names)

# Select columns that start with 'S'
s_cols = [col for col in df.columns if col.startswith('S')]
selected_s_cols = df[s_cols]
print("\nColumns starting with 'S':")
print(selected_s_cols)

Output:

Column names: ['Name', 'Age', 'City', 'Salary']

Columns starting with 'S':
Salary
0 65000
1 72000
2 54000
3 80000
4 69000

Practical Examples

Example 1: Data Analysis on Specific Features

When analyzing data, often you'll want to focus on specific features:

python
# Create a more complex dataset
data = {
'Name': ['John', 'Emma', 'Sam', 'Lisa', 'Tom'],
'Age': [28, 32, 24, 35, 29],
'Experience': [5, 8, 3, 10, 7],
'Department': ['Sales', 'Engineering', 'Marketing', 'Engineering', 'Sales'],
'Salary': [65000, 72000, 54000, 80000, 69000],
'Bonus': [5000, 7000, 3000, 8000, 5500]
}

employee_df = pd.DataFrame(data)

# Analyze compensation columns only
compensation_df = employee_df[['Salary', 'Bonus']]
total_comp = compensation_df.sum(axis=1)
employee_df['Total_Comp'] = total_comp

print("Employee compensation analysis:")
print(employee_df[['Name', 'Salary', 'Bonus', 'Total_Comp']])

Output:

Employee compensation analysis:
Name Salary Bonus Total_Comp
0 John 65000 5000 70000
1 Emma 72000 7000 79000
2 Sam 54000 3000 57000
3 Lisa 80000 8000 88000
4 Tom 69000 5500 74500

Example 2: Feature Engineering

Column selection is crucial for feature engineering in data science projects:

python
# Creating derived features from selected columns
engineering_features = employee_df[['Age', 'Experience']]
employee_df['Experience_Ratio'] = engineering_features['Experience'] / engineering_features['Age']

print("\nFeature engineering example:")
print(employee_df[['Name', 'Age', 'Experience', 'Experience_Ratio']])

Output:

Feature engineering example:
Name Age Experience Experience_Ratio
0 John 28 5 0.17857
1 Emma 32 8 0.25000
2 Sam 24 3 0.12500
3 Lisa 35 10 0.28571
4 Tom 29 7 0.24138

Summary

In this tutorial, we've covered various methods to select columns in pandas:

  • Basic selection with brackets df['column_name'] and dot notation df.column_name
  • Selecting multiple columns with lists df[['col1', 'col2']]
  • Advanced selection with loc and iloc
  • Selecting columns by data type using select_dtypes()
  • Filtering columns using conditional logic
  • Practical applications in data analysis and feature engineering

Mastering column selection techniques allows you to efficiently work with your data and focus on the variables that matter for your analysis.

Exercises

  1. Create a DataFrame with at least 5 columns of different data types, then select only the numeric columns.
  2. Write code to select columns whose names are longer than 5 characters.
  3. Create a DataFrame and practice reordering its columns in reverse order.
  4. Select columns based on their position (first and last columns only) using iloc.
  5. Take a real dataset (like the Titanic dataset from seaborn) and practice selecting different columns for analysis.

Additional Resources

Happy data analyzing with pandas!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)