Pandas Column Operations
When working with data in Pandas, column operations are among the most common tasks you'll perform. Whether you're selecting specific columns for analysis, creating new columns from existing data, or transforming column values, mastering column operations is essential for effective data manipulation.
Introduction to Pandas Columns
In a Pandas DataFrame, columns represent variables or features of your dataset. Each column has a name (label) and contains data of a specific type. Understanding how to work with columns efficiently is crucial for data analysis and preprocessing.
Let's start by importing Pandas and creating a sample DataFrame:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [24, 27, 22, 32, 28],
'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
'Salary': [72000, 65000, 58000, 90000, 76000]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City Salary
0 Alice 24 New York 72000
1 Bob 27 Boston 65000
2 Charlie 22 Chicago 58000
3 David 32 Seattle 90000
4 Emma 28 Miami 76000
Basic Column Selection
Selecting a Single Column
You can select a single column from a DataFrame in two ways:
# Method 1: Using square brackets - returns a pandas Series
ages = df['Age']
print("Type:", type(ages))
print(ages)
# Method 2: Using dot notation (only works for simple column names without spaces or special characters)
ages_dot = df.Age
print("Type:", type(ages_dot))
print(ages_dot)
Output:
Type: <class 'pandas.core.series.Series'>
0 24
1 27
2 22
3 32
4 28
Name: Age, dtype: int64
Type: <class 'pandas.core.series.Series'>
0 24
1 27
2 22
3 32
4 28
Name: Age, dtype: int64
Selecting Multiple Columns
To select multiple columns, you can provide a list of column names:
# Select Name and Salary columns
name_salary = df[['Name', 'Salary']]
print(type(name_salary))
print(name_salary)
Output:
<class 'pandas.core.frame.DataFrame'>
Name Salary
0 Alice 72000
1 Bob 65000
2 Charlie 58000
3 David 90000
4 Emma 76000
Adding and Creating New Columns
Adding a New Column with a Constant Value
# Add a department column with the same value for all rows
df['Department'] = 'Engineering'
print(df)
Output:
Name Age City Salary Department
0 Alice 24 New York 72000 Engineering
1 Bob 27 Boston 65000 Engineering
2 Charlie 22 Chicago 58000 Engineering
3 David 32 Seattle 90000 Engineering
4 Emma 28 Miami 76000 Engineering
Creating a New Column Based on Existing Columns
You can create new columns by performing operations on existing columns:
# Calculate bonus as 10% of salary
df['Bonus'] = df['Salary'] * 0.1
print(df)
Output:
Name Age City Salary Department Bonus
0 Alice 24 New York 72000 Engineering 7200.0
1 Bob 27 Boston 65000 Engineering 6500.0
2 Charlie 22 Chicago 58000 Engineering 5800.0
3 David 32 Seattle 90000 Engineering 9000.0
4 Emma 28 Miami 76000 Engineering 7600.0
Using apply()
to Create Columns
For more complex operations, you can use the apply()
function:
# Create a new column with custom logic
def calculate_experience(age):
return max(0, age - 21)
df['Experience'] = df['Age'].apply(calculate_experience)
print(df)
Output:
Name Age City Salary Department Bonus Experience
0 Alice 24 New York 72000 Engineering 7200.0 3
1 Bob 27 Boston 65000 Engineering 6500.0 6
2 Charlie 22 Chicago 58000 Engineering 5800.0 1
3 David 32 Seattle 90000 Engineering 9000.0 11
4 Emma 28 Miami 76000 Engineering 7600.0 7
Using Lambda Functions with Apply
You can also use lambda functions for simpler operations:
# Create a column for total compensation
df['Total_Comp'] = df.apply(lambda row: row['Salary'] + row['Bonus'], axis=1)
print(df[['Name', 'Salary', 'Bonus', 'Total_Comp']])
Output:
Name Salary Bonus Total_Comp
0 Alice 72000 7200.0 79200.0
1 Bob 65000 6500.0 71500.0
2 Charlie 58000 5800.0 63800.0
3 David 90000 9000.0 99000.0
4 Emma 76000 7600.0 83600.0
Renaming Columns
Using the rename()
Method
# Rename specific columns
renamed_df = df.rename(columns={'Name': 'Employee', 'City': 'Location'})
print(renamed_df.columns)
Output:
Index(['Employee', 'Age', 'Location', 'Salary', 'Department', 'Bonus', 'Experience',
'Total_Comp'],
dtype='object')
Renaming All Columns at Once
# Create a smaller dataframe for demo
mini_df = df[['Name', 'Age', 'Salary']].copy()
# Rename all columns
mini_df.columns = ['Employee', 'Years', 'Annual_Pay']
print(mini_df)
Output:
Employee Years Annual_Pay
0 Alice 24 72000
1 Bob 27 65000
2 Charlie 22 58000
3 David 32 90000
4 Emma 28 76000
Deleting Columns
Using drop()
Method
# Remove the Department column
df_no_dept = df.drop('Department', axis=1)
print(df_no_dept.columns)
# Remove multiple columns
df_simplified = df.drop(['Bonus', 'Experience', 'Total_Comp'], axis=1)
print(df_simplified.columns)
Output:
Index(['Name', 'Age', 'City', 'Salary', 'Bonus', 'Experience', 'Total_Comp'], dtype='object')
Index(['Name', 'Age', 'City', 'Salary', 'Department'], dtype='object')
Using del
Statement
# Create a copy to avoid modifying our main DataFrame
df_copy = df.copy()
# Delete a column
del df_copy['Bonus']
print(df_copy.columns)
Output:
Index(['Name', 'Age', 'City', 'Salary', 'Department', 'Experience', 'Total_Comp'], dtype='object')
Column Operations and Transformations
Applying Math Operations to Columns
# Increase all salaries by 5%
df['New_Salary'] = df['Salary'] * 1.05
print(df[['Name', 'Salary', 'New_Salary']])
Output:
Name Salary New_Salary
0 Alice 72000 75600.0
1 Bob 65000 68250.0
2 Charlie 58000 60900.0
3 David 90000 94500.0
4 Emma 76000 79800.0
String Operations on Text Columns
Pandas provides .str
accessor for string operations:
# Convert names to uppercase
df['Name_Upper'] = df['Name'].str.upper()
# Get the first 3 characters of each city
df['City_Short'] = df['City'].str[:3]
print(df[['Name', 'Name_Upper', 'City', 'City_Short']])
Output:
Name Name_Upper City City_Short
0 Alice ALICE New York New
1 Bob BOB Boston Bos
2 Charlie CHARLIE Chicago Chi
3 David DAVID Seattle Sea
4 Emma EMMA Miami Mia
Conditional Column Operations
You can create columns based on conditions using numpy.where()
or pandas.Series.map()
:
# Using numpy.where
df['Salary_Category'] = np.where(df['Salary'] >= 70000, 'High', 'Standard')
# More complex conditions using multiple where statements
conditions = [
(df['Age'] < 25),
(df['Age'] >= 25) & (df['Age'] < 30),
(df['Age'] >= 30)
]
choices = ['Junior', 'Mid-level', 'Senior']
df['Experience_Level'] = np.select(conditions, choices, default='Unknown')
print(df[['Name', 'Age', 'Salary', 'Salary_Category', 'Experience_Level']])
Output:
Name Age Salary Salary_Category Experience_Level
0 Alice 24 72000 High Junior
1 Bob 27 65000 Standard Mid-level
2 Charlie 22 58000 Standard Junior
3 David 32 90000 High Senior
4 Emma 28 76000 High Mid-level
Using map()
and replace()
# Create a mapping dictionary
city_region = {
'New York': 'East',
'Boston': 'East',
'Chicago': 'Midwest',
'Seattle': 'West',
'Miami': 'South'
}
# Apply mapping
df['Region'] = df['City'].map(city_region)
print(df[['City', 'Region']])
Output:
City Region
0 New York East
1 Boston East
2 Chicago Midwest
3 Seattle West
4 Miami South
Real-World Example: Customer Data Analysis
Let's look at a practical example where we analyze customer data to derive insights:
# Create a customer dataset
customer_data = {
'customer_id': range(1001, 1011),
'name': ['John Smith', 'Mary Johnson', 'Robert Brown', 'Patricia Davis',
'Michael Miller', 'Jennifer Wilson', 'William Moore', 'Elizabeth Taylor',
'David Anderson', 'Susan Thomas'],
'order_total': [120.50, 55.75, 310.00, 45.25, 78.90, 210.30, 15.50, 180.75, 90.20, 325.15],
'order_date': ['2023-01-15', '2023-01-17', '2023-01-20', '2023-01-22', '2023-01-25',
'2023-02-02', '2023-02-05', '2023-02-10', '2023-02-15', '2023-02-20'],
'payment_method': ['Credit Card', 'PayPal', 'Credit Card', 'Debit Card', 'Credit Card',
'PayPal', 'Debit Card', 'Credit Card', 'PayPal', 'Credit Card']
}
customers_df = pd.DataFrame(customer_data)
# Convert date string to datetime objects
customers_df['order_date'] = pd.to_datetime(customers_df['order_date'])
# Extract month from the date
customers_df['order_month'] = customers_df['order_date'].dt.month_name()
# Calculate tax (8%)
customers_df['tax'] = customers_df['order_total'] * 0.08
# Calculate total with tax
customers_df['total_with_tax'] = customers_df['order_total'] + customers_df['tax']
# Create discount column based on order amount
conditions = [
(customers_df['order_total'] < 50),
(customers_df['order_total'] >= 50) & (customers_df['order_total'] < 200),
(customers_df['order_total'] >= 200)
]
discount_rates = [0.00, 0.05, 0.10]
customers_df['discount_rate'] = np.select(conditions, discount_rates)
# Calculate final amount after discount
customers_df['final_amount'] = customers_df['total_with_tax'] * (1 - customers_df['discount_rate'])
# Round currency values
for col in ['tax', 'total_with_tax', 'final_amount']:
customers_df[col] = customers_df[col].round(2)
# Display results
print(customers_df[['name', 'order_total', 'order_month', 'tax', 'discount_rate', 'final_amount']])
Output:
name order_total order_month tax discount_rate final_amount
0 John Smith 120.50 January 9.64 0.05 123.88
1 Mary Johnson 55.75 January 4.46 0.05 57.20
2 Robert Brown 310.00 January 24.80 0.10 301.32
3 Patricia Davis 45.25 January 3.62 0.00 48.87
4 Michael Miller 78.90 January 6.31 0.05 80.97
5 Jennifer Wilson 210.30 February 16.82 0.10 204.52
6 William Moore 15.50 February 1.24 0.00 16.74
7 Elizabeth Taylor 180.75 February 14.46 0.05 185.20
8 David Anderson 90.20 February 7.22 0.05 92.45
9 Susan Thomas 325.15 February 26.01 0.10 315.74
Summary
In this tutorial, you've learned how to perform various operations on columns in Pandas DataFrames:
- Basic Column Selection: Selecting single and multiple columns
- Adding and Creating Columns: Adding constant values and creating derived columns
- Renaming Columns: Using
rename()
and direct assignment tocolumns
attribute - Deleting Columns: Using
drop()
anddel
- Column Transformations:
- Applying mathematical operations
- String manipulations with
.str
accessor - Conditional operations with
np.where()
andnp.select()
- Mapping values with
map()
andreplace()
These operations form the foundation of data manipulation in Pandas and will help you efficiently transform and analyze data in your projects.
Exercises
To reinforce your learning, try these exercises:
- Create a DataFrame with student data (name, scores in different subjects) and calculate the average score for each student.
- Using the customer DataFrame from our example, add a new column that categorizes customers into 'New' (February) and 'Returning' (January) based on their order month.
- Create a column that extracts the first name only from the customer names.
- Add a 'preferred_customer' column that marks customers as True if they spent more than $200 and used Credit Card, otherwise False.
- Create a 'days_since_order' column that calculates the number of days between the order date and February 28, 2023.
Additional Resources
- Pandas Documentation: Working with Text Data
- Pandas Documentation: Indexing and Selecting Data
- 10 Minutes to Pandas - Column Operations
- Real Python: Pandas DataFrame Tutorial
With these skills, you'll be well-equipped to handle data manipulation tasks in your data analysis projects!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)