Skip to main content

Pandas Column Operations

When working with data in Pandas, column operations are among the most common tasks you'll perform. Whether you're selecting specific columns for analysis, creating new columns from existing data, or transforming column values, mastering column operations is essential for effective data manipulation.

Introduction to Pandas Columns

In a Pandas DataFrame, columns represent variables or features of your dataset. Each column has a name (label) and contains data of a specific type. Understanding how to work with columns efficiently is crucial for data analysis and preprocessing.

Let's start by importing Pandas and creating a sample DataFrame:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [24, 27, 22, 32, 28],
'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
'Salary': [72000, 65000, 58000, 90000, 76000]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age     City  Salary
0 Alice 24 New York 72000
1 Bob 27 Boston 65000
2 Charlie 22 Chicago 58000
3 David 32 Seattle 90000
4 Emma 28 Miami 76000

Basic Column Selection

Selecting a Single Column

You can select a single column from a DataFrame in two ways:

python
# Method 1: Using square brackets - returns a pandas Series
ages = df['Age']
print("Type:", type(ages))
print(ages)

# Method 2: Using dot notation (only works for simple column names without spaces or special characters)
ages_dot = df.Age
print("Type:", type(ages_dot))
print(ages_dot)

Output:

Type: <class 'pandas.core.series.Series'>
0 24
1 27
2 22
3 32
4 28
Name: Age, dtype: int64
Type: <class 'pandas.core.series.Series'>
0 24
1 27
2 22
3 32
4 28
Name: Age, dtype: int64

Selecting Multiple Columns

To select multiple columns, you can provide a list of column names:

python
# Select Name and Salary columns
name_salary = df[['Name', 'Salary']]
print(type(name_salary))
print(name_salary)

Output:

<class 'pandas.core.frame.DataFrame'>
Name Salary
0 Alice 72000
1 Bob 65000
2 Charlie 58000
3 David 90000
4 Emma 76000

Adding and Creating New Columns

Adding a New Column with a Constant Value

python
# Add a department column with the same value for all rows
df['Department'] = 'Engineering'
print(df)

Output:

      Name  Age     City  Salary   Department
0 Alice 24 New York 72000 Engineering
1 Bob 27 Boston 65000 Engineering
2 Charlie 22 Chicago 58000 Engineering
3 David 32 Seattle 90000 Engineering
4 Emma 28 Miami 76000 Engineering

Creating a New Column Based on Existing Columns

You can create new columns by performing operations on existing columns:

python
# Calculate bonus as 10% of salary
df['Bonus'] = df['Salary'] * 0.1
print(df)

Output:

      Name  Age     City  Salary   Department    Bonus
0 Alice 24 New York 72000 Engineering 7200.0
1 Bob 27 Boston 65000 Engineering 6500.0
2 Charlie 22 Chicago 58000 Engineering 5800.0
3 David 32 Seattle 90000 Engineering 9000.0
4 Emma 28 Miami 76000 Engineering 7600.0

Using apply() to Create Columns

For more complex operations, you can use the apply() function:

python
# Create a new column with custom logic
def calculate_experience(age):
return max(0, age - 21)

df['Experience'] = df['Age'].apply(calculate_experience)
print(df)

Output:

      Name  Age     City  Salary   Department    Bonus  Experience
0 Alice 24 New York 72000 Engineering 7200.0 3
1 Bob 27 Boston 65000 Engineering 6500.0 6
2 Charlie 22 Chicago 58000 Engineering 5800.0 1
3 David 32 Seattle 90000 Engineering 9000.0 11
4 Emma 28 Miami 76000 Engineering 7600.0 7

Using Lambda Functions with Apply

You can also use lambda functions for simpler operations:

python
# Create a column for total compensation
df['Total_Comp'] = df.apply(lambda row: row['Salary'] + row['Bonus'], axis=1)
print(df[['Name', 'Salary', 'Bonus', 'Total_Comp']])

Output:

      Name  Salary    Bonus  Total_Comp
0 Alice 72000 7200.0 79200.0
1 Bob 65000 6500.0 71500.0
2 Charlie 58000 5800.0 63800.0
3 David 90000 9000.0 99000.0
4 Emma 76000 7600.0 83600.0

Renaming Columns

Using the rename() Method

python
# Rename specific columns
renamed_df = df.rename(columns={'Name': 'Employee', 'City': 'Location'})
print(renamed_df.columns)

Output:

Index(['Employee', 'Age', 'Location', 'Salary', 'Department', 'Bonus', 'Experience',
'Total_Comp'],
dtype='object')

Renaming All Columns at Once

python
# Create a smaller dataframe for demo
mini_df = df[['Name', 'Age', 'Salary']].copy()
# Rename all columns
mini_df.columns = ['Employee', 'Years', 'Annual_Pay']
print(mini_df)

Output:

  Employee  Years  Annual_Pay
0 Alice 24 72000
1 Bob 27 65000
2 Charlie 22 58000
3 David 32 90000
4 Emma 28 76000

Deleting Columns

Using drop() Method

python
# Remove the Department column
df_no_dept = df.drop('Department', axis=1)
print(df_no_dept.columns)

# Remove multiple columns
df_simplified = df.drop(['Bonus', 'Experience', 'Total_Comp'], axis=1)
print(df_simplified.columns)

Output:

Index(['Name', 'Age', 'City', 'Salary', 'Bonus', 'Experience', 'Total_Comp'], dtype='object')
Index(['Name', 'Age', 'City', 'Salary', 'Department'], dtype='object')

Using del Statement

python
# Create a copy to avoid modifying our main DataFrame
df_copy = df.copy()
# Delete a column
del df_copy['Bonus']
print(df_copy.columns)

Output:

Index(['Name', 'Age', 'City', 'Salary', 'Department', 'Experience', 'Total_Comp'], dtype='object')

Column Operations and Transformations

Applying Math Operations to Columns

python
# Increase all salaries by 5%
df['New_Salary'] = df['Salary'] * 1.05
print(df[['Name', 'Salary', 'New_Salary']])

Output:

      Name  Salary  New_Salary
0 Alice 72000 75600.0
1 Bob 65000 68250.0
2 Charlie 58000 60900.0
3 David 90000 94500.0
4 Emma 76000 79800.0

String Operations on Text Columns

Pandas provides .str accessor for string operations:

python
# Convert names to uppercase
df['Name_Upper'] = df['Name'].str.upper()
# Get the first 3 characters of each city
df['City_Short'] = df['City'].str[:3]
print(df[['Name', 'Name_Upper', 'City', 'City_Short']])

Output:

      Name Name_Upper     City City_Short
0 Alice ALICE New York New
1 Bob BOB Boston Bos
2 Charlie CHARLIE Chicago Chi
3 David DAVID Seattle Sea
4 Emma EMMA Miami Mia

Conditional Column Operations

You can create columns based on conditions using numpy.where() or pandas.Series.map():

python
# Using numpy.where
df['Salary_Category'] = np.where(df['Salary'] >= 70000, 'High', 'Standard')

# More complex conditions using multiple where statements
conditions = [
(df['Age'] < 25),
(df['Age'] >= 25) & (df['Age'] < 30),
(df['Age'] >= 30)
]
choices = ['Junior', 'Mid-level', 'Senior']
df['Experience_Level'] = np.select(conditions, choices, default='Unknown')

print(df[['Name', 'Age', 'Salary', 'Salary_Category', 'Experience_Level']])

Output:

      Name  Age  Salary Salary_Category Experience_Level
0 Alice 24 72000 High Junior
1 Bob 27 65000 Standard Mid-level
2 Charlie 22 58000 Standard Junior
3 David 32 90000 High Senior
4 Emma 28 76000 High Mid-level

Using map() and replace()

python
# Create a mapping dictionary
city_region = {
'New York': 'East',
'Boston': 'East',
'Chicago': 'Midwest',
'Seattle': 'West',
'Miami': 'South'
}

# Apply mapping
df['Region'] = df['City'].map(city_region)
print(df[['City', 'Region']])

Output:

      City  Region
0 New York East
1 Boston East
2 Chicago Midwest
3 Seattle West
4 Miami South

Real-World Example: Customer Data Analysis

Let's look at a practical example where we analyze customer data to derive insights:

python
# Create a customer dataset
customer_data = {
'customer_id': range(1001, 1011),
'name': ['John Smith', 'Mary Johnson', 'Robert Brown', 'Patricia Davis',
'Michael Miller', 'Jennifer Wilson', 'William Moore', 'Elizabeth Taylor',
'David Anderson', 'Susan Thomas'],
'order_total': [120.50, 55.75, 310.00, 45.25, 78.90, 210.30, 15.50, 180.75, 90.20, 325.15],
'order_date': ['2023-01-15', '2023-01-17', '2023-01-20', '2023-01-22', '2023-01-25',
'2023-02-02', '2023-02-05', '2023-02-10', '2023-02-15', '2023-02-20'],
'payment_method': ['Credit Card', 'PayPal', 'Credit Card', 'Debit Card', 'Credit Card',
'PayPal', 'Debit Card', 'Credit Card', 'PayPal', 'Credit Card']
}

customers_df = pd.DataFrame(customer_data)

# Convert date string to datetime objects
customers_df['order_date'] = pd.to_datetime(customers_df['order_date'])

# Extract month from the date
customers_df['order_month'] = customers_df['order_date'].dt.month_name()

# Calculate tax (8%)
customers_df['tax'] = customers_df['order_total'] * 0.08

# Calculate total with tax
customers_df['total_with_tax'] = customers_df['order_total'] + customers_df['tax']

# Create discount column based on order amount
conditions = [
(customers_df['order_total'] < 50),
(customers_df['order_total'] >= 50) & (customers_df['order_total'] < 200),
(customers_df['order_total'] >= 200)
]
discount_rates = [0.00, 0.05, 0.10]
customers_df['discount_rate'] = np.select(conditions, discount_rates)

# Calculate final amount after discount
customers_df['final_amount'] = customers_df['total_with_tax'] * (1 - customers_df['discount_rate'])

# Round currency values
for col in ['tax', 'total_with_tax', 'final_amount']:
customers_df[col] = customers_df[col].round(2)

# Display results
print(customers_df[['name', 'order_total', 'order_month', 'tax', 'discount_rate', 'final_amount']])

Output:

              name  order_total order_month    tax  discount_rate  final_amount
0 John Smith 120.50 January 9.64 0.05 123.88
1 Mary Johnson 55.75 January 4.46 0.05 57.20
2 Robert Brown 310.00 January 24.80 0.10 301.32
3 Patricia Davis 45.25 January 3.62 0.00 48.87
4 Michael Miller 78.90 January 6.31 0.05 80.97
5 Jennifer Wilson 210.30 February 16.82 0.10 204.52
6 William Moore 15.50 February 1.24 0.00 16.74
7 Elizabeth Taylor 180.75 February 14.46 0.05 185.20
8 David Anderson 90.20 February 7.22 0.05 92.45
9 Susan Thomas 325.15 February 26.01 0.10 315.74

Summary

In this tutorial, you've learned how to perform various operations on columns in Pandas DataFrames:

  1. Basic Column Selection: Selecting single and multiple columns
  2. Adding and Creating Columns: Adding constant values and creating derived columns
  3. Renaming Columns: Using rename() and direct assignment to columns attribute
  4. Deleting Columns: Using drop() and del
  5. Column Transformations:
    • Applying mathematical operations
    • String manipulations with .str accessor
    • Conditional operations with np.where() and np.select()
    • Mapping values with map() and replace()

These operations form the foundation of data manipulation in Pandas and will help you efficiently transform and analyze data in your projects.

Exercises

To reinforce your learning, try these exercises:

  1. Create a DataFrame with student data (name, scores in different subjects) and calculate the average score for each student.
  2. Using the customer DataFrame from our example, add a new column that categorizes customers into 'New' (February) and 'Returning' (January) based on their order month.
  3. Create a column that extracts the first name only from the customer names.
  4. Add a 'preferred_customer' column that marks customers as True if they spent more than $200 and used Credit Card, otherwise False.
  5. Create a 'days_since_order' column that calculates the number of days between the order date and February 28, 2023.

Additional Resources

With these skills, you'll be well-equipped to handle data manipulation tasks in your data analysis projects!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)