Pandas Column Operations

When working with data in Pandas, column operations are among the most common tasks you'll perform. Whether you're selecting specific columns for analysis, creating new columns from existing data, or transforming column values, mastering column operations is essential for effective data manipulation.

Introduction to Pandas Columns

In a Pandas DataFrame, columns represent variables or features of your dataset. Each column has a name (label) and contains data of a specific type. Understanding how to work with columns efficiently is crucial for data analysis and preprocessing.

Let's start by importing Pandas and creating a sample DataFrame:

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [24, 27, 22, 32, 28],
    'City': ['New York', 'Boston', 'Chicago', 'Seattle', 'Miami'],
    'Salary': [72000, 65000, 58000, 90000, 76000]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age     City  Salary
  Alice   24  New York   72000
    Bob   27    Boston   65000
Charlie   22   Chicago   58000
  David   32   Seattle   90000
   Emma   28     Miami   76000

Basic Column Selection

Selecting a Single Column

You can select a single column from a DataFrame in two ways:

# Method 1: Using square brackets - returns a pandas Series
ages = df['Age']
print("Type:", type(ages))
print(ages)

# Method 2: Using dot notation (only works for simple column names without spaces or special characters)
ages_dot = df.Age
print("Type:", type(ages_dot))
print(ages_dot)

Output:

Type: <class 'pandas.core.series.Series'>
  24
  27
  22
  32
  28
Name: Age, dtype: int64
Type: <class 'pandas.core.series.Series'>
  24
  27
  22
  32
  28
Name: Age, dtype: int64

Selecting Multiple Columns

To select multiple columns, you can provide a list of column names:

# Select Name and Salary columns
name_salary = df[['Name', 'Salary']]
print(type(name_salary))
print(name_salary)

Output:

<class 'pandas.core.frame.DataFrame'>
      Name  Salary
  Alice   72000
    Bob   65000
Charlie   58000
  David   90000
   Emma   76000

Adding and Creating New Columns

Adding a New Column with a Constant Value

# Add a department column with the same value for all rows
df['Department'] = 'Engineering'
print(df)

Output:

      Name  Age     City  Salary   Department
  Alice   24  New York   72000  Engineering
    Bob   27    Boston   65000  Engineering
Charlie   22   Chicago   58000  Engineering
  David   32   Seattle   90000  Engineering
   Emma   28     Miami   76000  Engineering

Creating a New Column Based on Existing Columns

You can create new columns by performing operations on existing columns:

# Calculate bonus as 10% of salary
df['Bonus'] = df['Salary'] * 0.1
print(df)

Output:

      Name  Age     City  Salary   Department    Bonus
  Alice   24  New York   72000  Engineering   7200.0
    Bob   27    Boston   65000  Engineering   6500.0
Charlie   22   Chicago   58000  Engineering   5800.0
  David   32   Seattle   90000  Engineering   9000.0
   Emma   28     Miami   76000  Engineering   7600.0

Using `apply()` to Create Columns

For more complex operations, you can use the apply() function:

# Create a new column with custom logic
def calculate_experience(age):
    return max(0, age - 21)

df['Experience'] = df['Age'].apply(calculate_experience)
print(df)

Output:

      Name  Age     City  Salary   Department    Bonus  Experience
  Alice   24  New York   72000  Engineering   7200.0           3
    Bob   27    Boston   65000  Engineering   6500.0           6
Charlie   22   Chicago   58000  Engineering   5800.0           1
  David   32   Seattle   90000  Engineering   9000.0          11
   Emma   28     Miami   76000  Engineering   7600.0           7

Using Lambda Functions with Apply

You can also use lambda functions for simpler operations:

# Create a column for total compensation
df['Total_Comp'] = df.apply(lambda row: row['Salary'] + row['Bonus'], axis=1)
print(df[['Name', 'Salary', 'Bonus', 'Total_Comp']])

Output:

      Name  Salary    Bonus  Total_Comp
  Alice   72000   7200.0      79200.0
    Bob   65000   6500.0      71500.0
Charlie   58000   5800.0      63800.0
  David   90000   9000.0      99000.0
   Emma   76000   7600.0      83600.0

Renaming Columns

Using the `rename()` Method

# Rename specific columns
renamed_df = df.rename(columns={'Name': 'Employee', 'City': 'Location'})
print(renamed_df.columns)

Output:

Index(['Employee', 'Age', 'Location', 'Salary', 'Department', 'Bonus', 'Experience',
       'Total_Comp'],
      dtype='object')

Renaming All Columns at Once

# Create a smaller dataframe for demo
mini_df = df[['Name', 'Age', 'Salary']].copy()
# Rename all columns
mini_df.columns = ['Employee', 'Years', 'Annual_Pay']
print(mini_df)

Output:

  Employee  Years  Annual_Pay
  Alice     24       72000
    Bob     27       65000
Charlie     22       58000
  David     32       90000
   Emma     28       76000

Deleting Columns

Using `drop()` Method

# Remove the Department column
df_no_dept = df.drop('Department', axis=1)
print(df_no_dept.columns)

# Remove multiple columns
df_simplified = df.drop(['Bonus', 'Experience', 'Total_Comp'], axis=1)
print(df_simplified.columns)

Output:

Index(['Name', 'Age', 'City', 'Salary', 'Bonus', 'Experience', 'Total_Comp'], dtype='object')
Index(['Name', 'Age', 'City', 'Salary', 'Department'], dtype='object')

Using `del` Statement

# Create a copy to avoid modifying our main DataFrame
df_copy = df.copy()
# Delete a column
del df_copy['Bonus']
print(df_copy.columns)

Output:

Index(['Name', 'Age', 'City', 'Salary', 'Department', 'Experience', 'Total_Comp'], dtype='object')

Column Operations and Transformations

Applying Math Operations to Columns

# Increase all salaries by 5%
df['New_Salary'] = df['Salary'] * 1.05
print(df[['Name', 'Salary', 'New_Salary']])

Output:

      Name  Salary  New_Salary
  Alice   72000     75600.0
    Bob   65000     68250.0
Charlie   58000     60900.0
  David   90000     94500.0
   Emma   76000     79800.0

String Operations on Text Columns

Pandas provides .str accessor for string operations:

# Convert names to uppercase
df['Name_Upper'] = df['Name'].str.upper()
# Get the first 3 characters of each city
df['City_Short'] = df['City'].str[:3]
print(df[['Name', 'Name_Upper', 'City', 'City_Short']])

Output:

      Name Name_Upper     City City_Short
  Alice     ALICE  New York        New
    Bob       BOB    Boston        Bos
Charlie   CHARLIE   Chicago        Chi
  David     DAVID   Seattle        Sea
   Emma      EMMA     Miami        Mia

Conditional Column Operations

You can create columns based on conditions using numpy.where() or pandas.Series.map():

# Using numpy.where
df['Salary_Category'] = np.where(df['Salary'] >= 70000, 'High', 'Standard')

# More complex conditions using multiple where statements
conditions = [
    (df['Age'] < 25),
    (df['Age'] >= 25) & (df['Age'] < 30),
    (df['Age'] >= 30)
]
choices = ['Junior', 'Mid-level', 'Senior']
df['Experience_Level'] = np.select(conditions, choices, default='Unknown')

print(df[['Name', 'Age', 'Salary', 'Salary_Category', 'Experience_Level']])

Output:

      Name  Age  Salary Salary_Category Experience_Level
  Alice   24   72000           High          Junior
    Bob   27   65000       Standard       Mid-level
Charlie   22   58000       Standard          Junior
  David   32   90000           High          Senior
   Emma   28   76000           High       Mid-level

Using `map()` and `replace()`

# Create a mapping dictionary
city_region = {
    'New York': 'East',
    'Boston': 'East',
    'Chicago': 'Midwest',
    'Seattle': 'West',
    'Miami': 'South'
}

# Apply mapping
df['Region'] = df['City'].map(city_region)
print(df[['City', 'Region']])

Output:

      City  Region
New York    East
  Boston    East
 Chicago Midwest
 Seattle    West
   Miami   South

Real-World Example: Customer Data Analysis

Let's look at a practical example where we analyze customer data to derive insights:

# Create a customer dataset
customer_data = {
    'customer_id': range(1001, 1011),
    'name': ['John Smith', 'Mary Johnson', 'Robert Brown', 'Patricia Davis', 
             'Michael Miller', 'Jennifer Wilson', 'William Moore', 'Elizabeth Taylor', 
             'David Anderson', 'Susan Thomas'],
    'order_total': [120.50, 55.75, 310.00, 45.25, 78.90, 210.30, 15.50, 180.75, 90.20, 325.15],
    'order_date': ['2023-01-15', '2023-01-17', '2023-01-20', '2023-01-22', '2023-01-25', 
                   '2023-02-02', '2023-02-05', '2023-02-10', '2023-02-15', '2023-02-20'],
    'payment_method': ['Credit Card', 'PayPal', 'Credit Card', 'Debit Card', 'Credit Card', 
                       'PayPal', 'Debit Card', 'Credit Card', 'PayPal', 'Credit Card']
}

customers_df = pd.DataFrame(customer_data)

# Convert date string to datetime objects
customers_df['order_date'] = pd.to_datetime(customers_df['order_date'])

# Extract month from the date
customers_df['order_month'] = customers_df['order_date'].dt.month_name()

# Calculate tax (8%)
customers_df['tax'] = customers_df['order_total'] * 0.08

# Calculate total with tax
customers_df['total_with_tax'] = customers_df['order_total'] + customers_df['tax']

# Create discount column based on order amount
conditions = [
    (customers_df['order_total'] < 50),
    (customers_df['order_total'] >= 50) & (customers_df['order_total'] < 200),
    (customers_df['order_total'] >= 200)
]
discount_rates = [0.00, 0.05, 0.10]
customers_df['discount_rate'] = np.select(conditions, discount_rates)

# Calculate final amount after discount
customers_df['final_amount'] = customers_df['total_with_tax'] * (1 - customers_df['discount_rate'])

# Round currency values
for col in ['tax', 'total_with_tax', 'final_amount']:
    customers_df[col] = customers_df[col].round(2)

# Display results
print(customers_df[['name', 'order_total', 'order_month', 'tax', 'discount_rate', 'final_amount']])

Output:

              name  order_total order_month    tax  discount_rate  final_amount
    John Smith       120.50    January   9.64          0.05       123.88
  Mary Johnson        55.75    January   4.46          0.05        57.20
  Robert Brown       310.00    January  24.80          0.10       301.32
 Patricia Davis        45.25    January   3.62          0.00        48.87
 Michael Miller        78.90    January   6.31          0.05        80.97
Jennifer Wilson       210.30   February  16.82          0.10       204.52
  William Moore        15.50   February   1.24          0.00        16.74
Elizabeth Taylor      180.75   February  14.46          0.05       185.20
  David Anderson       90.20   February   7.22          0.05        92.45
   Susan Thomas       325.15   February  26.01          0.10       315.74

Summary

In this tutorial, you've learned how to perform various operations on columns in Pandas DataFrames:

Basic Column Selection: Selecting single and multiple columns
Adding and Creating Columns: Adding constant values and creating derived columns
Renaming Columns: Using rename() and direct assignment to columns attribute
Deleting Columns: Using drop() and del
Column Transformations:
- Applying mathematical operations
- String manipulations with .str accessor
- Conditional operations with np.where() and np.select()
- Mapping values with map() and replace()

These operations form the foundation of data manipulation in Pandas and will help you efficiently transform and analyze data in your projects.

Exercises

To reinforce your learning, try these exercises:

Create a DataFrame with student data (name, scores in different subjects) and calculate the average score for each student.
Using the customer DataFrame from our example, add a new column that categorizes customers into 'New' (February) and 'Returning' (January) based on their order month.
Create a column that extracts the first name only from the customer names.
Add a 'preferred_customer' column that marks customers as True if they spent more than $200 and used Credit Card, otherwise False.
Create a 'days_since_order' column that calculates the number of days between the order date and February 28, 2023.

Additional Resources

With these skills, you'll be well-equipped to handle data manipulation tasks in your data analysis projects!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Pandas Columns​

Basic Column Selection​

Selecting a Single Column​

Selecting Multiple Columns​

Adding and Creating New Columns​

Adding a New Column with a Constant Value​

Creating a New Column Based on Existing Columns​

Using apply() to Create Columns​

Using Lambda Functions with Apply​

Renaming Columns​

Using the rename() Method​

Renaming All Columns at Once​

Deleting Columns​

Using drop() Method​

Using del Statement​

Column Operations and Transformations​

Applying Math Operations to Columns​

String Operations on Text Columns​

Conditional Column Operations​

Using map() and replace()​

Real-World Example: Customer Data Analysis​

Summary​

Exercises​

Additional Resources​

Introduction to Pandas Columns

Basic Column Selection

Selecting a Single Column

Selecting Multiple Columns

Adding and Creating New Columns

Adding a New Column with a Constant Value

Creating a New Column Based on Existing Columns

Using `apply()` to Create Columns

Using Lambda Functions with Apply

Renaming Columns

Using the `rename()` Method

Renaming All Columns at Once

Deleting Columns

Using `drop()` Method

Using `del` Statement

Column Operations and Transformations

Applying Math Operations to Columns

String Operations on Text Columns

Conditional Column Operations

Using `map()` and `replace()`

Real-World Example: Customer Data Analysis

Summary

Exercises

Additional Resources