Pandas Basic Operations

Introduction

Pandas is one of the most powerful and flexible data analysis libraries in Python. It provides data structures and functions designed to make working with structured data fast, easy, and expressive. In this tutorial, we'll explore the basic operations that form the foundation of data manipulation with pandas.

Before we dive in, make sure you have pandas installed:

python
# Install pandas if you haven't already
# pip install pandas

# Import pandas
import pandas as pd
import numpy as np

By the end of this tutorial, you'll be comfortable performing common data operations that are essential for any data analysis workflow.

Creating Data Structures

Pandas provides two primary data structures: Series (1-dimensional) and DataFrame (2-dimensional).

Creating a Series

A Series is a one-dimensional labeled array that can hold any data type:

python
# Creating a Series from a list
s = pd.Series([1, 3, 5, 7, 9])
print(s)

Output:

  1
  3
  5
  7
  9
dtype: int64

You can specify custom index labels:

python
# Creating a Series with custom indices
s = pd.Series([1, 3, 5, 7, 9], index=['a', 'b', 'c', 'd', 'e'])
print(s)

Output:

a    1
b    3
c    5
d    7
e    9
dtype: int64

Creating a DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns that can be of different types:

python
# Creating a DataFrame from a dictionary
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age      City
 John   28  New York
 Anna   34     Paris
Peter   29    Berlin
Linda   32    London

You can also create a DataFrame from a list of lists or arrays:

python
# Creating a DataFrame from a list of lists
data = [
    ['John', 28, 'New York'],
    ['Anna', 34, 'Paris'],
    ['Peter', 29, 'Berlin'],
    ['Linda', 32, 'London']
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

Viewing Data

Let's learn how to inspect our data:

python
# Sample DataFrame for demonstration
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'Bob', 'Sarah', 'Mike', 'Emma', 'Tom', 'Laura'],
    'Age': [28, 34, 29, 32, 45, 27, 36, 31, 42, 39],
    'City': ['New York', 'Paris', 'Berlin', 'London', 'Tokyo', 'Sydney', 'Toronto', 'Madrid', 'Rome', 'Chicago'],
    'Salary': [65000, 72000, 58000, 79000, 95000, 67000, 82000, 71000, 88000, 76000]
}

df = pd.DataFrame(data)

Getting Basic Information

python
# View the first 5 rows
print("First 5 rows:")
print(df.head())

# View the last 5 rows
print("\nLast 5 rows:")
print(df.tail())

# View basic statistics of numeric columns
print("\nSummary statistics:")
print(df.describe())

# Get DataFrame information
print("\nDataFrame info:")
print(df.info())

Output (partial):

First 5 rows:
    Name  Age      City  Salary
 John   28  New York   65000
 Anna   34     Paris   72000
Peter   29    Berlin   58000
Linda   32    London   79000
  Bob   45     Tokyo   95000

Last 5 rows:
    Name  Age     City  Salary
Sarah   27   Sydney   67000
 Mike   36  Toronto   82000
 Emma   31   Madrid   71000
  Tom   42     Rome   88000
Laura   39  Chicago   76000

Accessing Columns

You can access columns in multiple ways:

python
# Get the 'Name' column
print(df['Name'])

# Alternative method: df.Name (works only for columns without spaces or special characters)
print(df.Age)

Output:

   John
   Anna
  Peter
  Linda
    Bob
  Sarah
   Mike
   Emma
    Tom
  Laura
Name: Name, dtype: object

  28
  34
  29
  32
  45
  27
  36
  31
  42
  39
Name: Age, dtype: int64

Accessing Rows

Access rows using iloc (integer position) or loc (label):

python
# Get row by position using iloc
print("Row at position 2:")
print(df.iloc[2])

# Get row by label using loc
print("\nRow with label 3:")
print(df.loc[3])

# Get specific rows and columns
print("\nSpecific data using iloc:")
print(df.iloc[1:4, 1:3])  # Rows 1-3, columns 1-2

Output:

Row at position 2:
Name      Peter
Age          29
City      Berlin
Salary     58000
Name: 2, dtype: object

Row with label 3:
Name       Linda
Age           32
City       London
Salary      79000
Name: 3, dtype: object

Specific data using iloc:
   Age     City
1   34    Paris
2   29   Berlin
3   32   London

Basic Operations

Filtering Data

You can filter data based on conditions:

python
# Filter rows where Age > 30
older_than_30 = df[df['Age'] > 30]
print("People older than 30:")
print(older_than_30)

# Filter multiple conditions
high_earners_in_big_cities = df[(df['Salary'] > 70000) & 
                                (df['City'].isin(['New York', 'London', 'Tokyo']))]
print("\nHigh earners in major cities:")
print(high_earners_in_big_cities)

Output:

People older than 30:
    Name  Age      City  Salary
1   Anna   34     Paris   72000
3  Linda   32    London   79000
4    Bob   45     Tokyo   95000
6   Mike   36  Toronto   82000
8    Tom   42      Rome   88000
9  Laura   39   Chicago   76000

High earners in major cities:
    Name  Age     City  Salary
3  Linda   32  London   79000
4    Bob   45   Tokyo   95000

Adding and Removing Columns

python
# Add a new column
df['Experience'] = [5, 8, 4, 7, 15, 3, 9, 6, 12, 10]
print("DataFrame with new column:")
print(df.head())

# Remove a column
df_no_city = df.drop('City', axis=1)
print("\nDataFrame without City column:")
print(df_no_city.head())

Output:

DataFrame with new column:
    Name  Age      City  Salary  Experience
 John   28  New York   65000           5
 Anna   34     Paris   72000           8
Peter   29    Berlin   58000           4
Linda   32    London   79000           7
  Bob   45     Tokyo   95000          15

DataFrame without City column:
    Name  Age  Salary  Experience
 John   28   65000           5
 Anna   34   72000           8
Peter   29   58000           4
Linda   32   79000           7
  Bob   45   95000          15

Sorting Data

python
# Sort by Age in ascending order
sorted_by_age = df.sort_values('Age')
print("Sorted by Age (ascending):")
print(sorted_by_age.head())

# Sort by Salary in descending order
sorted_by_salary_desc = df.sort_values('Salary', ascending=False)
print("\nSorted by Salary (descending):")
print(sorted_by_salary_desc.head())

# Sort by multiple columns
sorted_multiple = df.sort_values(['City', 'Age'])
print("\nSorted by City then Age:")
print(sorted_multiple.head())

Output:

Sorted by Age (ascending):
    Name  Age    City  Salary  Experience
Sarah   27  Sydney   67000           3
 John   28  New York   65000           5
Peter   29  Berlin   58000           4
Linda   32  London   79000           7
 Anna   34   Paris   72000           8

Handling Missing Values

Let's create a DataFrame with some missing values:

python
# Creating data with missing values
data_missing = {
    'Name': ['John', 'Anna', None, 'Linda', 'Bob'],
    'Age': [28, None, 29, 32, None],
    'City': ['New York', 'Paris', None, 'London', 'Tokyo'],
    'Salary': [65000, 72000, None, 79000, 95000]
}

df_missing = pd.DataFrame(data_missing)
print("DataFrame with missing values:")
print(df_missing)

Output:

DataFrame with missing values:
    Name   Age      City  Salary
 John  28.0  New York  65000.0
 Anna   NaN     Paris  72000.0
 None  29.0      None      NaN
Linda  32.0    London  79000.0
  Bob   NaN     Tokyo  95000.0

Check for missing values:

python
# Check for missing values
print("\nMissing value count by column:")
print(df_missing.isnull().sum())

# Fill missing values
df_filled = df_missing.fillna({
    'Name': 'Unknown',
    'Age': df_missing['Age'].mean(),
    'City': 'Unknown',
    'Salary': df_missing['Salary'].median()
})
print("\nDataFrame with filled missing values:")
print(df_filled)

# Drop rows with missing values
df_dropped = df_missing.dropna()
print("\nDataFrame after dropping rows with any missing values:")
print(df_dropped)

Output:

Missing value count by column:
Name      1
Age       2
City      1
Salary    1
dtype: int64

DataFrame with filled missing values:
     Name        Age      City     Salary
0    John  28.000000  New York  65000.000
1    Anna  29.666667     Paris  72000.000
2  Unknown  29.000000   Unknown  75500.000
3   Linda  32.000000    London  79000.000
4     Bob  29.666667     Tokyo  95000.000

DataFrame after dropping rows with any missing values:
    Name   Age    City  Salary
0   John  28.0  New York  65000.0
3  Linda  32.0  London  79000.0

Data Transformation

Applying Functions

You can apply functions to transform your data:

python
# Apply a function to a single column
df['Name_Length'] = df['Name'].apply(len)
print("Names with their lengths:")
print(df[['Name', 'Name_Length']])

# Apply a custom function to create a new column
def calculate_bonus(row):
    if row['Experience'] > 10:
        return row['Salary'] * 0.15
    elif row['Experience'] > 5:
        return row['Salary'] * 0.10
    else:
        return row['Salary'] * 0.05

df['Bonus'] = df.apply(calculate_bonus, axis=1)
print("\nBonus calculation:")
print(df[['Name', 'Salary', 'Experience', 'Bonus']])

Output:

Names with their lengths:
    Name  Name_Length
 John           4
 Anna           4
Peter           5
Linda           5
  Bob           3
Sarah           5
 Mike           4
 Emma           4
  Tom           3
Laura           5

Bonus calculation:
    Name  Salary  Experience    Bonus
 John   65000           5   3250.0
 Anna   72000           8   7200.0
Peter   58000           4   2900.0
Linda   79000           7   7900.0
  Bob   95000          15  14250.0
Sarah   67000           3   3350.0
 Mike   82000           9   8200.0
 Emma   71000           6   7100.0
  Tom   88000          12  13200.0
Laura   76000          10   7600.0

GroupBy Operations

GroupBy allows you to split your data into groups and apply functions to each group:

python
# Group by Age range and calculate average salary
df['Age_Group'] = pd.cut(df['Age'], bins=[20, 30, 40, 50], 
                          labels=['20s', '30s', '40s'])

age_group_stats = df.groupby('Age_Group').agg({
    'Salary': ['mean', 'min', 'max', 'count'],
    'Experience': 'mean'
})

print("Statistics by Age Group:")
print(age_group_stats)

# Let's do a more practical example - average experience by city
city_experience = df.groupby('City')['Experience'].mean().sort_values(ascending=False)
print("\nAverage experience by city:")
print(city_experience)

Output:

Statistics by Age Group:
        Salary                           Experience
          mean     min     max count        mean
Age_Group                                       
20s     63333.33  58000  67000     3    4.000000
30s     76000.00  71000  82000     5    8.000000
40s     91500.00  88000  95000     2   13.500000

Average experience by city:
City
Tokyo      15.0
Rome       12.0
Toronto     9.0
Paris       8.0
London      7.0
Chicago     7.0
Madrid      6.0
New York    5.0
Berlin      4.0
Sydney      3.0
Name: Experience, dtype: float64

Real-World Application: Data Analysis

Let's put everything together in a real-world example. We'll analyze a dataset containing employee information:

python
# Creating a more comprehensive dataset
np.random.seed(42)  # For reproducibility

departments = ['IT', 'Marketing', 'Finance', 'HR', 'Operations']
locations = ['New York', 'Chicago', 'San Francisco', 'Boston', 'Seattle']

data = {
    'Employee_ID': range(1001, 1031),
    'Name': ['Employee_' + str(i) for i in range(1, 31)],
    'Department': np.random.choice(departments, 30),
    'Salary': np.random.randint(50000, 150000, 30),
    'Years_Experience': np.random.randint(1, 20, 30),
    'Location': np.random.choice(locations, 30),
    'Performance_Score': np.random.uniform(2.0, 5.0, 30).round(1)
}

employees = pd.DataFrame(data)

# Display the first few rows
print("Employee Dataset:")
print(employees.head())

Now, let's analyze this dataset:

python
# 1. Basic statistics about the data
print("\nBasic Statistics:")
print(employees.describe())

# 2. Average salary by department
avg_salary_dept = employees.groupby('Department')['Salary'].mean().sort_values(ascending=False)
print("\nAverage Salary by Department:")
print(avg_salary_dept)

# 3. Count employees by location
location_count = employees.groupby('Location').size().sort_values(ascending=False)
print("\nEmployee Count by Location:")
print(location_count)

# 4. Find top performers (score > 4.5) and their details
top_performers = employees[employees['Performance_Score'] > 4.5]
print("\nTop Performers:")
print(top_performers[['Name', 'Department', 'Performance_Score', 'Salary']])

# 5. Calculate a bonus based on performance and experience
def bonus_calculator(row):
    base_percent = (row['Performance_Score'] - 2) * 10  # 0-30%
    experience_boost = min(row['Years_Experience'] * 0.5, 10)  # 0-10%
    return int(row['Salary'] * (base_percent + experience_boost) / 100)

employees['Bonus'] = employees.apply(bonus_calculator, axis=1)
print("\nEmployees with calculated bonus:")
print(employees[['Name', 'Performance_Score', 'Years_Experience', 'Salary', 'Bonus']].head(10))

# 6. Identify departments with above-average performance
dept_performance = employees.groupby('Department')['Performance_Score'].mean().sort_values(ascending=False)
above_avg = dept_performance[dept_performance > employees['Performance_Score'].mean()]
print("\nDepartments with Above-Average Performance:")
print(above_avg)

# 7. Correlation between years of experience and salary
correlation = employees['Years_Experience'].corr(employees['Salary'])
print(f"\nCorrelation between Years Experience and Salary: {correlation:.2f}")

This analysis demonstrates how to:

Calculate summary statistics
Group and aggregate data
Filter data based on conditions
Apply custom functions to create derived metrics
Identify patterns and correlations in the data

Summary

In this tutorial, we've covered the fundamental operations in pandas that you'll use in almost every data analysis task:

Creating data structures: Series and DataFrames
Viewing data: Inspecting the shape and content of your data
Basic operations: Filtering, sorting, and manipulating data
Handling missing values: Identifying and filling or removing missing data
Transforming data: Applying functions to create new columns
GroupBy operations: Aggregating data by categories
Real-world application: Performing comprehensive data analysis

These basic operations form the foundation of data manipulation with pandas. As you gain experience, you'll be able to combine these operations to solve increasingly complex data problems.

Practice Exercises

Create a DataFrame with student information (name, grades for different subjects, attendance)
Calculate the average grade for each student
Identify students with attendance below 80%
Group students by grade ranges (A: 90-100, B: 80-89, etc.)
Create a new column that calculates if a student passed (average grade > 70)

Additional Resources

Happy data analyzing!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Creating Data Structures​

Creating a Series​

Creating a DataFrame​

Viewing Data​

Getting Basic Information​

Accessing Columns​

Accessing Rows​

Basic Operations​

Filtering Data​

Adding and Removing Columns​

Sorting Data​

Handling Missing Values​

Data Transformation​

Applying Functions​

GroupBy Operations​

Real-World Application: Data Analysis​

Summary​

Practice Exercises​

Additional Resources​

Introduction

Creating Data Structures

Creating a Series

Creating a DataFrame

Viewing Data

Getting Basic Information

Accessing Columns

Accessing Rows

Basic Operations

Filtering Data

Adding and Removing Columns

Sorting Data

Handling Missing Values

Data Transformation

Applying Functions

GroupBy Operations

Real-World Application: Data Analysis

Summary

Practice Exercises

Additional Resources