Skip to main content

Pandas Basic Operations

Introduction

Pandas is one of the most powerful and flexible data analysis libraries in Python. It provides data structures and functions designed to make working with structured data fast, easy, and expressive. In this tutorial, we'll explore the basic operations that form the foundation of data manipulation with pandas.

Before we dive in, make sure you have pandas installed:

python
# Install pandas if you haven't already
# pip install pandas

# Import pandas
import pandas as pd
import numpy as np

By the end of this tutorial, you'll be comfortable performing common data operations that are essential for any data analysis workflow.

Creating Data Structures

Pandas provides two primary data structures: Series (1-dimensional) and DataFrame (2-dimensional).

Creating a Series

A Series is a one-dimensional labeled array that can hold any data type:

python
# Creating a Series from a list
s = pd.Series([1, 3, 5, 7, 9])
print(s)

Output:

0    1
1 3
2 5
3 7
4 9
dtype: int64

You can specify custom index labels:

python
# Creating a Series with custom indices
s = pd.Series([1, 3, 5, 7, 9], index=['a', 'b', 'c', 'd', 'e'])
print(s)

Output:

a    1
b 3
c 5
d 7
e 9
dtype: int64

Creating a DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns that can be of different types:

python
# Creating a DataFrame from a dictionary
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age      City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 32 London

You can also create a DataFrame from a list of lists or arrays:

python
# Creating a DataFrame from a list of lists
data = [
['John', 28, 'New York'],
['Anna', 34, 'Paris'],
['Peter', 29, 'Berlin'],
['Linda', 32, 'London']
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

Viewing Data

Let's learn how to inspect our data:

python
# Sample DataFrame for demonstration
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda', 'Bob', 'Sarah', 'Mike', 'Emma', 'Tom', 'Laura'],
'Age': [28, 34, 29, 32, 45, 27, 36, 31, 42, 39],
'City': ['New York', 'Paris', 'Berlin', 'London', 'Tokyo', 'Sydney', 'Toronto', 'Madrid', 'Rome', 'Chicago'],
'Salary': [65000, 72000, 58000, 79000, 95000, 67000, 82000, 71000, 88000, 76000]
}

df = pd.DataFrame(data)

Getting Basic Information

python
# View the first 5 rows
print("First 5 rows:")
print(df.head())

# View the last 5 rows
print("\nLast 5 rows:")
print(df.tail())

# View basic statistics of numeric columns
print("\nSummary statistics:")
print(df.describe())

# Get DataFrame information
print("\nDataFrame info:")
print(df.info())

Output (partial):

First 5 rows:
Name Age City Salary
0 John 28 New York 65000
1 Anna 34 Paris 72000
2 Peter 29 Berlin 58000
3 Linda 32 London 79000
4 Bob 45 Tokyo 95000

Last 5 rows:
Name Age City Salary
5 Sarah 27 Sydney 67000
6 Mike 36 Toronto 82000
7 Emma 31 Madrid 71000
8 Tom 42 Rome 88000
9 Laura 39 Chicago 76000

Accessing Columns

You can access columns in multiple ways:

python
# Get the 'Name' column
print(df['Name'])

# Alternative method: df.Name (works only for columns without spaces or special characters)
print(df.Age)

Output:

0     John
1 Anna
2 Peter
3 Linda
4 Bob
5 Sarah
6 Mike
7 Emma
8 Tom
9 Laura
Name: Name, dtype: object

0 28
1 34
2 29
3 32
4 45
5 27
6 36
7 31
8 42
9 39
Name: Age, dtype: int64

Accessing Rows

Access rows using iloc (integer position) or loc (label):

python
# Get row by position using iloc
print("Row at position 2:")
print(df.iloc[2])

# Get row by label using loc
print("\nRow with label 3:")
print(df.loc[3])

# Get specific rows and columns
print("\nSpecific data using iloc:")
print(df.iloc[1:4, 1:3]) # Rows 1-3, columns 1-2

Output:

Row at position 2:
Name Peter
Age 29
City Berlin
Salary 58000
Name: 2, dtype: object

Row with label 3:
Name Linda
Age 32
City London
Salary 79000
Name: 3, dtype: object

Specific data using iloc:
Age City
1 34 Paris
2 29 Berlin
3 32 London

Basic Operations

Filtering Data

You can filter data based on conditions:

python
# Filter rows where Age > 30
older_than_30 = df[df['Age'] > 30]
print("People older than 30:")
print(older_than_30)

# Filter multiple conditions
high_earners_in_big_cities = df[(df['Salary'] > 70000) &
(df['City'].isin(['New York', 'London', 'Tokyo']))]
print("\nHigh earners in major cities:")
print(high_earners_in_big_cities)

Output:

People older than 30:
Name Age City Salary
1 Anna 34 Paris 72000
3 Linda 32 London 79000
4 Bob 45 Tokyo 95000
6 Mike 36 Toronto 82000
8 Tom 42 Rome 88000
9 Laura 39 Chicago 76000

High earners in major cities:
Name Age City Salary
3 Linda 32 London 79000
4 Bob 45 Tokyo 95000

Adding and Removing Columns

python
# Add a new column
df['Experience'] = [5, 8, 4, 7, 15, 3, 9, 6, 12, 10]
print("DataFrame with new column:")
print(df.head())

# Remove a column
df_no_city = df.drop('City', axis=1)
print("\nDataFrame without City column:")
print(df_no_city.head())

Output:

DataFrame with new column:
Name Age City Salary Experience
0 John 28 New York 65000 5
1 Anna 34 Paris 72000 8
2 Peter 29 Berlin 58000 4
3 Linda 32 London 79000 7
4 Bob 45 Tokyo 95000 15

DataFrame without City column:
Name Age Salary Experience
0 John 28 65000 5
1 Anna 34 72000 8
2 Peter 29 58000 4
3 Linda 32 79000 7
4 Bob 45 95000 15

Sorting Data

python
# Sort by Age in ascending order
sorted_by_age = df.sort_values('Age')
print("Sorted by Age (ascending):")
print(sorted_by_age.head())

# Sort by Salary in descending order
sorted_by_salary_desc = df.sort_values('Salary', ascending=False)
print("\nSorted by Salary (descending):")
print(sorted_by_salary_desc.head())

# Sort by multiple columns
sorted_multiple = df.sort_values(['City', 'Age'])
print("\nSorted by City then Age:")
print(sorted_multiple.head())

Output:

Sorted by Age (ascending):
Name Age City Salary Experience
5 Sarah 27 Sydney 67000 3
0 John 28 New York 65000 5
2 Peter 29 Berlin 58000 4
3 Linda 32 London 79000 7
1 Anna 34 Paris 72000 8

Handling Missing Values

Let's create a DataFrame with some missing values:

python
# Creating data with missing values
data_missing = {
'Name': ['John', 'Anna', None, 'Linda', 'Bob'],
'Age': [28, None, 29, 32, None],
'City': ['New York', 'Paris', None, 'London', 'Tokyo'],
'Salary': [65000, 72000, None, 79000, 95000]
}

df_missing = pd.DataFrame(data_missing)
print("DataFrame with missing values:")
print(df_missing)

Output:

DataFrame with missing values:
Name Age City Salary
0 John 28.0 New York 65000.0
1 Anna NaN Paris 72000.0
2 None 29.0 None NaN
3 Linda 32.0 London 79000.0
4 Bob NaN Tokyo 95000.0

Check for missing values:

python
# Check for missing values
print("\nMissing value count by column:")
print(df_missing.isnull().sum())

# Fill missing values
df_filled = df_missing.fillna({
'Name': 'Unknown',
'Age': df_missing['Age'].mean(),
'City': 'Unknown',
'Salary': df_missing['Salary'].median()
})
print("\nDataFrame with filled missing values:")
print(df_filled)

# Drop rows with missing values
df_dropped = df_missing.dropna()
print("\nDataFrame after dropping rows with any missing values:")
print(df_dropped)

Output:

Missing value count by column:
Name 1
Age 2
City 1
Salary 1
dtype: int64

DataFrame with filled missing values:
Name Age City Salary
0 John 28.000000 New York 65000.000
1 Anna 29.666667 Paris 72000.000
2 Unknown 29.000000 Unknown 75500.000
3 Linda 32.000000 London 79000.000
4 Bob 29.666667 Tokyo 95000.000

DataFrame after dropping rows with any missing values:
Name Age City Salary
0 John 28.0 New York 65000.0
3 Linda 32.0 London 79000.0

Data Transformation

Applying Functions

You can apply functions to transform your data:

python
# Apply a function to a single column
df['Name_Length'] = df['Name'].apply(len)
print("Names with their lengths:")
print(df[['Name', 'Name_Length']])

# Apply a custom function to create a new column
def calculate_bonus(row):
if row['Experience'] > 10:
return row['Salary'] * 0.15
elif row['Experience'] > 5:
return row['Salary'] * 0.10
else:
return row['Salary'] * 0.05

df['Bonus'] = df.apply(calculate_bonus, axis=1)
print("\nBonus calculation:")
print(df[['Name', 'Salary', 'Experience', 'Bonus']])

Output:

Names with their lengths:
Name Name_Length
0 John 4
1 Anna 4
2 Peter 5
3 Linda 5
4 Bob 3
5 Sarah 5
6 Mike 4
7 Emma 4
8 Tom 3
9 Laura 5

Bonus calculation:
Name Salary Experience Bonus
0 John 65000 5 3250.0
1 Anna 72000 8 7200.0
2 Peter 58000 4 2900.0
3 Linda 79000 7 7900.0
4 Bob 95000 15 14250.0
5 Sarah 67000 3 3350.0
6 Mike 82000 9 8200.0
7 Emma 71000 6 7100.0
8 Tom 88000 12 13200.0
9 Laura 76000 10 7600.0

GroupBy Operations

GroupBy allows you to split your data into groups and apply functions to each group:

python
# Group by Age range and calculate average salary
df['Age_Group'] = pd.cut(df['Age'], bins=[20, 30, 40, 50],
labels=['20s', '30s', '40s'])

age_group_stats = df.groupby('Age_Group').agg({
'Salary': ['mean', 'min', 'max', 'count'],
'Experience': 'mean'
})

print("Statistics by Age Group:")
print(age_group_stats)

# Let's do a more practical example - average experience by city
city_experience = df.groupby('City')['Experience'].mean().sort_values(ascending=False)
print("\nAverage experience by city:")
print(city_experience)

Output:

Statistics by Age Group:
Salary Experience
mean min max count mean
Age_Group
20s 63333.33 58000 67000 3 4.000000
30s 76000.00 71000 82000 5 8.000000
40s 91500.00 88000 95000 2 13.500000

Average experience by city:
City
Tokyo 15.0
Rome 12.0
Toronto 9.0
Paris 8.0
London 7.0
Chicago 7.0
Madrid 6.0
New York 5.0
Berlin 4.0
Sydney 3.0
Name: Experience, dtype: float64

Real-World Application: Data Analysis

Let's put everything together in a real-world example. We'll analyze a dataset containing employee information:

python
# Creating a more comprehensive dataset
np.random.seed(42) # For reproducibility

departments = ['IT', 'Marketing', 'Finance', 'HR', 'Operations']
locations = ['New York', 'Chicago', 'San Francisco', 'Boston', 'Seattle']

data = {
'Employee_ID': range(1001, 1031),
'Name': ['Employee_' + str(i) for i in range(1, 31)],
'Department': np.random.choice(departments, 30),
'Salary': np.random.randint(50000, 150000, 30),
'Years_Experience': np.random.randint(1, 20, 30),
'Location': np.random.choice(locations, 30),
'Performance_Score': np.random.uniform(2.0, 5.0, 30).round(1)
}

employees = pd.DataFrame(data)

# Display the first few rows
print("Employee Dataset:")
print(employees.head())

Now, let's analyze this dataset:

python
# 1. Basic statistics about the data
print("\nBasic Statistics:")
print(employees.describe())

# 2. Average salary by department
avg_salary_dept = employees.groupby('Department')['Salary'].mean().sort_values(ascending=False)
print("\nAverage Salary by Department:")
print(avg_salary_dept)

# 3. Count employees by location
location_count = employees.groupby('Location').size().sort_values(ascending=False)
print("\nEmployee Count by Location:")
print(location_count)

# 4. Find top performers (score > 4.5) and their details
top_performers = employees[employees['Performance_Score'] > 4.5]
print("\nTop Performers:")
print(top_performers[['Name', 'Department', 'Performance_Score', 'Salary']])

# 5. Calculate a bonus based on performance and experience
def bonus_calculator(row):
base_percent = (row['Performance_Score'] - 2) * 10 # 0-30%
experience_boost = min(row['Years_Experience'] * 0.5, 10) # 0-10%
return int(row['Salary'] * (base_percent + experience_boost) / 100)

employees['Bonus'] = employees.apply(bonus_calculator, axis=1)
print("\nEmployees with calculated bonus:")
print(employees[['Name', 'Performance_Score', 'Years_Experience', 'Salary', 'Bonus']].head(10))

# 6. Identify departments with above-average performance
dept_performance = employees.groupby('Department')['Performance_Score'].mean().sort_values(ascending=False)
above_avg = dept_performance[dept_performance > employees['Performance_Score'].mean()]
print("\nDepartments with Above-Average Performance:")
print(above_avg)

# 7. Correlation between years of experience and salary
correlation = employees['Years_Experience'].corr(employees['Salary'])
print(f"\nCorrelation between Years Experience and Salary: {correlation:.2f}")

This analysis demonstrates how to:

  • Calculate summary statistics
  • Group and aggregate data
  • Filter data based on conditions
  • Apply custom functions to create derived metrics
  • Identify patterns and correlations in the data

Summary

In this tutorial, we've covered the fundamental operations in pandas that you'll use in almost every data analysis task:

  1. Creating data structures: Series and DataFrames
  2. Viewing data: Inspecting the shape and content of your data
  3. Basic operations: Filtering, sorting, and manipulating data
  4. Handling missing values: Identifying and filling or removing missing data
  5. Transforming data: Applying functions to create new columns
  6. GroupBy operations: Aggregating data by categories
  7. Real-world application: Performing comprehensive data analysis

These basic operations form the foundation of data manipulation with pandas. As you gain experience, you'll be able to combine these operations to solve increasingly complex data problems.

Practice Exercises

  1. Create a DataFrame with student information (name, grades for different subjects, attendance)
  2. Calculate the average grade for each student
  3. Identify students with attendance below 80%
  4. Group students by grade ranges (A: 90-100, B: 80-89, etc.)
  5. Create a new column that calculates if a student passed (average grade > 70)

Additional Resources

Happy data analyzing!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)