Python Pandas Basics
Introduction
Pandas is one of the most essential libraries for data manipulation and analysis in Python. It provides powerful data structures and functions that make working with structured data intuitive and efficient. Whether you're analyzing financial data, processing survey results, or preparing datasets for machine learning, Pandas is likely to be a core tool in your Python data science workflow.
In this tutorial, we'll cover the fundamentals of Pandas, focusing on:
- What Pandas is and why it's important
- Core Pandas data structures: Series and DataFrame
- Basic data operations and manipulations
- Reading and writing data with Pandas
- Essential data analysis functionality
Getting Started with Pandas
Installation
Before we begin, you'll need to install Pandas. If you haven't already installed it, you can do so using pip:
pip install pandas
For data visualization capabilities that we'll use in some examples, you might also want to install Matplotlib:
pip install matplotlib
Importing Pandas
To use Pandas in your Python script or notebook, import it with the conventional alias pd
:
import pandas as pd
import numpy as np # Often used alongside Pandas
Core Pandas Data Structures
Series
A Series is a one-dimensional labeled array that can hold any data type. Think of it as a single column of data with an index.
Creating a Series
# Creating a Series from a list
s = pd.Series([10, 20, 30, 40])
print(s)
Output:
0 10
1 20
2 30
3 40
dtype: int64
You can also create a Series with a custom index:
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)
Output:
a 10
b 20
c 30
d 40
dtype: int64
Accessing Elements in a Series
# Access by index label
print(s['a'])
# Access by position
print(s[0])
# Multiple elements
print(s[['a', 'c']])
# Slicing
print(s['a':'c'])
Output:
10
10
a 10
c 30
dtype: int64
a 10
b 20
c 30
dtype: int64
DataFrame
A DataFrame is a two-dimensional labeled data structure with columns that can be of different types. It's similar to a spreadsheet or SQL table.
Creating a DataFrame
# Creating a DataFrame from a dictionary
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 32 London
You can also specify the row indices:
df = pd.DataFrame(data, index=['p1', 'p2', 'p3', 'p4'])
print(df)
Output:
Name Age City
p1 John 28 New York
p2 Anna 34 Paris
p3 Peter 29 Berlin
p4 Linda 32 London
Basic DataFrame Information
# View the first 2 rows
print(df.head(2))
# Get basic statistics
print(df.describe())
# Get information about DataFrame
print(df.info())
# Get dimensions (rows, columns)
print(df.shape)
Example output for df.describe()
:
Age
count 4.000000
mean 30.750000
std 2.753785
min 28.000000
25% 28.750000
50% 30.500000
75% 32.500000
max 34.000000
Working with Data
Reading Data from Files
One of the most common ways to create a DataFrame is by reading data from files:
# Read from CSV
df = pd.read_csv('data.csv')
# Read from Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Read from JSON
df = pd.read_json('data.json')
Selecting and Filtering Data
Pandas provides multiple ways to select and filter data:
# Select a single column (returns a Series)
print(df['Name'])
# Select multiple columns (returns a DataFrame)
print(df[['Name', 'City']])
# Select rows by index label
print(df.loc['p1'])
# Select rows by integer position
print(df.iloc[0])
# Select rows and columns by label
print(df.loc['p1', 'Name'])
# Select rows and columns by position
print(df.iloc[0, 0])
# Filtering with conditions
print(df[df['Age'] > 30])
Output of df[df['Age'] > 30]
:
Name Age City
p2 Anna 34 Paris
p4 Linda 32 London
Adding and Modifying Data
# Add a new column
df['Country'] = ['USA', 'France', 'Germany', 'UK']
# Modify existing values
df.loc['p1', 'Age'] = 29
# Add a new row
df.loc['p5'] = ['Michael', 35, 'Toronto', 'Canada']
print(df)
Output:
Name Age City Country
p1 John 29 New York USA
p2 Anna 34 Paris France
p3 Peter 29 Berlin Germany
p4 Linda 32 London UK
p5 Michael 35 Toronto Canada
Basic Operations
# Calculate the mean age
print("Average age:", df['Age'].mean())
# Count occurrences
print(df['City'].value_counts())
# Sort by Age
print(df.sort_values(by='Age', ascending=False))
# Group by and aggregate
print(df.groupby('Country')['Age'].mean())
Data Cleaning and Preparation
Handling Missing Values
Missing values are common in real-world datasets. Pandas represents them as NaN
(Not a Number).
# Detecting missing values
print(df.isna().sum())
# Filling missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())
# Dropping rows with missing values
df_clean = df.dropna()
Removing Duplicates
# Check for duplicates
print(df.duplicated().sum())
# Remove duplicates
df_unique = df.drop_duplicates()
Data Type Conversion
# Check data types
print(df.dtypes)
# Convert data types
df['Age'] = df['Age'].astype(int)
Practical Example: Data Analysis Workflow
Let's walk through a complete example to illustrate a basic Pandas workflow:
# Step 1: Import libraries
import pandas as pd
import matplotlib.pyplot as plt
# Step 2: Load the data (we'll create sample data for this example)
data = {
'Date': pd.date_range(start='2023-01-01', periods=10),
'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
'Sales': [150, 200, 130, 340, 220, 160, 310, 190, 140, 280],
'Revenue': [1500, 3000, 1300, 4080, 3300, 1600, 3720, 2850, 1400, 3360]
}
df = pd.DataFrame(data)
print("Original Data:")
print(df.head())
# Step 3: Data exploration
print("\nData Info:")
print(df.info())
print("\nSummary Statistics:")
print(df.describe())
# Step 4: Data preparation
# Convert Date column to datetime if needed
df['Date'] = pd.to_datetime(df['Date'])
# Extract month and day
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
# Step 5: Data analysis
# Calculate average revenue per product
avg_revenue = df.groupby('Product')['Revenue'].mean().reset_index()
print("\nAverage Revenue per Product:")
print(avg_revenue)
# Calculate total sales and revenue by product
product_summary = df.groupby('Product').agg({
'Sales': 'sum',
'Revenue': 'sum'
}).reset_index()
print("\nProduct Summary:")
print(product_summary)
# Step 6: Visualization (basic example)
plt.figure(figsize=(10, 6))
plt.bar(product_summary['Product'], product_summary['Revenue'])
plt.title('Total Revenue by Product')
plt.xlabel('Product')
plt.ylabel('Revenue')
# In a real script, you would add: plt.show()
This example demonstrates:
- Loading data into a DataFrame
- Exploring and understanding the data
- Preparing and transforming the data
- Analyzing the data through grouping and aggregation
- Visualizing the results (though the visualization won't display in this tutorial)
Summary
In this tutorial, we've covered the fundamentals of Pandas, including:
- Creating and working with Series and DataFrames
- Reading and manipulating data
- Selecting and filtering data
- Handling missing values and duplicates
- Basic data analysis operations
- A practical workflow example
Pandas is a vast library with many more capabilities than we can cover in a single tutorial. However, these basics will give you a solid foundation to build upon as you dive deeper into data science with Python.
Additional Resources
To continue learning Pandas, check out these resources:
- Official Pandas Documentation
- 10 Minutes to Pandas - A quick introduction from the official docs
- Pandas Cookbook - Practical recipes for data analysis
Exercises
To reinforce your learning, try these exercises:
-
Create a DataFrame with student information (name, age, grade) and perform basic operations on it.
-
Read a CSV file of your choice and perform data cleaning tasks:
- Handle missing values
- Remove duplicates
- Convert data types as needed
-
Group and aggregate data to answer questions like:
- What is the average value per category?
- Which category has the highest/lowest total?
-
Create a new column based on calculations from existing columns (e.g., BMI from height and weight).
-
Practice filtering data with multiple conditions (e.g., age > 25 AND city = 'New York').
Happy data analyzing with Pandas!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)