Python Pandas Basics

Introduction

Pandas is one of the most essential libraries for data manipulation and analysis in Python. It provides powerful data structures and functions that make working with structured data intuitive and efficient. Whether you're analyzing financial data, processing survey results, or preparing datasets for machine learning, Pandas is likely to be a core tool in your Python data science workflow.

In this tutorial, we'll cover the fundamentals of Pandas, focusing on:

What Pandas is and why it's important
Core Pandas data structures: Series and DataFrame
Basic data operations and manipulations
Reading and writing data with Pandas
Essential data analysis functionality

Getting Started with Pandas

Installation

Before we begin, you'll need to install Pandas. If you haven't already installed it, you can do so using pip:

bash
pip install pandas

For data visualization capabilities that we'll use in some examples, you might also want to install Matplotlib:

bash
pip install matplotlib

Importing Pandas

To use Pandas in your Python script or notebook, import it with the conventional alias pd:

python
import pandas as pd
import numpy as np  # Often used alongside Pandas

Core Pandas Data Structures

Series

A Series is a one-dimensional labeled array that can hold any data type. Think of it as a single column of data with an index.

Creating a Series

python
# Creating a Series from a list
s = pd.Series([10, 20, 30, 40])
print(s)

Output:

  10
  20
  30
  40
dtype: int64

You can also create a Series with a custom index:

python
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)

Output:

a    10
b    20
c    30
d    40
dtype: int64

Accessing Elements in a Series

python
# Access by index label
print(s['a'])

# Access by position
print(s[0])

# Multiple elements
print(s[['a', 'c']])

# Slicing
print(s['a':'c'])

Output:

10
10
a    10
c    30
dtype: int64
a    10
b    20
c    30
dtype: int64

DataFrame

A DataFrame is a two-dimensional labeled data structure with columns that can be of different types. It's similar to a spreadsheet or SQL table.

Creating a DataFrame

python
# Creating a DataFrame from a dictionary
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age      City
 John   28  New York
 Anna   34     Paris
Peter   29    Berlin
Linda   32    London

You can also specify the row indices:

python
df = pd.DataFrame(data, index=['p1', 'p2', 'p3', 'p4'])
print(df)

Output:

     Name  Age      City
p1   John   28  New York
p2   Anna   34     Paris
p3  Peter   29    Berlin
p4  Linda   32    London

Basic DataFrame Information

python
# View the first 2 rows
print(df.head(2))

# Get basic statistics
print(df.describe())

# Get information about DataFrame
print(df.info())

# Get dimensions (rows, columns)
print(df.shape)

Example output for df.describe():

             Age
count   4.000000
mean   30.750000
std     2.753785
min    28.000000
25%    28.750000
50%    30.500000
75%    32.500000
max    34.000000

Working with Data

Reading Data from Files

One of the most common ways to create a DataFrame is by reading data from files:

python
# Read from CSV
df = pd.read_csv('data.csv')

# Read from Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Read from JSON
df = pd.read_json('data.json')

Selecting and Filtering Data

Pandas provides multiple ways to select and filter data:

python
# Select a single column (returns a Series)
print(df['Name'])

# Select multiple columns (returns a DataFrame)
print(df[['Name', 'City']])

# Select rows by index label
print(df.loc['p1'])

# Select rows by integer position
print(df.iloc[0])

# Select rows and columns by label
print(df.loc['p1', 'Name'])

# Select rows and columns by position
print(df.iloc[0, 0])

# Filtering with conditions
print(df[df['Age'] > 30])

Output of df[df['Age'] > 30]:

     Name  Age    City
p2   Anna   34   Paris
p4  Linda   32  London

Adding and Modifying Data

python
# Add a new column
df['Country'] = ['USA', 'France', 'Germany', 'UK']

# Modify existing values
df.loc['p1', 'Age'] = 29

# Add a new row
df.loc['p5'] = ['Michael', 35, 'Toronto', 'Canada']

print(df)

Output:

       Name  Age      City  Country
p1     John   29  New York      USA
p2     Anna   34     Paris   France
p3    Peter   29    Berlin  Germany
p4    Linda   32    London       UK
p5  Michael   35   Toronto   Canada

Basic Operations

python
# Calculate the mean age
print("Average age:", df['Age'].mean())

# Count occurrences
print(df['City'].value_counts())

# Sort by Age
print(df.sort_values(by='Age', ascending=False))

# Group by and aggregate
print(df.groupby('Country')['Age'].mean())

Data Cleaning and Preparation

Handling Missing Values

Missing values are common in real-world datasets. Pandas represents them as NaN (Not a Number).

python
# Detecting missing values
print(df.isna().sum())

# Filling missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Dropping rows with missing values
df_clean = df.dropna()

Removing Duplicates

python
# Check for duplicates
print(df.duplicated().sum())

# Remove duplicates
df_unique = df.drop_duplicates()

Data Type Conversion

python
# Check data types
print(df.dtypes)

# Convert data types
df['Age'] = df['Age'].astype(int)

Practical Example: Data Analysis Workflow

Let's walk through a complete example to illustrate a basic Pandas workflow:

python
# Step 1: Import libraries
import pandas as pd
import matplotlib.pyplot as plt

# Step 2: Load the data (we'll create sample data for this example)
data = {
    'Date': pd.date_range(start='2023-01-01', periods=10),
    'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
    'Sales': [150, 200, 130, 340, 220, 160, 310, 190, 140, 280],
    'Revenue': [1500, 3000, 1300, 4080, 3300, 1600, 3720, 2850, 1400, 3360]
}

df = pd.DataFrame(data)
print("Original Data:")
print(df.head())

# Step 3: Data exploration
print("\nData Info:")
print(df.info())

print("\nSummary Statistics:")
print(df.describe())

# Step 4: Data preparation
# Convert Date column to datetime if needed
df['Date'] = pd.to_datetime(df['Date'])
# Extract month and day
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# Step 5: Data analysis
# Calculate average revenue per product
avg_revenue = df.groupby('Product')['Revenue'].mean().reset_index()
print("\nAverage Revenue per Product:")
print(avg_revenue)

# Calculate total sales and revenue by product
product_summary = df.groupby('Product').agg({
    'Sales': 'sum',
    'Revenue': 'sum'
}).reset_index()
print("\nProduct Summary:")
print(product_summary)

# Step 6: Visualization (basic example)
plt.figure(figsize=(10, 6))
plt.bar(product_summary['Product'], product_summary['Revenue'])
plt.title('Total Revenue by Product')
plt.xlabel('Product')
plt.ylabel('Revenue')
# In a real script, you would add: plt.show()

This example demonstrates:

Loading data into a DataFrame
Exploring and understanding the data
Preparing and transforming the data
Analyzing the data through grouping and aggregation
Visualizing the results (though the visualization won't display in this tutorial)

Summary

In this tutorial, we've covered the fundamentals of Pandas, including:

Creating and working with Series and DataFrames
Reading and manipulating data
Selecting and filtering data
Handling missing values and duplicates
Basic data analysis operations
A practical workflow example

Pandas is a vast library with many more capabilities than we can cover in a single tutorial. However, these basics will give you a solid foundation to build upon as you dive deeper into data science with Python.

Additional Resources

To continue learning Pandas, check out these resources:

Official Pandas Documentation
10 Minutes to Pandas - A quick introduction from the official docs
Pandas Cookbook - Practical recipes for data analysis

Exercises

To reinforce your learning, try these exercises:

Create a DataFrame with student information (name, age, grade) and perform basic operations on it.
Read a CSV file of your choice and perform data cleaning tasks:
- Handle missing values
- Remove duplicates
- Convert data types as needed
Group and aggregate data to answer questions like:
- What is the average value per category?
- Which category has the highest/lowest total?
Create a new column based on calculations from existing columns (e.g., BMI from height and weight).
Practice filtering data with multiple conditions (e.g., age > 25 AND city = 'New York').

Happy data analyzing with Pandas!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Getting Started with Pandas​

Installation​

Importing Pandas​

Core Pandas Data Structures​

Series​

Creating a Series​

Accessing Elements in a Series​

DataFrame​

Creating a DataFrame​

Basic DataFrame Information​

Working with Data​

Reading Data from Files​

Selecting and Filtering Data​

Adding and Modifying Data​

Basic Operations​

Data Cleaning and Preparation​

Handling Missing Values​

Removing Duplicates​

Data Type Conversion​

Practical Example: Data Analysis Workflow​

Summary​

Additional Resources​

Exercises​

Introduction

Getting Started with Pandas

Installation

Importing Pandas

Core Pandas Data Structures

Series

Creating a Series

Accessing Elements in a Series

DataFrame

Creating a DataFrame

Basic DataFrame Information

Working with Data

Reading Data from Files

Selecting and Filtering Data

Adding and Modifying Data

Basic Operations

Data Cleaning and Preparation

Handling Missing Values

Removing Duplicates

Data Type Conversion

Practical Example: Data Analysis Workflow

Summary

Additional Resources

Exercises