Skip to main content

Python Pandas Basics

Introduction

Pandas is one of the most essential libraries for data manipulation and analysis in Python. It provides powerful data structures and functions that make working with structured data intuitive and efficient. Whether you're analyzing financial data, processing survey results, or preparing datasets for machine learning, Pandas is likely to be a core tool in your Python data science workflow.

In this tutorial, we'll cover the fundamentals of Pandas, focusing on:

  • What Pandas is and why it's important
  • Core Pandas data structures: Series and DataFrame
  • Basic data operations and manipulations
  • Reading and writing data with Pandas
  • Essential data analysis functionality

Getting Started with Pandas

Installation

Before we begin, you'll need to install Pandas. If you haven't already installed it, you can do so using pip:

bash
pip install pandas

For data visualization capabilities that we'll use in some examples, you might also want to install Matplotlib:

bash
pip install matplotlib

Importing Pandas

To use Pandas in your Python script or notebook, import it with the conventional alias pd:

python
import pandas as pd
import numpy as np # Often used alongside Pandas

Core Pandas Data Structures

Series

A Series is a one-dimensional labeled array that can hold any data type. Think of it as a single column of data with an index.

Creating a Series

python
# Creating a Series from a list
s = pd.Series([10, 20, 30, 40])
print(s)

Output:

0    10
1 20
2 30
3 40
dtype: int64

You can also create a Series with a custom index:

python
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)

Output:

a    10
b 20
c 30
d 40
dtype: int64

Accessing Elements in a Series

python
# Access by index label
print(s['a'])

# Access by position
print(s[0])

# Multiple elements
print(s[['a', 'c']])

# Slicing
print(s['a':'c'])

Output:

10
10
a 10
c 30
dtype: int64
a 10
b 20
c 30
dtype: int64

DataFrame

A DataFrame is a two-dimensional labeled data structure with columns that can be of different types. It's similar to a spreadsheet or SQL table.

Creating a DataFrame

python
# Creating a DataFrame from a dictionary
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age      City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 32 London

You can also specify the row indices:

python
df = pd.DataFrame(data, index=['p1', 'p2', 'p3', 'p4'])
print(df)

Output:

     Name  Age      City
p1 John 28 New York
p2 Anna 34 Paris
p3 Peter 29 Berlin
p4 Linda 32 London

Basic DataFrame Information

python
# View the first 2 rows
print(df.head(2))

# Get basic statistics
print(df.describe())

# Get information about DataFrame
print(df.info())

# Get dimensions (rows, columns)
print(df.shape)

Example output for df.describe():

             Age
count 4.000000
mean 30.750000
std 2.753785
min 28.000000
25% 28.750000
50% 30.500000
75% 32.500000
max 34.000000

Working with Data

Reading Data from Files

One of the most common ways to create a DataFrame is by reading data from files:

python
# Read from CSV
df = pd.read_csv('data.csv')

# Read from Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Read from JSON
df = pd.read_json('data.json')

Selecting and Filtering Data

Pandas provides multiple ways to select and filter data:

python
# Select a single column (returns a Series)
print(df['Name'])

# Select multiple columns (returns a DataFrame)
print(df[['Name', 'City']])

# Select rows by index label
print(df.loc['p1'])

# Select rows by integer position
print(df.iloc[0])

# Select rows and columns by label
print(df.loc['p1', 'Name'])

# Select rows and columns by position
print(df.iloc[0, 0])

# Filtering with conditions
print(df[df['Age'] > 30])

Output of df[df['Age'] > 30]:

     Name  Age    City
p2 Anna 34 Paris
p4 Linda 32 London

Adding and Modifying Data

python
# Add a new column
df['Country'] = ['USA', 'France', 'Germany', 'UK']

# Modify existing values
df.loc['p1', 'Age'] = 29

# Add a new row
df.loc['p5'] = ['Michael', 35, 'Toronto', 'Canada']

print(df)

Output:

       Name  Age      City  Country
p1 John 29 New York USA
p2 Anna 34 Paris France
p3 Peter 29 Berlin Germany
p4 Linda 32 London UK
p5 Michael 35 Toronto Canada

Basic Operations

python
# Calculate the mean age
print("Average age:", df['Age'].mean())

# Count occurrences
print(df['City'].value_counts())

# Sort by Age
print(df.sort_values(by='Age', ascending=False))

# Group by and aggregate
print(df.groupby('Country')['Age'].mean())

Data Cleaning and Preparation

Handling Missing Values

Missing values are common in real-world datasets. Pandas represents them as NaN (Not a Number).

python
# Detecting missing values
print(df.isna().sum())

# Filling missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Dropping rows with missing values
df_clean = df.dropna()

Removing Duplicates

python
# Check for duplicates
print(df.duplicated().sum())

# Remove duplicates
df_unique = df.drop_duplicates()

Data Type Conversion

python
# Check data types
print(df.dtypes)

# Convert data types
df['Age'] = df['Age'].astype(int)

Practical Example: Data Analysis Workflow

Let's walk through a complete example to illustrate a basic Pandas workflow:

python
# Step 1: Import libraries
import pandas as pd
import matplotlib.pyplot as plt

# Step 2: Load the data (we'll create sample data for this example)
data = {
'Date': pd.date_range(start='2023-01-01', periods=10),
'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
'Sales': [150, 200, 130, 340, 220, 160, 310, 190, 140, 280],
'Revenue': [1500, 3000, 1300, 4080, 3300, 1600, 3720, 2850, 1400, 3360]
}

df = pd.DataFrame(data)
print("Original Data:")
print(df.head())

# Step 3: Data exploration
print("\nData Info:")
print(df.info())

print("\nSummary Statistics:")
print(df.describe())

# Step 4: Data preparation
# Convert Date column to datetime if needed
df['Date'] = pd.to_datetime(df['Date'])
# Extract month and day
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# Step 5: Data analysis
# Calculate average revenue per product
avg_revenue = df.groupby('Product')['Revenue'].mean().reset_index()
print("\nAverage Revenue per Product:")
print(avg_revenue)

# Calculate total sales and revenue by product
product_summary = df.groupby('Product').agg({
'Sales': 'sum',
'Revenue': 'sum'
}).reset_index()
print("\nProduct Summary:")
print(product_summary)

# Step 6: Visualization (basic example)
plt.figure(figsize=(10, 6))
plt.bar(product_summary['Product'], product_summary['Revenue'])
plt.title('Total Revenue by Product')
plt.xlabel('Product')
plt.ylabel('Revenue')
# In a real script, you would add: plt.show()

This example demonstrates:

  1. Loading data into a DataFrame
  2. Exploring and understanding the data
  3. Preparing and transforming the data
  4. Analyzing the data through grouping and aggregation
  5. Visualizing the results (though the visualization won't display in this tutorial)

Summary

In this tutorial, we've covered the fundamentals of Pandas, including:

  • Creating and working with Series and DataFrames
  • Reading and manipulating data
  • Selecting and filtering data
  • Handling missing values and duplicates
  • Basic data analysis operations
  • A practical workflow example

Pandas is a vast library with many more capabilities than we can cover in a single tutorial. However, these basics will give you a solid foundation to build upon as you dive deeper into data science with Python.

Additional Resources

To continue learning Pandas, check out these resources:

  1. Official Pandas Documentation
  2. 10 Minutes to Pandas - A quick introduction from the official docs
  3. Pandas Cookbook - Practical recipes for data analysis

Exercises

To reinforce your learning, try these exercises:

  1. Create a DataFrame with student information (name, age, grade) and perform basic operations on it.

  2. Read a CSV file of your choice and perform data cleaning tasks:

    • Handle missing values
    • Remove duplicates
    • Convert data types as needed
  3. Group and aggregate data to answer questions like:

    • What is the average value per category?
    • Which category has the highest/lowest total?
  4. Create a new column based on calculations from existing columns (e.g., BMI from height and weight).

  5. Practice filtering data with multiple conditions (e.g., age > 25 AND city = 'New York').

Happy data analyzing with Pandas!



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)