Pandas Structure Overview
Introduction
Pandas is one of the most powerful and flexible data analysis libraries in Python. At its core, Pandas provides two primary data structures that you'll use for almost all data manipulation tasks: Series and DataFrames. Understanding these structures is fundamental to working effectively with data in Python.
In this tutorial, we'll explore these structures, learn how they're organized, and see how they work together to make data analysis more intuitive and efficient.
Pandas Core Data Structures
1. Series: One-dimensional labeled arrays
A Pandas Series is a one-dimensional labeled array that can hold any data type (integers, strings, floating-point numbers, Python objects, etc.). It's similar to a column in a spreadsheet or a single variable in your dataset.
Creating a Series
Let's start by creating a simple Series:
import pandas as pd
# Create a Series from a list
simple_series = pd.Series([10, 20, 30, 40, 50])
print(simple_series)
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
Notice how Pandas automatically assigns an index (0 through 4) to our values.
We can also specify our own index labels:
# Create a Series with custom indices
fruit_series = pd.Series([30, 25, 40, 10],
index=['apples', 'bananas', 'oranges', 'grapes'])
print(fruit_series)
Output:
apples 30
bananas 25
oranges 40
grapes 10
dtype: int64
Series Attributes and Methods
Series have useful attributes and methods:
# Get basic information about our Series
print("Series values:", fruit_series.values)
print("Series index:", fruit_series.index)
print("Series data type:", fruit_series.dtype)
print("Series shape:", fruit_series.shape)
# Using methods
print("Mean value:", fruit_series.mean())
print("Max value:", fruit_series.max())
print("Description:", fruit_series.describe())
Output:
Series values: [30 25 40 10]
Series index: Index(['apples', 'bananas', 'oranges', 'grapes'], dtype='object')
Series data type: int64
Series shape: (4,)
Mean value: 26.25
Max value: 40
Description: count 4.00000
mean 26.25000
std 12.65922
min 10.00000
25% 21.25000
50% 27.50000
75% 32.50000
max 40.00000
dtype: float64
Accessing Series Elements
You can access elements by their index label or position:
# Access by label
print("Number of apples:", fruit_series['apples'])
# Access by position
print("First element:", fruit_series[0])
# Slicing
print("First two elements:\n", fruit_series[:2])
Output:
Number of apples: 30
First element: 30
First two elements:
apples 30
bananas 25
dtype: int64
2. DataFrame: Two-dimensional labeled data structure
A DataFrame is a 2D labeled data structure, similar to a spreadsheet or SQL table. It's like a collection of Series objects that share the same index.
Creating a DataFrame
Let's create a simple DataFrame:
# Create a DataFrame from a dictionary
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London'],
'Salary': [65000, 70000, 80000, 75000]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City Salary
0 John 28 New York 65000
1 Anna 24 Paris 70000
2 Peter 35 Berlin 80000
3 Linda 32 London 75000
You can also create a DataFrame from a list of lists, a NumPy array, or other formats.
DataFrame Attributes and Methods
DataFrames have many useful attributes and methods:
# Basic DataFrame information
print("DataFrame shape:", df.shape)
print("DataFrame columns:", df.columns)
print("DataFrame index:", df.index)
print("DataFrame data types:\n", df.dtypes)
# Statistical summary
print("\nSummary statistics:\n", df.describe())
# Top and bottom rows
print("\nFirst 2 rows:\n", df.head(2))
print("\nLast 2 rows:\n", df.tail(2))
Output:
DataFrame shape: (4, 4)
DataFrame columns: Index(['Name', 'Age', 'City', 'Salary'], dtype='object')
DataFrame index: RangeIndex(start=0, stop=4, step=1)
DataFrame data types:
Name object
Age int64
City object
Salary int64
dtype: object
Summary statistics:
Age Salary
count 4.000000 4.000000
mean 29.750000 72500.000000
std 4.856267 6454.972244
min 24.000000 65000.000000
25% 27.000000 68750.000000
50% 30.000000 72500.000000
75% 32.750000 76250.000000
max 35.000000 80000.000000
First 2 rows:
Name Age City Salary
0 John 28 New York 65000
1 Anna 24 Paris 70000
Last 2 rows:
Name Age City Salary
2 Peter 35 Berlin 80000
3 Linda 32 London 75000
Accessing DataFrame Elements
There are multiple ways to access data in a DataFrame:
# Access a single column (returns a Series)
print("Names:\n", df['Name'])
# Access multiple columns
print("\nNames and Ages:\n", df[['Name', 'Age']])
# Access a row by position using iloc
print("\nSecond row:\n", df.iloc[1])
# Access a row by label using loc
print("\nRow with index 2:\n", df.loc[2])
# Access a specific cell
print("\nLinda's salary:", df.loc[3, 'Salary'])
# Boolean indexing
print("\nPeople older than 30:\n", df[df['Age'] > 30])
Output:
Names:
0 John
1 Anna
2 Peter
3 Linda
Name: Name, dtype: object
Names and Ages:
Name Age
0 John 28
1 Anna 24
2 Peter 35
3 Linda 32
Second row:
Name Anna
Age 24
City Paris
Salary 70000
Name: 1, dtype: object
Row with index 2:
Name Peter
Age 35
City Berlin
Salary 80000
Name: 2, dtype: object
Linda's salary: 75000
People older than 30:
Name Age City Salary
2 Peter 35 Berlin 80000
3 Linda 32 London 75000
The Relationship Between Series and DataFrames
A DataFrame is essentially a collection of Series objects that share the same index. Each column in a DataFrame is a Series:
# Extract the Age column as a Series
age_series = df['Age']
print("Type of age_series:", type(age_series))
print(age_series)
Output:
Type of age_series: <class 'pandas.core.series.Series'>
0 28
1 24
2 35
3 32
Name: Age, dtype: int64
Real-World Example: Analyzing Sales Data
Let's use a more practical example of analyzing some sales data:
# Create a DataFrame with sales data
sales_data = {
'Date': ['2023-01-15', '2023-01-16', '2023-01-17', '2023-01-18', '2023-01-19'],
'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
'Units_Sold': [5, 10, 7, 8, 12],
'Revenue': [5000, 6000, 2100, 8000, 7200]
}
sales_df = pd.DataFrame(sales_data)
# Convert Date column to datetime format
sales_df['Date'] = pd.to_datetime(sales_df['Date'])
print(sales_df)
# Calculate the revenue per unit
sales_df['Price_Per_Unit'] = sales_df['Revenue'] / sales_df['Units_Sold']
print("\nSales data with price per unit:\n", sales_df)
# Group by product and calculate average metrics
product_summary = sales_df.groupby('Product').agg({
'Units_Sold': 'sum',
'Revenue': 'sum',
'Price_Per_Unit': 'mean'
})
print("\nProduct summary:\n", product_summary)
# Sort by revenue to find top performing products
print("\nProducts sorted by revenue:\n", product_summary.sort_values('Revenue', ascending=False))
Output:
Date Product Units_Sold Revenue
0 2023-01-15 Laptop 5 5000
1 2023-01-16 Phone 10 6000
2 2023-01-17 Tablet 7 2100
3 2023-01-18 Laptop 8 8000
4 2023-01-19 Phone 12 7200
Sales data with price per unit:
Date Product Units_Sold Revenue Price_Per_Unit
0 2023-01-15 Laptop 5 5000 1000.000
1 2023-01-16 Phone 10 6000 600.000
2 2023-01-17 Tablet 7 2100 300.000
3 2023-01-18 Laptop 8 8000 1000.000
4 2023-01-19 Phone 12 7200 600.000
Product summary:
Units_Sold Revenue Price_Per_Unit
Product
Laptop 13 13000 1000.000
Phone 22 13200 600.000
Tablet 7 2100 300.000
Products sorted by revenue:
Units_Sold Revenue Price_Per_Unit
Product
Phone 22 13200 600.000
Laptop 13 13000 1000.000
Tablet 7 2100 300.000
Summary
In this tutorial, we've covered the fundamental structures that make up the Pandas library:
- Series: One-dimensional labeled arrays that can hold any data type
- DataFrames: Two-dimensional labeled data structures similar to tables
We've learned how to:
- Create Series and DataFrames from different data sources
- Access elements using various indexing methods
- Use basic attributes and methods to explore our data
- Apply these concepts to analyze real-world data
Understanding these structures forms the foundation for all your data analysis work in Pandas. As you become more familiar with them, you'll find that Pandas makes complex data tasks simpler and more intuitive.
Additional Resources and Exercises
Further Reading
- Official Pandas Documentation
- 10 Minutes to Pandas (official tutorial)
- Pandas Cheat Sheet
Practice Exercises
- Create a Series of 5 different countries and their populations. Calculate the total population.
- Create a DataFrame with information about 5 books (title, author, year, pages, genre). Then:
- Add a column indicating whether the book is "long" (more than 300 pages)
- Find the oldest and newest books
- Calculate the average page count by genre
- Using the sales data example above:
- Add a column for the day of the week
- Find which day had the highest average sales
- Create a pivot table showing Products vs. Days
By practicing these exercises, you'll strengthen your understanding of Pandas' core structures and be ready to tackle more complex data analysis tasks.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)