Skip to main content

Pandas Structure Overview

Introduction

Pandas is one of the most powerful and flexible data analysis libraries in Python. At its core, Pandas provides two primary data structures that you'll use for almost all data manipulation tasks: Series and DataFrames. Understanding these structures is fundamental to working effectively with data in Python.

In this tutorial, we'll explore these structures, learn how they're organized, and see how they work together to make data analysis more intuitive and efficient.

Pandas Core Data Structures

1. Series: One-dimensional labeled arrays

A Pandas Series is a one-dimensional labeled array that can hold any data type (integers, strings, floating-point numbers, Python objects, etc.). It's similar to a column in a spreadsheet or a single variable in your dataset.

Creating a Series

Let's start by creating a simple Series:

python
import pandas as pd

# Create a Series from a list
simple_series = pd.Series([10, 20, 30, 40, 50])
print(simple_series)

Output:

0    10
1 20
2 30
3 40
4 50
dtype: int64

Notice how Pandas automatically assigns an index (0 through 4) to our values.

We can also specify our own index labels:

python
# Create a Series with custom indices
fruit_series = pd.Series([30, 25, 40, 10],
index=['apples', 'bananas', 'oranges', 'grapes'])
print(fruit_series)

Output:

apples     30
bananas 25
oranges 40
grapes 10
dtype: int64

Series Attributes and Methods

Series have useful attributes and methods:

python
# Get basic information about our Series
print("Series values:", fruit_series.values)
print("Series index:", fruit_series.index)
print("Series data type:", fruit_series.dtype)
print("Series shape:", fruit_series.shape)

# Using methods
print("Mean value:", fruit_series.mean())
print("Max value:", fruit_series.max())
print("Description:", fruit_series.describe())

Output:

Series values: [30 25 40 10]
Series index: Index(['apples', 'bananas', 'oranges', 'grapes'], dtype='object')
Series data type: int64
Series shape: (4,)
Mean value: 26.25
Max value: 40
Description: count 4.00000
mean 26.25000
std 12.65922
min 10.00000
25% 21.25000
50% 27.50000
75% 32.50000
max 40.00000
dtype: float64

Accessing Series Elements

You can access elements by their index label or position:

python
# Access by label
print("Number of apples:", fruit_series['apples'])

# Access by position
print("First element:", fruit_series[0])

# Slicing
print("First two elements:\n", fruit_series[:2])

Output:

Number of apples: 30
First element: 30
First two elements:
apples 30
bananas 25
dtype: int64

2. DataFrame: Two-dimensional labeled data structure

A DataFrame is a 2D labeled data structure, similar to a spreadsheet or SQL table. It's like a collection of Series objects that share the same index.

Creating a DataFrame

Let's create a simple DataFrame:

python
# Create a DataFrame from a dictionary
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London'],
'Salary': [65000, 70000, 80000, 75000]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age     City  Salary
0 John 28 New York 65000
1 Anna 24 Paris 70000
2 Peter 35 Berlin 80000
3 Linda 32 London 75000

You can also create a DataFrame from a list of lists, a NumPy array, or other formats.

DataFrame Attributes and Methods

DataFrames have many useful attributes and methods:

python
# Basic DataFrame information
print("DataFrame shape:", df.shape)
print("DataFrame columns:", df.columns)
print("DataFrame index:", df.index)
print("DataFrame data types:\n", df.dtypes)

# Statistical summary
print("\nSummary statistics:\n", df.describe())

# Top and bottom rows
print("\nFirst 2 rows:\n", df.head(2))
print("\nLast 2 rows:\n", df.tail(2))

Output:

DataFrame shape: (4, 4)
DataFrame columns: Index(['Name', 'Age', 'City', 'Salary'], dtype='object')
DataFrame index: RangeIndex(start=0, stop=4, step=1)
DataFrame data types:
Name object
Age int64
City object
Salary int64
dtype: object

Summary statistics:
Age Salary
count 4.000000 4.000000
mean 29.750000 72500.000000
std 4.856267 6454.972244
min 24.000000 65000.000000
25% 27.000000 68750.000000
50% 30.000000 72500.000000
75% 32.750000 76250.000000
max 35.000000 80000.000000

First 2 rows:
Name Age City Salary
0 John 28 New York 65000
1 Anna 24 Paris 70000

Last 2 rows:
Name Age City Salary
2 Peter 35 Berlin 80000
3 Linda 32 London 75000

Accessing DataFrame Elements

There are multiple ways to access data in a DataFrame:

python
# Access a single column (returns a Series)
print("Names:\n", df['Name'])

# Access multiple columns
print("\nNames and Ages:\n", df[['Name', 'Age']])

# Access a row by position using iloc
print("\nSecond row:\n", df.iloc[1])

# Access a row by label using loc
print("\nRow with index 2:\n", df.loc[2])

# Access a specific cell
print("\nLinda's salary:", df.loc[3, 'Salary'])

# Boolean indexing
print("\nPeople older than 30:\n", df[df['Age'] > 30])

Output:

Names:
0 John
1 Anna
2 Peter
3 Linda
Name: Name, dtype: object

Names and Ages:
Name Age
0 John 28
1 Anna 24
2 Peter 35
3 Linda 32

Second row:
Name Anna
Age 24
City Paris
Salary 70000
Name: 1, dtype: object

Row with index 2:
Name Peter
Age 35
City Berlin
Salary 80000
Name: 2, dtype: object

Linda's salary: 75000

People older than 30:
Name Age City Salary
2 Peter 35 Berlin 80000
3 Linda 32 London 75000

The Relationship Between Series and DataFrames

A DataFrame is essentially a collection of Series objects that share the same index. Each column in a DataFrame is a Series:

python
# Extract the Age column as a Series
age_series = df['Age']
print("Type of age_series:", type(age_series))
print(age_series)

Output:

Type of age_series: <class 'pandas.core.series.Series'>
0 28
1 24
2 35
3 32
Name: Age, dtype: int64

Real-World Example: Analyzing Sales Data

Let's use a more practical example of analyzing some sales data:

python
# Create a DataFrame with sales data
sales_data = {
'Date': ['2023-01-15', '2023-01-16', '2023-01-17', '2023-01-18', '2023-01-19'],
'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
'Units_Sold': [5, 10, 7, 8, 12],
'Revenue': [5000, 6000, 2100, 8000, 7200]
}

sales_df = pd.DataFrame(sales_data)

# Convert Date column to datetime format
sales_df['Date'] = pd.to_datetime(sales_df['Date'])
print(sales_df)

# Calculate the revenue per unit
sales_df['Price_Per_Unit'] = sales_df['Revenue'] / sales_df['Units_Sold']
print("\nSales data with price per unit:\n", sales_df)

# Group by product and calculate average metrics
product_summary = sales_df.groupby('Product').agg({
'Units_Sold': 'sum',
'Revenue': 'sum',
'Price_Per_Unit': 'mean'
})
print("\nProduct summary:\n", product_summary)

# Sort by revenue to find top performing products
print("\nProducts sorted by revenue:\n", product_summary.sort_values('Revenue', ascending=False))

Output:

        Date Product  Units_Sold  Revenue
0 2023-01-15 Laptop 5 5000
1 2023-01-16 Phone 10 6000
2 2023-01-17 Tablet 7 2100
3 2023-01-18 Laptop 8 8000
4 2023-01-19 Phone 12 7200

Sales data with price per unit:
Date Product Units_Sold Revenue Price_Per_Unit
0 2023-01-15 Laptop 5 5000 1000.000
1 2023-01-16 Phone 10 6000 600.000
2 2023-01-17 Tablet 7 2100 300.000
3 2023-01-18 Laptop 8 8000 1000.000
4 2023-01-19 Phone 12 7200 600.000

Product summary:
Units_Sold Revenue Price_Per_Unit
Product
Laptop 13 13000 1000.000
Phone 22 13200 600.000
Tablet 7 2100 300.000

Products sorted by revenue:
Units_Sold Revenue Price_Per_Unit
Product
Phone 22 13200 600.000
Laptop 13 13000 1000.000
Tablet 7 2100 300.000

Summary

In this tutorial, we've covered the fundamental structures that make up the Pandas library:

  1. Series: One-dimensional labeled arrays that can hold any data type
  2. DataFrames: Two-dimensional labeled data structures similar to tables

We've learned how to:

  • Create Series and DataFrames from different data sources
  • Access elements using various indexing methods
  • Use basic attributes and methods to explore our data
  • Apply these concepts to analyze real-world data

Understanding these structures forms the foundation for all your data analysis work in Pandas. As you become more familiar with them, you'll find that Pandas makes complex data tasks simpler and more intuitive.

Additional Resources and Exercises

Further Reading

Practice Exercises

  1. Create a Series of 5 different countries and their populations. Calculate the total population.
  2. Create a DataFrame with information about 5 books (title, author, year, pages, genre). Then:
    • Add a column indicating whether the book is "long" (more than 300 pages)
    • Find the oldest and newest books
    • Calculate the average page count by genre
  3. Using the sales data example above:
    • Add a column for the day of the week
    • Find which day had the highest average sales
    • Create a pivot table showing Products vs. Days

By practicing these exercises, you'll strengthen your understanding of Pandas' core structures and be ready to tackle more complex data analysis tasks.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)