Pandas Structure Overview

Introduction

Pandas is one of the most powerful and flexible data analysis libraries in Python. At its core, Pandas provides two primary data structures that you'll use for almost all data manipulation tasks: Series and DataFrames. Understanding these structures is fundamental to working effectively with data in Python.

In this tutorial, we'll explore these structures, learn how they're organized, and see how they work together to make data analysis more intuitive and efficient.

Pandas Core Data Structures

1. Series: One-dimensional labeled arrays

A Pandas Series is a one-dimensional labeled array that can hold any data type (integers, strings, floating-point numbers, Python objects, etc.). It's similar to a column in a spreadsheet or a single variable in your dataset.

Creating a Series

Let's start by creating a simple Series:

python
import pandas as pd

# Create a Series from a list
simple_series = pd.Series([10, 20, 30, 40, 50])
print(simple_series)

Output:

  10
  20
  30
  40
  50
dtype: int64

Notice how Pandas automatically assigns an index (0 through 4) to our values.

We can also specify our own index labels:

python
# Create a Series with custom indices
fruit_series = pd.Series([30, 25, 40, 10], 
                         index=['apples', 'bananas', 'oranges', 'grapes'])
print(fruit_series)

Output:

apples     30
bananas    25
oranges    40
grapes     10
dtype: int64

Series Attributes and Methods

Series have useful attributes and methods:

python
# Get basic information about our Series
print("Series values:", fruit_series.values)
print("Series index:", fruit_series.index)
print("Series data type:", fruit_series.dtype)
print("Series shape:", fruit_series.shape)

# Using methods
print("Mean value:", fruit_series.mean())
print("Max value:", fruit_series.max())
print("Description:", fruit_series.describe())

Output:

Series values: [30 25 40 10]
Series index: Index(['apples', 'bananas', 'oranges', 'grapes'], dtype='object')
Series data type: int64
Series shape: (4,)
Mean value: 26.25
Max value: 40
Description: count     4.00000
mean     26.25000
std      12.65922
min      10.00000
25%      21.25000
50%      27.50000
75%      32.50000
max      40.00000
dtype: float64

Accessing Series Elements

You can access elements by their index label or position:

python
# Access by label
print("Number of apples:", fruit_series['apples'])

# Access by position
print("First element:", fruit_series[0])

# Slicing
print("First two elements:\n", fruit_series[:2])

Output:

Number of apples: 30
First element: 30
First two elements:
 apples     30
bananas    25
dtype: int64

2. DataFrame: Two-dimensional labeled data structure

A DataFrame is a 2D labeled data structure, similar to a spreadsheet or SQL table. It's like a collection of Series objects that share the same index.

Creating a DataFrame

Let's create a simple DataFrame:

python
# Create a DataFrame from a dictionary
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London'],
    'Salary': [65000, 70000, 80000, 75000]
}

df = pd.DataFrame(data)
print(df)

Output:

    Name  Age     City  Salary
 John   28  New York   65000
 Anna   24     Paris   70000
Peter   35    Berlin   80000
Linda   32    London   75000

You can also create a DataFrame from a list of lists, a NumPy array, or other formats.

DataFrame Attributes and Methods

DataFrames have many useful attributes and methods:

python
# Basic DataFrame information
print("DataFrame shape:", df.shape)
print("DataFrame columns:", df.columns)
print("DataFrame index:", df.index)
print("DataFrame data types:\n", df.dtypes)

# Statistical summary
print("\nSummary statistics:\n", df.describe())

# Top and bottom rows
print("\nFirst 2 rows:\n", df.head(2))
print("\nLast 2 rows:\n", df.tail(2))

Output:

DataFrame shape: (4, 4)
DataFrame columns: Index(['Name', 'Age', 'City', 'Salary'], dtype='object')
DataFrame index: RangeIndex(start=0, stop=4, step=1)
DataFrame data types:
 Name      object
Age        int64
City      object
Salary     int64
dtype: object

Summary statistics:
            Age       Salary
count   4.000000     4.000000
mean   29.750000  72500.000000
std     4.856267   6454.972244
min    24.000000  65000.000000
25%    27.000000  68750.000000
50%    30.000000  72500.000000
75%    32.750000  76250.000000
max    35.000000  80000.000000

First 2 rows:
   Name  Age      City  Salary
0  John   28  New York   65000
1  Anna   24     Paris   70000

Last 2 rows:
    Name  Age     City  Salary
2  Peter   35    Berlin   80000
3  Linda   32    London   75000

Accessing DataFrame Elements

There are multiple ways to access data in a DataFrame:

python
# Access a single column (returns a Series)
print("Names:\n", df['Name'])

# Access multiple columns
print("\nNames and Ages:\n", df[['Name', 'Age']])

# Access a row by position using iloc
print("\nSecond row:\n", df.iloc[1])

# Access a row by label using loc
print("\nRow with index 2:\n", df.loc[2])

# Access a specific cell
print("\nLinda's salary:", df.loc[3, 'Salary'])

# Boolean indexing
print("\nPeople older than 30:\n", df[df['Age'] > 30])

Output:

Names:
 0     John
1     Anna
2    Peter
3    Linda
Name: Name, dtype: object

Names and Ages:
    Name  Age
0   John   28
1   Anna   24
2  Peter   35
3  Linda   32

Second row:
 Name      Anna
Age         24
City      Paris
Salary    70000
Name: 1, dtype: object

Row with index 2:
 Name      Peter
Age          35
City      Berlin
Salary     80000
Name: 2, dtype: object

Linda's salary: 75000

People older than 30:
    Name  Age     City  Salary
2  Peter   35    Berlin   80000
3  Linda   32    London   75000

The Relationship Between Series and DataFrames

A DataFrame is essentially a collection of Series objects that share the same index. Each column in a DataFrame is a Series:

python
# Extract the Age column as a Series
age_series = df['Age']
print("Type of age_series:", type(age_series))
print(age_series)

Output:

Type of age_series: <class 'pandas.core.series.Series'>
0    28
1    24
2    35
3    32
Name: Age, dtype: int64

Real-World Example: Analyzing Sales Data

Let's use a more practical example of analyzing some sales data:

python
# Create a DataFrame with sales data
sales_data = {
    'Date': ['2023-01-15', '2023-01-16', '2023-01-17', '2023-01-18', '2023-01-19'],
    'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
    'Units_Sold': [5, 10, 7, 8, 12],
    'Revenue': [5000, 6000, 2100, 8000, 7200]
}

sales_df = pd.DataFrame(sales_data)

# Convert Date column to datetime format
sales_df['Date'] = pd.to_datetime(sales_df['Date'])
print(sales_df)

# Calculate the revenue per unit
sales_df['Price_Per_Unit'] = sales_df['Revenue'] / sales_df['Units_Sold']
print("\nSales data with price per unit:\n", sales_df)

# Group by product and calculate average metrics
product_summary = sales_df.groupby('Product').agg({
    'Units_Sold': 'sum',
    'Revenue': 'sum',
    'Price_Per_Unit': 'mean'
})
print("\nProduct summary:\n", product_summary)

# Sort by revenue to find top performing products
print("\nProducts sorted by revenue:\n", product_summary.sort_values('Revenue', ascending=False))

Output:

        Date Product  Units_Sold  Revenue
0 2023-01-15  Laptop           5     5000
1 2023-01-16   Phone          10     6000
2 2023-01-17  Tablet           7     2100
3 2023-01-18  Laptop           8     8000
4 2023-01-19   Phone          12     7200

Sales data with price per unit:
         Date Product  Units_Sold  Revenue  Price_Per_Unit
0 2023-01-15  Laptop           5     5000        1000.000
1 2023-01-16   Phone          10     6000         600.000
2 2023-01-17  Tablet           7     2100         300.000
3 2023-01-18  Laptop           8     8000        1000.000
4 2023-01-19   Phone          12     7200         600.000

Product summary:
         Units_Sold  Revenue  Price_Per_Unit
Product                                     
Laptop           13    13000        1000.000
Phone            22    13200         600.000
Tablet            7     2100         300.000

Products sorted by revenue:
         Units_Sold  Revenue  Price_Per_Unit
Product                                     
Phone            22    13200         600.000
Laptop           13    13000        1000.000
Tablet            7     2100         300.000

Summary

In this tutorial, we've covered the fundamental structures that make up the Pandas library:

Series: One-dimensional labeled arrays that can hold any data type
DataFrames: Two-dimensional labeled data structures similar to tables

We've learned how to:

Create Series and DataFrames from different data sources
Access elements using various indexing methods
Use basic attributes and methods to explore our data
Apply these concepts to analyze real-world data

Understanding these structures forms the foundation for all your data analysis work in Pandas. As you become more familiar with them, you'll find that Pandas makes complex data tasks simpler and more intuitive.

Additional Resources and Exercises

Practice Exercises

Create a Series of 5 different countries and their populations. Calculate the total population.
Create a DataFrame with information about 5 books (title, author, year, pages, genre). Then:
- Add a column indicating whether the book is "long" (more than 300 pages)
- Find the oldest and newest books
- Calculate the average page count by genre
Using the sales data example above:
- Add a column for the day of the week
- Find which day had the highest average sales
- Create a pivot table showing Products vs. Days

By practicing these exercises, you'll strengthen your understanding of Pandas' core structures and be ready to tackle more complex data analysis tasks.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Pandas Core Data Structures​

1. Series: One-dimensional labeled arrays​

Creating a Series​

Series Attributes and Methods​

Accessing Series Elements​

2. DataFrame: Two-dimensional labeled data structure​

Creating a DataFrame​

DataFrame Attributes and Methods​

Accessing DataFrame Elements​

The Relationship Between Series and DataFrames​

Real-World Example: Analyzing Sales Data​

Summary​

Additional Resources and Exercises​

Further Reading​

Practice Exercises​

Introduction

Pandas Core Data Structures

1. Series: One-dimensional labeled arrays

Creating a Series

Series Attributes and Methods

Accessing Series Elements

2. DataFrame: Two-dimensional labeled data structure

Creating a DataFrame

DataFrame Attributes and Methods

Accessing DataFrame Elements

The Relationship Between Series and DataFrames

Real-World Example: Analyzing Sales Data

Summary

Additional Resources and Exercises

Further Reading

Practice Exercises