Pandas Complex Data Types

Introduction

When working with real-world data in Pandas, you'll often encounter scenarios where the simple string, number, or boolean data types aren't sufficient. Pandas provides support for a variety of complex data types that allow you to work with more sophisticated data structures. In this tutorial, we'll explore how to work with these complex data types effectively.

Complex data types in Pandas include:

Lists, dictionaries, and other Python objects stored within DataFrame cells
JSON and nested data structures
Custom data types and extension arrays
Categorical data
Time-related data types

Understanding these complex data types will significantly enhance your data manipulation capabilities in Pandas.

Basic Complex Data Types in Pandas

Lists and Dictionaries in DataFrame Cells

One of the most common complex data types you'll encounter is having Python objects like lists or dictionaries stored within cells of a DataFrame.

import pandas as pd
import numpy as np

# Creating a DataFrame with lists in cells
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'scores': [[85, 90, 78], [92, 88, 95], [76, 85, 91]],
    'preferences': [{'color': 'blue', 'food': 'pizza'}, 
                    {'color': 'green', 'food': 'pasta'}, 
                    {'color': 'red', 'food': 'burger'}]
})

print(df)

Output:

      name         scores                        preferences
  Alice    [85, 90, 78]  {'color': 'blue', 'food': 'pizza'}
    Bob    [92, 88, 95]  {'color': 'green', 'food': 'pasta'}
Charlie    [76, 85, 91]   {'color': 'red', 'food': 'burger'}

Accessing Elements within Complex Types

To access elements within these complex types, you can use standard Python indexing and methods:

# Accessing the first score of each student
print(df['scores'].apply(lambda x: x[0]))

# Accessing the color preference of each student
print(df['preferences'].apply(lambda x: x['color']))

Output:

  85
  92
  76
Name: scores, dtype: int64

   blue
  green
    red
Name: preferences, dtype: object

Working with JSON and Nested Data

Pandas provides functions to work with nested JSON data, which is common in web API responses.

# Sample JSON data
json_data = [
    {
        "user": {
            "name": "Alice",
            "age": 30,
            "skills": ["Python", "SQL", "Tableau"]
        },
        "activity": {
            "logins": 45,
            "projects": 12
        }
    },
    {
        "user": {
            "name": "Bob",
            "age": 25,
            "skills": ["Java", "JavaScript", "CSS"]
        },
        "activity": {
            "logins": 32,
            "projects": 8
        }
    }
]

# Convert JSON to DataFrame
df_nested = pd.json_normalize(json_data)
print(df_nested)

Output:

  user.name  user.age               user.skills  activity.logins  activity.projects
0     Alice        30  [Python, SQL, Tableau]              45                 12
1       Bob        25  [Java, JavaScript, CSS]              32                  8

Notice how Pandas flattens the nested JSON structure using dot notation for column names.

Categorical Data Type

Categorical data is common in data analysis and Pandas provides a special data type for it. Using the categorical data type can save memory and improve performance when you have data with a limited set of possible values.

# Creating a DataFrame with categorical data
df = pd.DataFrame({
    'id': range(1000),
    'department': np.random.choice(['HR', 'Sales', 'IT', 'Marketing'], 1000)
})

# Check memory usage
print(f"Original size: {df['department'].memory_usage(deep=True)} bytes")

# Convert to categorical
df['department'] = df['department'].astype('category')

# Check memory usage after conversion
print(f"Categorical size: {df['department'].memory_usage(deep=True)} bytes")

# View the categories
print(df['department'].cat.categories)

Output:

Original size: 8000 bytes
Categorical size: 1124 bytes
Index(['HR', 'IT', 'Marketing', 'Sales'], dtype='object')

Working with Categorical Data

Categorical data types provide special methods for manipulation:

# Reordering categories
df['department'] = df['department'].cat.reorder_categories(['IT', 'HR', 'Sales', 'Marketing'])

# Adding a new category
df['department'] = df['department'].cat.add_categories('Finance')

# Renaming categories
df['department'] = df['department'].cat.rename_categories({
    'HR': 'Human Resources',
    'IT': 'Information Technology'
})

print(df['department'].cat.categories)

Output:

Index(['Information Technology', 'Human Resources', 'Sales', 'Marketing', 'Finance'], dtype='object')

Pandas has powerful capabilities for working with dates, times, and timedeltas.

DateTime Data

# Create a DataFrame with date information
df_dates = pd.DataFrame({
    'date_string': ['2023-01-15', '2023-02-20', '2023-03-25']
})

# Convert to datetime
df_dates['date'] = pd.to_datetime(df_dates['date_string'])

# Extract components
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month
df_dates['day'] = df_dates['date'].dt.day
df_dates['weekday'] = df_dates['date'].dt.day_name()

print(df_dates)

Output:

  date_string       date  year  month  day    weekday
2023-01-15 2023-01-15  2023      1   15     Sunday
2023-02-20 2023-02-20  2023      2   20     Monday
2023-03-25 2023-03-25  2023      3   25   Saturday

TimeDelta Data

TimeDelta represents the difference between two datetime values:

# Creating TimeDelta
df_dates['next_week'] = df_dates['date'] + pd.Timedelta(weeks=1)
df_dates['diff'] = df_dates['next_week'] - df_dates['date']

print(df_dates[['date', 'next_week', 'diff']])

Output:

        date  next_week     diff
2023-01-15 2023-01-22  7 days
2023-02-20 2023-02-27  7 days
2023-03-25 2023-04-01  7 days

Extension Arrays and Custom Data Types

Pandas allows for extension arrays that enable custom data types. One common example is the IntegerArray which allows for integers with missing values (NA).

# Regular int arrays convert NaN to float
standard_series = pd.Series([1, 2, np.nan, 4])
print(f"Standard series dtype: {standard_series.dtype}")
print(standard_series)

# Integer array preserves integer type while supporting NA
integer_series = pd.Series([1, 2, np.nan, 4], dtype="Int64")  # Note the capital "I"
print(f"Integer array dtype: {integer_series.dtype}")
print(integer_series)

Output:

Standard series dtype: float64
0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

Integer array dtype: Int64
0       1
1       2
2    <NA>
3       4
dtype: Int64

Other useful extension array types include:

# Boolean with NA values
bool_series = pd.Series([True, False, np.nan, True], dtype="boolean")
print(bool_series)

# String data type
string_series = pd.Series(["apple", "banana", None, "date"], dtype="string")
print(string_series)

Output:

   True
  False
   <NA>
   True
dtype: boolean

  apple
  banana
   <NA>
   date
dtype: string

Real-World Applications

Let's look at some practical examples where complex data types are useful.

Analyzing Survey Data

Survey data often includes multiple-choice questions, ratings, and text responses:

# Sample survey data
survey_data = pd.DataFrame({
    'respondent_id': range(1, 6),
    'age_group': pd.Categorical(['18-24', '35-44', '25-34', '45-54', '18-24'],
                               categories=['18-24', '25-34', '35-44', '45-54', '55+']),
    'responses': [
        {'q1': 5, 'q2': 4, 'q3': 3},
        {'q1': 3, 'q2': 3, 'q3': 2},
        {'q1': 4, 'q2': 5, 'q3': 4},
        {'q1': 2, 'q2': 2, 'q3': 1},
        {'q1': 5, 'q2': 4, 'q3': 5}
    ],
    'selected_options': [
        ['option1', 'option3'],
        ['option2'],
        ['option1', 'option2', 'option3'],
        ['option3'],
        ['option1', 'option2']
    ]
})

# Extract nested data for analysis
survey_data['q1_score'] = survey_data['responses'].apply(lambda x: x['q1'])
survey_data['selected_option_count'] = survey_data['selected_options'].apply(len)

# Analyze by age group
print(survey_data.groupby('age_group')['q1_score'].mean())

Output:

age_group
18-24    5.0
25-34    4.0
35-44    3.0
45-54    2.0
55+      NaN
Name: q1_score, dtype: float64

Processing Log Data with Timestamps

Working with server logs often involves timestamp processing:

# Sample log data
log_data = pd.DataFrame({
    'timestamp': pd.to_datetime([
        '2023-01-15 08:30:45',
        '2023-01-15 09:15:32',
        '2023-01-15 10:45:10',
        '2023-01-16 08:05:23',
        '2023-01-16 12:30:15'
    ]),
    'user_id': [101, 102, 101, 103, 102],
    'action': ['login', 'view_page', 'logout', 'login', 'purchase']
})

# Set timestamp as index
log_data.set_index('timestamp', inplace=True)

# Resample by hour to count actions
hourly_actions = log_data.groupby(pd.Grouper(freq='H')).count()
print(hourly_actions['user_id'])

# Calculate session durations for user 101
user_101 = log_data[log_data['user_id'] == 101]
login_time = user_101[user_101['action'] == 'login'].index[0]
logout_time = user_101[user_101['action'] == 'logout'].index[0]
session_duration = logout_time - login_time

print(f"\nUser 101 session duration: {session_duration}")

Output:

timestamp
2023-01-15 08:00:00    1
2023-01-15 09:00:00    1
2023-01-15 10:00:00    1
2023-01-15 11:00:00    0
2023-01-15 12:00:00    0
2023-01-15 13:00:00    0
...
2023-01-16 08:00:00    1
2023-01-16 09:00:00    0
2023-01-16 10:00:00    0
2023-01-16 11:00:00    0
2023-01-16 12:00:00    1
Name: user_id, dtype: int64

User 101 session duration: 2:14:25

Summary

In this tutorial, we've explored Pandas' handling of complex data types:

Lists and dictionaries within DataFrame cells
Working with nested JSON data
Categorical data for memory efficiency and specialized operations
DateTime, TimeDelta, and time-related operations
Extension arrays for specialized data types like Int64 with NA support
Real-world applications demonstrating how to leverage these data types

Understanding complex data types in Pandas allows you to work with more sophisticated real-world datasets and perform more advanced analyses. These capabilities become particularly valuable when dealing with data from web APIs, survey responses, log files, and other complex sources.

Additional Resources and Exercises

Resources

Exercises

Exploratory Exercise: Create a DataFrame containing product information where each product has multiple categories stored as a list. Extract and analyze the most common categories.
JSON Processing: Download a JSON dataset from a public API (like the GitHub API or a weather API) and practice normalizing and analyzing the nested data.
Time Series Challenge: Create a DataFrame with datetime data spanning multiple years. Calculate various time-based metrics like week-over-week growth, monthly averages, and business days between events.
Survey Analysis: Create a more complex survey dataset with multiple question types (single choice stored as categories, multiple choice stored as lists, and ratings stored in a nested dictionary). Write functions to extract and visualize insights from this data.
Memory Optimization: Create a large DataFrame (with at least 100,000 rows) containing text columns with repetitive values. Experiment with converting these columns to categorical and measure the memory savings.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Basic Complex Data Types in Pandas​

Lists and Dictionaries in DataFrame Cells​

Accessing Elements within Complex Types​

Working with JSON and Nested Data​

Categorical Data Type​

Working with Categorical Data​

Time-Related Data Types​

DateTime Data​

TimeDelta Data​

Extension Arrays and Custom Data Types​

Real-World Applications​

Analyzing Survey Data​

Processing Log Data with Timestamps​

Summary​

Additional Resources and Exercises​

Resources​

Exercises​

Introduction

Basic Complex Data Types in Pandas

Lists and Dictionaries in DataFrame Cells

Accessing Elements within Complex Types

Working with JSON and Nested Data

Categorical Data Type

Working with Categorical Data

Time-Related Data Types

DateTime Data

TimeDelta Data

Extension Arrays and Custom Data Types

Real-World Applications

Analyzing Survey Data

Processing Log Data with Timestamps

Summary

Additional Resources and Exercises

Resources

Exercises