Skip to main content

Pandas Complex Data Types

Introduction

When working with real-world data in Pandas, you'll often encounter scenarios where the simple string, number, or boolean data types aren't sufficient. Pandas provides support for a variety of complex data types that allow you to work with more sophisticated data structures. In this tutorial, we'll explore how to work with these complex data types effectively.

Complex data types in Pandas include:

  • Lists, dictionaries, and other Python objects stored within DataFrame cells
  • JSON and nested data structures
  • Custom data types and extension arrays
  • Categorical data
  • Time-related data types

Understanding these complex data types will significantly enhance your data manipulation capabilities in Pandas.

Basic Complex Data Types in Pandas

Lists and Dictionaries in DataFrame Cells

One of the most common complex data types you'll encounter is having Python objects like lists or dictionaries stored within cells of a DataFrame.

python
import pandas as pd
import numpy as np

# Creating a DataFrame with lists in cells
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'scores': [[85, 90, 78], [92, 88, 95], [76, 85, 91]],
'preferences': [{'color': 'blue', 'food': 'pizza'},
{'color': 'green', 'food': 'pasta'},
{'color': 'red', 'food': 'burger'}]
})

print(df)

Output:

      name         scores                        preferences
0 Alice [85, 90, 78] {'color': 'blue', 'food': 'pizza'}
1 Bob [92, 88, 95] {'color': 'green', 'food': 'pasta'}
2 Charlie [76, 85, 91] {'color': 'red', 'food': 'burger'}

Accessing Elements within Complex Types

To access elements within these complex types, you can use standard Python indexing and methods:

python
# Accessing the first score of each student
print(df['scores'].apply(lambda x: x[0]))

# Accessing the color preference of each student
print(df['preferences'].apply(lambda x: x['color']))

Output:

0    85
1 92
2 76
Name: scores, dtype: int64

0 blue
1 green
2 red
Name: preferences, dtype: object

Working with JSON and Nested Data

Pandas provides functions to work with nested JSON data, which is common in web API responses.

python
# Sample JSON data
json_data = [
{
"user": {
"name": "Alice",
"age": 30,
"skills": ["Python", "SQL", "Tableau"]
},
"activity": {
"logins": 45,
"projects": 12
}
},
{
"user": {
"name": "Bob",
"age": 25,
"skills": ["Java", "JavaScript", "CSS"]
},
"activity": {
"logins": 32,
"projects": 8
}
}
]

# Convert JSON to DataFrame
df_nested = pd.json_normalize(json_data)
print(df_nested)

Output:

  user.name  user.age               user.skills  activity.logins  activity.projects
0 Alice 30 [Python, SQL, Tableau] 45 12
1 Bob 25 [Java, JavaScript, CSS] 32 8

Notice how Pandas flattens the nested JSON structure using dot notation for column names.

Categorical Data Type

Categorical data is common in data analysis and Pandas provides a special data type for it. Using the categorical data type can save memory and improve performance when you have data with a limited set of possible values.

python
# Creating a DataFrame with categorical data
df = pd.DataFrame({
'id': range(1000),
'department': np.random.choice(['HR', 'Sales', 'IT', 'Marketing'], 1000)
})

# Check memory usage
print(f"Original size: {df['department'].memory_usage(deep=True)} bytes")

# Convert to categorical
df['department'] = df['department'].astype('category')

# Check memory usage after conversion
print(f"Categorical size: {df['department'].memory_usage(deep=True)} bytes")

# View the categories
print(df['department'].cat.categories)

Output:

Original size: 8000 bytes
Categorical size: 1124 bytes
Index(['HR', 'IT', 'Marketing', 'Sales'], dtype='object')

Working with Categorical Data

Categorical data types provide special methods for manipulation:

python
# Reordering categories
df['department'] = df['department'].cat.reorder_categories(['IT', 'HR', 'Sales', 'Marketing'])

# Adding a new category
df['department'] = df['department'].cat.add_categories('Finance')

# Renaming categories
df['department'] = df['department'].cat.rename_categories({
'HR': 'Human Resources',
'IT': 'Information Technology'
})

print(df['department'].cat.categories)

Output:

Index(['Information Technology', 'Human Resources', 'Sales', 'Marketing', 'Finance'], dtype='object')

Pandas has powerful capabilities for working with dates, times, and timedeltas.

DateTime Data

python
# Create a DataFrame with date information
df_dates = pd.DataFrame({
'date_string': ['2023-01-15', '2023-02-20', '2023-03-25']
})

# Convert to datetime
df_dates['date'] = pd.to_datetime(df_dates['date_string'])

# Extract components
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month
df_dates['day'] = df_dates['date'].dt.day
df_dates['weekday'] = df_dates['date'].dt.day_name()

print(df_dates)

Output:

  date_string       date  year  month  day    weekday
0 2023-01-15 2023-01-15 2023 1 15 Sunday
1 2023-02-20 2023-02-20 2023 2 20 Monday
2 2023-03-25 2023-03-25 2023 3 25 Saturday

TimeDelta Data

TimeDelta represents the difference between two datetime values:

python
# Creating TimeDelta
df_dates['next_week'] = df_dates['date'] + pd.Timedelta(weeks=1)
df_dates['diff'] = df_dates['next_week'] - df_dates['date']

print(df_dates[['date', 'next_week', 'diff']])

Output:

        date  next_week     diff
0 2023-01-15 2023-01-22 7 days
1 2023-02-20 2023-02-27 7 days
2 2023-03-25 2023-04-01 7 days

Extension Arrays and Custom Data Types

Pandas allows for extension arrays that enable custom data types. One common example is the IntegerArray which allows for integers with missing values (NA).

python
# Regular int arrays convert NaN to float
standard_series = pd.Series([1, 2, np.nan, 4])
print(f"Standard series dtype: {standard_series.dtype}")
print(standard_series)

# Integer array preserves integer type while supporting NA
integer_series = pd.Series([1, 2, np.nan, 4], dtype="Int64") # Note the capital "I"
print(f"Integer array dtype: {integer_series.dtype}")
print(integer_series)

Output:

Standard series dtype: float64
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64

Integer array dtype: Int64
0 1
1 2
2 <NA>
3 4
dtype: Int64

Other useful extension array types include:

python
# Boolean with NA values
bool_series = pd.Series([True, False, np.nan, True], dtype="boolean")
print(bool_series)

# String data type
string_series = pd.Series(["apple", "banana", None, "date"], dtype="string")
print(string_series)

Output:

0     True
1 False
2 <NA>
3 True
dtype: boolean

0 apple
1 banana
2 <NA>
3 date
dtype: string

Real-World Applications

Let's look at some practical examples where complex data types are useful.

Analyzing Survey Data

Survey data often includes multiple-choice questions, ratings, and text responses:

python
# Sample survey data
survey_data = pd.DataFrame({
'respondent_id': range(1, 6),
'age_group': pd.Categorical(['18-24', '35-44', '25-34', '45-54', '18-24'],
categories=['18-24', '25-34', '35-44', '45-54', '55+']),
'responses': [
{'q1': 5, 'q2': 4, 'q3': 3},
{'q1': 3, 'q2': 3, 'q3': 2},
{'q1': 4, 'q2': 5, 'q3': 4},
{'q1': 2, 'q2': 2, 'q3': 1},
{'q1': 5, 'q2': 4, 'q3': 5}
],
'selected_options': [
['option1', 'option3'],
['option2'],
['option1', 'option2', 'option3'],
['option3'],
['option1', 'option2']
]
})

# Extract nested data for analysis
survey_data['q1_score'] = survey_data['responses'].apply(lambda x: x['q1'])
survey_data['selected_option_count'] = survey_data['selected_options'].apply(len)

# Analyze by age group
print(survey_data.groupby('age_group')['q1_score'].mean())

Output:

age_group
18-24 5.0
25-34 4.0
35-44 3.0
45-54 2.0
55+ NaN
Name: q1_score, dtype: float64

Processing Log Data with Timestamps

Working with server logs often involves timestamp processing:

python
# Sample log data
log_data = pd.DataFrame({
'timestamp': pd.to_datetime([
'2023-01-15 08:30:45',
'2023-01-15 09:15:32',
'2023-01-15 10:45:10',
'2023-01-16 08:05:23',
'2023-01-16 12:30:15'
]),
'user_id': [101, 102, 101, 103, 102],
'action': ['login', 'view_page', 'logout', 'login', 'purchase']
})

# Set timestamp as index
log_data.set_index('timestamp', inplace=True)

# Resample by hour to count actions
hourly_actions = log_data.groupby(pd.Grouper(freq='H')).count()
print(hourly_actions['user_id'])

# Calculate session durations for user 101
user_101 = log_data[log_data['user_id'] == 101]
login_time = user_101[user_101['action'] == 'login'].index[0]
logout_time = user_101[user_101['action'] == 'logout'].index[0]
session_duration = logout_time - login_time

print(f"\nUser 101 session duration: {session_duration}")

Output:

timestamp
2023-01-15 08:00:00 1
2023-01-15 09:00:00 1
2023-01-15 10:00:00 1
2023-01-15 11:00:00 0
2023-01-15 12:00:00 0
2023-01-15 13:00:00 0
...
2023-01-16 08:00:00 1
2023-01-16 09:00:00 0
2023-01-16 10:00:00 0
2023-01-16 11:00:00 0
2023-01-16 12:00:00 1
Name: user_id, dtype: int64

User 101 session duration: 2:14:25

Summary

In this tutorial, we've explored Pandas' handling of complex data types:

  • Lists and dictionaries within DataFrame cells
  • Working with nested JSON data
  • Categorical data for memory efficiency and specialized operations
  • DateTime, TimeDelta, and time-related operations
  • Extension arrays for specialized data types like Int64 with NA support
  • Real-world applications demonstrating how to leverage these data types

Understanding complex data types in Pandas allows you to work with more sophisticated real-world datasets and perform more advanced analyses. These capabilities become particularly valuable when dealing with data from web APIs, survey responses, log files, and other complex sources.

Additional Resources and Exercises

Resources

Exercises

  1. Exploratory Exercise: Create a DataFrame containing product information where each product has multiple categories stored as a list. Extract and analyze the most common categories.

  2. JSON Processing: Download a JSON dataset from a public API (like the GitHub API or a weather API) and practice normalizing and analyzing the nested data.

  3. Time Series Challenge: Create a DataFrame with datetime data spanning multiple years. Calculate various time-based metrics like week-over-week growth, monthly averages, and business days between events.

  4. Survey Analysis: Create a more complex survey dataset with multiple question types (single choice stored as categories, multiple choice stored as lists, and ratings stored in a nested dictionary). Write functions to extract and visualize insights from this data.

  5. Memory Optimization: Create a large DataFrame (with at least 100,000 rows) containing text columns with repetitive values. Experiment with converting these columns to categorical and measure the memory savings.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)