Pandas Complex Data Types
Introduction
When working with real-world data in Pandas, you'll often encounter scenarios where the simple string, number, or boolean data types aren't sufficient. Pandas provides support for a variety of complex data types that allow you to work with more sophisticated data structures. In this tutorial, we'll explore how to work with these complex data types effectively.
Complex data types in Pandas include:
- Lists, dictionaries, and other Python objects stored within DataFrame cells
- JSON and nested data structures
- Custom data types and extension arrays
- Categorical data
- Time-related data types
Understanding these complex data types will significantly enhance your data manipulation capabilities in Pandas.
Basic Complex Data Types in Pandas
Lists and Dictionaries in DataFrame Cells
One of the most common complex data types you'll encounter is having Python objects like lists or dictionaries stored within cells of a DataFrame.
import pandas as pd
import numpy as np
# Creating a DataFrame with lists in cells
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'scores': [[85, 90, 78], [92, 88, 95], [76, 85, 91]],
'preferences': [{'color': 'blue', 'food': 'pizza'},
{'color': 'green', 'food': 'pasta'},
{'color': 'red', 'food': 'burger'}]
})
print(df)
Output:
name scores preferences
0 Alice [85, 90, 78] {'color': 'blue', 'food': 'pizza'}
1 Bob [92, 88, 95] {'color': 'green', 'food': 'pasta'}
2 Charlie [76, 85, 91] {'color': 'red', 'food': 'burger'}
Accessing Elements within Complex Types
To access elements within these complex types, you can use standard Python indexing and methods:
# Accessing the first score of each student
print(df['scores'].apply(lambda x: x[0]))
# Accessing the color preference of each student
print(df['preferences'].apply(lambda x: x['color']))
Output:
0 85
1 92
2 76
Name: scores, dtype: int64
0 blue
1 green
2 red
Name: preferences, dtype: object
Working with JSON and Nested Data
Pandas provides functions to work with nested JSON data, which is common in web API responses.
# Sample JSON data
json_data = [
{
"user": {
"name": "Alice",
"age": 30,
"skills": ["Python", "SQL", "Tableau"]
},
"activity": {
"logins": 45,
"projects": 12
}
},
{
"user": {
"name": "Bob",
"age": 25,
"skills": ["Java", "JavaScript", "CSS"]
},
"activity": {
"logins": 32,
"projects": 8
}
}
]
# Convert JSON to DataFrame
df_nested = pd.json_normalize(json_data)
print(df_nested)
Output:
user.name user.age user.skills activity.logins activity.projects
0 Alice 30 [Python, SQL, Tableau] 45 12
1 Bob 25 [Java, JavaScript, CSS] 32 8
Notice how Pandas flattens the nested JSON structure using dot notation for column names.
Categorical Data Type
Categorical data is common in data analysis and Pandas provides a special data type for it. Using the categorical data type can save memory and improve performance when you have data with a limited set of possible values.
# Creating a DataFrame with categorical data
df = pd.DataFrame({
'id': range(1000),
'department': np.random.choice(['HR', 'Sales', 'IT', 'Marketing'], 1000)
})
# Check memory usage
print(f"Original size: {df['department'].memory_usage(deep=True)} bytes")
# Convert to categorical
df['department'] = df['department'].astype('category')
# Check memory usage after conversion
print(f"Categorical size: {df['department'].memory_usage(deep=True)} bytes")
# View the categories
print(df['department'].cat.categories)
Output:
Original size: 8000 bytes
Categorical size: 1124 bytes
Index(['HR', 'IT', 'Marketing', 'Sales'], dtype='object')
Working with Categorical Data
Categorical data types provide special methods for manipulation:
# Reordering categories
df['department'] = df['department'].cat.reorder_categories(['IT', 'HR', 'Sales', 'Marketing'])
# Adding a new category
df['department'] = df['department'].cat.add_categories('Finance')
# Renaming categories
df['department'] = df['department'].cat.rename_categories({
'HR': 'Human Resources',
'IT': 'Information Technology'
})
print(df['department'].cat.categories)
Output:
Index(['Information Technology', 'Human Resources', 'Sales', 'Marketing', 'Finance'], dtype='object')
Time-Related Data Types
Pandas has powerful capabilities for working with dates, times, and timedeltas.
DateTime Data
# Create a DataFrame with date information
df_dates = pd.DataFrame({
'date_string': ['2023-01-15', '2023-02-20', '2023-03-25']
})
# Convert to datetime
df_dates['date'] = pd.to_datetime(df_dates['date_string'])
# Extract components
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month
df_dates['day'] = df_dates['date'].dt.day
df_dates['weekday'] = df_dates['date'].dt.day_name()
print(df_dates)
Output:
date_string date year month day weekday
0 2023-01-15 2023-01-15 2023 1 15 Sunday
1 2023-02-20 2023-02-20 2023 2 20 Monday
2 2023-03-25 2023-03-25 2023 3 25 Saturday
TimeDelta Data
TimeDelta represents the difference between two datetime values:
# Creating TimeDelta
df_dates['next_week'] = df_dates['date'] + pd.Timedelta(weeks=1)
df_dates['diff'] = df_dates['next_week'] - df_dates['date']
print(df_dates[['date', 'next_week', 'diff']])
Output:
date next_week diff
0 2023-01-15 2023-01-22 7 days
1 2023-02-20 2023-02-27 7 days
2 2023-03-25 2023-04-01 7 days
Extension Arrays and Custom Data Types
Pandas allows for extension arrays that enable custom data types. One common example is the IntegerArray
which allows for integers with missing values (NA).
# Regular int arrays convert NaN to float
standard_series = pd.Series([1, 2, np.nan, 4])
print(f"Standard series dtype: {standard_series.dtype}")
print(standard_series)
# Integer array preserves integer type while supporting NA
integer_series = pd.Series([1, 2, np.nan, 4], dtype="Int64") # Note the capital "I"
print(f"Integer array dtype: {integer_series.dtype}")
print(integer_series)
Output:
Standard series dtype: float64
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64
Integer array dtype: Int64
0 1
1 2
2 <NA>
3 4
dtype: Int64
Other useful extension array types include:
# Boolean with NA values
bool_series = pd.Series([True, False, np.nan, True], dtype="boolean")
print(bool_series)
# String data type
string_series = pd.Series(["apple", "banana", None, "date"], dtype="string")
print(string_series)
Output:
0 True
1 False
2 <NA>
3 True
dtype: boolean
0 apple
1 banana
2 <NA>
3 date
dtype: string
Real-World Applications
Let's look at some practical examples where complex data types are useful.
Analyzing Survey Data
Survey data often includes multiple-choice questions, ratings, and text responses:
# Sample survey data
survey_data = pd.DataFrame({
'respondent_id': range(1, 6),
'age_group': pd.Categorical(['18-24', '35-44', '25-34', '45-54', '18-24'],
categories=['18-24', '25-34', '35-44', '45-54', '55+']),
'responses': [
{'q1': 5, 'q2': 4, 'q3': 3},
{'q1': 3, 'q2': 3, 'q3': 2},
{'q1': 4, 'q2': 5, 'q3': 4},
{'q1': 2, 'q2': 2, 'q3': 1},
{'q1': 5, 'q2': 4, 'q3': 5}
],
'selected_options': [
['option1', 'option3'],
['option2'],
['option1', 'option2', 'option3'],
['option3'],
['option1', 'option2']
]
})
# Extract nested data for analysis
survey_data['q1_score'] = survey_data['responses'].apply(lambda x: x['q1'])
survey_data['selected_option_count'] = survey_data['selected_options'].apply(len)
# Analyze by age group
print(survey_data.groupby('age_group')['q1_score'].mean())
Output:
age_group
18-24 5.0
25-34 4.0
35-44 3.0
45-54 2.0
55+ NaN
Name: q1_score, dtype: float64
Processing Log Data with Timestamps
Working with server logs often involves timestamp processing:
# Sample log data
log_data = pd.DataFrame({
'timestamp': pd.to_datetime([
'2023-01-15 08:30:45',
'2023-01-15 09:15:32',
'2023-01-15 10:45:10',
'2023-01-16 08:05:23',
'2023-01-16 12:30:15'
]),
'user_id': [101, 102, 101, 103, 102],
'action': ['login', 'view_page', 'logout', 'login', 'purchase']
})
# Set timestamp as index
log_data.set_index('timestamp', inplace=True)
# Resample by hour to count actions
hourly_actions = log_data.groupby(pd.Grouper(freq='H')).count()
print(hourly_actions['user_id'])
# Calculate session durations for user 101
user_101 = log_data[log_data['user_id'] == 101]
login_time = user_101[user_101['action'] == 'login'].index[0]
logout_time = user_101[user_101['action'] == 'logout'].index[0]
session_duration = logout_time - login_time
print(f"\nUser 101 session duration: {session_duration}")
Output:
timestamp
2023-01-15 08:00:00 1
2023-01-15 09:00:00 1
2023-01-15 10:00:00 1
2023-01-15 11:00:00 0
2023-01-15 12:00:00 0
2023-01-15 13:00:00 0
...
2023-01-16 08:00:00 1
2023-01-16 09:00:00 0
2023-01-16 10:00:00 0
2023-01-16 11:00:00 0
2023-01-16 12:00:00 1
Name: user_id, dtype: int64
User 101 session duration: 2:14:25
Summary
In this tutorial, we've explored Pandas' handling of complex data types:
- Lists and dictionaries within DataFrame cells
- Working with nested JSON data
- Categorical data for memory efficiency and specialized operations
- DateTime, TimeDelta, and time-related operations
- Extension arrays for specialized data types like Int64 with NA support
- Real-world applications demonstrating how to leverage these data types
Understanding complex data types in Pandas allows you to work with more sophisticated real-world datasets and perform more advanced analyses. These capabilities become particularly valuable when dealing with data from web APIs, survey responses, log files, and other complex sources.
Additional Resources and Exercises
Resources
Exercises
-
Exploratory Exercise: Create a DataFrame containing product information where each product has multiple categories stored as a list. Extract and analyze the most common categories.
-
JSON Processing: Download a JSON dataset from a public API (like the GitHub API or a weather API) and practice normalizing and analyzing the nested data.
-
Time Series Challenge: Create a DataFrame with datetime data spanning multiple years. Calculate various time-based metrics like week-over-week growth, monthly averages, and business days between events.
-
Survey Analysis: Create a more complex survey dataset with multiple question types (single choice stored as categories, multiple choice stored as lists, and ratings stored in a nested dictionary). Write functions to extract and visualize insights from this data.
-
Memory Optimization: Create a large DataFrame (with at least 100,000 rows) containing text columns with repetitive values. Experiment with converting these columns to categorical and measure the memory savings.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)