Pandas Data Structure Conversion
Data structure conversion is a fundamental skill when working with Pandas. In real-world data analysis, you'll frequently need to transform data between different formats to accomplish various tasks. This guide will walk you through the common conversion operations in Pandas.
Introduction to Data Structure Conversion
Pandas offers powerful and flexible data structures like Series and DataFrame, but working with data often requires converting between different formats. You might need to:
- Convert a Python list or dictionary to a Pandas Series or DataFrame
- Transform a DataFrame to a NumPy array
- Extract data from a DataFrame as a Python dictionary
- Change a Series to a list or array
Understanding these conversion techniques will make your data manipulation workflow smoother and more efficient.
Converting to Pandas Data Structures
Lists/Arrays to Series
Let's start with converting Python lists and NumPy arrays to Pandas Series:
import pandas as pd
import numpy as np
# From a list to Series
my_list = [10, 20, 30, 40, 50]
series_from_list = pd.Series(my_list)
print("Series from list:")
print(series_from_list)
print()
# From a NumPy array to Series
my_array = np.array([10, 20, 30, 40, 50])
series_from_array = pd.Series(my_array)
print("Series from NumPy array:")
print(series_from_array)
Output:
Series from list:
0 10
1 20
2 30
3 40
4 50
dtype: int64
Series from NumPy array:
0 10
1 20
2 30
3 40
4 50
dtype: int64
Dictionary to Series
Dictionaries are particularly useful when you want to define your own index:
# From dictionary to Series
my_dict = {'a': 100, 'b': 200, 'c': 300, 'd': 400, 'e': 500}
series_from_dict = pd.Series(my_dict)
print("Series from dictionary:")
print(series_from_dict)
Output:
Series from dictionary:
a 100
b 200
c 300
d 400
e 500
dtype: int64
Creating a DataFrame
There are multiple ways to create a DataFrame:
# From a dictionary of lists
data_dict = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df_from_dict = pd.DataFrame(data_dict)
print("DataFrame from dictionary of lists:")
print(df_from_dict)
print()
# From a list of dictionaries
data_list = [
{'Name': 'John', 'Age': 28, 'City': 'New York'},
{'Name': 'Anna', 'Age': 34, 'City': 'Paris'},
{'Name': 'Peter', 'Age': 29, 'City': 'Berlin'},
{'Name': 'Linda', 'Age': 32, 'City': 'London'}
]
df_from_list_of_dicts = pd.DataFrame(data_list)
print("DataFrame from list of dictionaries:")
print(df_from_list_of_dicts)
Output:
DataFrame from dictionary of lists:
Name Age City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 32 London
DataFrame from list of dictionaries:
Name Age City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
3 Linda 32 London
NumPy Array to DataFrame
You can convert a NumPy array to a DataFrame and specify column names:
# From NumPy array to DataFrame
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df_from_array = pd.DataFrame(array_2d, columns=['A', 'B', 'C'])
print("DataFrame from NumPy array:")
print(df_from_array)
Output:
DataFrame from NumPy array:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
Converting from Pandas Data Structures
Series to List, Array, and Dictionary
Let's see how to convert a Series back to Python native data types:
# Create a sample Series
series = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print("Original Series:")
print(series)
print()
# Convert Series to list
series_to_list = series.tolist() # or series.values.tolist()
print("Series to list:")
print(series_to_list)
print()
# Convert Series to NumPy array
series_to_array = series.to_numpy() # or series.values
print("Series to NumPy array:")
print(series_to_array)
print()
# Convert Series to dictionary
series_to_dict = series.to_dict()
print("Series to dictionary:")
print(series_to_dict)
Output:
Original Series:
a 10
b 20
c 30
d 40
e 50
dtype: int64
Series to list:
[10, 20, 30, 40, 50]
Series to NumPy array:
[10 20 30 40 50]
Series to dictionary:
{'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50}
DataFrame to Various Formats
Now let's explore how to convert a DataFrame to other formats:
# Create a sample DataFrame
df = pd.DataFrame({
'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 34, 29],
'City': ['New York', 'Paris', 'Berlin']
})
print("Original DataFrame:")
print(df)
print()
# Convert DataFrame to NumPy array
df_to_array = df.to_numpy()
print("DataFrame to NumPy array:")
print(df_to_array)
print()
# Convert DataFrame to dictionary
df_to_dict = df.to_dict()
print("DataFrame to dictionary (column-oriented):")
print(df_to_dict)
print()
# Convert DataFrame to dictionary (record-oriented)
df_to_dict_records = df.to_dict(orient='records')
print("DataFrame to list of dictionaries (records):")
print(df_to_dict_records)
print()
# Convert DataFrame to list of lists
df_to_list = df.values.tolist()
print("DataFrame to list of lists:")
print(df_to_list)
Output:
Original DataFrame:
Name Age City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
DataFrame to NumPy array:
[['John' 28 'New York']
['Anna' 34 'Paris']
['Peter' 29 'Berlin']]
DataFrame to dictionary (column-oriented):
{'Name': {0: 'John', 1: 'Anna', 2: 'Peter'}, 'Age': {0: 28, 1: 34, 2: 29}, 'City': {0: 'New York', 1: 'Paris', 2: 'Berlin'}}
DataFrame to list of dictionaries (records):
[{'Name': 'John', 'Age': 28, 'City': 'New York'}, {'Name': 'Anna', 'Age': 34, 'City': 'Paris'}, {'Name': 'Peter', 'Age': 29, 'City': 'Berlin'}]
DataFrame to list of lists:
[['John', 28, 'New York'], ['Anna', 34, 'Paris'], ['Peter', 29, 'Berlin']]
Converting Between Series and DataFrame
You can convert between Series and DataFrame objects:
# Series to DataFrame
series = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
df_from_series = series.to_frame(name='Values')
print("Series to DataFrame:")
print(df_from_series)
print()
# DataFrame column to Series
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})
series_from_df_col = df['B'] # Extract column as Series
print("DataFrame column to Series:")
print(series_from_df_col)
Output:
Series to DataFrame:
Values
a 10
b 20
c 30
d 40
DataFrame column to Series:
0 4
1 5
2 6
Name: B, dtype: int64
Advanced Conversions
Data Type Conversion
Changing the data types of columns is a common operation:
# Create a DataFrame with mixed types
df = pd.DataFrame({
'A': ['1', '2', '3'],
'B': [4, 5, 6],
'C': ['7.1', '8.2', '9.3']
})
print("Original DataFrame:")
print(df)
print(df.dtypes)
print()
# Convert column A from string to integer
df['A'] = df['A'].astype(int)
# Convert column C from string to float
df['C'] = df['C'].astype(float)
print("DataFrame with converted types:")
print(df)
print(df.dtypes)
Output:
Original DataFrame:
A B C
0 1 4 7.1
1 2 5 8.2
2 3 6 9.3
A object
B int64
C object
dtype: object
DataFrame with converted types:
A B C
0 1 4 7.1
1 2 5 8.2
2 3 6 9.3
A int64
B int64
C float64
dtype: object
Converting to/from Other File Formats
Pandas makes it easy to convert DataFrames to various file formats:
import io
# Create a DataFrame
df = pd.DataFrame({
'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 34, 29],
'City': ['New York', 'Paris', 'Berlin']
})
# To CSV (in memory for this example)
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)
csv_data = csv_buffer.getvalue()
print("DataFrame to CSV:")
print(csv_data)
print()
# From CSV (reading from the string we just created)
csv_buffer.seek(0) # Reset buffer position
df_from_csv = pd.read_csv(csv_buffer)
print("DataFrame from CSV:")
print(df_from_csv)
print()
# To JSON
json_data = df.to_json(orient='records')
print("DataFrame to JSON:")
print(json_data)
print()
# From JSON
df_from_json = pd.read_json(json_data, orient='records')
print("DataFrame from JSON:")
print(df_from_json)
Output:
DataFrame to CSV:
Name,Age,City
John,28,New York
Anna,34,Paris
Peter,29,Berlin
DataFrame from CSV:
Name Age City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
DataFrame to JSON:
[{"Name":"John","Age":28,"City":"New York"},{"Name":"Anna","Age":34,"City":"Paris"},{"Name":"Peter","Age":29,"City":"Berlin"}]
DataFrame from JSON:
Name Age City
0 John 28 New York
1 Anna 34 Paris
2 Peter 29 Berlin
Real-World Example: Working with Multiple Data Sources
Let's look at a practical example where we need to combine data from different sources:
# Data from a CSV file (simulated)
csv_data = """id,product,quantity
1,Widget A,10
2,Widget B,5
3,Widget C,15"""
# Data from an API (simulated as a list of dictionaries)
api_data = [
{"product_id": 1, "price": 9.99, "category": "Tools"},
{"product_id": 2, "price": 15.49, "category": "Household"},
{"product_id": 3, "price": 5.99, "category": "Tools"}
]
# Load CSV data
csv_io = io.StringIO(csv_data)
df_inventory = pd.read_csv(csv_io)
print("Inventory Data:")
print(df_inventory)
print()
# Convert API data to DataFrame
df_pricing = pd.DataFrame(api_data)
print("Pricing Data:")
print(df_pricing)
print()
# Merge the two data sources
df_pricing = df_pricing.rename(columns={"product_id": "id"}) # Rename column for merging
df_combined = pd.merge(df_inventory, df_pricing, on="id")
# Calculate total value
df_combined['total_value'] = df_combined['quantity'] * df_combined['price']
print("Combined Data:")
print(df_combined)
print()
# Export to different formats
print("Summary as dictionary:")
summary = {
'total_items': df_combined['quantity'].sum(),
'total_value': df_combined['total_value'].sum(),
'by_category': df_combined.groupby('category')['total_value'].sum().to_dict()
}
print(summary)
Output:
Inventory Data:
id product quantity
0 1 Widget A 10
1 2 Widget B 5
2 3 Widget C 15
Pricing Data:
product_id price category
0 1 9.99 Tools
1 2 15.49 Household
2 3 5.99 Tools
Combined Data:
id product quantity price category total_value
0 1 Widget A 10 9.99 Tools 99.90
1 2 Widget B 5 15.49 Household 77.45
2 3 Widget C 15 5.99 Tools 89.85
Summary as dictionary:
{'total_items': 30, 'total_value': 267.2, 'by_category': {'Tools': 189.75, 'Household': 77.45}}
Summary
In this guide, we've covered the essential techniques for converting between different data structures when working with Pandas:
- Creating Pandas Series and DataFrames from Python lists, dictionaries, and NumPy arrays
- Converting Pandas objects back to Python native types and NumPy arrays
- Transforming between Series and DataFrames
- Converting data types within Pandas objects
- Working with file formats like CSV and JSON
- Combining data from multiple sources with different formats
These conversion skills are fundamental for any data analysis workflow, allowing you to efficiently work with data in whatever format is most appropriate for a given task.
Additional Resources and Exercises
Further Reading
Practice Exercises
-
Basic Conversion: Create a Series from a list of numbers, then convert it back to a list and a NumPy array. Verify they're identical.
-
Working with Dictionaries: Create a nested dictionary representing student grades for different subjects. Convert it to a DataFrame and calculate each student's average grade.
-
Data Type Challenge: Read a CSV file containing numeric data stored as strings. Convert all appropriate columns to numeric types and perform basic statistical analysis.
-
Integration Exercise: Download JSON data from a public API, convert it to a DataFrame, extract specific information, and export the results as both CSV and Excel files.
-
Data Transformation: Create a DataFrame from a list of dictionaries representing website traffic data. Group by date, calculate daily averages, and convert the results to a different format for visualization.
By practicing these conversions, you'll become more fluent in manipulating data with Pandas, which is an essential skill for any data analysis project.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)