Pandas JSON Import
Introduction
JSON (JavaScript Object Notation) is one of the most common data interchange formats used in modern web applications and APIs. When working with data analysis in Python, you'll frequently need to import JSON data into pandas DataFrames for manipulation and analysis.
In this tutorial, you'll learn how to:
- Import JSON data from files and strings
- Handle different JSON structures (nested objects and arrays)
- Normalize and flatten complex JSON data
- Work with real-world JSON from APIs
Prerequisites
Before we begin, make sure you have the following libraries installed:
import pandas as pd
import json
import requests # For API examples
Basic JSON Import
Let's start with a simple example of importing JSON data into a pandas DataFrame.
From a JSON String
The simplest way to import JSON data is from a string using pd.read_json()
:
# Simple JSON string
json_string = '''
{
"name": ["John", "Anna", "Peter", "Linda"],
"age": [28, 24, 35, 32],
"city": ["New York", "Paris", "Berlin", "London"]
}
'''
# Parse JSON string into a DataFrame
df = pd.read_json(json_string)
print(df)
Output:
name age city
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 Linda 32 London
From a JSON File
More commonly, you'll import JSON from a file:
# Assuming you have a 'data.json' file
df = pd.read_json('data.json')
# If your file is in another encoding
df = pd.read_json('data.json', encoding='utf-8')
Working with Different JSON Structures
Records Oriented JSON
JSON can be structured in different ways. The "records" format has each record as an object:
# Records-oriented JSON
records_json = '''
[
{"name": "John", "age": 28, "city": "New York"},
{"name": "Anna", "age": 24, "city": "Paris"},
{"name": "Peter", "age": 35, "city": "Berlin"},
{"name": "Linda", "age": 32, "city": "London"}
]
'''
df_records = pd.read_json(records_json)
print(df_records)
Output:
name age city
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 Linda 32 London
Split Oriented JSON
The "split" format separates column names and data:
# Split-oriented JSON
split_json = '''
{
"columns": ["name", "age", "city"],
"data": [
["John", 28, "New York"],
["Anna", 24, "Paris"],
["Peter", 35, "Berlin"],
["Linda", 32, "London"]
]
}
'''
df_split = pd.read_json(split_json, orient='split')
print(df_split)
Output:
name age city
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 Linda 32 London
Specifying Orientation
The orient
parameter helps pandas understand your JSON structure:
# Different orientations
df = pd.read_json(json_string, orient='columns') # Default
df = pd.read_json(records_json, orient='records')
df = pd.read_json(split_json, orient='split')
Common orientations include:
'split'
: dict like{index -> [index], columns -> [columns], data -> [values]}
'records'
: list like[{column -> value}, ... , {column -> value}]
'index'
: dict like{index -> {column -> value}}
'columns'
: dict like{column -> {index -> value}}
(default)'values'
: just the values array
Handling Nested JSON
Real-world JSON often contains nested structures that need special handling.
Using json_normalize()
The json_normalize()
function is powerful for flattening nested JSON:
# Nested JSON
nested_json = '''
[
{
"id": 1,
"name": "John",
"contact": {
"email": "[email protected]",
"phone": "555-1234"
},
"hobbies": ["reading", "hiking"]
},
{
"id": 2,
"name": "Anna",
"contact": {
"email": "[email protected]",
"phone": "555-5678"
},
"hobbies": ["painting", "yoga", "travel"]
}
]
'''
# Parse the string first
nested_data = json.loads(nested_json)
# Normalize the nested data
df_nested = pd.json_normalize(nested_data)
print(df_nested)
Output:
id name contact.email contact.phone hobbies
0 1 John [email protected] 555-1234 [reading, hiking]
1 2 Anna [email protected] 555-5678 [painting, yoga, travel]
Customizing Path Separator
You can customize how nested fields are named:
df_nested = pd.json_normalize(nested_data, sep='_')
print(df_nested)
Output:
id name contact_email contact_phone hobbies
0 1 John [email protected] 555-1234 [reading, hiking]
1 2 Anna [email protected] 555-5678 [painting, yoga, travel]
Accessing Specific Nested Paths
For complex JSON, you can specify which nested fields to extract:
df_contacts = pd.json_normalize(
nested_data,
record_path=['contact'],
meta=['id', 'name']
)
print(df_contacts)
Real-world Example: Importing JSON from an API
Let's see a practical example of importing JSON data from a public API:
# Get data from a public API (JSONPlaceholder)
response = requests.get('https://jsonplaceholder.typicode.com/users')
# Check if request was successful
if response.status_code == 200:
# Parse JSON response
users_data = response.json()
# Convert to DataFrame
users_df = pd.json_normalize(users_data)
# Select just a few columns for display
users_sample = users_df[['id', 'name', 'email', 'address.city', 'company.name']]
print(users_sample.head())
else:
print(f"Error fetching data: {response.status_code}")
Output:
id name email address.city company.name
0 1 Leanne Graham [email protected] Gwenborough Romaguera-Crona
1 2 Ervin Howell [email protected] Wisokyburgh Deckow-Crist
2 3 Clementine Bauch [email protected] McKenziehaven Romaguera-Jacobson
3 4 Patricia Lebsack [email protected] South Elvis Robel-Corkery
4 5 Chelsey Dietrich [email protected] Roscoeview Keebler LLC
Working with JSON Lines (JSONL)
JSON Lines format has one JSON object per line. Pandas can read this format too:
# JSON Lines example
jsonl_data = '''
{"name": "John", "age": 28, "city": "New York"}
{"name": "Anna", "age": 24, "city": "Paris"}
{"name": "Peter", "age": 35, "city": "Berlin"}
{"name": "Linda", "age": 32, "city": "London"}
'''
# Save to a file for demonstration
with open('data.jsonl', 'w') as f:
f.write(jsonl_data)
# Read the JSONL file
df_jsonl = pd.read_json('data.jsonl', lines=True)
print(df_jsonl)
Output:
name age city
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 Linda 32 London
Converting a DataFrame Back to JSON
You can also convert your DataFrame back to JSON using to_json()
:
# Create a simple DataFrame
df = pd.DataFrame({
'name': ['John', 'Anna', 'Peter'],
'age': [28, 24, 35],
'city': ['New York', 'Paris', 'Berlin']
})
# Convert to JSON string with different orientations
json_string = df.to_json() # Default (columns orientation)
json_records = df.to_json(orient='records') # Records orientation
json_split = df.to_json(orient='split') # Split orientation
print("Default JSON:")
print(json_string)
print("\nRecords format:")
print(json_records)
# Write to a file
df.to_json('output_data.json', orient='records')
Output:
Default JSON:
{"name":{"0":"John","1":"Anna","2":"Peter"},"age":{"0":28,"1":24,"2":35},"city":{"0":"New York","1":"Paris","2":"Berlin"}}
Records format:
[{"name":"John","age":28,"city":"New York"},{"name":"Anna","age":24,"city":"Paris"},{"name":"Peter","age":35,"city":"Berlin"}]
Best Practices and Common Issues
Handling Missing Data
JSON may contain null
values or missing fields:
# JSON with nulls and missing values
json_with_nulls = '''
[
{"name": "John", "age": 28, "city": "New York"},
{"name": "Anna", "age": null, "city": "Paris"},
{"name": "Peter", "city": "Berlin"}
]
'''
df_nulls = pd.read_json(json_with_nulls)
print(df_nulls)
# Check for missing values
print("\nMissing values:")
print(df_nulls.isnull().sum())
Output:
name age city
0 John 28.0 New York
1 Anna NaN Paris
2 Peter NaN Berlin
Missing values:
name 0
age 2
city 0
dtype: int64
Type Inference
Pandas tries to infer data types when reading JSON:
# JSON with different types
json_types = '''
[
{"id": "1", "value": "100"},
{"id": "2", "value": "200"},
{"id": "3", "value": "300"}
]
'''
# Default reading (strings stay strings)
df_types1 = pd.read_json(json_types)
print(df_types1.dtypes)
# Convert numeric strings to numbers
df_types2 = pd.read_json(json_types, convert_dates=False, dtype=False)
print(df_types2.dtypes)
Output:
id object
value object
dtype: object
id int64
value int64
dtype: object
Summary
In this tutorial, you learned how to:
- Import JSON data from strings and files using
pd.read_json()
- Work with different JSON structures using the
orient
parameter - Handle nested JSON data with
pd.json_normalize()
- Extract data from real-world APIs and convert it to DataFrames
- Deal with JSON Lines format
- Convert DataFrames back to JSON
- Handle common issues like missing data and data types
JSON is a versatile format, and pandas provides powerful tools for working with it. As you work with more data sources, these skills will help you efficiently import and prepare data for analysis.
Additional Resources
Exercises
- Find a public API that returns JSON data and create a DataFrame from it.
- Practice normalizing a complex nested JSON structure with multiple levels.
- Convert a DataFrame with multilevel columns to JSON and back again.
- Create a function that reads a JSON file and returns only specific fields as a DataFrame.
- Import a large JSON dataset and practice optimizing the import for memory usage.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)