Skip to main content

Pandas JSON Import

Introduction

JSON (JavaScript Object Notation) is one of the most common data interchange formats used in modern web applications and APIs. When working with data analysis in Python, you'll frequently need to import JSON data into pandas DataFrames for manipulation and analysis.

In this tutorial, you'll learn how to:

  • Import JSON data from files and strings
  • Handle different JSON structures (nested objects and arrays)
  • Normalize and flatten complex JSON data
  • Work with real-world JSON from APIs

Prerequisites

Before we begin, make sure you have the following libraries installed:

python
import pandas as pd
import json
import requests # For API examples

Basic JSON Import

Let's start with a simple example of importing JSON data into a pandas DataFrame.

From a JSON String

The simplest way to import JSON data is from a string using pd.read_json():

python
# Simple JSON string
json_string = '''
{
"name": ["John", "Anna", "Peter", "Linda"],
"age": [28, 24, 35, 32],
"city": ["New York", "Paris", "Berlin", "London"]
}
'''

# Parse JSON string into a DataFrame
df = pd.read_json(json_string)
print(df)

Output:

    name  age      city
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 Linda 32 London

From a JSON File

More commonly, you'll import JSON from a file:

python
# Assuming you have a 'data.json' file
df = pd.read_json('data.json')

# If your file is in another encoding
df = pd.read_json('data.json', encoding='utf-8')

Working with Different JSON Structures

Records Oriented JSON

JSON can be structured in different ways. The "records" format has each record as an object:

python
# Records-oriented JSON
records_json = '''
[
{"name": "John", "age": 28, "city": "New York"},
{"name": "Anna", "age": 24, "city": "Paris"},
{"name": "Peter", "age": 35, "city": "Berlin"},
{"name": "Linda", "age": 32, "city": "London"}
]
'''

df_records = pd.read_json(records_json)
print(df_records)

Output:

    name  age      city
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 Linda 32 London

Split Oriented JSON

The "split" format separates column names and data:

python
# Split-oriented JSON
split_json = '''
{
"columns": ["name", "age", "city"],
"data": [
["John", 28, "New York"],
["Anna", 24, "Paris"],
["Peter", 35, "Berlin"],
["Linda", 32, "London"]
]
}
'''

df_split = pd.read_json(split_json, orient='split')
print(df_split)

Output:

    name  age      city
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 Linda 32 London

Specifying Orientation

The orient parameter helps pandas understand your JSON structure:

python
# Different orientations
df = pd.read_json(json_string, orient='columns') # Default
df = pd.read_json(records_json, orient='records')
df = pd.read_json(split_json, orient='split')

Common orientations include:

  • 'split': dict like {index -> [index], columns -> [columns], data -> [values]}
  • 'records': list like [{column -> value}, ... , {column -> value}]
  • 'index': dict like {index -> {column -> value}}
  • 'columns': dict like {column -> {index -> value}} (default)
  • 'values': just the values array

Handling Nested JSON

Real-world JSON often contains nested structures that need special handling.

Using json_normalize()

The json_normalize() function is powerful for flattening nested JSON:

python
# Nested JSON
nested_json = '''
[
{
"id": 1,
"name": "John",
"contact": {
"email": "[email protected]",
"phone": "555-1234"
},
"hobbies": ["reading", "hiking"]
},
{
"id": 2,
"name": "Anna",
"contact": {
"email": "[email protected]",
"phone": "555-5678"
},
"hobbies": ["painting", "yoga", "travel"]
}
]
'''

# Parse the string first
nested_data = json.loads(nested_json)

# Normalize the nested data
df_nested = pd.json_normalize(nested_data)
print(df_nested)

Output:

   id  name            contact.email contact.phone                     hobbies
0 1 John [email protected] 555-1234 [reading, hiking]
1 2 Anna [email protected] 555-5678 [painting, yoga, travel]

Customizing Path Separator

You can customize how nested fields are named:

python
df_nested = pd.json_normalize(nested_data, sep='_')
print(df_nested)

Output:

   id  name          contact_email contact_phone                     hobbies
0 1 John [email protected] 555-1234 [reading, hiking]
1 2 Anna [email protected] 555-5678 [painting, yoga, travel]

Accessing Specific Nested Paths

For complex JSON, you can specify which nested fields to extract:

python
df_contacts = pd.json_normalize(
nested_data,
record_path=['contact'],
meta=['id', 'name']
)
print(df_contacts)

Real-world Example: Importing JSON from an API

Let's see a practical example of importing JSON data from a public API:

python
# Get data from a public API (JSONPlaceholder)
response = requests.get('https://jsonplaceholder.typicode.com/users')

# Check if request was successful
if response.status_code == 200:
# Parse JSON response
users_data = response.json()

# Convert to DataFrame
users_df = pd.json_normalize(users_data)

# Select just a few columns for display
users_sample = users_df[['id', 'name', 'email', 'address.city', 'company.name']]
print(users_sample.head())
else:
print(f"Error fetching data: {response.status_code}")

Output:

   id                      name                  email    address.city            company.name
0 1 Leanne Graham [email protected] Gwenborough Romaguera-Crona
1 2 Ervin Howell [email protected] Wisokyburgh Deckow-Crist
2 3 Clementine Bauch [email protected] McKenziehaven Romaguera-Jacobson
3 4 Patricia Lebsack [email protected] South Elvis Robel-Corkery
4 5 Chelsey Dietrich [email protected] Roscoeview Keebler LLC

Working with JSON Lines (JSONL)

JSON Lines format has one JSON object per line. Pandas can read this format too:

python
# JSON Lines example
jsonl_data = '''
{"name": "John", "age": 28, "city": "New York"}
{"name": "Anna", "age": 24, "city": "Paris"}
{"name": "Peter", "age": 35, "city": "Berlin"}
{"name": "Linda", "age": 32, "city": "London"}
'''

# Save to a file for demonstration
with open('data.jsonl', 'w') as f:
f.write(jsonl_data)

# Read the JSONL file
df_jsonl = pd.read_json('data.jsonl', lines=True)
print(df_jsonl)

Output:

    name  age      city
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 Linda 32 London

Converting a DataFrame Back to JSON

You can also convert your DataFrame back to JSON using to_json():

python
# Create a simple DataFrame
df = pd.DataFrame({
'name': ['John', 'Anna', 'Peter'],
'age': [28, 24, 35],
'city': ['New York', 'Paris', 'Berlin']
})

# Convert to JSON string with different orientations
json_string = df.to_json() # Default (columns orientation)
json_records = df.to_json(orient='records') # Records orientation
json_split = df.to_json(orient='split') # Split orientation

print("Default JSON:")
print(json_string)
print("\nRecords format:")
print(json_records)

# Write to a file
df.to_json('output_data.json', orient='records')

Output:

Default JSON:
{"name":{"0":"John","1":"Anna","2":"Peter"},"age":{"0":28,"1":24,"2":35},"city":{"0":"New York","1":"Paris","2":"Berlin"}}

Records format:
[{"name":"John","age":28,"city":"New York"},{"name":"Anna","age":24,"city":"Paris"},{"name":"Peter","age":35,"city":"Berlin"}]

Best Practices and Common Issues

Handling Missing Data

JSON may contain null values or missing fields:

python
# JSON with nulls and missing values
json_with_nulls = '''
[
{"name": "John", "age": 28, "city": "New York"},
{"name": "Anna", "age": null, "city": "Paris"},
{"name": "Peter", "city": "Berlin"}
]
'''

df_nulls = pd.read_json(json_with_nulls)
print(df_nulls)

# Check for missing values
print("\nMissing values:")
print(df_nulls.isnull().sum())

Output:

    name   age      city
0 John 28.0 New York
1 Anna NaN Paris
2 Peter NaN Berlin

Missing values:
name 0
age 2
city 0
dtype: int64

Type Inference

Pandas tries to infer data types when reading JSON:

python
# JSON with different types
json_types = '''
[
{"id": "1", "value": "100"},
{"id": "2", "value": "200"},
{"id": "3", "value": "300"}
]
'''

# Default reading (strings stay strings)
df_types1 = pd.read_json(json_types)
print(df_types1.dtypes)

# Convert numeric strings to numbers
df_types2 = pd.read_json(json_types, convert_dates=False, dtype=False)
print(df_types2.dtypes)

Output:

id       object
value object
dtype: object

id int64
value int64
dtype: object

Summary

In this tutorial, you learned how to:

  • Import JSON data from strings and files using pd.read_json()
  • Work with different JSON structures using the orient parameter
  • Handle nested JSON data with pd.json_normalize()
  • Extract data from real-world APIs and convert it to DataFrames
  • Deal with JSON Lines format
  • Convert DataFrames back to JSON
  • Handle common issues like missing data and data types

JSON is a versatile format, and pandas provides powerful tools for working with it. As you work with more data sources, these skills will help you efficiently import and prepare data for analysis.

Additional Resources

Exercises

  1. Find a public API that returns JSON data and create a DataFrame from it.
  2. Practice normalizing a complex nested JSON structure with multiple levels.
  3. Convert a DataFrame with multilevel columns to JSON and back again.
  4. Create a function that reads a JSON file and returns only specific fields as a DataFrame.
  5. Import a large JSON dataset and practice optimizing the import for memory usage.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)