Pandas JSON Import

Introduction

JSON (JavaScript Object Notation) is one of the most common data interchange formats used in modern web applications and APIs. When working with data analysis in Python, you'll frequently need to import JSON data into pandas DataFrames for manipulation and analysis.

In this tutorial, you'll learn how to:

Import JSON data from files and strings
Handle different JSON structures (nested objects and arrays)
Normalize and flatten complex JSON data
Work with real-world JSON from APIs

Prerequisites

Before we begin, make sure you have the following libraries installed:

python
import pandas as pd
import json
import requests  # For API examples

Basic JSON Import

Let's start with a simple example of importing JSON data into a pandas DataFrame.

From a JSON String

The simplest way to import JSON data is from a string using pd.read_json():

python
# Simple JSON string
json_string = '''
{
  "name": ["John", "Anna", "Peter", "Linda"],
  "age": [28, 24, 35, 32],
  "city": ["New York", "Paris", "Berlin", "London"]
}
'''

# Parse JSON string into a DataFrame
df = pd.read_json(json_string)
print(df)

Output:

    name  age      city
 John   28  New York
 Anna   24     Paris
Peter   35    Berlin
Linda   32    London

From a JSON File

More commonly, you'll import JSON from a file:

python
# Assuming you have a 'data.json' file
df = pd.read_json('data.json')

# If your file is in another encoding
df = pd.read_json('data.json', encoding='utf-8')

Working with Different JSON Structures

Records Oriented JSON

JSON can be structured in different ways. The "records" format has each record as an object:

python
# Records-oriented JSON
records_json = '''
[
  {"name": "John", "age": 28, "city": "New York"},
  {"name": "Anna", "age": 24, "city": "Paris"},
  {"name": "Peter", "age": 35, "city": "Berlin"},
  {"name": "Linda", "age": 32, "city": "London"}
]
'''

df_records = pd.read_json(records_json)
print(df_records)

Output:

    name  age      city
 John   28  New York
 Anna   24     Paris
Peter   35    Berlin
Linda   32    London

Split Oriented JSON

The "split" format separates column names and data:

python
# Split-oriented JSON
split_json = '''
{
  "columns": ["name", "age", "city"],
  "data": [
    ["John", 28, "New York"],
    ["Anna", 24, "Paris"],
    ["Peter", 35, "Berlin"],
    ["Linda", 32, "London"]
  ]
}
'''

df_split = pd.read_json(split_json, orient='split')
print(df_split)

Output:

    name  age      city
 John   28  New York
 Anna   24     Paris
Peter   35    Berlin
Linda   32    London

Specifying Orientation

The orient parameter helps pandas understand your JSON structure:

python
# Different orientations
df = pd.read_json(json_string, orient='columns')  # Default
df = pd.read_json(records_json, orient='records')
df = pd.read_json(split_json, orient='split')

Common orientations include:

'split': dict like {index -> [index], columns -> [columns], data -> [values]}
'records': list like [{column -> value}, ... , {column -> value}]
'index': dict like {index -> {column -> value}}
'columns': dict like {column -> {index -> value}} (default)
'values': just the values array

Handling Nested JSON

Real-world JSON often contains nested structures that need special handling.

Using json_normalize()

The json_normalize() function is powerful for flattening nested JSON:

python
# Nested JSON
nested_json = '''
[
  {
    "id": 1,
    "name": "John",
    "contact": {
      "email": "[email protected]",
      "phone": "555-1234"
    },
    "hobbies": ["reading", "hiking"]
  },
  {
    "id": 2,
    "name": "Anna",
    "contact": {
      "email": "[email protected]",
      "phone": "555-5678"
    },
    "hobbies": ["painting", "yoga", "travel"]
  }
]
'''

# Parse the string first
nested_data = json.loads(nested_json)

# Normalize the nested data
df_nested = pd.json_normalize(nested_data)
print(df_nested)

Output:

   id  name            contact.email contact.phone                     hobbies
0   1  John     [email protected]    555-1234               [reading, hiking]
1   2  Anna     [email protected]    555-5678  [painting, yoga, travel]

Customizing Path Separator

You can customize how nested fields are named:

python
df_nested = pd.json_normalize(nested_data, sep='_')
print(df_nested)

Output:

   id  name          contact_email contact_phone                     hobbies
0   1  John     [email protected]    555-1234               [reading, hiking]
1   2  Anna     [email protected]    555-5678  [painting, yoga, travel]

Accessing Specific Nested Paths

For complex JSON, you can specify which nested fields to extract:

python
df_contacts = pd.json_normalize(
    nested_data,
    record_path=['contact'],
    meta=['id', 'name']
)
print(df_contacts)

Real-world Example: Importing JSON from an API

Let's see a practical example of importing JSON data from a public API:

python
# Get data from a public API (JSONPlaceholder)
response = requests.get('https://jsonplaceholder.typicode.com/users')

# Check if request was successful
if response.status_code == 200:
    # Parse JSON response
    users_data = response.json()
    
    # Convert to DataFrame
    users_df = pd.json_normalize(users_data)
    
    # Select just a few columns for display
    users_sample = users_df[['id', 'name', 'email', 'address.city', 'company.name']]
    print(users_sample.head())
else:
    print(f"Error fetching data: {response.status_code}")

Output:

   id                      name                  email    address.city            company.name
 1        Leanne Graham      [email protected]      Gwenborough     Romaguera-Crona
 2          Ervin Howell  [email protected]         Wisokyburgh      Deckow-Crist
 3     Clementine Bauch   [email protected]        McKenziehaven   Romaguera-Jacobson
 4        Patricia Lebsack  [email protected]  South Elvis   Robel-Corkery
 5        Chelsey Dietrich  [email protected]   Roscoeview   Keebler LLC

Working with JSON Lines (JSONL)

JSON Lines format has one JSON object per line. Pandas can read this format too:

python
# JSON Lines example
jsonl_data = '''
{"name": "John", "age": 28, "city": "New York"}
{"name": "Anna", "age": 24, "city": "Paris"}
{"name": "Peter", "age": 35, "city": "Berlin"}
{"name": "Linda", "age": 32, "city": "London"}
'''

# Save to a file for demonstration
with open('data.jsonl', 'w') as f:
    f.write(jsonl_data)

# Read the JSONL file
df_jsonl = pd.read_json('data.jsonl', lines=True)
print(df_jsonl)

Output:

    name  age      city
 John   28  New York
 Anna   24     Paris
Peter   35    Berlin
Linda   32    London

Converting a DataFrame Back to JSON

You can also convert your DataFrame back to JSON using to_json():

python
# Create a simple DataFrame
df = pd.DataFrame({
    'name': ['John', 'Anna', 'Peter'],
    'age': [28, 24, 35],
    'city': ['New York', 'Paris', 'Berlin']
})

# Convert to JSON string with different orientations
json_string = df.to_json()                   # Default (columns orientation)
json_records = df.to_json(orient='records')  # Records orientation
json_split = df.to_json(orient='split')      # Split orientation

print("Default JSON:")
print(json_string)
print("\nRecords format:")
print(json_records)

# Write to a file
df.to_json('output_data.json', orient='records')

Output:

Default JSON:
{"name":{"0":"John","1":"Anna","2":"Peter"},"age":{"0":28,"1":24,"2":35},"city":{"0":"New York","1":"Paris","2":"Berlin"}}

Records format:
[{"name":"John","age":28,"city":"New York"},{"name":"Anna","age":24,"city":"Paris"},{"name":"Peter","age":35,"city":"Berlin"}]

Best Practices and Common Issues

Handling Missing Data

JSON may contain null values or missing fields:

python
# JSON with nulls and missing values
json_with_nulls = '''
[
  {"name": "John", "age": 28, "city": "New York"},
  {"name": "Anna", "age": null, "city": "Paris"},
  {"name": "Peter", "city": "Berlin"}
]
'''

df_nulls = pd.read_json(json_with_nulls)
print(df_nulls)

# Check for missing values
print("\nMissing values:")
print(df_nulls.isnull().sum())

Output:

    name   age      city
0   John  28.0  New York
1   Anna   NaN     Paris
2  Peter   NaN    Berlin

Missing values:
name    0
age     2
city    0
dtype: int64

Type Inference

Pandas tries to infer data types when reading JSON:

python
# JSON with different types
json_types = '''
[
  {"id": "1", "value": "100"},
  {"id": "2", "value": "200"},
  {"id": "3", "value": "300"}
]
'''

# Default reading (strings stay strings)
df_types1 = pd.read_json(json_types)
print(df_types1.dtypes)

# Convert numeric strings to numbers
df_types2 = pd.read_json(json_types, convert_dates=False, dtype=False)
print(df_types2.dtypes)

Output:

id       object
value    object
dtype: object

id         int64
value      int64
dtype: object

Summary

In this tutorial, you learned how to:

Import JSON data from strings and files using pd.read_json()
Work with different JSON structures using the orient parameter
Handle nested JSON data with pd.json_normalize()
Extract data from real-world APIs and convert it to DataFrames
Deal with JSON Lines format
Convert DataFrames back to JSON
Handle common issues like missing data and data types

JSON is a versatile format, and pandas provides powerful tools for working with it. As you work with more data sources, these skills will help you efficiently import and prepare data for analysis.

Additional Resources

Exercises

Find a public API that returns JSON data and create a DataFrame from it.
Practice normalizing a complex nested JSON structure with multiple levels.
Convert a DataFrame with multilevel columns to JSON and back again.
Create a function that reads a JSON file and returns only specific fields as a DataFrame.
Import a large JSON dataset and practice optimizing the import for memory usage.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Basic JSON Import​

From a JSON String​

From a JSON File​

Working with Different JSON Structures​

Records Oriented JSON​

Split Oriented JSON​

Specifying Orientation​

Handling Nested JSON​

Using json_normalize()​

Customizing Path Separator​

Accessing Specific Nested Paths​

Real-world Example: Importing JSON from an API​

Working with JSON Lines (JSONL)​

Converting a DataFrame Back to JSON​

Best Practices and Common Issues​

Handling Missing Data​

Type Inference​

Summary​

Additional Resources​

Exercises​

Introduction

Prerequisites

Basic JSON Import

From a JSON String

From a JSON File

Working with Different JSON Structures

Records Oriented JSON

Split Oriented JSON

Specifying Orientation

Handling Nested JSON

Using json_normalize()

Customizing Path Separator

Accessing Specific Nested Paths

Real-world Example: Importing JSON from an API

Working with JSON Lines (JSONL)

Converting a DataFrame Back to JSON

Best Practices and Common Issues

Handling Missing Data

Type Inference

Summary

Additional Resources

Exercises