Pandas STATA Import

Introduction

STATA is a powerful statistical software package widely used in research fields like economics, sociology, political science, and epidemiology. If you're transitioning from STATA to Python or need to work with STATA files in a Python environment, pandas provides an excellent way to import and manipulate STATA data files (.dta format).

In this tutorial, you'll learn how to:

Import STATA files into pandas DataFrames
Handle STATA-specific metadata
Work with different STATA file versions
Apply various import options to customize your data import

Prerequisites

Before we begin, ensure you have the following installed:

Python 3.x
pandas
pytables (required for STATA import functionality)

You can install the necessary packages using pip:

pip install pandas pytables

Basic STATA Import with Pandas

The primary function for importing STATA files in pandas is pd.read_stata(). This function loads STATA files directly into a pandas DataFrame.

Simple Example

Here's a basic example of importing a STATA file:

import pandas as pd

# Load a STATA file
df = pd.read_stata('example.dta')

# Display the first few rows
print(df.head())

Output:

   id    name  age  income education
 1   Alice   24   45000  Bachelor
 2     Bob   32   65000   Masters
 3  Claire   45   78000  Doctoral
 4   David   19   12000  Highschool
 5     Eva   38   92000   Masters

Understanding STATA Metadata

STATA files contain rich metadata, including variable labels, value labels, and dataset notes. Pandas preserves this information during import.

Accessing Variable Labels

import pandas as pd

# Load a STATA file with variable information
df = pd.read_stata('example.dta')

# Access variable labels
variable_labels = df.variable_labels
print("Variable Labels:")
for var, label in variable_labels.items():
    print(f"{var}: {label}")

Output:

Variable Labels:
id: Participant ID
name: Participant Name
age: Age in years
income: Annual income in USD
education: Highest degree attained

Working with Value Labels

Value labels in STATA map numeric codes to descriptive strings. Here's how to access them:

import pandas as pd

# Load a STATA file with value labels
df = pd.read_stata('survey.dta')

# Print the value_labels attribute
print(df.value_labels)

Output:

{'education': {1: 'No formal education', 2: 'Primary', 3: 'Secondary', 4: 'Undergraduate', 5: 'Graduate'}}

Advanced STATA Import Options

Pandas provides several options to customize how STATA files are imported.

Selecting Specific Columns

To load only specific columns from a large STATA file:

import pandas as pd

# Load only selected columns
df = pd.read_stata('large_dataset.dta', columns=['id', 'age', 'income'])

print(df.head())

Output:

   id  age  income
 1   24   45000
 2   32   65000
 3   45   78000
 4   19   12000
 5   38   92000

Handling Different STATA Versions

STATA has evolved over the years, with different file formats. The version parameter helps handle specific versions:

import pandas as pd

# Import a file created in STATA 12
df = pd.read_stata('stata12_file.dta', version=12)

# View the DataFrame information
print(df.info())

Converting Categorical Data

STATA often uses categorical data. You can control how these are imported:

import pandas as pd

# Option 1: Convert STATA categorical data to pandas categorical
df1 = pd.read_stata('example.dta', convert_categoricals=True)

# Option 2: Don't convert categories, keep original encodings
df2 = pd.read_stata('example.dta', convert_categoricals=False)

# Compare the results
print("With categorical conversion:")
print(df1['education'].head())
print("\nWithout categorical conversion:")
print(df2['education'].head())

Output:

With categorical conversion:
0     Bachelor
1      Masters
2     Doctoral
3    Highschool
4      Masters
Name: education, dtype: category
Categories (4, object): ['Bachelor', 'Doctoral', 'Highschool', 'Masters']

Without categorical conversion:
0    Bachelor
1     Masters
2    Doctoral
3   Highschool
4     Masters
Name: education, dtype: object

Handling Missing Values

STATA has specific ways to represent missing values. Pandas can convert these appropriately:

import pandas as pd
import numpy as np

# Convert STATA missing values to pandas NA values
df = pd.read_stata('missing_data.dta', convert_missing=True)
print(df.isna().sum())

Real-World Examples

Example 1: Analyzing Economic Data

Let's work with a hypothetical economic dataset from STATA:

import pandas as pd
import matplotlib.pyplot as plt

# Import economic data from STATA
econ_data = pd.read_stata('economic_indicators.dta')

# Basic exploratory analysis
print("Dataset shape:", econ_data.shape)
print("\nBasic statistics:")
print(econ_data.describe())

# Plot GDP growth over time
plt.figure(figsize=(12, 6))
plt.plot(econ_data['year'], econ_data['gdp_growth'], marker='o')
plt.title('GDP Growth Over Time')
plt.xlabel('Year')
plt.ylabel('GDP Growth (%)')
plt.grid(True)
plt.savefig('gdp_growth.png')
plt.close()

# Calculate correlation between variables
correlation = econ_data[['gdp_growth', 'inflation', 'unemployment']].corr()
print("\nCorrelation between economic indicators:")
print(correlation)

Example 2: Working with Survey Data

Survey data is commonly stored in STATA format. Here's how to process it with pandas:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Import survey data
survey = pd.read_stata('survey_data.dta', convert_categoricals=True)

# Print the structure of the data
print("Survey data structure:")
print(survey.info())

# Get a summary of responses by category
response_counts = survey['satisfaction_level'].value_counts().sort_index()
print("\nSatisfaction level distribution:")
print(response_counts)

# Create a visualization
plt.figure(figsize=(10, 6))
sns.countplot(data=survey, x='satisfaction_level', hue='gender')
plt.title('Satisfaction Levels by Gender')
plt.xlabel('Satisfaction Level')
plt.ylabel('Count')
plt.savefig('satisfaction_by_gender.png')
plt.close()

# Calculate average values by group
avg_by_group = survey.groupby('age_group')['income', 'satisfaction_score'].mean()
print("\nAverage income and satisfaction by age group:")
print(avg_by_group)

Handling Large STATA Files

When working with large STATA files, you may need to optimize the import process:

import pandas as pd

# Import a large STATA file with chunking
chunks = []
reader = pd.read_stata('very_large_file.dta', chunksize=10000)

# Process the file in chunks
for chunk in reader:
    # Process or filter each chunk
    processed_chunk = chunk[chunk['age'] > 25]  # Example filter
    chunks.append(processed_chunk)

# Combine all processed chunks
df = pd.concat(chunks)
print(f"Loaded {len(df)} rows after filtering")

Exporting Pandas DataFrames to STATA

Although this tutorial focuses on importing, it's also useful to know how to export DataFrames back to STATA format:

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'id': range(1, 101),
    'name': [f'Person_{i}' for i in range(1, 101)],
    'age': np.random.randint(18, 65, 100),
    'income': np.random.normal(50000, 15000, 100).astype(int),
    'group': np.random.choice(['A', 'B', 'C', 'D'], 100)
}

df = pd.DataFrame(data)

# Export to STATA format (version 14)
df.to_stata('exported_data.dta', write_index=False, version=114)

print("Data successfully exported to STATA format!")

Common Issues and Solutions

Issue 1: UnicodeDecodeError

When working with files that contain non-ASCII characters:

try:
    df = pd.read_stata('international_data.dta')
except UnicodeDecodeError:
    # Try specifying an encoding
    df = pd.read_stata('international_data.dta', encoding='latin1')
    print("File loaded with latin1 encoding")

Issue 2: ValueError with Date Formats

STATA and pandas handle dates differently:

# Import with proper date conversion
df = pd.read_stata('dates_data.dta')

# Convert STATA dates to pandas datetime
if 'interview_date' in df.columns:
    df['interview_date'] = pd.to_datetime(df['interview_date'], origin='1960-01-01')
    
print(df[['interview_date']].head())

Summary

In this tutorial, you've learned how to:

Import STATA files into pandas DataFrames using pd.read_stata()
Access and utilize STATA metadata such as variable labels and value labels
Customize the import process with various parameters
Handle different STATA file versions
Work with categorical data and missing values
Process large STATA files efficiently
Export pandas DataFrames back to STATA format
Troubleshoot common issues when working with STATA files

The ability to import STATA files seamlessly into pandas enables researchers and data scientists to leverage Python's powerful data analysis ecosystem while maintaining compatibility with STATA-based workflows.

Additional Resources

Practice Exercises

Import a STATA file and create a summary report showing basic statistics for all numeric variables.
Load a STATA file with categorical data and create visualizations showing the distribution of categories.
Write a function that converts value labels in a STATA file to human-readable categories in pandas.
Import a large STATA file using chunking and perform a complex aggregation operation.
Create a pandas DataFrame and export it to STATA format with custom variable and value labels.

These exercises will help you build practical skills in working with STATA data in the pandas environment.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Basic STATA Import with Pandas​

Simple Example​

Understanding STATA Metadata​

Accessing Variable Labels​

Working with Value Labels​

Advanced STATA Import Options​

Selecting Specific Columns​

Handling Different STATA Versions​

Converting Categorical Data​

Handling Missing Values​

Real-World Examples​

Example 1: Analyzing Economic Data​

Example 2: Working with Survey Data​

Handling Large STATA Files​

Exporting Pandas DataFrames to STATA​

Common Issues and Solutions​

Issue 1: UnicodeDecodeError​

Issue 2: ValueError with Date Formats​

Summary​

Additional Resources​

Practice Exercises​

Introduction

Prerequisites

Basic STATA Import with Pandas

Simple Example

Understanding STATA Metadata

Accessing Variable Labels

Working with Value Labels

Advanced STATA Import Options

Selecting Specific Columns

Handling Different STATA Versions

Converting Categorical Data

Handling Missing Values

Real-World Examples

Example 1: Analyzing Economic Data

Example 2: Working with Survey Data

Handling Large STATA Files

Exporting Pandas DataFrames to STATA

Common Issues and Solutions

Issue 1: UnicodeDecodeError

Issue 2: ValueError with Date Formats

Summary

Additional Resources

Practice Exercises