Pandas STATA Import
Introduction
STATA is a powerful statistical software package widely used in research fields like economics, sociology, political science, and epidemiology. If you're transitioning from STATA to Python or need to work with STATA files in a Python environment, pandas provides an excellent way to import and manipulate STATA data files (.dta
format).
In this tutorial, you'll learn how to:
- Import STATA files into pandas DataFrames
- Handle STATA-specific metadata
- Work with different STATA file versions
- Apply various import options to customize your data import
Prerequisites
Before we begin, ensure you have the following installed:
- Python 3.x
- pandas
- pytables (required for STATA import functionality)
You can install the necessary packages using pip:
pip install pandas pytables
Basic STATA Import with Pandas
The primary function for importing STATA files in pandas is pd.read_stata()
. This function loads STATA files directly into a pandas DataFrame.
Simple Example
Here's a basic example of importing a STATA file:
import pandas as pd
# Load a STATA file
df = pd.read_stata('example.dta')
# Display the first few rows
print(df.head())
Output:
id name age income education
0 1 Alice 24 45000 Bachelor
1 2 Bob 32 65000 Masters
2 3 Claire 45 78000 Doctoral
3 4 David 19 12000 Highschool
4 5 Eva 38 92000 Masters
Understanding STATA Metadata
STATA files contain rich metadata, including variable labels, value labels, and dataset notes. Pandas preserves this information during import.
Accessing Variable Labels
import pandas as pd
# Load a STATA file with variable information
df = pd.read_stata('example.dta')
# Access variable labels
variable_labels = df.variable_labels
print("Variable Labels:")
for var, label in variable_labels.items():
print(f"{var}: {label}")
Output:
Variable Labels:
id: Participant ID
name: Participant Name
age: Age in years
income: Annual income in USD
education: Highest degree attained
Working with Value Labels
Value labels in STATA map numeric codes to descriptive strings. Here's how to access them:
import pandas as pd
# Load a STATA file with value labels
df = pd.read_stata('survey.dta')
# Print the value_labels attribute
print(df.value_labels)
Output:
{'education': {1: 'No formal education', 2: 'Primary', 3: 'Secondary', 4: 'Undergraduate', 5: 'Graduate'}}
Advanced STATA Import Options
Pandas provides several options to customize how STATA files are imported.
Selecting Specific Columns
To load only specific columns from a large STATA file:
import pandas as pd
# Load only selected columns
df = pd.read_stata('large_dataset.dta', columns=['id', 'age', 'income'])
print(df.head())
Output:
id age income
0 1 24 45000
1 2 32 65000
2 3 45 78000
3 4 19 12000
4 5 38 92000
Handling Different STATA Versions
STATA has evolved over the years, with different file formats. The version
parameter helps handle specific versions:
import pandas as pd
# Import a file created in STATA 12
df = pd.read_stata('stata12_file.dta', version=12)
# View the DataFrame information
print(df.info())
Converting Categorical Data
STATA often uses categorical data. You can control how these are imported:
import pandas as pd
# Option 1: Convert STATA categorical data to pandas categorical
df1 = pd.read_stata('example.dta', convert_categoricals=True)
# Option 2: Don't convert categories, keep original encodings
df2 = pd.read_stata('example.dta', convert_categoricals=False)
# Compare the results
print("With categorical conversion:")
print(df1['education'].head())
print("\nWithout categorical conversion:")
print(df2['education'].head())
Output:
With categorical conversion:
0 Bachelor
1 Masters
2 Doctoral
3 Highschool
4 Masters
Name: education, dtype: category
Categories (4, object): ['Bachelor', 'Doctoral', 'Highschool', 'Masters']
Without categorical conversion:
0 Bachelor
1 Masters
2 Doctoral
3 Highschool
4 Masters
Name: education, dtype: object
Handling Missing Values
STATA has specific ways to represent missing values. Pandas can convert these appropriately:
import pandas as pd
import numpy as np
# Convert STATA missing values to pandas NA values
df = pd.read_stata('missing_data.dta', convert_missing=True)
print(df.isna().sum())
Real-World Examples
Example 1: Analyzing Economic Data
Let's work with a hypothetical economic dataset from STATA:
import pandas as pd
import matplotlib.pyplot as plt
# Import economic data from STATA
econ_data = pd.read_stata('economic_indicators.dta')
# Basic exploratory analysis
print("Dataset shape:", econ_data.shape)
print("\nBasic statistics:")
print(econ_data.describe())
# Plot GDP growth over time
plt.figure(figsize=(12, 6))
plt.plot(econ_data['year'], econ_data['gdp_growth'], marker='o')
plt.title('GDP Growth Over Time')
plt.xlabel('Year')
plt.ylabel('GDP Growth (%)')
plt.grid(True)
plt.savefig('gdp_growth.png')
plt.close()
# Calculate correlation between variables
correlation = econ_data[['gdp_growth', 'inflation', 'unemployment']].corr()
print("\nCorrelation between economic indicators:")
print(correlation)
Example 2: Working with Survey Data
Survey data is commonly stored in STATA format. Here's how to process it with pandas:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Import survey data
survey = pd.read_stata('survey_data.dta', convert_categoricals=True)
# Print the structure of the data
print("Survey data structure:")
print(survey.info())
# Get a summary of responses by category
response_counts = survey['satisfaction_level'].value_counts().sort_index()
print("\nSatisfaction level distribution:")
print(response_counts)
# Create a visualization
plt.figure(figsize=(10, 6))
sns.countplot(data=survey, x='satisfaction_level', hue='gender')
plt.title('Satisfaction Levels by Gender')
plt.xlabel('Satisfaction Level')
plt.ylabel('Count')
plt.savefig('satisfaction_by_gender.png')
plt.close()
# Calculate average values by group
avg_by_group = survey.groupby('age_group')['income', 'satisfaction_score'].mean()
print("\nAverage income and satisfaction by age group:")
print(avg_by_group)
Handling Large STATA Files
When working with large STATA files, you may need to optimize the import process:
import pandas as pd
# Import a large STATA file with chunking
chunks = []
reader = pd.read_stata('very_large_file.dta', chunksize=10000)
# Process the file in chunks
for chunk in reader:
# Process or filter each chunk
processed_chunk = chunk[chunk['age'] > 25] # Example filter
chunks.append(processed_chunk)
# Combine all processed chunks
df = pd.concat(chunks)
print(f"Loaded {len(df)} rows after filtering")
Exporting Pandas DataFrames to STATA
Although this tutorial focuses on importing, it's also useful to know how to export DataFrames back to STATA format:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'id': range(1, 101),
'name': [f'Person_{i}' for i in range(1, 101)],
'age': np.random.randint(18, 65, 100),
'income': np.random.normal(50000, 15000, 100).astype(int),
'group': np.random.choice(['A', 'B', 'C', 'D'], 100)
}
df = pd.DataFrame(data)
# Export to STATA format (version 14)
df.to_stata('exported_data.dta', write_index=False, version=114)
print("Data successfully exported to STATA format!")
Common Issues and Solutions
Issue 1: UnicodeDecodeError
When working with files that contain non-ASCII characters:
try:
df = pd.read_stata('international_data.dta')
except UnicodeDecodeError:
# Try specifying an encoding
df = pd.read_stata('international_data.dta', encoding='latin1')
print("File loaded with latin1 encoding")
Issue 2: ValueError with Date Formats
STATA and pandas handle dates differently:
# Import with proper date conversion
df = pd.read_stata('dates_data.dta')
# Convert STATA dates to pandas datetime
if 'interview_date' in df.columns:
df['interview_date'] = pd.to_datetime(df['interview_date'], origin='1960-01-01')
print(df[['interview_date']].head())
Summary
In this tutorial, you've learned how to:
- Import STATA files into pandas DataFrames using
pd.read_stata()
- Access and utilize STATA metadata such as variable labels and value labels
- Customize the import process with various parameters
- Handle different STATA file versions
- Work with categorical data and missing values
- Process large STATA files efficiently
- Export pandas DataFrames back to STATA format
- Troubleshoot common issues when working with STATA files
The ability to import STATA files seamlessly into pandas enables researchers and data scientists to leverage Python's powerful data analysis ecosystem while maintaining compatibility with STATA-based workflows.
Additional Resources
- Pandas Official Documentation on STATA Reader
- UCLA's Guide to Using STATA with Python
- Comparing STATA and Python Analysis Workflows
Practice Exercises
- Import a STATA file and create a summary report showing basic statistics for all numeric variables.
- Load a STATA file with categorical data and create visualizations showing the distribution of categories.
- Write a function that converts value labels in a STATA file to human-readable categories in pandas.
- Import a large STATA file using chunking and perform a complex aggregation operation.
- Create a pandas DataFrame and export it to STATA format with custom variable and value labels.
These exercises will help you build practical skills in working with STATA data in the pandas environment.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)