Skip to main content

Pandas STATA Import

Introduction

STATA is a powerful statistical software package widely used in research fields like economics, sociology, political science, and epidemiology. If you're transitioning from STATA to Python or need to work with STATA files in a Python environment, pandas provides an excellent way to import and manipulate STATA data files (.dta format).

In this tutorial, you'll learn how to:

  • Import STATA files into pandas DataFrames
  • Handle STATA-specific metadata
  • Work with different STATA file versions
  • Apply various import options to customize your data import

Prerequisites

Before we begin, ensure you have the following installed:

  • Python 3.x
  • pandas
  • pytables (required for STATA import functionality)

You can install the necessary packages using pip:

bash
pip install pandas pytables

Basic STATA Import with Pandas

The primary function for importing STATA files in pandas is pd.read_stata(). This function loads STATA files directly into a pandas DataFrame.

Simple Example

Here's a basic example of importing a STATA file:

python
import pandas as pd

# Load a STATA file
df = pd.read_stata('example.dta')

# Display the first few rows
print(df.head())

Output:

   id    name  age  income education
0 1 Alice 24 45000 Bachelor
1 2 Bob 32 65000 Masters
2 3 Claire 45 78000 Doctoral
3 4 David 19 12000 Highschool
4 5 Eva 38 92000 Masters

Understanding STATA Metadata

STATA files contain rich metadata, including variable labels, value labels, and dataset notes. Pandas preserves this information during import.

Accessing Variable Labels

python
import pandas as pd

# Load a STATA file with variable information
df = pd.read_stata('example.dta')

# Access variable labels
variable_labels = df.variable_labels
print("Variable Labels:")
for var, label in variable_labels.items():
print(f"{var}: {label}")

Output:

Variable Labels:
id: Participant ID
name: Participant Name
age: Age in years
income: Annual income in USD
education: Highest degree attained

Working with Value Labels

Value labels in STATA map numeric codes to descriptive strings. Here's how to access them:

python
import pandas as pd

# Load a STATA file with value labels
df = pd.read_stata('survey.dta')

# Print the value_labels attribute
print(df.value_labels)

Output:

{'education': {1: 'No formal education', 2: 'Primary', 3: 'Secondary', 4: 'Undergraduate', 5: 'Graduate'}}

Advanced STATA Import Options

Pandas provides several options to customize how STATA files are imported.

Selecting Specific Columns

To load only specific columns from a large STATA file:

python
import pandas as pd

# Load only selected columns
df = pd.read_stata('large_dataset.dta', columns=['id', 'age', 'income'])

print(df.head())

Output:

   id  age  income
0 1 24 45000
1 2 32 65000
2 3 45 78000
3 4 19 12000
4 5 38 92000

Handling Different STATA Versions

STATA has evolved over the years, with different file formats. The version parameter helps handle specific versions:

python
import pandas as pd

# Import a file created in STATA 12
df = pd.read_stata('stata12_file.dta', version=12)

# View the DataFrame information
print(df.info())

Converting Categorical Data

STATA often uses categorical data. You can control how these are imported:

python
import pandas as pd

# Option 1: Convert STATA categorical data to pandas categorical
df1 = pd.read_stata('example.dta', convert_categoricals=True)

# Option 2: Don't convert categories, keep original encodings
df2 = pd.read_stata('example.dta', convert_categoricals=False)

# Compare the results
print("With categorical conversion:")
print(df1['education'].head())
print("\nWithout categorical conversion:")
print(df2['education'].head())

Output:

With categorical conversion:
0 Bachelor
1 Masters
2 Doctoral
3 Highschool
4 Masters
Name: education, dtype: category
Categories (4, object): ['Bachelor', 'Doctoral', 'Highschool', 'Masters']

Without categorical conversion:
0 Bachelor
1 Masters
2 Doctoral
3 Highschool
4 Masters
Name: education, dtype: object

Handling Missing Values

STATA has specific ways to represent missing values. Pandas can convert these appropriately:

python
import pandas as pd
import numpy as np

# Convert STATA missing values to pandas NA values
df = pd.read_stata('missing_data.dta', convert_missing=True)
print(df.isna().sum())

Real-World Examples

Example 1: Analyzing Economic Data

Let's work with a hypothetical economic dataset from STATA:

python
import pandas as pd
import matplotlib.pyplot as plt

# Import economic data from STATA
econ_data = pd.read_stata('economic_indicators.dta')

# Basic exploratory analysis
print("Dataset shape:", econ_data.shape)
print("\nBasic statistics:")
print(econ_data.describe())

# Plot GDP growth over time
plt.figure(figsize=(12, 6))
plt.plot(econ_data['year'], econ_data['gdp_growth'], marker='o')
plt.title('GDP Growth Over Time')
plt.xlabel('Year')
plt.ylabel('GDP Growth (%)')
plt.grid(True)
plt.savefig('gdp_growth.png')
plt.close()

# Calculate correlation between variables
correlation = econ_data[['gdp_growth', 'inflation', 'unemployment']].corr()
print("\nCorrelation between economic indicators:")
print(correlation)

Example 2: Working with Survey Data

Survey data is commonly stored in STATA format. Here's how to process it with pandas:

python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Import survey data
survey = pd.read_stata('survey_data.dta', convert_categoricals=True)

# Print the structure of the data
print("Survey data structure:")
print(survey.info())

# Get a summary of responses by category
response_counts = survey['satisfaction_level'].value_counts().sort_index()
print("\nSatisfaction level distribution:")
print(response_counts)

# Create a visualization
plt.figure(figsize=(10, 6))
sns.countplot(data=survey, x='satisfaction_level', hue='gender')
plt.title('Satisfaction Levels by Gender')
plt.xlabel('Satisfaction Level')
plt.ylabel('Count')
plt.savefig('satisfaction_by_gender.png')
plt.close()

# Calculate average values by group
avg_by_group = survey.groupby('age_group')['income', 'satisfaction_score'].mean()
print("\nAverage income and satisfaction by age group:")
print(avg_by_group)

Handling Large STATA Files

When working with large STATA files, you may need to optimize the import process:

python
import pandas as pd

# Import a large STATA file with chunking
chunks = []
reader = pd.read_stata('very_large_file.dta', chunksize=10000)

# Process the file in chunks
for chunk in reader:
# Process or filter each chunk
processed_chunk = chunk[chunk['age'] > 25] # Example filter
chunks.append(processed_chunk)

# Combine all processed chunks
df = pd.concat(chunks)
print(f"Loaded {len(df)} rows after filtering")

Exporting Pandas DataFrames to STATA

Although this tutorial focuses on importing, it's also useful to know how to export DataFrames back to STATA format:

python
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
'id': range(1, 101),
'name': [f'Person_{i}' for i in range(1, 101)],
'age': np.random.randint(18, 65, 100),
'income': np.random.normal(50000, 15000, 100).astype(int),
'group': np.random.choice(['A', 'B', 'C', 'D'], 100)
}

df = pd.DataFrame(data)

# Export to STATA format (version 14)
df.to_stata('exported_data.dta', write_index=False, version=114)

print("Data successfully exported to STATA format!")

Common Issues and Solutions

Issue 1: UnicodeDecodeError

When working with files that contain non-ASCII characters:

python
try:
df = pd.read_stata('international_data.dta')
except UnicodeDecodeError:
# Try specifying an encoding
df = pd.read_stata('international_data.dta', encoding='latin1')
print("File loaded with latin1 encoding")

Issue 2: ValueError with Date Formats

STATA and pandas handle dates differently:

python
# Import with proper date conversion
df = pd.read_stata('dates_data.dta')

# Convert STATA dates to pandas datetime
if 'interview_date' in df.columns:
df['interview_date'] = pd.to_datetime(df['interview_date'], origin='1960-01-01')

print(df[['interview_date']].head())

Summary

In this tutorial, you've learned how to:

  • Import STATA files into pandas DataFrames using pd.read_stata()
  • Access and utilize STATA metadata such as variable labels and value labels
  • Customize the import process with various parameters
  • Handle different STATA file versions
  • Work with categorical data and missing values
  • Process large STATA files efficiently
  • Export pandas DataFrames back to STATA format
  • Troubleshoot common issues when working with STATA files

The ability to import STATA files seamlessly into pandas enables researchers and data scientists to leverage Python's powerful data analysis ecosystem while maintaining compatibility with STATA-based workflows.

Additional Resources

Practice Exercises

  1. Import a STATA file and create a summary report showing basic statistics for all numeric variables.
  2. Load a STATA file with categorical data and create visualizations showing the distribution of categories.
  3. Write a function that converts value labels in a STATA file to human-readable categories in pandas.
  4. Import a large STATA file using chunking and perform a complex aggregation operation.
  5. Create a pandas DataFrame and export it to STATA format with custom variable and value labels.

These exercises will help you build practical skills in working with STATA data in the pandas environment.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)