Skip to main content

Pandas SAS Import

Introduction

SAS (Statistical Analysis System) is a widely used software suite for advanced analytics, business intelligence, and data management. When transitioning from SAS to Python or working in environments that use both tools, you'll often need to import SAS data files into pandas.

Pandas provides functionality to read SAS data files directly, making it easy to incorporate existing SAS data into your Python-based data analysis workflow. In this guide, we'll explore how to import different types of SAS files into pandas DataFrames.

Prerequisites

Before we begin, make sure you have the required libraries installed:

bash
pip install pandas
pip install pyreadstat # For better SAS file support

Reading SAS Data Files

Importing SAS7BDAT Files

The most common SAS data file format is .sas7bdat. Pandas provides a read_sas() function to import these files.

Basic Usage

python
import pandas as pd

# Read a SAS7BDAT file
df = pd.read_sas('dataset.sas7bdat')

# Display the first few rows
print(df.head())

Example with Output

Let's see how this works with a sample SAS dataset:

python
import pandas as pd

# Read a sample SAS7BDAT file
df = pd.read_sas('sample_sales.sas7bdat')

# Display the first 5 rows
print(df.head())

Output:

   region  product  year   quarter  sales
0 North Widget 2020 Q1 45000
1 North Widget 2020 Q2 52000
2 North Gadget 2020 Q1 12000
3 South Widget 2020 Q1 28000
4 South Gadget 2020 Q2 31000

Customizing Import Options

The read_sas() function offers several parameters to customize how SAS files are imported:

python
import pandas as pd

# Read with additional options
df = pd.read_sas(
'dataset.sas7bdat',
encoding='latin-1', # Specify encoding if needed
chunksize=10000, # Read file in chunks for large files
format='sas7bdat', # Explicitly specify format
iterator=True # Return an iterator for processing large files
)

# When using iterator=True, you need to read chunks:
chunk = df.read(5) # Read first 5 rows
print(chunk)

Reading SAS XPORT Files

SAS XPORT format (.xpt) is another common SAS file format, especially used for data exchange. Pandas can read these files too:

python
import pandas as pd

# Read a SAS XPORT file
df = pd.read_sas('dataset.xpt', format='xport')

# Show basic information
print(df.info())
print(df.head())

Working with SAS File Encodings

SAS files created on different systems might use different character encodings. You can specify the encoding when importing:

python
import pandas as pd

# Try different encodings if you encounter character issues
df_latin = pd.read_sas('dataset.sas7bdat', encoding='latin-1')
df_utf8 = pd.read_sas('dataset.sas7bdat', encoding='utf-8')

Handling Large SAS Files

SAS datasets can be very large. For these cases, you can process the file in chunks:

python
import pandas as pd

# Create an iterator
sas_file = pd.read_sas('large_dataset.sas7bdat', chunksize=10000, iterator=True)

# Process the file in chunks
total_rows = 0
for chunk in sas_file:
# Process each chunk
total_rows += len(chunk)
# Example operation: Calculate mean of a column
print(f"Chunk mean of 'value' column: {chunk['value'].mean()}")

print(f"Total rows processed: {total_rows}")

Real-world Applications

Example 1: Analyzing Healthcare Data

Many healthcare organizations use SAS for data management. When transitioning to Python analytics:

python
import pandas as pd
import matplotlib.pyplot as plt

# Import patient data from SAS
patients = pd.read_sas('patient_records.sas7bdat')

# Basic data exploration
print(f"Dataset shape: {patients.shape}")
print("\nColumn names:")
print(patients.columns.tolist())

# Summary statistics
print("\nSummary statistics for numerical columns:")
print(patients.describe())

# Example analysis: Age distribution
plt.figure(figsize=(10, 6))
patients['age'].hist(bins=20)
plt.title('Patient Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.grid(False)
plt.show()

Example 2: Converting and Merging Multiple SAS Files

Often you'll need to combine data from multiple SAS files:

python
import pandas as pd
import os

# Directory containing SAS files
sas_dir = 'data/sas_files/'

# List to hold DataFrames
dfs = []

# Read all SAS7BDAT files in the directory
for file in os.listdir(sas_dir):
if file.endswith('.sas7bdat'):
print(f"Reading {file}...")
file_path = os.path.join(sas_dir, file)
df = pd.read_sas(file_path)

# Add a source column to track the file origin
df['source_file'] = file
dfs.append(df)

# Combine all DataFrames
combined_df = pd.concat(dfs, ignore_index=True)

# Save as a CSV for future use
combined_df.to_csv('combined_data.csv', index=False)

print(f"Combined {len(dfs)} SAS files into one DataFrame with {combined_df.shape[0]} rows and {combined_df.shape[1]} columns")

Using PyReadStat for Enhanced SAS Import

For more robust SAS file handling, you can use the pyreadstat library:

python
import pandas as pd
import pyreadstat

# Read SAS file with pyreadstat
df, meta = pyreadstat.read_sas7bdat('dataset.sas7bdat')

# The metadata contains valuable information
print("Variable labels:")
print(meta.column_names_to_labels)

print("\nValue labels for categorical variables:")
for var, labels in meta.value_labels.items():
print(f"\n{var}:")
for value, label in labels.items():
print(f" {value}: {label}")

# Work with the DataFrame as usual
print("\nFirst few rows:")
print(df.head())

Common Challenges and Solutions

Missing Variable Labels

SAS files often contain variable labels that get lost during import:

python
import pandas as pd
import pyreadstat

# Import with metadata
df, meta = pyreadstat.read_sas7bdat('dataset.sas7bdat')

# Apply variable labels as column descriptions
for col in df.columns:
if col in meta.column_names_to_labels:
df[col].attrs['label'] = meta.column_names_to_labels[col]

# Display the data with labels
for col in df.columns:
label = df[col].attrs.get('label', 'No label')
print(f"{col}: {label}")
print(df[col].head())
print()

Handling SAS Formats and Informats

SAS uses formats and informats to control how values are displayed and read:

python
import pandas as pd
import pyreadstat

# Read with format information
df, meta = pyreadstat.read_sas7bdat('dataset.sas7bdat')

# Display format information
print("SAS Formats:")
for col, fmt in zip(meta.column_names, meta.column_formats):
print(f"{col}: {fmt}")

# Example: Convert SAS date format to pandas datetime
# SAS dates are days since January 1, 1960
import numpy as np
from pandas import Timestamp

def sas_date_to_datetime(sas_date):
if np.isnan(sas_date):
return pd.NaT
return Timestamp('1960-01-01') + pd.Timedelta(days=int(sas_date))

# Apply conversion if needed
if 'date_column' in df.columns:
df['date_column_converted'] = df['date_column'].apply(sas_date_to_datetime)

Summary

In this guide, we've covered how to:

  • Import SAS7BDAT and SAS XPORT files into pandas DataFrames
  • Handle different encodings in SAS files
  • Process large SAS files efficiently using chunks
  • Work with SAS metadata including variable labels and formats
  • Apply pandas and PyReadStat for enhanced SAS file handling

Being able to import SAS files into pandas allows you to leverage Python's data science ecosystem while working with existing SAS data, creating a bridge between traditional statistical software and modern data science tools.

Additional Resources

Exercises

  1. Import a SAS7BDAT file and explore its contents, identifying the data types and summary statistics.
  2. Use the chunk processing approach to calculate running statistics on a large SAS file.
  3. Import multiple SAS files and join them based on a common key.
  4. Extract and apply SAS formats to format numerical values appropriately in pandas.
  5. Convert a pandas DataFrame to a SAS file using the to_sas method for interoperability with SAS users.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)