Pandas SAS Import

Introduction

SAS (Statistical Analysis System) is a widely used software suite for advanced analytics, business intelligence, and data management. When transitioning from SAS to Python or working in environments that use both tools, you'll often need to import SAS data files into pandas.

Pandas provides functionality to read SAS data files directly, making it easy to incorporate existing SAS data into your Python-based data analysis workflow. In this guide, we'll explore how to import different types of SAS files into pandas DataFrames.

Prerequisites

Before we begin, make sure you have the required libraries installed:

pip install pandas
pip install pyreadstat  # For better SAS file support

Reading SAS Data Files

Importing SAS7BDAT Files

The most common SAS data file format is .sas7bdat. Pandas provides a read_sas() function to import these files.

Basic Usage

import pandas as pd

# Read a SAS7BDAT file
df = pd.read_sas('dataset.sas7bdat')

# Display the first few rows
print(df.head())

Example with Output

Let's see how this works with a sample SAS dataset:

import pandas as pd

# Read a sample SAS7BDAT file
df = pd.read_sas('sample_sales.sas7bdat')

# Display the first 5 rows
print(df.head())

Output:

   region  product  year   quarter  sales
North   Widget   2020   Q1       45000
North   Widget   2020   Q2       52000
North   Gadget   2020   Q1       12000
South   Widget   2020   Q1       28000
South   Gadget   2020   Q2       31000

Customizing Import Options

The read_sas() function offers several parameters to customize how SAS files are imported:

import pandas as pd

# Read with additional options
df = pd.read_sas(
    'dataset.sas7bdat',
    encoding='latin-1',        # Specify encoding if needed
    chunksize=10000,           # Read file in chunks for large files
    format='sas7bdat',         # Explicitly specify format
    iterator=True              # Return an iterator for processing large files
)

# When using iterator=True, you need to read chunks:
chunk = df.read(5)  # Read first 5 rows
print(chunk)

Reading SAS XPORT Files

SAS XPORT format (.xpt) is another common SAS file format, especially used for data exchange. Pandas can read these files too:

import pandas as pd

# Read a SAS XPORT file
df = pd.read_sas('dataset.xpt', format='xport')

# Show basic information
print(df.info())
print(df.head())

Working with SAS File Encodings

SAS files created on different systems might use different character encodings. You can specify the encoding when importing:

import pandas as pd

# Try different encodings if you encounter character issues
df_latin = pd.read_sas('dataset.sas7bdat', encoding='latin-1')
df_utf8 = pd.read_sas('dataset.sas7bdat', encoding='utf-8')

Handling Large SAS Files

SAS datasets can be very large. For these cases, you can process the file in chunks:

import pandas as pd

# Create an iterator
sas_file = pd.read_sas('large_dataset.sas7bdat', chunksize=10000, iterator=True)

# Process the file in chunks
total_rows = 0
for chunk in sas_file:
    # Process each chunk
    total_rows += len(chunk)
    # Example operation: Calculate mean of a column
    print(f"Chunk mean of 'value' column: {chunk['value'].mean()}")

print(f"Total rows processed: {total_rows}")

Real-world Applications

Example 1: Analyzing Healthcare Data

Many healthcare organizations use SAS for data management. When transitioning to Python analytics:

import pandas as pd
import matplotlib.pyplot as plt

# Import patient data from SAS
patients = pd.read_sas('patient_records.sas7bdat')

# Basic data exploration
print(f"Dataset shape: {patients.shape}")
print("\nColumn names:")
print(patients.columns.tolist())

# Summary statistics
print("\nSummary statistics for numerical columns:")
print(patients.describe())

# Example analysis: Age distribution
plt.figure(figsize=(10, 6))
patients['age'].hist(bins=20)
plt.title('Patient Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.grid(False)
plt.show()

Example 2: Converting and Merging Multiple SAS Files

Often you'll need to combine data from multiple SAS files:

import pandas as pd
import os

# Directory containing SAS files
sas_dir = 'data/sas_files/'

# List to hold DataFrames
dfs = []

# Read all SAS7BDAT files in the directory
for file in os.listdir(sas_dir):
    if file.endswith('.sas7bdat'):
        print(f"Reading {file}...")
        file_path = os.path.join(sas_dir, file)
        df = pd.read_sas(file_path)
        
        # Add a source column to track the file origin
        df['source_file'] = file
        dfs.append(df)

# Combine all DataFrames 
combined_df = pd.concat(dfs, ignore_index=True)

# Save as a CSV for future use
combined_df.to_csv('combined_data.csv', index=False)

print(f"Combined {len(dfs)} SAS files into one DataFrame with {combined_df.shape[0]} rows and {combined_df.shape[1]} columns")

Using PyReadStat for Enhanced SAS Import

For more robust SAS file handling, you can use the pyreadstat library:

import pandas as pd
import pyreadstat

# Read SAS file with pyreadstat
df, meta = pyreadstat.read_sas7bdat('dataset.sas7bdat')

# The metadata contains valuable information
print("Variable labels:")
print(meta.column_names_to_labels)

print("\nValue labels for categorical variables:")
for var, labels in meta.value_labels.items():
    print(f"\n{var}:")
    for value, label in labels.items():
        print(f"  {value}: {label}")

# Work with the DataFrame as usual
print("\nFirst few rows:")
print(df.head())

Common Challenges and Solutions

Missing Variable Labels

SAS files often contain variable labels that get lost during import:

import pandas as pd
import pyreadstat

# Import with metadata
df, meta = pyreadstat.read_sas7bdat('dataset.sas7bdat')

# Apply variable labels as column descriptions
for col in df.columns:
    if col in meta.column_names_to_labels:
        df[col].attrs['label'] = meta.column_names_to_labels[col]

# Display the data with labels
for col in df.columns:
    label = df[col].attrs.get('label', 'No label')
    print(f"{col}: {label}")
    print(df[col].head())
    print()

Handling SAS Formats and Informats

SAS uses formats and informats to control how values are displayed and read:

import pandas as pd
import pyreadstat

# Read with format information
df, meta = pyreadstat.read_sas7bdat('dataset.sas7bdat')

# Display format information
print("SAS Formats:")
for col, fmt in zip(meta.column_names, meta.column_formats):
    print(f"{col}: {fmt}")

# Example: Convert SAS date format to pandas datetime
# SAS dates are days since January 1, 1960
import numpy as np
from pandas import Timestamp

def sas_date_to_datetime(sas_date):
    if np.isnan(sas_date):
        return pd.NaT
    return Timestamp('1960-01-01') + pd.Timedelta(days=int(sas_date))

# Apply conversion if needed
if 'date_column' in df.columns:
    df['date_column_converted'] = df['date_column'].apply(sas_date_to_datetime)

Summary

In this guide, we've covered how to:

Import SAS7BDAT and SAS XPORT files into pandas DataFrames
Handle different encodings in SAS files
Process large SAS files efficiently using chunks
Work with SAS metadata including variable labels and formats
Apply pandas and PyReadStat for enhanced SAS file handling

Being able to import SAS files into pandas allows you to leverage Python's data science ecosystem while working with existing SAS data, creating a bridge between traditional statistical software and modern data science tools.

Additional Resources

Exercises

Import a SAS7BDAT file and explore its contents, identifying the data types and summary statistics.
Use the chunk processing approach to calculate running statistics on a large SAS file.
Import multiple SAS files and join them based on a common key.
Extract and apply SAS formats to format numerical values appropriately in pandas.
Convert a pandas DataFrame to a SAS file using the to_sas method for interoperability with SAS users.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Reading SAS Data Files​

Importing SAS7BDAT Files​

Basic Usage​

Example with Output​

Customizing Import Options​

Reading SAS XPORT Files​

Working with SAS File Encodings​

Handling Large SAS Files​

Real-world Applications​

Example 1: Analyzing Healthcare Data​

Example 2: Converting and Merging Multiple SAS Files​

Using PyReadStat for Enhanced SAS Import​

Common Challenges and Solutions​

Missing Variable Labels​

Handling SAS Formats and Informats​

Summary​

Additional Resources​

Exercises​