Pandas Data Profiling

Data profiling is a crucial first step in any data cleaning or analysis project. It helps you understand the structure, content, and quality of your dataset before diving into deeper analysis. In this tutorial, we'll explore various techniques to profile your pandas DataFrames effectively.

What is Data Profiling?

Data profiling is the process of examining, analyzing, and creating useful summaries of data. The goal is to:

Understand the structure of your data
Identify quality issues (missing values, outliers)
Discover patterns and relationships
Get a general feel for your dataset

Let's learn how to profile pandas DataFrames using both built-in methods and external libraries.

Basic Data Profiling with Pandas

Creating a Sample Dataset

First, let's create a sample dataset to work with throughout this tutorial:

python
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Create a sample dataset
data = {
    'age': np.random.randint(18, 70, 100),
    'income': np.random.randint(20000, 100000, 100),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 100),
    'satisfaction': np.random.choice([1, 2, 3, 4, 5], 100),
    'customer_since': pd.date_range(start='2018-01-01', periods=100, freq='D'),
    'last_purchase': pd.date_range(start='2022-01-01', periods=100, freq='D')
}

df = pd.DataFrame(data)

# Add some missing values
df.loc[np.random.choice(df.index, 10), 'income'] = np.nan
df.loc[np.random.choice(df.index, 5), 'education'] = np.nan
df.loc[np.random.choice(df.index, 8), 'satisfaction'] = np.nan

print(df.head())

Output:

   age   income  education  satisfaction customer_since last_purchase
 61  69394.0  Bachelor             3    2018-01-01     2022-01-01
 37  21711.0  Bachelor             1    2018-01-02     2022-01-02
 59  87004.0        PhD           4    2018-01-03     2022-01-03
 67      NaN  Bachelor             5    2018-01-04     2022-01-04
 36  42675.0  Bachelor             1    2018-01-05     2022-01-05

Basic Information Methods

Pandas provides several built-in methods for quick data profiling:

1. `df.info()` - General Information

This method provides a concise summary of the DataFrame:

python
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   age             100 non-null    int64         
 1   income          90 non-null     float64       
 2   education       95 non-null     object        
 3   satisfaction    92 non-null     float64       
 4   customer_since  100 non-null    datetime64[ns]
 5   last_purchase   100 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1), object(1)
memory usage: 4.8+ KB

The info() method shows:

Number of non-null entries in each column
Data types of each column
Memory usage of the DataFrame

2. `df.describe()` - Statistical Summary

This method generates descriptive statistics for numerical columns:

python
df.describe()

Output:

             age        income  satisfaction
count  100.000000     90.00000     92.000000
mean    42.380000  59273.30000      2.956522
std     15.383842  24020.29832      1.430841
min     18.000000  20126.00000      1.000000
25%     29.750000  40291.25000      2.000000
50%     42.000000  58674.50000      3.000000
75%     54.500000  79582.25000      4.000000
max     69.000000  99568.00000      5.000000

For categorical data, you can use:

python
df['education'].describe()

Output:

count          95
unique          4
top       Bachelor
freq           37
Name: education, dtype: object

3. `df.nunique()` - Counting Unique Values

To see how many unique values exist in each column:

python
df.nunique()

Output:

age              50
income           90
education         4
satisfaction      5
customer_since  100
last_purchase   100
dtype: int64

4. Looking for Missing Values

python
# Count of missing values per column
print(df.isnull().sum())

# Percentage of missing values
print((df.isnull().sum() / len(df)) * 100)

Output:

age               0
income           10
education         5
satisfaction      8
customer_since    0
last_purchase     0
dtype: int64

age             0.0
income         10.0
education       5.0
satisfaction    8.0
customer_since  0.0
last_purchase   0.0
dtype: float64

5. Value Distributions

For categorical variables, let's check the value distributions:

python
df['education'].value_counts()

Output:

Bachelor      37
High School   27
Master        16
PhD           15
Name: education, dtype: int64

For numeric variables, we can create histograms:

python
df['age'].hist(bins=10, figsize=(8, 4))

Advanced Profiling with pandas-profiling

While pandas built-in methods are useful, the pandas-profiling library takes data profiling to another level by generating comprehensive HTML reports of your DataFrame.

Installation

First, you need to install the library:

bash
pip install pandas-profiling

Basic Usage

python
from pandas_profiling import ProfileReport

# Generate the report
profile = ProfileReport(df, title="Dataset Profiling Report", explorative=True)

# Save the report as HTML
profile.to_file("data_profile_report.html")

# In Jupyter notebook, you can display it directly:
# profile.to_notebook_iframe()

The generated HTML report includes:

Overview statistics
Variables details (type, unique values, missing values, etc.)
Interactions between variables
Correlations
Missing values patterns
And much more!

Customizing the Report

You can customize the profiling by passing additional parameters:

python
profile = ProfileReport(
    df,
    title="Custom Profiling Report",
    minimal=False,  # Set to True for a minimal report
    explorative=True,  # Enables correlations, etc.
    html={'style': {'full_width': True}},
    correlations={
        'pearson': {'calculate': True},
        'spearman': {'calculate': True},
        'kendall': {'calculate': True},
    }
)

Creating Custom Profiling Functions

Sometimes you might need custom profiling functions for specific analytical needs:

python
def custom_profile(dataframe):
    """
    Custom function to profile a pandas DataFrame.
    """
    profile = {
        "shape": dataframe.shape,
        "column_types": dataframe.dtypes,
        "missing_values": dataframe.isnull().sum(),
        "missing_percentage": (dataframe.isnull().sum() / len(dataframe)) * 100,
        "unique_values": dataframe.nunique(),
    }
    
    # Add numeric column statistics
    numeric_columns = dataframe.select_dtypes(include=['int64', 'float64']).columns
    if len(numeric_columns) > 0:
        profile["numeric_stats"] = dataframe[numeric_columns].describe()
        profile["skewness"] = dataframe[numeric_columns].skew()
        profile["kurtosis"] = dataframe[numeric_columns].kurt()
    
    # Add categorical column statistics
    cat_columns = dataframe.select_dtypes(include=['object', 'category']).columns
    if len(cat_columns) > 0:
        profile["categorical_stats"] = {col: dataframe[col].value_counts() for col in cat_columns}
    
    return profile

# Use the custom profile function
profile_info = custom_profile(df)

# Print some profile information
print(f"Dataset shape: {profile_info['shape']}")
print("\nMissing value percentage:")
print(profile_info['missing_percentage'])
if 'skewness' in profile_info:
    print("\nSkewness (detects asymmetry in distribution):")
    print(profile_info['skewness'])

Output:

Dataset shape: (100, 6)

Missing value percentage:
age             0.0
income         10.0
education       5.0
satisfaction    8.0
customer_since  0.0
last_purchase   0.0
dtype: float64

Skewness (detects asymmetry in distribution):
age            0.006354
income         0.047287
satisfaction   0.061190
dtype: float64

Real-World Example: Profiling Customer Data

Let's implement a more complete real-world example by profiling customer data and making recommendations based on findings:

python
def analyze_customer_dataset(df):
    """
    Analyze a customer dataset and provide recommendations for data cleaning.
    """
    print("===== CUSTOMER DATASET ANALYSIS =====\n")
    
    # Basic information
    print(f"Dataset dimensions: {df.shape[0]} rows and {df.shape[1]} columns")
    print(f"Memory usage: {df.memory_usage().sum() / 1024:.2f} KB")
    
    # Missing values analysis
    missing = df.isnull().sum()
    missing_percent = (missing / len(df)) * 100
    print("\n----- Missing Values Analysis -----")
    for col, count in missing.items():
        if count > 0:
            print(f"{col}: {count} missing values ({missing_percent[col]:.1f}%)")
    
    # Data type analysis
    print("\n----- Data Type Analysis -----")
    for col, dtype in df.dtypes.items():
        print(f"{col}: {dtype}")
    
    # Value distribution analysis
    print("\n----- Value Distribution -----")
    
    # For categorical columns
    cat_columns = df.select_dtypes(include=['object', 'category']).columns
    for col in cat_columns:
        value_counts = df[col].value_counts()
        print(f"\n{col} distribution:")
        for val, count in value_counts.items():
            print(f"  - {val}: {count} ({count/len(df):.1%})")
    
    # For numeric columns - basic stats
    num_columns = df.select_dtypes(include=['int64', 'float64']).columns
    print("\nNumeric columns statistics:")
    for col in num_columns:
        stats = df[col].describe()
        print(f"\n{col}:")
        print(f"  Range: {stats['min']:.2f} to {stats['max']:.2f}")
        print(f"  Mean: {stats['mean']:.2f}")
        print(f"  Median: {stats['50%']:.2f}")
        print(f"  Std Dev: {stats['std']:.2f}")
        
        # Check for potential outliers using IQR method
        Q1 = stats['25%']
        Q3 = stats['75%']
        IQR = Q3 - Q1
        outliers = df[(df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))]
        if len(outliers) > 0:
            print(f"  Potential outliers: {len(outliers)} values")
    
    # Date columns analysis
    date_columns = df.select_dtypes(include=['datetime64']).columns
    if len(date_columns) > 0:
        print("\n----- Date Analysis -----")
        for col in date_columns:
            print(f"\n{col}:")
            print(f"  Range: {df[col].min()} to {df[col].max()}")
            print(f"  Time span: {(df[col].max() - df[col].min()).days} days")
    
    # Recommendations based on findings
    print("\n===== RECOMMENDATIONS =====")
    
    # Missing values recommendations
    cols_with_missing = [col for col in df.columns if missing[col] > 0]
    if cols_with_missing:
        print("\n1. Missing Values:")
        for col in cols_with_missing:
            if missing_percent[col] < 5:
                print(f"  - {col}: Low missing rate ({missing_percent[col]:.1f}%). Consider imputation.")
            elif missing_percent[col] < 20:
                print(f"  - {col}: Moderate missing rate ({missing_percent[col]:.1f}%). Impute or use models that handle missing data.")
            else:
                print(f"  - {col}: High missing rate ({missing_percent[col]:.1f}%). Consider if this column should be kept.")
    
    # Data type recommendations
    print("\n2. Data Types:")
    for col in df.columns:
        if col.lower() in ['age', 'year', 'count', 'number']:
            if df[col].dtype != 'int64':
                print(f"  - {col}: Consider converting to integer type.")
        elif col.lower() in ['date', 'time', 'since', 'until', 'purchase']:
            if not pd.api.types.is_datetime64_dtype(df[col]):
                print(f"  - {col}: Consider converting to datetime format.")
    
    # Column-specific recommendations
    print("\n3. Column-Specific Insights:")
    if 'age' in df.columns:
        if df['age'].min() < 0 or df['age'].max() > 120:
            print(f"  - age: Contains implausible values. Clean or filter.")
    
    if 'income' in df.columns:
        if df['income'].min() < 0:
            print(f"  - income: Contains negative values which may need correction.")
    
    return "Analysis complete!"

# Run the analysis on our sample data
analyze_customer_dataset(df)

This real-world example demonstrates how to:

Get a comprehensive view of your dataset
Identify potential data quality issues
Generate recommendations for data cleaning based on the profile

Summary

Data profiling is an essential first step in any data analysis project. It helps you understand your data's structure, identify quality issues, and plan your cleaning strategy. In this tutorial, we covered:

Basic pandas profiling methods (info(), describe(), nunique())
Advanced profiling using the pandas-profiling library
Custom profiling functions for specific needs
A real-world example of profiling customer data

By mastering these techniques, you'll be able to approach any dataset with confidence and develop a solid plan for cleaning and analysis.

Additional Resources

Pandas Documentation
pandas-profiling GitHub Repository
Great Expectations - A more advanced data validation framework

Exercises

Download a public dataset of your choice and create a comprehensive profile using both pandas built-in methods and pandas-profiling.
Create a custom profiling function that detects potential data quality issues like:
- Inconsistent date formats
- Text fields with numeric values
- Unexpected categorical values
Modify the analyze_customer_dataset function to include correlation analysis between variables.
Create a visualization dashboard that shows key profiling metrics from a dataset using Matplotlib or Seaborn.
Compare the performance (speed and memory usage) of pandas built-in profiling versus pandas-profiling on a large dataset (>100,000 rows).

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

What is Data Profiling?​

Basic Data Profiling with Pandas​

Creating a Sample Dataset​

Basic Information Methods​

1. df.info() - General Information​

2. df.describe() - Statistical Summary​

3. df.nunique() - Counting Unique Values​

4. Looking for Missing Values​

5. Value Distributions​

Advanced Profiling with pandas-profiling​

Installation​

Basic Usage​

Customizing the Report​

Creating Custom Profiling Functions​

Real-World Example: Profiling Customer Data​

Summary​

Additional Resources​

Exercises​

What is Data Profiling?

Basic Data Profiling with Pandas

Creating a Sample Dataset

Basic Information Methods

1. `df.info()` - General Information

2. `df.describe()` - Statistical Summary

3. `df.nunique()` - Counting Unique Values

4. Looking for Missing Values

5. Value Distributions

Advanced Profiling with pandas-profiling

Installation

Basic Usage

Customizing the Report

Creating Custom Profiling Functions

Real-World Example: Profiling Customer Data

Summary

Additional Resources

Exercises