Skip to main content

Pandas Data Profiling

Data profiling is a crucial first step in any data cleaning or analysis project. It helps you understand the structure, content, and quality of your dataset before diving into deeper analysis. In this tutorial, we'll explore various techniques to profile your pandas DataFrames effectively.

What is Data Profiling?

Data profiling is the process of examining, analyzing, and creating useful summaries of data. The goal is to:

  • Understand the structure of your data
  • Identify quality issues (missing values, outliers)
  • Discover patterns and relationships
  • Get a general feel for your dataset

Let's learn how to profile pandas DataFrames using both built-in methods and external libraries.

Basic Data Profiling with Pandas

Creating a Sample Dataset

First, let's create a sample dataset to work with throughout this tutorial:

python
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Create a sample dataset
data = {
'age': np.random.randint(18, 70, 100),
'income': np.random.randint(20000, 100000, 100),
'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 100),
'satisfaction': np.random.choice([1, 2, 3, 4, 5], 100),
'customer_since': pd.date_range(start='2018-01-01', periods=100, freq='D'),
'last_purchase': pd.date_range(start='2022-01-01', periods=100, freq='D')
}

df = pd.DataFrame(data)

# Add some missing values
df.loc[np.random.choice(df.index, 10), 'income'] = np.nan
df.loc[np.random.choice(df.index, 5), 'education'] = np.nan
df.loc[np.random.choice(df.index, 8), 'satisfaction'] = np.nan

print(df.head())

Output:

   age   income  education  satisfaction customer_since last_purchase
0 61 69394.0 Bachelor 3 2018-01-01 2022-01-01
1 37 21711.0 Bachelor 1 2018-01-02 2022-01-02
2 59 87004.0 PhD 4 2018-01-03 2022-01-03
3 67 NaN Bachelor 5 2018-01-04 2022-01-04
4 36 42675.0 Bachelor 1 2018-01-05 2022-01-05

Basic Information Methods

Pandas provides several built-in methods for quick data profiling:

1. df.info() - General Information

This method provides a concise summary of the DataFrame:

python
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 100 non-null int64
1 income 90 non-null float64
2 education 95 non-null object
3 satisfaction 92 non-null float64
4 customer_since 100 non-null datetime64[ns]
5 last_purchase 100 non-null datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1), object(1)
memory usage: 4.8+ KB

The info() method shows:

  • Number of non-null entries in each column
  • Data types of each column
  • Memory usage of the DataFrame

2. df.describe() - Statistical Summary

This method generates descriptive statistics for numerical columns:

python
df.describe()

Output:

             age        income  satisfaction
count 100.000000 90.00000 92.000000
mean 42.380000 59273.30000 2.956522
std 15.383842 24020.29832 1.430841
min 18.000000 20126.00000 1.000000
25% 29.750000 40291.25000 2.000000
50% 42.000000 58674.50000 3.000000
75% 54.500000 79582.25000 4.000000
max 69.000000 99568.00000 5.000000

For categorical data, you can use:

python
df['education'].describe()

Output:

count          95
unique 4
top Bachelor
freq 37
Name: education, dtype: object

3. df.nunique() - Counting Unique Values

To see how many unique values exist in each column:

python
df.nunique()

Output:

age              50
income 90
education 4
satisfaction 5
customer_since 100
last_purchase 100
dtype: int64

4. Looking for Missing Values

python
# Count of missing values per column
print(df.isnull().sum())

# Percentage of missing values
print((df.isnull().sum() / len(df)) * 100)

Output:

age               0
income 10
education 5
satisfaction 8
customer_since 0
last_purchase 0
dtype: int64

age 0.0
income 10.0
education 5.0
satisfaction 8.0
customer_since 0.0
last_purchase 0.0
dtype: float64

5. Value Distributions

For categorical variables, let's check the value distributions:

python
df['education'].value_counts()

Output:

Bachelor      37
High School 27
Master 16
PhD 15
Name: education, dtype: int64

For numeric variables, we can create histograms:

python
df['age'].hist(bins=10, figsize=(8, 4))

Advanced Profiling with pandas-profiling

While pandas built-in methods are useful, the pandas-profiling library takes data profiling to another level by generating comprehensive HTML reports of your DataFrame.

Installation

First, you need to install the library:

bash
pip install pandas-profiling

Basic Usage

python
from pandas_profiling import ProfileReport

# Generate the report
profile = ProfileReport(df, title="Dataset Profiling Report", explorative=True)

# Save the report as HTML
profile.to_file("data_profile_report.html")

# In Jupyter notebook, you can display it directly:
# profile.to_notebook_iframe()

The generated HTML report includes:

  • Overview statistics
  • Variables details (type, unique values, missing values, etc.)
  • Interactions between variables
  • Correlations
  • Missing values patterns
  • And much more!

Customizing the Report

You can customize the profiling by passing additional parameters:

python
profile = ProfileReport(
df,
title="Custom Profiling Report",
minimal=False, # Set to True for a minimal report
explorative=True, # Enables correlations, etc.
html={'style': {'full_width': True}},
correlations={
'pearson': {'calculate': True},
'spearman': {'calculate': True},
'kendall': {'calculate': True},
}
)

Creating Custom Profiling Functions

Sometimes you might need custom profiling functions for specific analytical needs:

python
def custom_profile(dataframe):
"""
Custom function to profile a pandas DataFrame.
"""
profile = {
"shape": dataframe.shape,
"column_types": dataframe.dtypes,
"missing_values": dataframe.isnull().sum(),
"missing_percentage": (dataframe.isnull().sum() / len(dataframe)) * 100,
"unique_values": dataframe.nunique(),
}

# Add numeric column statistics
numeric_columns = dataframe.select_dtypes(include=['int64', 'float64']).columns
if len(numeric_columns) > 0:
profile["numeric_stats"] = dataframe[numeric_columns].describe()
profile["skewness"] = dataframe[numeric_columns].skew()
profile["kurtosis"] = dataframe[numeric_columns].kurt()

# Add categorical column statistics
cat_columns = dataframe.select_dtypes(include=['object', 'category']).columns
if len(cat_columns) > 0:
profile["categorical_stats"] = {col: dataframe[col].value_counts() for col in cat_columns}

return profile

# Use the custom profile function
profile_info = custom_profile(df)

# Print some profile information
print(f"Dataset shape: {profile_info['shape']}")
print("\nMissing value percentage:")
print(profile_info['missing_percentage'])
if 'skewness' in profile_info:
print("\nSkewness (detects asymmetry in distribution):")
print(profile_info['skewness'])

Output:

Dataset shape: (100, 6)

Missing value percentage:
age 0.0
income 10.0
education 5.0
satisfaction 8.0
customer_since 0.0
last_purchase 0.0
dtype: float64

Skewness (detects asymmetry in distribution):
age 0.006354
income 0.047287
satisfaction 0.061190
dtype: float64

Real-World Example: Profiling Customer Data

Let's implement a more complete real-world example by profiling customer data and making recommendations based on findings:

python
def analyze_customer_dataset(df):
"""
Analyze a customer dataset and provide recommendations for data cleaning.
"""
print("===== CUSTOMER DATASET ANALYSIS =====\n")

# Basic information
print(f"Dataset dimensions: {df.shape[0]} rows and {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage().sum() / 1024:.2f} KB")

# Missing values analysis
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100
print("\n----- Missing Values Analysis -----")
for col, count in missing.items():
if count > 0:
print(f"{col}: {count} missing values ({missing_percent[col]:.1f}%)")

# Data type analysis
print("\n----- Data Type Analysis -----")
for col, dtype in df.dtypes.items():
print(f"{col}: {dtype}")

# Value distribution analysis
print("\n----- Value Distribution -----")

# For categorical columns
cat_columns = df.select_dtypes(include=['object', 'category']).columns
for col in cat_columns:
value_counts = df[col].value_counts()
print(f"\n{col} distribution:")
for val, count in value_counts.items():
print(f" - {val}: {count} ({count/len(df):.1%})")

# For numeric columns - basic stats
num_columns = df.select_dtypes(include=['int64', 'float64']).columns
print("\nNumeric columns statistics:")
for col in num_columns:
stats = df[col].describe()
print(f"\n{col}:")
print(f" Range: {stats['min']:.2f} to {stats['max']:.2f}")
print(f" Mean: {stats['mean']:.2f}")
print(f" Median: {stats['50%']:.2f}")
print(f" Std Dev: {stats['std']:.2f}")

# Check for potential outliers using IQR method
Q1 = stats['25%']
Q3 = stats['75%']
IQR = Q3 - Q1
outliers = df[(df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))]
if len(outliers) > 0:
print(f" Potential outliers: {len(outliers)} values")

# Date columns analysis
date_columns = df.select_dtypes(include=['datetime64']).columns
if len(date_columns) > 0:
print("\n----- Date Analysis -----")
for col in date_columns:
print(f"\n{col}:")
print(f" Range: {df[col].min()} to {df[col].max()}")
print(f" Time span: {(df[col].max() - df[col].min()).days} days")

# Recommendations based on findings
print("\n===== RECOMMENDATIONS =====")

# Missing values recommendations
cols_with_missing = [col for col in df.columns if missing[col] > 0]
if cols_with_missing:
print("\n1. Missing Values:")
for col in cols_with_missing:
if missing_percent[col] < 5:
print(f" - {col}: Low missing rate ({missing_percent[col]:.1f}%). Consider imputation.")
elif missing_percent[col] < 20:
print(f" - {col}: Moderate missing rate ({missing_percent[col]:.1f}%). Impute or use models that handle missing data.")
else:
print(f" - {col}: High missing rate ({missing_percent[col]:.1f}%). Consider if this column should be kept.")

# Data type recommendations
print("\n2. Data Types:")
for col in df.columns:
if col.lower() in ['age', 'year', 'count', 'number']:
if df[col].dtype != 'int64':
print(f" - {col}: Consider converting to integer type.")
elif col.lower() in ['date', 'time', 'since', 'until', 'purchase']:
if not pd.api.types.is_datetime64_dtype(df[col]):
print(f" - {col}: Consider converting to datetime format.")

# Column-specific recommendations
print("\n3. Column-Specific Insights:")
if 'age' in df.columns:
if df['age'].min() < 0 or df['age'].max() > 120:
print(f" - age: Contains implausible values. Clean or filter.")

if 'income' in df.columns:
if df['income'].min() < 0:
print(f" - income: Contains negative values which may need correction.")

return "Analysis complete!"

# Run the analysis on our sample data
analyze_customer_dataset(df)

This real-world example demonstrates how to:

  1. Get a comprehensive view of your dataset
  2. Identify potential data quality issues
  3. Generate recommendations for data cleaning based on the profile

Summary

Data profiling is an essential first step in any data analysis project. It helps you understand your data's structure, identify quality issues, and plan your cleaning strategy. In this tutorial, we covered:

  1. Basic pandas profiling methods (info(), describe(), nunique())
  2. Advanced profiling using the pandas-profiling library
  3. Custom profiling functions for specific needs
  4. A real-world example of profiling customer data

By mastering these techniques, you'll be able to approach any dataset with confidence and develop a solid plan for cleaning and analysis.

Additional Resources

Exercises

  1. Download a public dataset of your choice and create a comprehensive profile using both pandas built-in methods and pandas-profiling.

  2. Create a custom profiling function that detects potential data quality issues like:

    • Inconsistent date formats
    • Text fields with numeric values
    • Unexpected categorical values
  3. Modify the analyze_customer_dataset function to include correlation analysis between variables.

  4. Create a visualization dashboard that shows key profiling metrics from a dataset using Matplotlib or Seaborn.

  5. Compare the performance (speed and memory usage) of pandas built-in profiling versus pandas-profiling on a large dataset (>100,000 rows).



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)