Pandas Data Profiling
Data profiling is a crucial first step in any data cleaning or analysis project. It helps you understand the structure, content, and quality of your dataset before diving into deeper analysis. In this tutorial, we'll explore various techniques to profile your pandas DataFrames effectively.
What is Data Profiling?
Data profiling is the process of examining, analyzing, and creating useful summaries of data. The goal is to:
- Understand the structure of your data
- Identify quality issues (missing values, outliers)
- Discover patterns and relationships
- Get a general feel for your dataset
Let's learn how to profile pandas DataFrames using both built-in methods and external libraries.
Basic Data Profiling with Pandas
Creating a Sample Dataset
First, let's create a sample dataset to work with throughout this tutorial:
import pandas as pd
import numpy as np
# Set random seed for reproducibility
np.random.seed(42)
# Create a sample dataset
data = {
'age': np.random.randint(18, 70, 100),
'income': np.random.randint(20000, 100000, 100),
'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 100),
'satisfaction': np.random.choice([1, 2, 3, 4, 5], 100),
'customer_since': pd.date_range(start='2018-01-01', periods=100, freq='D'),
'last_purchase': pd.date_range(start='2022-01-01', periods=100, freq='D')
}
df = pd.DataFrame(data)
# Add some missing values
df.loc[np.random.choice(df.index, 10), 'income'] = np.nan
df.loc[np.random.choice(df.index, 5), 'education'] = np.nan
df.loc[np.random.choice(df.index, 8), 'satisfaction'] = np.nan
print(df.head())
Output:
age income education satisfaction customer_since last_purchase
0 61 69394.0 Bachelor 3 2018-01-01 2022-01-01
1 37 21711.0 Bachelor 1 2018-01-02 2022-01-02
2 59 87004.0 PhD 4 2018-01-03 2022-01-03
3 67 NaN Bachelor 5 2018-01-04 2022-01-04
4 36 42675.0 Bachelor 1 2018-01-05 2022-01-05
Basic Information Methods
Pandas provides several built-in methods for quick data profiling:
1. df.info()
- General Information
This method provides a concise summary of the DataFrame:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 100 non-null int64
1 income 90 non-null float64
2 education 95 non-null object
3 satisfaction 92 non-null float64
4 customer_since 100 non-null datetime64[ns]
5 last_purchase 100 non-null datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1), object(1)
memory usage: 4.8+ KB
The info()
method shows:
- Number of non-null entries in each column
- Data types of each column
- Memory usage of the DataFrame
2. df.describe()
- Statistical Summary
This method generates descriptive statistics for numerical columns:
df.describe()
Output:
age income satisfaction
count 100.000000 90.00000 92.000000
mean 42.380000 59273.30000 2.956522
std 15.383842 24020.29832 1.430841
min 18.000000 20126.00000 1.000000
25% 29.750000 40291.25000 2.000000
50% 42.000000 58674.50000 3.000000
75% 54.500000 79582.25000 4.000000
max 69.000000 99568.00000 5.000000
For categorical data, you can use:
df['education'].describe()
Output:
count 95
unique 4
top Bachelor
freq 37
Name: education, dtype: object
3. df.nunique()
- Counting Unique Values
To see how many unique values exist in each column:
df.nunique()
Output:
age 50
income 90
education 4
satisfaction 5
customer_since 100
last_purchase 100
dtype: int64
4. Looking for Missing Values
# Count of missing values per column
print(df.isnull().sum())
# Percentage of missing values
print((df.isnull().sum() / len(df)) * 100)
Output:
age 0
income 10
education 5
satisfaction 8
customer_since 0
last_purchase 0
dtype: int64
age 0.0
income 10.0
education 5.0
satisfaction 8.0
customer_since 0.0
last_purchase 0.0
dtype: float64
5. Value Distributions
For categorical variables, let's check the value distributions:
df['education'].value_counts()
Output:
Bachelor 37
High School 27
Master 16
PhD 15
Name: education, dtype: int64
For numeric variables, we can create histograms:
df['age'].hist(bins=10, figsize=(8, 4))
Advanced Profiling with pandas-profiling
While pandas built-in methods are useful, the pandas-profiling
library takes data profiling to another level by generating comprehensive HTML reports of your DataFrame.
Installation
First, you need to install the library:
pip install pandas-profiling
Basic Usage
from pandas_profiling import ProfileReport
# Generate the report
profile = ProfileReport(df, title="Dataset Profiling Report", explorative=True)
# Save the report as HTML
profile.to_file("data_profile_report.html")
# In Jupyter notebook, you can display it directly:
# profile.to_notebook_iframe()
The generated HTML report includes:
- Overview statistics
- Variables details (type, unique values, missing values, etc.)
- Interactions between variables
- Correlations
- Missing values patterns
- And much more!
Customizing the Report
You can customize the profiling by passing additional parameters:
profile = ProfileReport(
df,
title="Custom Profiling Report",
minimal=False, # Set to True for a minimal report
explorative=True, # Enables correlations, etc.
html={'style': {'full_width': True}},
correlations={
'pearson': {'calculate': True},
'spearman': {'calculate': True},
'kendall': {'calculate': True},
}
)
Creating Custom Profiling Functions
Sometimes you might need custom profiling functions for specific analytical needs:
def custom_profile(dataframe):
"""
Custom function to profile a pandas DataFrame.
"""
profile = {
"shape": dataframe.shape,
"column_types": dataframe.dtypes,
"missing_values": dataframe.isnull().sum(),
"missing_percentage": (dataframe.isnull().sum() / len(dataframe)) * 100,
"unique_values": dataframe.nunique(),
}
# Add numeric column statistics
numeric_columns = dataframe.select_dtypes(include=['int64', 'float64']).columns
if len(numeric_columns) > 0:
profile["numeric_stats"] = dataframe[numeric_columns].describe()
profile["skewness"] = dataframe[numeric_columns].skew()
profile["kurtosis"] = dataframe[numeric_columns].kurt()
# Add categorical column statistics
cat_columns = dataframe.select_dtypes(include=['object', 'category']).columns
if len(cat_columns) > 0:
profile["categorical_stats"] = {col: dataframe[col].value_counts() for col in cat_columns}
return profile
# Use the custom profile function
profile_info = custom_profile(df)
# Print some profile information
print(f"Dataset shape: {profile_info['shape']}")
print("\nMissing value percentage:")
print(profile_info['missing_percentage'])
if 'skewness' in profile_info:
print("\nSkewness (detects asymmetry in distribution):")
print(profile_info['skewness'])
Output:
Dataset shape: (100, 6)
Missing value percentage:
age 0.0
income 10.0
education 5.0
satisfaction 8.0
customer_since 0.0
last_purchase 0.0
dtype: float64
Skewness (detects asymmetry in distribution):
age 0.006354
income 0.047287
satisfaction 0.061190
dtype: float64
Real-World Example: Profiling Customer Data
Let's implement a more complete real-world example by profiling customer data and making recommendations based on findings:
def analyze_customer_dataset(df):
"""
Analyze a customer dataset and provide recommendations for data cleaning.
"""
print("===== CUSTOMER DATASET ANALYSIS =====\n")
# Basic information
print(f"Dataset dimensions: {df.shape[0]} rows and {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage().sum() / 1024:.2f} KB")
# Missing values analysis
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100
print("\n----- Missing Values Analysis -----")
for col, count in missing.items():
if count > 0:
print(f"{col}: {count} missing values ({missing_percent[col]:.1f}%)")
# Data type analysis
print("\n----- Data Type Analysis -----")
for col, dtype in df.dtypes.items():
print(f"{col}: {dtype}")
# Value distribution analysis
print("\n----- Value Distribution -----")
# For categorical columns
cat_columns = df.select_dtypes(include=['object', 'category']).columns
for col in cat_columns:
value_counts = df[col].value_counts()
print(f"\n{col} distribution:")
for val, count in value_counts.items():
print(f" - {val}: {count} ({count/len(df):.1%})")
# For numeric columns - basic stats
num_columns = df.select_dtypes(include=['int64', 'float64']).columns
print("\nNumeric columns statistics:")
for col in num_columns:
stats = df[col].describe()
print(f"\n{col}:")
print(f" Range: {stats['min']:.2f} to {stats['max']:.2f}")
print(f" Mean: {stats['mean']:.2f}")
print(f" Median: {stats['50%']:.2f}")
print(f" Std Dev: {stats['std']:.2f}")
# Check for potential outliers using IQR method
Q1 = stats['25%']
Q3 = stats['75%']
IQR = Q3 - Q1
outliers = df[(df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))]
if len(outliers) > 0:
print(f" Potential outliers: {len(outliers)} values")
# Date columns analysis
date_columns = df.select_dtypes(include=['datetime64']).columns
if len(date_columns) > 0:
print("\n----- Date Analysis -----")
for col in date_columns:
print(f"\n{col}:")
print(f" Range: {df[col].min()} to {df[col].max()}")
print(f" Time span: {(df[col].max() - df[col].min()).days} days")
# Recommendations based on findings
print("\n===== RECOMMENDATIONS =====")
# Missing values recommendations
cols_with_missing = [col for col in df.columns if missing[col] > 0]
if cols_with_missing:
print("\n1. Missing Values:")
for col in cols_with_missing:
if missing_percent[col] < 5:
print(f" - {col}: Low missing rate ({missing_percent[col]:.1f}%). Consider imputation.")
elif missing_percent[col] < 20:
print(f" - {col}: Moderate missing rate ({missing_percent[col]:.1f}%). Impute or use models that handle missing data.")
else:
print(f" - {col}: High missing rate ({missing_percent[col]:.1f}%). Consider if this column should be kept.")
# Data type recommendations
print("\n2. Data Types:")
for col in df.columns:
if col.lower() in ['age', 'year', 'count', 'number']:
if df[col].dtype != 'int64':
print(f" - {col}: Consider converting to integer type.")
elif col.lower() in ['date', 'time', 'since', 'until', 'purchase']:
if not pd.api.types.is_datetime64_dtype(df[col]):
print(f" - {col}: Consider converting to datetime format.")
# Column-specific recommendations
print("\n3. Column-Specific Insights:")
if 'age' in df.columns:
if df['age'].min() < 0 or df['age'].max() > 120:
print(f" - age: Contains implausible values. Clean or filter.")
if 'income' in df.columns:
if df['income'].min() < 0:
print(f" - income: Contains negative values which may need correction.")
return "Analysis complete!"
# Run the analysis on our sample data
analyze_customer_dataset(df)
This real-world example demonstrates how to:
- Get a comprehensive view of your dataset
- Identify potential data quality issues
- Generate recommendations for data cleaning based on the profile
Summary
Data profiling is an essential first step in any data analysis project. It helps you understand your data's structure, identify quality issues, and plan your cleaning strategy. In this tutorial, we covered:
- Basic pandas profiling methods (
info()
,describe()
,nunique()
) - Advanced profiling using the
pandas-profiling
library - Custom profiling functions for specific needs
- A real-world example of profiling customer data
By mastering these techniques, you'll be able to approach any dataset with confidence and develop a solid plan for cleaning and analysis.
Additional Resources
- Pandas Documentation
- pandas-profiling GitHub Repository
- Great Expectations - A more advanced data validation framework
Exercises
-
Download a public dataset of your choice and create a comprehensive profile using both pandas built-in methods and
pandas-profiling
. -
Create a custom profiling function that detects potential data quality issues like:
- Inconsistent date formats
- Text fields with numeric values
- Unexpected categorical values
-
Modify the
analyze_customer_dataset
function to include correlation analysis between variables. -
Create a visualization dashboard that shows key profiling metrics from a dataset using Matplotlib or Seaborn.
-
Compare the performance (speed and memory usage) of pandas built-in profiling versus
pandas-profiling
on a large dataset (>100,000 rows).
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)