Pandas Missing Values
Missing values are a common challenge in real-world datasets. Whether it's due to data entry errors, processing problems, or just the absence of information, knowing how to handle these gaps effectively is essential for any data analysis task.
Introduction to Missing Values
In Pandas, missing values are typically represented as NaN
(Not a Number), which is a special floating-point value. These values can cause issues when performing calculations or training machine learning models, which is why we need strategies to identify and handle them.
Let's start by understanding how missing values appear in Pandas:
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
data = {
'Name': ['John', 'Anna', 'Peter', None, 'Linda'],
'Age': [28, 34, np.nan, 52, 30],
'Salary': [50000, np.nan, 65000, 75000, np.nan],
'Department': ['IT', 'HR', 'IT', np.nan, 'Marketing']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Salary Department
0 John 28.0 50000.0 IT
1 Anna 34.0 NaN HR
2 Peter NaN 65000.0 IT
3 None 52.0 75000.0 None
4 Linda 30.0 NaN Marketing
Notice that missing values can be represented by NaN
or None
in a DataFrame.
Detecting Missing Values
Pandas provides several methods to identify missing values in your dataset:
1. Using isna()
or isnull()
These functions return a DataFrame of the same shape with boolean values indicating whether each value is missing:
# Check for missing values
print(df.isna()) # isnull() works exactly the same way
Output:
Name Age Salary Department
0 False False False False
1 False False True False
2 False True False False
3 True False False True
4 False False True False
2. Counting missing values
You can count missing values per column:
# Count missing values in each column
print(df.isna().sum())
# For the entire DataFrame
print(f"Total missing values: {df.isna().sum().sum()}")
Output:
Name 1
Age 1
Salary 2
Department 1
dtype: int64
Total missing values: 5
3. Checking if a DataFrame contains any missing values
# Check if there are any missing values
print(f"Has missing values: {df.isna().any().any()}")
# Check which columns have missing values
print(df.isna().any())
Output:
Has missing values: True
Name True
Age True
Salary True
Department True
dtype: bool
Handling Missing Values
There are several strategies for dealing with missing values:
1. Removing Missing Values
Removing rows with any missing values
# Drop rows with any missing values
df_dropped = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)
Output:
DataFrame after dropping rows with missing values:
Name Age Salary Department
0 John 28.0 50000.0 IT
Removing rows with all missing values
# Drop rows where all values are missing
df_dropped_all = df.dropna(how='all')
print("\nDataFrame after dropping rows with all missing values:")
print(df_dropped_all) # In this case, no rows have all values missing
Output:
DataFrame after dropping rows with all missing values:
Name Age Salary Department
0 John 28.0 50000.0 IT
1 Anna 34.0 NaN HR
2 Peter NaN 65000.0 IT
3 None 52.0 75000.0 None
4 Linda 30.0 NaN Marketing
Removing columns with missing values
# Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)
print("\nDataFrame after dropping columns with missing values:")
print(df_dropped_columns) # All columns have at least one missing value
Output:
DataFrame after dropping columns with missing values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
Removing rows with a threshold of missing values
# Drop rows with at least 2 missing values
df_threshold = df.dropna(thresh=3)
print("\nDataFrame after dropping rows with at least 2 missing values:")
print(df_threshold)
Output:
DataFrame after dropping rows with at least 2 missing values:
Name Age Salary Department
0 John 28.0 50000.0 IT
1 Anna 34.0 NaN HR
2 Peter NaN 65000.0 IT
4 Linda 30.0 NaN Marketing
2. Filling Missing Values
Filling with a specific value
# Fill all missing values with a specific value
df_filled = df.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_filled)
Output:
DataFrame after filling missing values with 0:
Name Age Salary Department
0 John 28.0 50000.0 IT
1 Anna 34.0 0.0 HR
2 Peter 0.0 65000.0 IT
3 0 52.0 75000.0 0
4 Linda 30.0 0.0 Marketing
Filling with different values for each column
# Fill missing values with different values for each column
fill_values = {
'Name': 'Unknown',
'Age': df['Age'].mean(),
'Salary': df['Salary'].median(),
'Department': 'Not Specified'
}
df_filled_dict = df.fillna(fill_values)
print("\nDataFrame after filling with specific values per column:")
print(df_filled_dict)
Output:
DataFrame after filling with specific values per column:
Name Age Salary Department
0 John 28.0 50000.0 IT
1 Anna 34.0 65000.0 HR
2 Peter 36.0 65000.0 IT
3 Unknown 52.0 75000.0 Not Specified
4 Linda 30.0 65000.0 Marketing
Forward and backward filling
# Forward fill (use the previous value)
df_ffill = df.fillna(method='ffill')
print("\nDataFrame after forward fill:")
print(df_ffill)
# Backward fill (use the next value)
df_bfill = df.fillna(method='bfill')
print("\nDataFrame after backward fill:")
print(df_bfill)
Output:
DataFrame after forward fill:
Name Age Salary Department
0 John 28.0 50000.0 IT
1 Anna 34.0 50000.0 HR
2 Peter 34.0 65000.0 IT
3 Peter 52.0 75000.0 IT
4 Linda 30.0 75000.0 Marketing
DataFrame after backward fill:
Name Age Salary Department
0 John 28.0 50000.0 IT
1 Anna 34.0 65000.0 HR
2 Peter 52.0 65000.0 IT
3 Linda 52.0 75000.0 Marketing
4 Linda 30.0 NaN Marketing
3. Interpolation
Interpolation is a method of estimating values between known data points:
# Create a sample time series with missing values
ts = pd.Series([10, 20, np.nan, 40, 50, np.nan, 70], index=pd.date_range('20230101', periods=7))
print("\nTime Series with missing values:")
print(ts)
# Linear interpolation
print("\nLinear interpolation:")
print(ts.interpolate(method='linear'))
# Polynomial interpolation
print("\nPolynomial interpolation:")
print(ts.interpolate(method='polynomial', order=2))
Output:
Time Series with missing values:
2023-01-01 10.0
2023-01-02 20.0
2023-01-03 NaN
2023-01-04 40.0
2023-01-05 50.0
2023-01-06 NaN
2023-01-07 70.0
Freq: D, dtype: float64
Linear interpolation:
2023-01-01 10.0
2023-01-02 20.0
2023-01-03 30.0
2023-01-04 40.0
2023-01-05 50.0
2023-01-06 60.0
2023-01-07 70.0
Freq: D, dtype: float64
Polynomial interpolation:
2023-01-01 10.0
2023-01-02 20.0
2023-01-03 30.0
2023-01-04 40.0
2023-01-05 50.0
2023-01-06 60.0
2023-01-07 70.0
Freq: D, dtype: float64