Pandas Missing Values

Missing values are a common challenge in real-world datasets. Whether it's due to data entry errors, processing problems, or just the absence of information, knowing how to handle these gaps effectively is essential for any data analysis task.

Introduction to Missing Values

In Pandas, missing values are typically represented as NaN (Not a Number), which is a special floating-point value. These values can cause issues when performing calculations or training machine learning models, which is why we need strategies to identify and handle them.

Let's start by understanding how missing values appear in Pandas:

import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
data = {
    'Name': ['John', 'Anna', 'Peter', None, 'Linda'],
    'Age': [28, 34, np.nan, 52, 30],
    'Salary': [50000, np.nan, 65000, 75000, np.nan],
    'Department': ['IT', 'HR', 'IT', np.nan, 'Marketing']
}

df = pd.DataFrame(data)
print(df)

Output:

    Name   Age   Salary Department
 John  28.0  50000.0         IT
 Anna  34.0      NaN         HR
Peter   NaN  65000.0         IT
 None  52.0  75000.0       None
Linda  30.0      NaN  Marketing

Notice that missing values can be represented by NaN or None in a DataFrame.

Detecting Missing Values

Pandas provides several methods to identify missing values in your dataset:

1. Using `isna()` or `isnull()`

These functions return a DataFrame of the same shape with boolean values indicating whether each value is missing:

# Check for missing values
print(df.isna())  # isnull() works exactly the same way

Output:

    Name    Age  Salary  Department
False  False   False       False
False  False    True       False
False   True   False       False
 True  False   False        True
False  False    True       False

2. Counting missing values

You can count missing values per column:

# Count missing values in each column
print(df.isna().sum())

# For the entire DataFrame
print(f"Total missing values: {df.isna().sum().sum()}")

Output:

Name          1
Age           1
Salary        2
Department    1
dtype: int64
Total missing values: 5

3. Checking if a DataFrame contains any missing values

# Check if there are any missing values
print(f"Has missing values: {df.isna().any().any()}")

# Check which columns have missing values
print(df.isna().any())

Output:

Has missing values: True
Name          True
Age           True
Salary        True
Department    True
dtype: bool

Handling Missing Values

There are several strategies for dealing with missing values:

1. Removing Missing Values

Removing rows with any missing values

# Drop rows with any missing values
df_dropped = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)

Output:

DataFrame after dropping rows with missing values:
   Name   Age   Salary Department
0  John  28.0  50000.0         IT

Removing rows with all missing values

# Drop rows where all values are missing
df_dropped_all = df.dropna(how='all')
print("\nDataFrame after dropping rows with all missing values:")
print(df_dropped_all)  # In this case, no rows have all values missing

Output:

DataFrame after dropping rows with all missing values:
    Name   Age   Salary Department
 John  28.0  50000.0         IT
 Anna  34.0      NaN         HR
Peter   NaN  65000.0         IT
 None  52.0  75000.0       None
Linda  30.0      NaN  Marketing

Removing columns with missing values

# Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)
print("\nDataFrame after dropping columns with missing values:")
print(df_dropped_columns)  # All columns have at least one missing value

Output:

DataFrame after dropping columns with missing values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

Removing rows with a threshold of missing values

# Drop rows with at least 2 missing values
df_threshold = df.dropna(thresh=3)
print("\nDataFrame after dropping rows with at least 2 missing values:")
print(df_threshold)

Output:

DataFrame after dropping rows with at least 2 missing values:
    Name   Age   Salary Department
0   John  28.0  50000.0         IT
1   Anna  34.0      NaN         HR
2  Peter   NaN  65000.0         IT
4  Linda  30.0      NaN  Marketing

2. Filling Missing Values

Filling with a specific value

# Fill all missing values with a specific value
df_filled = df.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_filled)

Output:

DataFrame after filling missing values with 0:
    Name   Age   Salary Department
 John  28.0  50000.0         IT
 Anna  34.0      0.0         HR
Peter   0.0  65000.0         IT
    0  52.0  75000.0          0
Linda  30.0      0.0  Marketing

Filling with different values for each column

# Fill missing values with different values for each column
fill_values = {
    'Name': 'Unknown',
    'Age': df['Age'].mean(),
    'Salary': df['Salary'].median(),
    'Department': 'Not Specified'
}

df_filled_dict = df.fillna(fill_values)
print("\nDataFrame after filling with specific values per column:")
print(df_filled_dict)

Output:

DataFrame after filling with specific values per column:
     Name   Age   Salary    Department
  John  28.0  50000.0            IT
  Anna  34.0  65000.0            HR
 Peter  36.0  65000.0            IT
Unknown  52.0  75000.0  Not Specified
 Linda  30.0  65000.0      Marketing

Forward and backward filling

# Forward fill (use the previous value)
df_ffill = df.fillna(method='ffill')
print("\nDataFrame after forward fill:")
print(df_ffill)

# Backward fill (use the next value)
df_bfill = df.fillna(method='bfill')
print("\nDataFrame after backward fill:")
print(df_bfill)

Output:

DataFrame after forward fill:
    Name   Age   Salary Department
 John  28.0  50000.0         IT
 Anna  34.0  50000.0         HR
Peter  34.0  65000.0         IT
Peter  52.0  75000.0         IT
Linda  30.0  75000.0  Marketing

DataFrame after backward fill:
    Name   Age   Salary Department
 John  28.0  50000.0         IT
 Anna  34.0  65000.0         HR
Peter  52.0  65000.0         IT
Linda  52.0  75000.0  Marketing
Linda  30.0      NaN  Marketing

3. Interpolation

Interpolation is a method of estimating values between known data points:

# Create a sample time series with missing values
ts = pd.Series([10, 20, np.nan, 40, 50, np.nan, 70], index=pd.date_range('20230101', periods=7))
print("\nTime Series with missing values:")
print(ts)

# Linear interpolation
print("\nLinear interpolation:")
print(ts.interpolate(method='linear'))

# Polynomial interpolation
print("\nPolynomial interpolation:")
print(ts.interpolate(method='polynomial', order=2))

Output:

Time Series with missing values:
2023-01-01    10.0
2023-01-02    20.0
2023-01-03     NaN
2023-01-04    40.0
2023-01-05    50.0
2023-01-06     NaN
2023-01-07    70.0
Freq: D, dtype: float64

Linear interpolation:
2023-01-01    10.0
2023-01-02    20.0
2023-01-03    30.0
2023-01-04    40.0
2023-01-05    50.0
2023-01-06    60.0
2023-01-07    70.0
Freq: D, dtype: float64

Polynomial interpolation:
2023-01-01    10.0
2023-01-02    20.0
2023-01-03    30.0
2023-01-04    40.0
2023-01-05    50.0
2023-01-06    60.0
2023-01-07    70.0
Freq: D, dtype: float64

Real-World Applications

Example 1: Cleaning Customer Data

Let's say we have a customer dataset with missing information:

# Create a customer dataset
customer_data = {
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['John Smith', 'Jane Doe', None, 'Robert Johnson', 'Emily Wilson'],
    'Age': [34, np.nan, 45, 52, np.nan],
    'Email': ['[email protected]', '[email protected]', None, None, '[email protected]'],
    'PurchaseAmount': [125.50, 89.99, np.nan, 210.75, 55.30]
}

customers = pd.DataFrame(customer_data)
print("Original Customer Data:")
print(customers)

# Identify missing values
print("\nMissing Values per Column:")
print(customers.isna().sum())

Output:

Original Customer Data:
   CustomerID           Name   Age             Email  PurchaseAmount
0           1     John Smith  34.0  [email protected]          125.50
1           2       Jane Doe   NaN  [email protected]           89.99
2           3           None  45.0             None             NaN
3           4  Robert Johnson  52.0             None          210.75
4           5   Emily Wilson   NaN  [email protected]           55.30

Missing Values per Column:
CustomerID        0
Name              1
Age               2
Email             2
PurchaseAmount    1
dtype: int64

Now let's clean this data appropriately:

# Step 1: For critical identifying data (CustomerID), we might choose to drop rows with missing values
# In this case, all CustomerIDs are present

# Step 2: For categorical/text data, use a meaningful placeholder
customers['Name'] = customers['Name'].fillna('Unknown')
customers['Email'] = customers['Email'].fillna('[email protected]')

# Step 3: For numerical data, use mean, median, or mode based on distribution
customers['Age'] = customers['Age'].fillna(customers['Age'].mean())
customers['PurchaseAmount'] = customers['PurchaseAmount'].fillna(customers['PurchaseAmount'].median())

print("\nCleaned Customer Data:")
print(customers)

Output:

Cleaned Customer Data:
   CustomerID           Name        Age                 Email  PurchaseAmount
         1     John Smith  34.000000      [email protected]          125.50
         2       Jane Doe  43.666667      [email protected]           89.99
         3        Unknown  45.000000  [email protected]          107.74
         4  Robert Johnson  52.000000  [email protected]          210.75
         5   Emily Wilson  43.666667     [email protected]           55.30

Example 2: Cleaning Time Series Data

Time series data often has missing values that are better handled with specialized approaches:

# Create a simulated daily sales dataset with missing values
date_range = pd.date_range(start='2023-01-01', end='2023-01-15', freq='D')
sales_data = [150, 165, np.nan, 140, 180, 195, np.nan, np.nan, 210, 205, 190, np.nan, 220, 215, 230]

sales = pd.Series(sales_data, index=date_range)
print("Original Sales Time Series:")
print(sales)

# Visualize missing values pattern
missing_days = sales.index[sales.isna()]
print(f"\nMissing values on: {[d.strftime('%Y-%m-%d') for d in missing_days]}")

# Fill missing values using time series interpolation
sales_filled = sales.interpolate(method='time')  # Time-based interpolation
print("\nSales after time-based interpolation:")
print(sales_filled)

# Alternative: Use a moving average to fill gaps
sales_ma = sales.fillna(sales.rolling(window=3, min_periods=1).mean())
print("\nSales after filling with moving average:")
print(sales_ma)

Output:

Original Sales Time Series:
2023-01-01    150.0
2023-01-02    165.0
2023-01-03      NaN
2023-01-04    140.0
2023-01-05    180.0
2023-01-06    195.0
2023-01-07      NaN
2023-01-08      NaN
2023-01-09    210.0
2023-01-10    205.0
2023-01-11    190.0
2023-01-12      NaN
2023-01-13    220.0
2023-01-14    215.0
2023-01-15    230.0
Freq: D, dtype: float64

Missing values on: ['2023-01-03', '2023-01-07', '2023-01-08', '2023-01-12']

Sales after time-based interpolation:
2023-01-01    150.000000
2023-01-02    165.000000
2023-01-03    152.500000
2023-01-04    140.000000
2023-01-05    180.000000
2023-01-06    195.000000
2023-01-07    200.000000
2023-01-08    205.000000
2023-01-09    210.000000
2023-01-10    205.000000
2023-01-11    190.000000
2023-01-12    205.000000
2023-01-13    220.000000
2023-01-14    215.000000
2023-01-15    230.000000
Freq: D, dtype: float64

Sales after filling with moving average:
2023-01-01    150.000000
2023-01-02    165.000000
2023-01-03    157.500000
2023-01-04    140.000000
2023-01-05    180.000000
2023-01-06    195.000000
2023-01-07    171.666667
2023-01-08    182.083333
2023-01-09    210.000000
2023-01-10    205.000000
2023-01-11    190.000000
2023-01-12    201.666667
2023-01-13    220.000000
2023-01-14    215.000000
2023-01-15    230.000000
Freq: D, dtype: float64

Summary

Handling missing values is an essential step in the data cleaning process. In this guide, we've covered:

Detecting missing values: Using functions like isna() and isnull() to identify missing values in your DataFrame.
Removing missing values: Dropping rows or columns with missing values using dropna().
Filling missing values: Using techniques like:
- Filling with constants (fillna())
- Forward and backward filling
- Using statistical measures like mean, median, or mode
- Interpolation for time series or ordered data
Real-world applications: Applying these techniques to customer data and time series data.

The strategy you choose should depend on:

The importance of the data
The amount of missing values
The nature of your dataset
The specific requirements of your analysis or model

Remember that handling missing values is not just a technical step but also an analytical decision that can significantly impact your results.

Additional Resources and Exercises

Resources

Exercises

Basic Exercise: Create a DataFrame with missing values and practice different methods of handling them.
Intermediate Exercise:
- Download a dataset from Kaggle with missing values
- Analyze the pattern of missing values
- Apply appropriate techniques to handle them
- Compare the results of different approaches
Advanced Exercise:
- Create a function that automatically detects and handles missing values in any DataFrame
- The function should choose the appropriate method based on the data type and distribution of each column
- Include options for reporting on the missing data and the actions taken

Happy data cleaning!

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction to Missing Values​

Detecting Missing Values​

1. Using isna() or isnull()​

2. Counting missing values​

3. Checking if a DataFrame contains any missing values​

Handling Missing Values​

1. Removing Missing Values​

Removing rows with any missing values​

Removing rows with all missing values​

Removing columns with missing values​

Removing rows with a threshold of missing values​

2. Filling Missing Values​

Filling with a specific value​

Filling with different values for each column​

Forward and backward filling​

3. Interpolation​

Real-World Applications​

Example 1: Cleaning Customer Data​

Example 2: Cleaning Time Series Data​

Summary​

Additional Resources and Exercises​

Resources​

Exercises​

Introduction to Missing Values

Detecting Missing Values

1. Using `isna()` or `isnull()`

2. Counting missing values

3. Checking if a DataFrame contains any missing values

Handling Missing Values

1. Removing Missing Values

Removing rows with any missing values

Removing rows with all missing values

Removing columns with missing values

Removing rows with a threshold of missing values

2. Filling Missing Values

Filling with a specific value

Filling with different values for each column

Forward and backward filling

3. Interpolation

Real-World Applications

Example 1: Cleaning Customer Data

Example 2: Cleaning Time Series Data

Summary

Additional Resources and Exercises

Resources

Exercises