TensorFlow Model Analysis

Introduction

TensorFlow Model Analysis (TFMA) is a powerful library within the TensorFlow Extended (TFX) ecosystem that enables developers to evaluate and analyze machine learning models. When building ML pipelines, it's essential not only to train effective models but also to thoroughly evaluate their performance across different data slices and metrics. TFMA provides the tools necessary for this deep evaluation.

In this tutorial, we'll explore how TFMA works, its key features, and how to integrate it into your TFX pipelines. By the end, you'll be able to perform comprehensive model analysis to ensure your models are performing as expected before deployment.

What is TensorFlow Model Analysis?

TensorFlow Model Analysis is a library for evaluating TensorFlow models. It allows you to:

Calculate and visualize evaluation metrics across data slices
Compare model performance across different model versions
Analyze model fairness and bias
Track model performance over time

TFMA works well with other TFX components and can be used both within a complete TFX pipeline or as a standalone library.

Getting Started with TFMA

Installation

First, let's install TFMA:

pip install tensorflow-model-analysis

Basic Usage

Let's start with a simple example of how to use TFMA:

import tensorflow as tf
import tensorflow_model_analysis as tfma

# Load a previously trained model
model = tf.keras.models.load_model('my_model')

# Define the evaluation configuration
eval_config = tfma.EvalConfig(
    model_specs=[tfma.ModelSpec(label_key='label')],
    slicing_specs=[tfma.SlicingSpec()],
    metrics_specs=[
        tfma.MetricsSpec(metrics=[
            tfma.MetricConfig(class_name='Accuracy'),
            tfma.MetricConfig(class_name='AUC')
        ])
    ]
)

# Define the location for evaluation results
output_path = 'tfma_results'

# Run the evaluation
eval_result = tfma.run_model_analysis(
    eval_shared_model=tfma.default_eval_shared_model(model_path='my_model'),
    eval_config=eval_config,
    data_location='eval_data.tfrecord',
    output_path=output_path
)

# View the results
tfma.view.render_slicing_metrics(eval_result)

The above example:

Loads a pre-trained model
Configures an evaluation with accuracy and AUC metrics
Runs the evaluation on the provided dataset
Renders the metrics for visualization

Key Concepts in TFMA

Evaluation Configuration

The EvalConfig is the central component that defines how your model will be evaluated:

eval_config = tfma.EvalConfig(
    model_specs=[
        # Define the model type and location of features and labels
        tfma.ModelSpec(name='my_model', label_key='label')
    ],
    slicing_specs=[
        # Define how to slice your data
        tfma.SlicingSpec(),                    # Overall metrics
        tfma.SlicingSpec(feature_keys=['age']) # Metrics for each age value
    ],
    metrics_specs=[
        # Define which metrics to compute
        tfma.MetricsSpec(metrics=[
            tfma.MetricConfig(class_name='Accuracy'),
            tfma.MetricConfig(class_name='AUC'),
            tfma.MetricConfig(class_name='Precision'),
            tfma.MetricConfig(class_name='Recall')
        ])
    ]
)

Data Slicing

One of the powerful features of TFMA is the ability to evaluate model performance across different slices of your data. This helps identify if your model performs poorly for specific subgroups:

slicing_specs = [
    # Overall metrics (no slicing)
    tfma.SlicingSpec(),
    
    # Slice by individual feature values
    tfma.SlicingSpec(feature_keys=['gender']),
    
    # Slice by feature cross (combination of features)
    tfma.SlicingSpec(feature_keys=['gender', 'age_group']),
    
    # Slice with specific feature value
    tfma.SlicingSpec(feature_values={'country': 'USA'})
]

Metrics

TFMA supports a wide range of metrics for model evaluation:

metrics_specs = [
    tfma.MetricsSpec(
        metrics=[
            # Classification metrics
            tfma.MetricConfig(class_name='Accuracy'),
            tfma.MetricConfig(class_name='AUC'),
            tfma.MetricConfig(class_name='Precision'),
            tfma.MetricConfig(class_name='Recall'),
            
            # Calibration metrics
            tfma.MetricConfig(class_name='MeanLabel'),
            tfma.MetricConfig(class_name='MeanPrediction'),
            
            # Fairness metrics
            tfma.MetricConfig(class_name='FairnessIndicators')
        ],
        # For multi-class problems
        thresholds=[0.3, 0.5, 0.7]  # Evaluate metrics at these thresholds
    )
]

Working with TFMA in a TFX Pipeline

In a complete TFX pipeline, the TFMA component naturally fits after the Trainer component. Here's how to integrate it:

from tfx.components import Evaluator

# Create the evaluator component
evaluator = Evaluator(
    examples=example_gen.outputs['examples'],
    model=trainer.outputs['model'],
    eval_config=eval_config
)

# Add it to your pipeline
pipeline = tfx.dsl.Pipeline(
    pipeline_name='my_pipeline',
    pipeline_root=pipeline_root,
    components=[
        example_gen,
        statistics_gen,
        schema_gen,
        example_validator,
        trainer,
        evaluator,  # TFMA component
        pusher
    ],
    enable_cache=True
)

Visualizing Results

TFMA provides various visualization options to interpret evaluation results:

# Load evaluation results
eval_results = tfma.load_eval_result(output_path)

# View metrics for all slices
tfma.view.render_slicing_metrics(eval_results)

# View metrics for a specific slice
tfma.view.render_slicing_metrics(eval_results, slicing_column='gender')

# Compare two models
tfma.view.render_slicing_metrics(
    [baseline_results, candidate_results],
    slicing_column='age_group'
)

# Show detailed metrics for a specific slice
tfma.view.render_metrics_table(eval_results)

Practical Example: Movie Recommendation Evaluation

Let's walk through a more complete example of evaluating a movie recommendation model:

import tensorflow as tf
import tensorflow_model_analysis as tfma
from tensorflow_model_analysis.addons.fairness.post_export_metrics import fairness_indicators

# Step 1: Define the evaluation configuration
eval_config = tfma.EvalConfig(
    model_specs=[
        tfma.ModelSpec(
            name='movie_recommender',
            label_key='watched',
            example_weight_key='weight'
        )
    ],
    slicing_specs=[
        # Overall metrics
        tfma.SlicingSpec(),
        # Slice by user demographic groups
        tfma.SlicingSpec(feature_keys=['user_age_group']),
        tfma.SlicingSpec(feature_keys=['user_gender']),
        # Slice by movie genre
        tfma.SlicingSpec(feature_keys=['movie_genre']),
        # Cross-slice by user gender and movie genre
        tfma.SlicingSpec(feature_keys=['user_gender', 'movie_genre']),
    ],
    metrics_specs=[
        tfma.MetricsSpec(
            metrics=[
                tfma.MetricConfig(class_name='AUC'),
                tfma.MetricConfig(class_name='Precision'),
                tfma.MetricConfig(class_name='Recall'),
                tfma.MetricConfig(class_name='ExampleCount'),
                # Fairness indicators to check for bias
                tfma.MetricConfig(
                    class_name='FairnessIndicators',
                    config=fairness_indicators.FairnessIndicatorsConfig(
                        thresholds=[0.1, 0.3, 0.5, 0.7, 0.9]
                    )
                )
            ],
            thresholds=[0.5]
        )
    ]
)

# Step 2: Run the evaluation
eval_result = tfma.run_model_analysis(
    eval_shared_model=tfma.default_eval_shared_model(
        eval_saved_model_path='path/to/movie_recommender_model',
        tags=['serve']
    ),
    data_location='gs://movie_recommender/eval_data*',
    file_format='tfrecord',
    output_path='gs://movie_recommender/tfma_output',
    eval_config=eval_config
)

# Step 3: Visualize the results
# Overall metrics
tfma.view.render_slicing_metrics(eval_result)

# Check if there's gender bias in recommendations
tfma.view.render_slicing_metrics(
    eval_result,
    slicing_column='user_gender'
)

# Check if certain movie genres are recommended fairly
tfma.view.render_slicing_metrics(
    eval_result,
    slicing_column='movie_genre'
)

# Step 4: Analyze fairness across demographic groups
fairness_result = tfma.view.render_fairness_indicator(
    eval_result,
    slicing_column='user_gender'
)

This example demonstrates:

Setting up a comprehensive evaluation configuration
Running the evaluation on a movie recommendation model
Analyzing both performance metrics and fairness across different user groups
Looking for potential biases in the recommendation system

Model Validation

TFMA can also validate whether a model meets certain criteria before deployment:

# Define model validators
baseline_path = 'gs://models/baseline'
candidate_path = 'gs://models/candidate'

eval_config = tfma.EvalConfig(
    model_specs=[
        tfma.ModelSpec(name='baseline', is_baseline=True),
        tfma.ModelSpec(name='candidate')
    ],
    slicing_specs=[tfma.SlicingSpec()],
    metrics_specs=[
        tfma.MetricsSpec(
            metrics=[
                tfma.MetricConfig(class_name='AUC'),
                tfma.MetricConfig(class_name='Accuracy')
            ]
        )
    ],
    # Add validators
    model_validators=[
        # Check if candidate is better than baseline
        tfma.validators.MetricsValidator(
            thresholds={
                'baseline/AUC': tfma.GenericValueThreshold(
                    lower_bound={'value': 0.0},
                    upper_bound=None
                ),
                'candidate/AUC': tfma.GenericValueThreshold(
                    lower_bound={'value': 0.5},
                    upper_bound=None
                ),
            }
        )
    ]
)

# Run validation
validation_results = tfma.run_model_analysis(
    eval_shared_models=[
        tfma.default_eval_shared_model(model_path=baseline_path, name='baseline'),
        tfma.default_eval_shared_model(model_path=candidate_path, name='candidate')
    ],
    eval_config=eval_config,
    data_location='gs://data/eval_data*'
)

# Check if model passes validation
validation_result = tfma.load_validation_result(output_path)
if validation_result.validation_ok:
    print("Model validated successfully! Ready for deployment.")
else:
    print("Model validation failed. Check metrics.")

The validation ensures that:

The baseline model's AUC is above 0.0 (a basic sanity check)
The candidate model's AUC is above 0.5 (a minimum performance requirement)

Only if these criteria are met will the model be considered valid for deployment.

Summary

TensorFlow Model Analysis (TFMA) is a crucial component in the TFX ecosystem that enables thorough model evaluation before deployment. It allows developers to:

Evaluate models on multiple metrics
Analyze performance across different data slices
Compare model versions
Detect potential biases in model predictions
Validate models against defined criteria

By using TFMA in your machine learning pipelines, you can ensure that only high-quality, fair, and well-performing models make it to production.

Additional Resources

Exercises

Implement TFMA in an image classification model and slice the data by image categories to identify any categories where the model underperforms.
Set up model validation criteria for a text classification model that ensures the new model version has at least 5% better accuracy than the previous version.
Use TFMA to evaluate a recommendation system and check for demographic parity across different user groups.
Create a complete TFX pipeline including a TFMA component that analyzes model performance over time as new training data becomes available.
Implement custom metrics in TFMA for a specific business use case and visualize the results across different data slices.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What is TensorFlow Model Analysis?​

Getting Started with TFMA​

Installation​

Basic Usage​

Key Concepts in TFMA​

Evaluation Configuration​

Data Slicing​

Metrics​

Working with TFMA in a TFX Pipeline​

Visualizing Results​

Practical Example: Movie Recommendation Evaluation​

Model Validation​

Summary​

Additional Resources​

Exercises​