Skip to main content

TensorFlow Model Analysis

Introduction

TensorFlow Model Analysis (TFMA) is a powerful library within the TensorFlow Extended (TFX) ecosystem that enables developers to evaluate and analyze machine learning models. When building ML pipelines, it's essential not only to train effective models but also to thoroughly evaluate their performance across different data slices and metrics. TFMA provides the tools necessary for this deep evaluation.

In this tutorial, we'll explore how TFMA works, its key features, and how to integrate it into your TFX pipelines. By the end, you'll be able to perform comprehensive model analysis to ensure your models are performing as expected before deployment.

What is TensorFlow Model Analysis?

TensorFlow Model Analysis is a library for evaluating TensorFlow models. It allows you to:

  • Calculate and visualize evaluation metrics across data slices
  • Compare model performance across different model versions
  • Analyze model fairness and bias
  • Track model performance over time

TFMA works well with other TFX components and can be used both within a complete TFX pipeline or as a standalone library.

Getting Started with TFMA

Installation

First, let's install TFMA:

bash
pip install tensorflow-model-analysis

Basic Usage

Let's start with a simple example of how to use TFMA:

python
import tensorflow as tf
import tensorflow_model_analysis as tfma

# Load a previously trained model
model = tf.keras.models.load_model('my_model')

# Define the evaluation configuration
eval_config = tfma.EvalConfig(
model_specs=[tfma.ModelSpec(label_key='label')],
slicing_specs=[tfma.SlicingSpec()],
metrics_specs=[
tfma.MetricsSpec(metrics=[
tfma.MetricConfig(class_name='Accuracy'),
tfma.MetricConfig(class_name='AUC')
])
]
)

# Define the location for evaluation results
output_path = 'tfma_results'

# Run the evaluation
eval_result = tfma.run_model_analysis(
eval_shared_model=tfma.default_eval_shared_model(model_path='my_model'),
eval_config=eval_config,
data_location='eval_data.tfrecord',
output_path=output_path
)

# View the results
tfma.view.render_slicing_metrics(eval_result)

The above example:

  1. Loads a pre-trained model
  2. Configures an evaluation with accuracy and AUC metrics
  3. Runs the evaluation on the provided dataset
  4. Renders the metrics for visualization

Key Concepts in TFMA

Evaluation Configuration

The EvalConfig is the central component that defines how your model will be evaluated:

python
eval_config = tfma.EvalConfig(
model_specs=[
# Define the model type and location of features and labels
tfma.ModelSpec(name='my_model', label_key='label')
],
slicing_specs=[
# Define how to slice your data
tfma.SlicingSpec(), # Overall metrics
tfma.SlicingSpec(feature_keys=['age']) # Metrics for each age value
],
metrics_specs=[
# Define which metrics to compute
tfma.MetricsSpec(metrics=[
tfma.MetricConfig(class_name='Accuracy'),
tfma.MetricConfig(class_name='AUC'),
tfma.MetricConfig(class_name='Precision'),
tfma.MetricConfig(class_name='Recall')
])
]
)

Data Slicing

One of the powerful features of TFMA is the ability to evaluate model performance across different slices of your data. This helps identify if your model performs poorly for specific subgroups:

python
slicing_specs = [
# Overall metrics (no slicing)
tfma.SlicingSpec(),

# Slice by individual feature values
tfma.SlicingSpec(feature_keys=['gender']),

# Slice by feature cross (combination of features)
tfma.SlicingSpec(feature_keys=['gender', 'age_group']),

# Slice with specific feature value
tfma.SlicingSpec(feature_values={'country': 'USA'})
]

Metrics

TFMA supports a wide range of metrics for model evaluation:

python
metrics_specs = [
tfma.MetricsSpec(
metrics=[
# Classification metrics
tfma.MetricConfig(class_name='Accuracy'),
tfma.MetricConfig(class_name='AUC'),
tfma.MetricConfig(class_name='Precision'),
tfma.MetricConfig(class_name='Recall'),

# Calibration metrics
tfma.MetricConfig(class_name='MeanLabel'),
tfma.MetricConfig(class_name='MeanPrediction'),

# Fairness metrics
tfma.MetricConfig(class_name='FairnessIndicators')
],
# For multi-class problems
thresholds=[0.3, 0.5, 0.7] # Evaluate metrics at these thresholds
)
]

Working with TFMA in a TFX Pipeline

In a complete TFX pipeline, the TFMA component naturally fits after the Trainer component. Here's how to integrate it:

python
from tfx.components import Evaluator

# Create the evaluator component
evaluator = Evaluator(
examples=example_gen.outputs['examples'],
model=trainer.outputs['model'],
eval_config=eval_config
)

# Add it to your pipeline
pipeline = tfx.dsl.Pipeline(
pipeline_name='my_pipeline',
pipeline_root=pipeline_root,
components=[
example_gen,
statistics_gen,
schema_gen,
example_validator,
trainer,
evaluator, # TFMA component
pusher
],
enable_cache=True
)

Visualizing Results

TFMA provides various visualization options to interpret evaluation results:

python
# Load evaluation results
eval_results = tfma.load_eval_result(output_path)

# View metrics for all slices
tfma.view.render_slicing_metrics(eval_results)

# View metrics for a specific slice
tfma.view.render_slicing_metrics(eval_results, slicing_column='gender')

# Compare two models
tfma.view.render_slicing_metrics(
[baseline_results, candidate_results],
slicing_column='age_group'
)

# Show detailed metrics for a specific slice
tfma.view.render_metrics_table(eval_results)

Practical Example: Movie Recommendation Evaluation

Let's walk through a more complete example of evaluating a movie recommendation model:

python
import tensorflow as tf
import tensorflow_model_analysis as tfma
from tensorflow_model_analysis.addons.fairness.post_export_metrics import fairness_indicators

# Step 1: Define the evaluation configuration
eval_config = tfma.EvalConfig(
model_specs=[
tfma.ModelSpec(
name='movie_recommender',
label_key='watched',
example_weight_key='weight'
)
],
slicing_specs=[
# Overall metrics
tfma.SlicingSpec(),
# Slice by user demographic groups
tfma.SlicingSpec(feature_keys=['user_age_group']),
tfma.SlicingSpec(feature_keys=['user_gender']),
# Slice by movie genre
tfma.SlicingSpec(feature_keys=['movie_genre']),
# Cross-slice by user gender and movie genre
tfma.SlicingSpec(feature_keys=['user_gender', 'movie_genre']),
],
metrics_specs=[
tfma.MetricsSpec(
metrics=[
tfma.MetricConfig(class_name='AUC'),
tfma.MetricConfig(class_name='Precision'),
tfma.MetricConfig(class_name='Recall'),
tfma.MetricConfig(class_name='ExampleCount'),
# Fairness indicators to check for bias
tfma.MetricConfig(
class_name='FairnessIndicators',
config=fairness_indicators.FairnessIndicatorsConfig(
thresholds=[0.1, 0.3, 0.5, 0.7, 0.9]
)
)
],
thresholds=[0.5]
)
]
)

# Step 2: Run the evaluation
eval_result = tfma.run_model_analysis(
eval_shared_model=tfma.default_eval_shared_model(
eval_saved_model_path='path/to/movie_recommender_model',
tags=['serve']
),
data_location='gs://movie_recommender/eval_data*',
file_format='tfrecord',
output_path='gs://movie_recommender/tfma_output',
eval_config=eval_config
)

# Step 3: Visualize the results
# Overall metrics
tfma.view.render_slicing_metrics(eval_result)

# Check if there's gender bias in recommendations
tfma.view.render_slicing_metrics(
eval_result,
slicing_column='user_gender'
)

# Check if certain movie genres are recommended fairly
tfma.view.render_slicing_metrics(
eval_result,
slicing_column='movie_genre'
)

# Step 4: Analyze fairness across demographic groups
fairness_result = tfma.view.render_fairness_indicator(
eval_result,
slicing_column='user_gender'
)

This example demonstrates:

  1. Setting up a comprehensive evaluation configuration
  2. Running the evaluation on a movie recommendation model
  3. Analyzing both performance metrics and fairness across different user groups
  4. Looking for potential biases in the recommendation system

Model Validation

TFMA can also validate whether a model meets certain criteria before deployment:

python
# Define model validators
baseline_path = 'gs://models/baseline'
candidate_path = 'gs://models/candidate'

eval_config = tfma.EvalConfig(
model_specs=[
tfma.ModelSpec(name='baseline', is_baseline=True),
tfma.ModelSpec(name='candidate')
],
slicing_specs=[tfma.SlicingSpec()],
metrics_specs=[
tfma.MetricsSpec(
metrics=[
tfma.MetricConfig(class_name='AUC'),
tfma.MetricConfig(class_name='Accuracy')
]
)
],
# Add validators
model_validators=[
# Check if candidate is better than baseline
tfma.validators.MetricsValidator(
thresholds={
'baseline/AUC': tfma.GenericValueThreshold(
lower_bound={'value': 0.0},
upper_bound=None
),
'candidate/AUC': tfma.GenericValueThreshold(
lower_bound={'value': 0.5},
upper_bound=None
),
}
)
]
)

# Run validation
validation_results = tfma.run_model_analysis(
eval_shared_models=[
tfma.default_eval_shared_model(model_path=baseline_path, name='baseline'),
tfma.default_eval_shared_model(model_path=candidate_path, name='candidate')
],
eval_config=eval_config,
data_location='gs://data/eval_data*'
)

# Check if model passes validation
validation_result = tfma.load_validation_result(output_path)
if validation_result.validation_ok:
print("Model validated successfully! Ready for deployment.")
else:
print("Model validation failed. Check metrics.")

The validation ensures that:

  1. The baseline model's AUC is above 0.0 (a basic sanity check)
  2. The candidate model's AUC is above 0.5 (a minimum performance requirement)

Only if these criteria are met will the model be considered valid for deployment.

Summary

TensorFlow Model Analysis (TFMA) is a crucial component in the TFX ecosystem that enables thorough model evaluation before deployment. It allows developers to:

  • Evaluate models on multiple metrics
  • Analyze performance across different data slices
  • Compare model versions
  • Detect potential biases in model predictions
  • Validate models against defined criteria

By using TFMA in your machine learning pipelines, you can ensure that only high-quality, fair, and well-performing models make it to production.

Additional Resources

Exercises

  1. Implement TFMA in an image classification model and slice the data by image categories to identify any categories where the model underperforms.

  2. Set up model validation criteria for a text classification model that ensures the new model version has at least 5% better accuracy than the previous version.

  3. Use TFMA to evaluate a recommendation system and check for demographic parity across different user groups.

  4. Create a complete TFX pipeline including a TFMA component that analyzes model performance over time as new training data becomes available.

  5. Implement custom metrics in TFMA for a specific business use case and visualize the results across different data slices.



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)