TensorFlow Feature Columns

Introduction

When building machine learning models with TensorFlow, one of the most critical aspects is handling different types of input data properly. TensorFlow Feature Columns provide a way to bridge the gap between raw data and what machine learning algorithms expect. They are the primary way to prepare and encode features for training models, especially when working with TensorFlow's high-level APIs like tf.estimator.

Feature columns are the intermediaries that transform various types of raw data (numerical values, categorical data, text, etc.) into formats that can be efficiently used by machine learning models. They help you create more sophisticated models without worrying about the underlying data transformations.

Why Use Feature Columns?

Data Transformation: Convert raw data into formats suitable for ML models
Feature Engineering: Create new features or transform existing ones
Handling Diverse Data: Process different data types (numeric, categorical, etc.)
Standardized Interface: Provide a consistent way to handle various data inputs

Types of Feature Columns

TensorFlow offers several types of feature columns to handle different kinds of data:

1. Numeric Columns

Numeric columns are the simplest type and are used for continuous numerical data.

import tensorflow as tf

# Create a numeric feature column
age = tf.feature_column.numeric_column("age")

# Creating a feature column with custom shape
multi_dim_feature = tf.feature_column.numeric_column("measurements", shape=[4])

# You can also normalize numeric columns
normalized_age = tf.feature_column.numeric_column(
    "age", normalizer_fn=lambda x: (x - 50) / 30
)

2. Categorical Columns with Vocabulary

For categorical data where you know all possible values in advance:

# Categorical column with vocabulary list
education_level = tf.feature_column.categorical_column_with_vocabulary_list(
    "education_level",
    ["High School", "Bachelors", "Masters", "PhD"]
)

# You can also specify weights for each category
education_with_weights = tf.feature_column.categorical_column_with_vocabulary_list(
    "education_level",
    ["High School", "Bachelors", "Masters", "PhD"],
    dtype=tf.string,
    default_value=-1,  # Use this for values outside the vocabulary
    num_oov_buckets=2  # Handle two "out of vocabulary" buckets
)

3. Categorical Columns with Hash Bucket

When you don't know all categories in advance or have too many unique values:

# Hash bucket for strings
city = tf.feature_column.categorical_column_with_hash_bucket(
    "city", hash_bucket_size=100  # Number of categories to hash into
)

# Hash bucket for integers
zip_code = tf.feature_column.categorical_column_with_hash_bucket(
    "zip_code", 
    hash_bucket_size=1000,
    dtype=tf.int64
)

4. Bucketized Columns

Convert continuous values into categorical ranges:

# Create a numeric column first
raw_age = tf.feature_column.numeric_column("age")

# Then bucketize it into ranges
age_buckets = tf.feature_column.bucketized_column(
    raw_age,
    boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65]  # Creates 11 buckets
)

5. Embedding Columns

Convert categorical data into dense vectors of fixed size:

# First create a categorical column
education = tf.feature_column.categorical_column_with_vocabulary_list(
    "education", ["High School", "Bachelors", "Masters", "PhD"]
)

# Then create an embedding column based on it
education_embedding = tf.feature_column.embedding_column(
    education, dimension=8  # Output embedding dimension
)

6. Indicator Columns (One-Hot Encoding)

Convert categorical columns to one-hot encoded vectors:

# Create a categorical column
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    "marital_status", ["Single", "Married", "Divorced", "Widowed"]
)

# Convert to indicator column (one-hot)
marital_status_indicator = tf.feature_column.indicator_column(marital_status)

Practical Example: Building a Model with Feature Columns

Let's build a simple model that predicts income based on various factors:

import tensorflow as tf
import pandas as pd
import numpy as np

# Create some sample data
data = {
    'age': [22, 35, 48, 30, 40, 55],
    'education': ['Bachelors', 'Masters', 'PhD', 'High School', 'Bachelors', 'Masters'],
    'city': ['New York', 'San Francisco', 'Boston', 'Chicago', 'Seattle', 'Austin'],
    'years_experience': [1, 8, 15, 5, 10, 20],
    'income': [35000, 75000, 120000, 40000, 80000, 110000]
}

df = pd.DataFrame(data)

# Define feature columns
age = tf.feature_column.numeric_column('age')
education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education', ['High School', 'Bachelors', 'Masters', 'PhD']
)
city = tf.feature_column.categorical_column_with_hash_bucket('city', 10)
years_experience = tf.feature_column.numeric_column('years_experience')

# Create derived feature columns
age_buckets = tf.feature_column.bucketized_column(age, [25, 35, 45, 55])
education_onehot = tf.feature_column.indicator_column(education)
city_embedding = tf.feature_column.embedding_column(city, dimension=5)

# Combine all features
feature_columns = [
    age_buckets,
    education_onehot,
    city_embedding,
    years_experience
]

# Create the feature layer
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

# Create an input function
def input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((
        {
            'age': df['age'].values,
            'education': df['education'].values,
            'city': df['city'].values,
            'years_experience': df['years_experience'].values
        },
        df['income'].values
    ))
    dataset = dataset.shuffle(6).batch(2)
    return dataset

# Build a model
model = tf.keras.Sequential([
    feature_layer,
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.MeanSquaredError(),
              metrics=['mean_absolute_error'])

# Train the model
model.fit(input_fn(), epochs=10)

# Sample output:
# Epoch 1/10
# 3/3 [==============================] - 1s 332ms/step - loss: 5602003968.0000 - mean_absolute_error: 55373.9141
# Epoch 2/10
# 3/3 [==============================] - 0s 8ms/step - loss: 5274835968.0000 - mean_absolute_error: 53892.9141
# ...
# Epoch 10/10
# 3/3 [==============================] - 0s 9ms/step - loss: 3011952.2500 - mean_absolute_error: 1208.2947

Crossed Feature Columns

One powerful aspect of feature columns is the ability to combine multiple columns to create interaction features:

# Create two categorical columns
department = tf.feature_column.categorical_column_with_vocabulary_list(
    'department', ['engineering', 'sales', 'marketing', 'support']
)
job_role = tf.feature_column.categorical_column_with_vocabulary_list(
    'job_role', ['junior', 'senior', 'manager', 'director']
)

# Create a crossed column
department_x_role = tf.feature_column.crossed_column(
    ['department', 'job_role'], hash_bucket_size=16
)

# Convert to indicator or embedding
department_x_role_indicator = tf.feature_column.indicator_column(department_x_role)
# OR
department_x_role_embedding = tf.feature_column.embedding_column(department_x_role, dimension=8)

Real-World Application: Housing Price Prediction

Let's apply feature columns to a more realistic example: predicting housing prices based on house features.

import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create sample housing data
housing_data = {
    'area_sqft': np.random.randint(800, 4500, 1000),
    'bedrooms': np.random.randint(1, 6, 1000),
    'bathrooms': np.random.choice([1, 1.5, 2, 2.5, 3, 3.5, 4], 1000),
    'zip_code': np.random.randint(10000, 99999, 1000),
    'year_built': np.random.randint(1950, 2023, 1000),
    'property_type': np.random.choice(['Single Family', 'Condo', 'Townhouse', 'Multi Family'], 1000),
    'price': []
}

# Generate prices based on features with some noise
for i in range(1000):
    base_price = 150000
    base_price += housing_data['area_sqft'][i] * 100
    base_price += housing_data['bedrooms'][i] * 20000
    base_price += housing_data['bathrooms'][i] * 25000
    
    if housing_data['property_type'][i] == 'Condo':
        base_price *= 0.9
    elif housing_data['property_type'][i] == 'Townhouse':
        base_price *= 1.1
    elif housing_data['property_type'][i] == 'Multi Family':
        base_price *= 1.3
    
    # Age discount
    age = 2023 - housing_data['year_built'][i]
    base_price -= age * 1000
    
    # Add random noise
    noise = np.random.normal(0, 50000)
    price = max(100000, base_price + noise)  # Ensure minimum price
    
    housing_data['price'].append(price)

housing_df = pd.DataFrame(housing_data)

# Split data
train_df, test_df = train_test_split(housing_df, test_size=0.2, random_state=42)

# Define feature columns
numeric_features = ['area_sqft', 'year_built']
numeric_columns = []

# Create numeric columns with normalization
for feature in numeric_features:
    # Compute mean and standard deviation for each feature in training data
    mean = train_df[feature].mean()
    std = train_df[feature].std()
    
    # Create normalized feature column
    numeric_columns.append(tf.feature_column.numeric_column(
        feature,
        normalizer_fn=lambda x, mean=mean, std=std: (x - mean) / std
    ))

# Simple numeric columns
bedrooms = tf.feature_column.numeric_column('bedrooms')
bathrooms = tf.feature_column.numeric_column('bathrooms')

# Categorical columns
property_type = tf.feature_column.categorical_column_with_vocabulary_list(
    'property_type',
    ['Single Family', 'Condo', 'Townhouse', 'Multi Family']
)
property_type_indicator = tf.feature_column.indicator_column(property_type)

# Hash bucket for zip code
zip_code = tf.feature_column.categorical_column_with_hash_bucket(
    'zip_code', hash_bucket_size=100
)
zip_code_embedding = tf.feature_column.embedding_column(zip_code, dimension=10)

# Create year buckets
year_built = tf.feature_column.numeric_column('year_built')
year_built_buckets = tf.feature_column.bucketized_column(
    year_built, boundaries=[1960, 1970, 1980, 1990, 2000, 2010, 2020]
)

# Create crossed features
bedrooms_property = tf.feature_column.crossed_column(
    ['bedrooms', 'property_type'], hash_bucket_size=24
)
bedrooms_property_indicator = tf.feature_column.indicator_column(bedrooms_property)

# Combine all feature columns
feature_columns = numeric_columns + [
    bedrooms,
    bathrooms,
    property_type_indicator,
    zip_code_embedding,
    year_built_buckets,
    bedrooms_property_indicator
]

# Create feature layer for Keras model
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

# Create input function
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    df = dataframe.copy()
    labels = df.pop('price')
    ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

train_ds = df_to_dataset(train_df)
test_ds = df_to_dataset(test_df, shuffle=False)

# Build and train model
model = tf.keras.Sequential([
    feature_layer,
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(
    optimizer='adam',
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=[tf.keras.metrics.MeanAbsoluteError()]
)

history = model.fit(
    train_ds,
    validation_data=test_ds,
    epochs=20,
    verbose=1
)

# Evaluate model
loss, mae = model.evaluate(test_ds)
print(f"\nTest Mean Absolute Error: ${mae:.2f}")

Best Practices for Feature Columns

Normalize numeric data: Always normalize your numeric data when using feature columns to help models converge faster.
Choose appropriate embedding dimensions: For categorical data, a good rule of thumb is to use embedding dimensions of approximately min(50, (n_categories + 1) ÷ 2) where n_categories is the number of unique categories.
Use crossed columns sparingly: While powerful, crossed columns can increase model complexity. Use them only when you expect meaningful interactions.
Hash bucket sizes: When using hash buckets, choose a size large enough to minimize collisions but not so large that it creates sparsity issues.
Combine feature column types: Mix different types of feature columns to capture various aspects of your data.
Check for data leakage: When creating derived features, be careful not to introduce data leakage from your test set.

Common Patterns and Use Cases

Feature Type	Feature Column Approach	Example
Age (continuous)	Numeric or bucketized	`numeric_column("age")` or `bucketized_column(numeric_column("age"), [18, 30, 50, 65])`
Gender (binary)	Categorical with vocabulary	`categorical_column_with_vocabulary_list("gender", ["M", "F"])`
City (many categories)	Hash bucket with embedding	`embedding_column(categorical_column_with_hash_bucket("city", 100), 8)`
Price ranges	Bucketized	`bucketized_column(numeric_column("price"), [10, 25, 50, 100, 250])`
Text data	Feature crosses or text embeddings	`crossed_column(["category", "subcategory"], 100)`
Date/time	Extract components and create multiple features	Create year, month, day, day-of-week features

Summary

TensorFlow Feature Columns provide a powerful and flexible way to handle different types of input data for machine learning models. They allow you to:

Transform raw data into formats suitable for machine learning
Handle various data types (numeric, categorical, etc.)
Create feature interactions and derived features
Standardize your data preprocessing pipeline

By mastering feature columns, you can build more sophisticated models that better capture the underlying patterns in your data, leading to improved model performance and more accurate predictions.

Additional Resources

Exercises

Create a feature column setup for a dataset with user information including age, income, education level, and location to predict whether they'll click on an advertisement.
Implement crossed columns to capture interactions between day of week and purchase behavior in a retail dataset.
Build a model for restaurant recommendation using feature columns for cuisine type, price range, user's past ratings, and geographic location.
Compare the performance of models using different representations of the same categorical data: indicator columns vs. embedding columns.
Create a regression model using feature columns to predict car prices based on make, model, year, mileage, and features.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Use Feature Columns?​

Types of Feature Columns​

1. Numeric Columns​

2. Categorical Columns with Vocabulary​

3. Categorical Columns with Hash Bucket​

4. Bucketized Columns​

5. Embedding Columns​

6. Indicator Columns (One-Hot Encoding)​

Practical Example: Building a Model with Feature Columns​

Crossed Feature Columns​

Real-World Application: Housing Price Prediction​

Best Practices for Feature Columns​

Common Patterns and Use Cases​

Summary​

Additional Resources​

Exercises​