TensorFlow Feature Columns
Introduction
When building machine learning models with TensorFlow, one of the most critical aspects is handling different types of input data properly. TensorFlow Feature Columns provide a way to bridge the gap between raw data and what machine learning algorithms expect. They are the primary way to prepare and encode features for training models, especially when working with TensorFlow's high-level APIs like tf.estimator
.
Feature columns are the intermediaries that transform various types of raw data (numerical values, categorical data, text, etc.) into formats that can be efficiently used by machine learning models. They help you create more sophisticated models without worrying about the underlying data transformations.
Why Use Feature Columns?
- Data Transformation: Convert raw data into formats suitable for ML models
- Feature Engineering: Create new features or transform existing ones
- Handling Diverse Data: Process different data types (numeric, categorical, etc.)
- Standardized Interface: Provide a consistent way to handle various data inputs
Types of Feature Columns
TensorFlow offers several types of feature columns to handle different kinds of data:
1. Numeric Columns
Numeric columns are the simplest type and are used for continuous numerical data.
import tensorflow as tf
# Create a numeric feature column
age = tf.feature_column.numeric_column("age")
# Creating a feature column with custom shape
multi_dim_feature = tf.feature_column.numeric_column("measurements", shape=[4])
# You can also normalize numeric columns
normalized_age = tf.feature_column.numeric_column(
"age", normalizer_fn=lambda x: (x - 50) / 30
)
2. Categorical Columns with Vocabulary
For categorical data where you know all possible values in advance:
# Categorical column with vocabulary list
education_level = tf.feature_column.categorical_column_with_vocabulary_list(
"education_level",
["High School", "Bachelors", "Masters", "PhD"]
)
# You can also specify weights for each category
education_with_weights = tf.feature_column.categorical_column_with_vocabulary_list(
"education_level",
["High School", "Bachelors", "Masters", "PhD"],
dtype=tf.string,
default_value=-1, # Use this for values outside the vocabulary
num_oov_buckets=2 # Handle two "out of vocabulary" buckets
)
3. Categorical Columns with Hash Bucket
When you don't know all categories in advance or have too many unique values:
# Hash bucket for strings
city = tf.feature_column.categorical_column_with_hash_bucket(
"city", hash_bucket_size=100 # Number of categories to hash into
)
# Hash bucket for integers
zip_code = tf.feature_column.categorical_column_with_hash_bucket(
"zip_code",
hash_bucket_size=1000,
dtype=tf.int64
)
4. Bucketized Columns
Convert continuous values into categorical ranges:
# Create a numeric column first
raw_age = tf.feature_column.numeric_column("age")
# Then bucketize it into ranges
age_buckets = tf.feature_column.bucketized_column(
raw_age,
boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65] # Creates 11 buckets
)
5. Embedding Columns
Convert categorical data into dense vectors of fixed size:
# First create a categorical column
education = tf.feature_column.categorical_column_with_vocabulary_list(
"education", ["High School", "Bachelors", "Masters", "PhD"]
)
# Then create an embedding column based on it
education_embedding = tf.feature_column.embedding_column(
education, dimension=8 # Output embedding dimension
)
6. Indicator Columns (One-Hot Encoding)
Convert categorical columns to one-hot encoded vectors:
# Create a categorical column
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
"marital_status", ["Single", "Married", "Divorced", "Widowed"]
)
# Convert to indicator column (one-hot)
marital_status_indicator = tf.feature_column.indicator_column(marital_status)
Practical Example: Building a Model with Feature Columns
Let's build a simple model that predicts income based on various factors:
import tensorflow as tf
import pandas as pd
import numpy as np
# Create some sample data
data = {
'age': [22, 35, 48, 30, 40, 55],
'education': ['Bachelors', 'Masters', 'PhD', 'High School', 'Bachelors', 'Masters'],
'city': ['New York', 'San Francisco', 'Boston', 'Chicago', 'Seattle', 'Austin'],
'years_experience': [1, 8, 15, 5, 10, 20],
'income': [35000, 75000, 120000, 40000, 80000, 110000]
}
df = pd.DataFrame(data)
# Define feature columns
age = tf.feature_column.numeric_column('age')
education = tf.feature_column.categorical_column_with_vocabulary_list(
'education', ['High School', 'Bachelors', 'Masters', 'PhD']
)
city = tf.feature_column.categorical_column_with_hash_bucket('city', 10)
years_experience = tf.feature_column.numeric_column('years_experience')
# Create derived feature columns
age_buckets = tf.feature_column.bucketized_column(age, [25, 35, 45, 55])
education_onehot = tf.feature_column.indicator_column(education)
city_embedding = tf.feature_column.embedding_column(city, dimension=5)
# Combine all features
feature_columns = [
age_buckets,
education_onehot,
city_embedding,
years_experience
]
# Create the feature layer
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
# Create an input function
def input_fn():
dataset = tf.data.Dataset.from_tensor_slices((
{
'age': df['age'].values,
'education': df['education'].values,
'city': df['city'].values,
'years_experience': df['years_experience'].values
},
df['income'].values
))
dataset = dataset.shuffle(6).batch(2)
return dataset
# Build a model
model = tf.keras.Sequential([
feature_layer,
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.MeanSquaredError(),
metrics=['mean_absolute_error'])
# Train the model
model.fit(input_fn(), epochs=10)
# Sample output:
# Epoch 1/10
# 3/3 [==============================] - 1s 332ms/step - loss: 5602003968.0000 - mean_absolute_error: 55373.9141
# Epoch 2/10
# 3/3 [==============================] - 0s 8ms/step - loss: 5274835968.0000 - mean_absolute_error: 53892.9141
# ...
# Epoch 10/10
# 3/3 [==============================] - 0s 9ms/step - loss: 3011952.2500 - mean_absolute_error: 1208.2947
Crossed Feature Columns
One powerful aspect of feature columns is the ability to combine multiple columns to create interaction features:
# Create two categorical columns
department = tf.feature_column.categorical_column_with_vocabulary_list(
'department', ['engineering', 'sales', 'marketing', 'support']
)
job_role = tf.feature_column.categorical_column_with_vocabulary_list(
'job_role', ['junior', 'senior', 'manager', 'director']
)
# Create a crossed column
department_x_role = tf.feature_column.crossed_column(
['department', 'job_role'], hash_bucket_size=16
)
# Convert to indicator or embedding
department_x_role_indicator = tf.feature_column.indicator_column(department_x_role)
# OR
department_x_role_embedding = tf.feature_column.embedding_column(department_x_role, dimension=8)
Real-World Application: Housing Price Prediction
Let's apply feature columns to a more realistic example: predicting housing prices based on house features.
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Create sample housing data
housing_data = {
'area_sqft': np.random.randint(800, 4500, 1000),
'bedrooms': np.random.randint(1, 6, 1000),
'bathrooms': np.random.choice([1, 1.5, 2, 2.5, 3, 3.5, 4], 1000),
'zip_code': np.random.randint(10000, 99999, 1000),
'year_built': np.random.randint(1950, 2023, 1000),
'property_type': np.random.choice(['Single Family', 'Condo', 'Townhouse', 'Multi Family'], 1000),
'price': []
}
# Generate prices based on features with some noise
for i in range(1000):
base_price = 150000
base_price += housing_data['area_sqft'][i] * 100
base_price += housing_data['bedrooms'][i] * 20000
base_price += housing_data['bathrooms'][i] * 25000
if housing_data['property_type'][i] == 'Condo':
base_price *= 0.9
elif housing_data['property_type'][i] == 'Townhouse':
base_price *= 1.1
elif housing_data['property_type'][i] == 'Multi Family':
base_price *= 1.3
# Age discount
age = 2023 - housing_data['year_built'][i]
base_price -= age * 1000
# Add random noise
noise = np.random.normal(0, 50000)
price = max(100000, base_price + noise) # Ensure minimum price
housing_data['price'].append(price)
housing_df = pd.DataFrame(housing_data)
# Split data
train_df, test_df = train_test_split(housing_df, test_size=0.2, random_state=42)
# Define feature columns
numeric_features = ['area_sqft', 'year_built']
numeric_columns = []
# Create numeric columns with normalization
for feature in numeric_features:
# Compute mean and standard deviation for each feature in training data
mean = train_df[feature].mean()
std = train_df[feature].std()
# Create normalized feature column
numeric_columns.append(tf.feature_column.numeric_column(
feature,
normalizer_fn=lambda x, mean=mean, std=std: (x - mean) / std
))
# Simple numeric columns
bedrooms = tf.feature_column.numeric_column('bedrooms')
bathrooms = tf.feature_column.numeric_column('bathrooms')
# Categorical columns
property_type = tf.feature_column.categorical_column_with_vocabulary_list(
'property_type',
['Single Family', 'Condo', 'Townhouse', 'Multi Family']
)
property_type_indicator = tf.feature_column.indicator_column(property_type)
# Hash bucket for zip code
zip_code = tf.feature_column.categorical_column_with_hash_bucket(
'zip_code', hash_bucket_size=100
)
zip_code_embedding = tf.feature_column.embedding_column(zip_code, dimension=10)
# Create year buckets
year_built = tf.feature_column.numeric_column('year_built')
year_built_buckets = tf.feature_column.bucketized_column(
year_built, boundaries=[1960, 1970, 1980, 1990, 2000, 2010, 2020]
)
# Create crossed features
bedrooms_property = tf.feature_column.crossed_column(
['bedrooms', 'property_type'], hash_bucket_size=24
)
bedrooms_property_indicator = tf.feature_column.indicator_column(bedrooms_property)
# Combine all feature columns
feature_columns = numeric_columns + [
bedrooms,
bathrooms,
property_type_indicator,
zip_code_embedding,
year_built_buckets,
bedrooms_property_indicator
]
# Create feature layer for Keras model
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
# Create input function
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
df = dataframe.copy()
labels = df.pop('price')
ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
return ds
train_ds = df_to_dataset(train_df)
test_ds = df_to_dataset(test_df, shuffle=False)
# Build and train model
model = tf.keras.Sequential([
feature_layer,
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(
optimizer='adam',
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.MeanAbsoluteError()]
)
history = model.fit(
train_ds,
validation_data=test_ds,
epochs=20,
verbose=1
)
# Evaluate model
loss, mae = model.evaluate(test_ds)
print(f"\nTest Mean Absolute Error: ${mae:.2f}")
Best Practices for Feature Columns
-
Normalize numeric data: Always normalize your numeric data when using feature columns to help models converge faster.
-
Choose appropriate embedding dimensions: For categorical data, a good rule of thumb is to use embedding dimensions of approximately
min(50, (n_categories + 1) ÷ 2)
wheren_categories
is the number of unique categories. -
Use crossed columns sparingly: While powerful, crossed columns can increase model complexity. Use them only when you expect meaningful interactions.
-
Hash bucket sizes: When using hash buckets, choose a size large enough to minimize collisions but not so large that it creates sparsity issues.
-
Combine feature column types: Mix different types of feature columns to capture various aspects of your data.
-
Check for data leakage: When creating derived features, be careful not to introduce data leakage from your test set.
Common Patterns and Use Cases
Feature Type | Feature Column Approach | Example |
---|---|---|
Age (continuous) | Numeric or bucketized | numeric_column("age") or bucketized_column(numeric_column("age"), [18, 30, 50, 65]) |
Gender (binary) | Categorical with vocabulary | categorical_column_with_vocabulary_list("gender", ["M", "F"]) |
City (many categories) | Hash bucket with embedding | embedding_column(categorical_column_with_hash_bucket("city", 100), 8) |
Price ranges | Bucketized | bucketized_column(numeric_column("price"), [10, 25, 50, 100, 250]) |
Text data | Feature crosses or text embeddings | crossed_column(["category", "subcategory"], 100) |
Date/time | Extract components and create multiple features | Create year, month, day, day-of-week features |
Summary
TensorFlow Feature Columns provide a powerful and flexible way to handle different types of input data for machine learning models. They allow you to:
- Transform raw data into formats suitable for machine learning
- Handle various data types (numeric, categorical, etc.)
- Create feature interactions and derived features
- Standardize your data preprocessing pipeline
By mastering feature columns, you can build more sophisticated models that better capture the underlying patterns in your data, leading to improved model performance and more accurate predictions.
Additional Resources
- TensorFlow Feature Columns Official Documentation
- Feature Column Guide on TensorFlow.org
- Structured Data Classification Tutorial
Exercises
-
Create a feature column setup for a dataset with user information including age, income, education level, and location to predict whether they'll click on an advertisement.
-
Implement crossed columns to capture interactions between day of week and purchase behavior in a retail dataset.
-
Build a model for restaurant recommendation using feature columns for cuisine type, price range, user's past ratings, and geographic location.
-
Compare the performance of models using different representations of the same categorical data: indicator columns vs. embedding columns.
-
Create a regression model using feature columns to predict car prices based on make, model, year, mileage, and features.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)