Dimension Tables
Introduction
Dimension tables are a fundamental component of data warehouse design. They provide the descriptive attributes that help analyze and filter the measurements stored in fact tables. If you think of a data warehouse as a way to answer business questions, dimension tables provide the "who," "what," "where," "when," and "how" context for your numerical data.
In this guide, we'll explore what dimension tables are, why they're important, how to design them effectively, and see examples of their implementation in real-world scenarios.
What are Dimension Tables?
Dimension tables contain descriptive attributes that provide context for the numerical measurements stored in fact tables. Together, fact tables and dimension tables form the core of dimensional modeling, which is a technique used in data warehouse design.
Key Characteristics of Dimension Tables
- Descriptive: They contain textual descriptions and attributes rather than measures
- Denormalized: Unlike normalized transactional databases, dimension tables often contain redundant data
- Relatively small: Compared to fact tables, dimension tables typically have fewer rows
- Slowly changing: Their data changes less frequently than fact tables
- Referenced by fact tables: Fact tables link to dimension tables through foreign keys
Types of Dimension Tables
Standard Dimensions
These are the most common dimension tables that describe business entities like customers, products, or time periods.
Conformed Dimensions
Conformed dimensions are standard dimensions that are shared across multiple fact tables or even multiple data marts, ensuring consistency across the data warehouse.
Role-playing Dimensions
A single dimension table that is referenced multiple times by a fact table, each time with a different semantic meaning.
For example, a date dimension might be referenced as order_date, ship_date, and delivery_date in an orders fact table.
Junk Dimensions
A collection of random, low-cardinality flags and indicators consolidated into a single dimension table.
Degenerate Dimensions
These are dimension attributes stored in the fact table rather than in a separate dimension table. Order numbers or transaction IDs are common examples.
Anatomy of a Dimension Table
Let's look at a typical structure of a dimension table:
CREATE TABLE dim_product (
product_key INT PRIMARY KEY, -- Surrogate key
product_id VARCHAR(50), -- Natural/business key
product_name VARCHAR(100),
brand VARCHAR(50),
category VARCHAR(50),
subcategory VARCHAR(50),
size VARCHAR(20),
color VARCHAR(20),
weight DECIMAL(10,2),
price_tier VARCHAR(20),
is_active BOOLEAN,
effective_date DATE, -- SCD tracking
expiration_date DATE, -- SCD tracking
current_flag BOOLEAN -- SCD tracking
);
Key Components
- Surrogate Key: A system-generated unique identifier (often an auto-incrementing integer)
- Natural Key: The original business identifier from the source system
- Descriptive Attributes: Textual and categorical data that describe the dimension entity
- Hierarchical Attributes: Attributes that form natural hierarchies (e.g., category > subcategory)
- SCD Tracking Fields: Fields used to track changes over time in slowly changing dimensions
Slowly Changing Dimensions (SCD)
Dimensions can change over time, and these changes are managed through different SCD types:
Type 1 SCD (Overwrite)
The simplest approach where new data overwrites existing data. Historical values are not preserved.
-- Example of Type 1 SCD update
UPDATE dim_customer
SET address = '123 New Street',
city = 'New City',
state = 'NS',
zipcode = '12345'
WHERE customer_id = 'CUST001';
Type 2 SCD (Add New Row)
Preserves historical data by adding a new row with the updated attributes while marking the old row as expired.
-- Example of Type 2 SCD update
-- First, expire the current record
UPDATE dim_customer
SET expiration_date = CURRENT_DATE - 1,
current_flag = FALSE
WHERE customer_id = 'CUST001' AND current_flag = TRUE;
-- Then, insert the new record
INSERT INTO dim_customer (
customer_key, customer_id, customer_name, address, city, state, zipcode,
effective_date, expiration_date, current_flag
)
VALUES (
(SELECT MAX(customer_key) + 1 FROM dim_customer),
'CUST001', 'John Doe', '123 New Street', 'New City', 'NS', '12345',
CURRENT_DATE, NULL, TRUE
);
Type 3 SCD (Add New Fields)
Maintains limited history by adding new columns to store previous values.
-- Example of Type 3 SCD update
UPDATE dim_customer
SET previous_address = address,
address = '123 New Street',
address_change_date = CURRENT_DATE
WHERE customer_id = 'CUST001';
Dimension Hierarchies
Dimensions often contain natural hierarchies that are important for data analysis and drill-down operations.
These hierarchies enable users to analyze data at different levels of granularity and drill down from summary information to detailed data.
Best Practices for Dimension Table Design
- Use Surrogate Keys: Always use system-generated surrogate keys as primary keys
- Include Natural Keys: Always preserve the business/natural keys from source systems
- Use Descriptive Names: Make attribute names clear and business-oriented
- Denormalize Where Appropriate: Don't be afraid to denormalize to improve query performance
- Plan for Change: Implement appropriate SCD types based on business requirements
- Add Descriptive Attributes: Include as many descriptive attributes as may be useful for analysis
- Ensure Data Quality: Clean and standardize dimension data during ETL processes
- Document Business Rules: Clearly document rules for handling special cases and changes
Real-World Example: Retail Data Warehouse
Let's design a simple star schema for a retail data warehouse with dimension tables:
-- Time Dimension
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
date_actual DATE,
day_of_week VARCHAR(10),
day_number_in_month INT,
day_number_in_year INT,
week_number_in_year INT,
month_name VARCHAR(10),
month_number INT,
quarter INT,
year INT,
is_weekend BOOLEAN,
is_holiday BOOLEAN,
holiday_name VARCHAR(50)
);
-- Product Dimension
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id VARCHAR(20),
product_name VARCHAR(100),
brand VARCHAR(50),
category VARCHAR(50),
subcategory VARCHAR(50),
unit_price DECIMAL(10,2),
unit_cost DECIMAL(10,2),
current_flag BOOLEAN,
effective_date DATE,
expiration_date DATE
);
-- Store Dimension
CREATE TABLE dim_store (
store_key INT PRIMARY KEY,
store_id VARCHAR(20),
store_name VARCHAR(100),
address VARCHAR(200),
city VARCHAR(50),
state VARCHAR(50),
country VARCHAR(50),
postal_code VARCHAR(20),
phone VARCHAR(20),
store_type VARCHAR(50),
square_footage INT,
opening_date DATE
);
-- Customer Dimension
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(20),
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100),
phone VARCHAR(20),
address VARCHAR(200),
city VARCHAR(50),
state VARCHAR(50),
country VARCHAR(50),
postal_code VARCHAR(20),
gender VARCHAR(10),
birth_date DATE,
join_date DATE,
customer_segment VARCHAR(50)
);
-- Sales Fact Table
CREATE TABLE fact_sales (
sales_key INT PRIMARY KEY,
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
store_key INT REFERENCES dim_store(store_key),
customer_key INT REFERENCES dim_customer(customer_key),
transaction_id VARCHAR(20),
quantity INT,
unit_price DECIMAL(10,2),
discount_amount DECIMAL(10,2),
sales_amount DECIMAL(10,2),
cost_amount DECIMAL(10,2),
profit_amount DECIMAL(10,2)
);
Example Query Using Dimension Tables
Here's how you might query this data warehouse to answer a business question:
-- Query: What are the total sales by product category and month for the year 2023?
SELECT
p.category,
d.month_name,
SUM(f.sales_amount) as total_sales
FROM
fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE
d.year = 2023
GROUP BY
p.category,
d.month_name
ORDER BY
p.category,
d.month_number;
Expected output:
category | month_name | total_sales
------------+------------+------------
Electronics | January | 125000.00
Electronics | February | 118000.00
Electronics | March | 132000.00
...
Furniture | January | 95000.00
Furniture | February | 88000.00
...
ETL Process for Dimension Tables
Loading and maintaining dimension tables involves several steps:
- Extract: Pull data from source systems
- Transform: Clean, standardize, and structure the data
- Load: Populate dimension tables according to SCD rules
Here's a simplified Python example using pandas for a Type 2 SCD load:
import pandas as pd
import sqlalchemy as sa
from datetime import datetime
# Connect to the data warehouse
engine = sa.create_engine("postgresql://username:password@localhost:5432/datawarehouse")
# Load existing dimension data
existing_dim = pd.read_sql("SELECT * FROM dim_product WHERE current_flag = TRUE", engine)
# Load new source data
source_data = pd.read_sql("SELECT * FROM source_products", engine)
# Identify changes (simplified)
existing_dim['source_key'] = existing_dim['product_id']
merged = source_data.merge(
existing_dim[['product_key', 'source_key', 'product_name', 'category', 'price']],
on='source_key',
how='left',
suffixes=('_source', '_existing')
)
# Find records that need to be updated (Type 2 SCD)
changes = merged[
(merged['product_key'].notnull()) &
((merged['product_name_source'] != merged['product_name_existing']) |
(merged['category_source'] != merged['category_existing']) |
(merged['price_source'] != merged['price_existing']))
]
# Find new records
new_records = merged[merged['product_key'].isnull()]
# Expire old records
if not changes.empty:
update_sql = """
UPDATE dim_product
SET expiration_date = CURRENT_DATE,
current_flag = FALSE
WHERE product_key IN ({})
""".format(','.join(changes['product_key'].astype(str)))
with engine.connect() as conn:
conn.execute(update_sql)
# Insert new versions of changed records
if not changes.empty:
# Get the next available key
next_key_query = "SELECT COALESCE(MAX(product_key), 0) + 1 FROM dim_product"
next_key = pd.read_sql(next_key_query, engine).iloc[0, 0]
# Prepare the data
inserts = []
for _, row in changes.iterrows():
inserts.append({
'product_key': next_key,
'product_id': row['source_key'],
'product_name': row['product_name_source'],
'category': row['category_source'],
'price': row['price_source'],
'effective_date': datetime.now().date(),
'expiration_date': None,
'current_flag': True
})
next_key += 1
# Insert the data
inserts_df = pd.DataFrame(inserts)
inserts_df.to_sql('dim_product', engine, if_exists='append', index=False)
# Insert completely new records
if not new_records.empty:
# Similar process for new records
# ...
Common Challenges and Solutions
High Cardinality Dimensions
When dimensions have a very large number of members:
Solution: Consider creating mini-dimensions and junk dimensions to split large dimensions into manageable pieces.
Changing Hierarchies
When hierarchical relationships in dimensions change:
Solution: Use bridge tables or implement Type 2 SCD for the entire hierarchy path.
Late Arriving Dimensions
When facts arrive before their corresponding dimension data:
Solution: Create placeholder dimension records that can be updated later when the full dimension data arrives.
Summary
Dimension tables are critical components of a data warehouse that provide context and descriptive attributes for analyzing fact data. Key points to remember:
- Dimension tables contain descriptive attributes that give meaning to fact measurements
- They support different types of slowly changing dimensions (SCD) to track historical changes
- Well-designed dimension tables improve query performance and analytical capabilities
- Common types include standard dimensions, conformed dimensions, role-playing dimensions, and junk dimensions
- Best practices include using surrogate keys, preserving natural keys, and denormalizing for performance
With properly designed dimension tables, your data warehouse will provide a solid foundation for business intelligence, reporting, and data analysis.
Additional Resources
- Practice creating dimension tables for different business domains
- Experiment with implementing different SCD types
- Learn how to optimize dimension tables for query performance
- Explore advanced dimensional modeling techniques like snowflaking and outriggers
Exercises
- Design a time dimension table with appropriate attributes for fiscal year reporting
- Create a product dimension with at least three hierarchical levels
- Implement Type 2 SCD tracking for a customer dimension when addresses change
- Design a star schema for a simple order processing system with at least three dimensions
- Write SQL queries to analyze data using multiple dimension tables
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)