Dimension Tables

Introduction

Dimension tables are a fundamental component of data warehouse design. They provide the descriptive attributes that help analyze and filter the measurements stored in fact tables. If you think of a data warehouse as a way to answer business questions, dimension tables provide the "who," "what," "where," "when," and "how" context for your numerical data.

In this guide, we'll explore what dimension tables are, why they're important, how to design them effectively, and see examples of their implementation in real-world scenarios.

What are Dimension Tables?

Dimension tables contain descriptive attributes that provide context for the numerical measurements stored in fact tables. Together, fact tables and dimension tables form the core of dimensional modeling, which is a technique used in data warehouse design.

Key Characteristics of Dimension Tables

Descriptive: They contain textual descriptions and attributes rather than measures
Denormalized: Unlike normalized transactional databases, dimension tables often contain redundant data
Relatively small: Compared to fact tables, dimension tables typically have fewer rows
Slowly changing: Their data changes less frequently than fact tables
Referenced by fact tables: Fact tables link to dimension tables through foreign keys

Types of Dimension Tables

Standard Dimensions

These are the most common dimension tables that describe business entities like customers, products, or time periods.

Conformed Dimensions

Conformed dimensions are standard dimensions that are shared across multiple fact tables or even multiple data marts, ensuring consistency across the data warehouse.

Role-playing Dimensions

A single dimension table that is referenced multiple times by a fact table, each time with a different semantic meaning.

For example, a date dimension might be referenced as order_date, ship_date, and delivery_date in an orders fact table.

Junk Dimensions

A collection of random, low-cardinality flags and indicators consolidated into a single dimension table.

Degenerate Dimensions

These are dimension attributes stored in the fact table rather than in a separate dimension table. Order numbers or transaction IDs are common examples.

Anatomy of a Dimension Table

Let's look at a typical structure of a dimension table:

CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,  -- Surrogate key
    product_id VARCHAR(50),       -- Natural/business key
    product_name VARCHAR(100),
    brand VARCHAR(50),
    category VARCHAR(50),
    subcategory VARCHAR(50),
    size VARCHAR(20),
    color VARCHAR(20),
    weight DECIMAL(10,2),
    price_tier VARCHAR(20),
    is_active BOOLEAN,
    effective_date DATE,          -- SCD tracking
    expiration_date DATE,         -- SCD tracking
    current_flag BOOLEAN          -- SCD tracking
);

Key Components

Surrogate Key: A system-generated unique identifier (often an auto-incrementing integer)
Natural Key: The original business identifier from the source system
Descriptive Attributes: Textual and categorical data that describe the dimension entity
Hierarchical Attributes: Attributes that form natural hierarchies (e.g., category > subcategory)
SCD Tracking Fields: Fields used to track changes over time in slowly changing dimensions

Slowly Changing Dimensions (SCD)

Dimensions can change over time, and these changes are managed through different SCD types:

Type 1 SCD (Overwrite)

The simplest approach where new data overwrites existing data. Historical values are not preserved.

-- Example of Type 1 SCD update
UPDATE dim_customer
SET address = '123 New Street',
    city = 'New City',
    state = 'NS',
    zipcode = '12345'
WHERE customer_id = 'CUST001';

Type 2 SCD (Add New Row)

Preserves historical data by adding a new row with the updated attributes while marking the old row as expired.

-- Example of Type 2 SCD update
-- First, expire the current record
UPDATE dim_customer
SET expiration_date = CURRENT_DATE - 1,
    current_flag = FALSE
WHERE customer_id = 'CUST001' AND current_flag = TRUE;

-- Then, insert the new record
INSERT INTO dim_customer (
    customer_key, customer_id, customer_name, address, city, state, zipcode,
    effective_date, expiration_date, current_flag
)
VALUES (
    (SELECT MAX(customer_key) + 1 FROM dim_customer),
    'CUST001', 'John Doe', '123 New Street', 'New City', 'NS', '12345',
    CURRENT_DATE, NULL, TRUE
);

Type 3 SCD (Add New Fields)

Maintains limited history by adding new columns to store previous values.

-- Example of Type 3 SCD update
UPDATE dim_customer
SET previous_address = address,
    address = '123 New Street',
    address_change_date = CURRENT_DATE
WHERE customer_id = 'CUST001';

Dimension Hierarchies

Dimensions often contain natural hierarchies that are important for data analysis and drill-down operations.

These hierarchies enable users to analyze data at different levels of granularity and drill down from summary information to detailed data.

Best Practices for Dimension Table Design

Use Surrogate Keys: Always use system-generated surrogate keys as primary keys
Include Natural Keys: Always preserve the business/natural keys from source systems
Use Descriptive Names: Make attribute names clear and business-oriented
Denormalize Where Appropriate: Don't be afraid to denormalize to improve query performance
Plan for Change: Implement appropriate SCD types based on business requirements
Add Descriptive Attributes: Include as many descriptive attributes as may be useful for analysis
Ensure Data Quality: Clean and standardize dimension data during ETL processes
Document Business Rules: Clearly document rules for handling special cases and changes

Real-World Example: Retail Data Warehouse

Let's design a simple star schema for a retail data warehouse with dimension tables:

-- Time Dimension
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,
    date_actual DATE,
    day_of_week VARCHAR(10),
    day_number_in_month INT,
    day_number_in_year INT,
    week_number_in_year INT,
    month_name VARCHAR(10),
    month_number INT,
    quarter INT,
    year INT,
    is_weekend BOOLEAN,
    is_holiday BOOLEAN,
    holiday_name VARCHAR(50)
);

-- Product Dimension
CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_id VARCHAR(20),
    product_name VARCHAR(100),
    brand VARCHAR(50),
    category VARCHAR(50),
    subcategory VARCHAR(50),
    unit_price DECIMAL(10,2),
    unit_cost DECIMAL(10,2),
    current_flag BOOLEAN,
    effective_date DATE,
    expiration_date DATE
);

-- Store Dimension
CREATE TABLE dim_store (
    store_key INT PRIMARY KEY,
    store_id VARCHAR(20),
    store_name VARCHAR(100),
    address VARCHAR(200),
    city VARCHAR(50),
    state VARCHAR(50),
    country VARCHAR(50),
    postal_code VARCHAR(20),
    phone VARCHAR(20),
    store_type VARCHAR(50),
    square_footage INT,
    opening_date DATE
);

-- Customer Dimension
CREATE TABLE dim_customer (
    customer_key INT PRIMARY KEY,
    customer_id VARCHAR(20),
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(100),
    phone VARCHAR(20),
    address VARCHAR(200),
    city VARCHAR(50),
    state VARCHAR(50),
    country VARCHAR(50),
    postal_code VARCHAR(20),
    gender VARCHAR(10),
    birth_date DATE,
    join_date DATE,
    customer_segment VARCHAR(50)
);

-- Sales Fact Table
CREATE TABLE fact_sales (
    sales_key INT PRIMARY KEY,
    date_key INT REFERENCES dim_date(date_key),
    product_key INT REFERENCES dim_product(product_key),
    store_key INT REFERENCES dim_store(store_key),
    customer_key INT REFERENCES dim_customer(customer_key),
    transaction_id VARCHAR(20),
    quantity INT,
    unit_price DECIMAL(10,2),
    discount_amount DECIMAL(10,2),
    sales_amount DECIMAL(10,2),
    cost_amount DECIMAL(10,2),
    profit_amount DECIMAL(10,2)
);

Example Query Using Dimension Tables

Here's how you might query this data warehouse to answer a business question:

-- Query: What are the total sales by product category and month for the year 2023?
SELECT 
    p.category,
    d.month_name,
    SUM(f.sales_amount) as total_sales
FROM 
    fact_sales f
    JOIN dim_product p ON f.product_key = p.product_key
    JOIN dim_date d ON f.date_key = d.date_key
WHERE 
    d.year = 2023
GROUP BY 
    p.category,
    d.month_name
ORDER BY 
    p.category,
    d.month_number;

Expected output:

category    | month_name | total_sales
------------+------------+------------
Electronics | January    | 125000.00
Electronics | February   | 118000.00
Electronics | March      | 132000.00
...
Furniture   | January    | 95000.00
Furniture   | February   | 88000.00
...

ETL Process for Dimension Tables

Loading and maintaining dimension tables involves several steps:

Extract: Pull data from source systems
Transform: Clean, standardize, and structure the data
Load: Populate dimension tables according to SCD rules

Here's a simplified Python example using pandas for a Type 2 SCD load:

import pandas as pd
import sqlalchemy as sa
from datetime import datetime

# Connect to the data warehouse
engine = sa.create_engine("postgresql://username:password@localhost:5432/datawarehouse")

# Load existing dimension data
existing_dim = pd.read_sql("SELECT * FROM dim_product WHERE current_flag = TRUE", engine)

# Load new source data
source_data = pd.read_sql("SELECT * FROM source_products", engine)

# Identify changes (simplified)
existing_dim['source_key'] = existing_dim['product_id']
merged = source_data.merge(
    existing_dim[['product_key', 'source_key', 'product_name', 'category', 'price']],
    on='source_key',
    how='left',
    suffixes=('_source', '_existing')
)

# Find records that need to be updated (Type 2 SCD)
changes = merged[
    (merged['product_key'].notnull()) & 
    ((merged['product_name_source'] != merged['product_name_existing']) | 
     (merged['category_source'] != merged['category_existing']) |
     (merged['price_source'] != merged['price_existing']))
]

# Find new records
new_records = merged[merged['product_key'].isnull()]

# Expire old records
if not changes.empty:
    update_sql = """
    UPDATE dim_product
    SET expiration_date = CURRENT_DATE,
        current_flag = FALSE
    WHERE product_key IN ({})
    """.format(','.join(changes['product_key'].astype(str)))
    
    with engine.connect() as conn:
        conn.execute(update_sql)

# Insert new versions of changed records
if not changes.empty:
    # Get the next available key
    next_key_query = "SELECT COALESCE(MAX(product_key), 0) + 1 FROM dim_product"
    next_key = pd.read_sql(next_key_query, engine).iloc[0, 0]
    
    # Prepare the data
    inserts = []
    for _, row in changes.iterrows():
        inserts.append({
            'product_key': next_key,
            'product_id': row['source_key'],
            'product_name': row['product_name_source'],
            'category': row['category_source'],
            'price': row['price_source'],
            'effective_date': datetime.now().date(),
            'expiration_date': None,
            'current_flag': True
        })
        next_key += 1
    
    # Insert the data
    inserts_df = pd.DataFrame(inserts)
    inserts_df.to_sql('dim_product', engine, if_exists='append', index=False)

# Insert completely new records
if not new_records.empty:
    # Similar process for new records
    # ...

Common Challenges and Solutions

High Cardinality Dimensions

When dimensions have a very large number of members:

Solution: Consider creating mini-dimensions and junk dimensions to split large dimensions into manageable pieces.

Changing Hierarchies

When hierarchical relationships in dimensions change:

Solution: Use bridge tables or implement Type 2 SCD for the entire hierarchy path.

Late Arriving Dimensions

When facts arrive before their corresponding dimension data:

Solution: Create placeholder dimension records that can be updated later when the full dimension data arrives.

Summary

Dimension tables are critical components of a data warehouse that provide context and descriptive attributes for analyzing fact data. Key points to remember:

Dimension tables contain descriptive attributes that give meaning to fact measurements
They support different types of slowly changing dimensions (SCD) to track historical changes
Well-designed dimension tables improve query performance and analytical capabilities
Common types include standard dimensions, conformed dimensions, role-playing dimensions, and junk dimensions
Best practices include using surrogate keys, preserving natural keys, and denormalizing for performance

With properly designed dimension tables, your data warehouse will provide a solid foundation for business intelligence, reporting, and data analysis.

Additional Resources

Practice creating dimension tables for different business domains
Experiment with implementing different SCD types
Learn how to optimize dimension tables for query performance
Explore advanced dimensional modeling techniques like snowflaking and outriggers

Exercises

Design a time dimension table with appropriate attributes for fiscal year reporting
Create a product dimension with at least three hierarchical levels
Implement Type 2 SCD tracking for a customer dimension when addresses change
Design a star schema for a simple order processing system with at least three dimensions
Write SQL queries to analyze data using multiple dimension tables

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

What are Dimension Tables?​

Key Characteristics of Dimension Tables​

Types of Dimension Tables​

Standard Dimensions​

Conformed Dimensions​

Role-playing Dimensions​

Junk Dimensions​

Degenerate Dimensions​

Anatomy of a Dimension Table​

Key Components​

Slowly Changing Dimensions (SCD)​

Type 1 SCD (Overwrite)​

Type 2 SCD (Add New Row)​

Type 3 SCD (Add New Fields)​

Dimension Hierarchies​

Best Practices for Dimension Table Design​

Real-World Example: Retail Data Warehouse​

Example Query Using Dimension Tables​

ETL Process for Dimension Tables​

Common Challenges and Solutions​

High Cardinality Dimensions​

Changing Hierarchies​

Late Arriving Dimensions​

Summary​

Additional Resources​

Exercises​