Data Warehouse Architecture

Introduction

A data warehouse is a centralized repository that stores structured data from various sources to support business intelligence activities, including reporting, analysis, and data mining. The architecture of a data warehouse determines how data flows in, how it's processed, stored, and eventually accessed by end users.

In this guide, we'll explore the fundamental components and popular architectural patterns of data warehouses. Whether you're setting up your first data system or looking to understand enterprise-level analytics infrastructure, understanding these architecture principles will help you make informed decisions about your data strategy.

Core Components of Data Warehouse Architecture

Before diving into specific architectural models, let's understand the essential components that make up most data warehouse systems:

1. Source Systems

Source systems are the origin points of data that will eventually flow into your data warehouse:

Operational databases: Transactional systems like CRM, ERP, or e-commerce platforms
External data sources: Third-party APIs, partner data feeds, or purchased datasets
Flat files: CSV, Excel, or other structured file formats
IoT devices: Sensors and connected devices generating data streams

2. Data Integration Layer (ETL/ELT)

This layer is responsible for moving data from source systems into the warehouse through processes known as:

Extract: Retrieving data from source systems
Transform: Converting, cleaning, and structuring the data
Load: Writing the processed data into the warehouse

Modern data pipelines might use either ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) depending on the specific requirements and tools.

3. Data Storage Layer

The central repository where data is stored, typically consisting of:

Staging area: Temporary storage for raw data before processing
Data warehouse database: The main storage system optimized for queries
Data marts: Subject-specific subsets of the warehouse for particular business areas

4. Metadata Repository

Metadata ("data about data") helps track and manage all aspects of the warehouse:

Technical metadata: Schema definitions, data lineage, ETL mappings
Business metadata: Business definitions, ownership information, data quality metrics
Operational metadata: Processing logs, usage statistics, performance metrics

5. Access Layer

Tools and interfaces that allow users to interact with the warehoused data:

Reporting tools: For creating standard or custom reports
BI platforms: For interactive dashboards and visualizations
Query tools: For ad-hoc analysis
Data mining applications: For discovering patterns and insights

Common Data Warehouse Architectural Patterns

1. Traditional Three-Tier Architecture

The classic approach to data warehousing consists of three tiers:

Tiers explained:

Bottom Tier: Source data systems
Middle Tier: ETL processes and the data warehouse itself, often including data marts
Top Tier: Front-end client tools for querying, reporting, and analysis

This architecture provides a clear separation of concerns but can become complex in large implementations.

2. Kimball's Dimensional Modeling Architecture

Ralph Kimball's approach focuses on building data marts first, using consistent dimensional models:

Key aspects of Kimball's architecture:

Star schema model: Facts surrounded by dimension tables
Conformed dimensions: Shared dimensions across data marts (e.g., time, customer)
Business process focus: Data marts are organized around business processes
Integration through bus architecture: Data marts connect through shared dimensions

This approach delivers faster results by building individual data marts that incrementally form the complete warehouse.

3. Inmon's Corporate Information Factory (CIF)

Bill Inmon advocates for a top-down enterprise approach:

Key aspects of Inmon's architecture:

Enterprise-wide approach: Build the complete data warehouse first
Normalized data model: 3NF (Third Normal Form) for the central repository
Top-down design: Enterprise warehouse feeds departmental data marts
Subject-oriented organization: Data organized by subject areas

This approach provides strong data consistency but may take longer to implement initially.

4. Data Vault Architecture

A more flexible and scalable approach designed for enterprise-scale data warehouses:

Key aspects of Data Vault architecture:

Hubs: Business keys that identify entities
Links: Relationships between hubs
Satellites: Descriptive attributes that change over time
Highly adaptable: Designed to handle changing business requirements
Auditability: Preserves full history and lineage of all data

This architecture excels in environments with frequent changes and high compliance requirements.

5. Modern Cloud Data Warehouse Architecture

Contemporary cloud-based data warehouses introduce new architectural patterns:

Key aspects of modern cloud data warehouse architecture:

ELT instead of ETL: Load raw data first, transform within the warehouse
Separation of storage and compute: Scalable independently
Data lake integration: Structured and unstructured data together
Serverless operations: Automated scaling without infrastructure management
Self-service analytics: Democratized access to data

This approach offers flexibility, cost efficiency, and scalability not possible with traditional on-premises architectures.

Implementation Example: Setting Up a Simple Data Warehouse

Let's walk through a simplified example of setting up a basic data warehouse using a star schema for an e-commerce business. We'll use SQL for this example.

Step 1: Create Dimension Tables

First, we'll create dimension tables to represent the core business entities:

sql
-- Time dimension
CREATE TABLE dim_time (
    time_key INT PRIMARY KEY,
    date DATE NOT NULL,
    day_of_week VARCHAR(10),
    month VARCHAR(10),
    quarter VARCHAR(2),
    year INT
);

-- Customer dimension
CREATE TABLE dim_customer (
    customer_key INT PRIMARY KEY,
    customer_id VARCHAR(20) NOT NULL,
    name VARCHAR(100),
    email VARCHAR(100),
    city VARCHAR(50),
    state VARCHAR(50),
    country VARCHAR(50),
    customer_since DATE,
    customer_segment VARCHAR(20)
);

-- Product dimension
CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_id VARCHAR(20) NOT NULL,
    product_name VARCHAR(100),
    category VARCHAR(50),
    subcategory VARCHAR(50),
    price DECIMAL(10,2),
    cost DECIMAL(10,2),
    brand VARCHAR(50)
);

Step 2: Create Fact Table

Next, we'll create the fact table that links to the dimensions and stores the measures:

sql
-- Sales fact table
CREATE TABLE fact_sales (
    sales_key INT PRIMARY KEY,
    time_key INT REFERENCES dim_time(time_key),
    customer_key INT REFERENCES dim_customer(customer_key),
    product_key INT REFERENCES dim_product(product_key),
    order_id VARCHAR(20),
    quantity INT,
    sales_amount DECIMAL(10,2),
    discount_amount DECIMAL(10,2),
    profit_amount DECIMAL(10,2)
);

Step 3: Create a Simple ETL Process

Here's a basic Python script that demonstrates the ETL process for loading data into our warehouse:

python
import pandas as pd
import sqlalchemy
from datetime import datetime

# Create database connection
engine = sqlalchemy.create_engine('postgresql://username:password@localhost:5432/data_warehouse')

# 1. Extract data from source system (simulated with a CSV file)
orders_df = pd.read_csv('source_data/orders.csv')
products_df = pd.read_csv('source_data/products.csv')
customers_df = pd.read_csv('source_data/customers.csv')

# 2. Transform data

# Process time dimension
unique_dates = orders_df['order_date'].unique()
time_dim = []

for date_str in unique_dates:
    date = datetime.strptime(date_str, '%Y-%m-%d')
    time_dim.append({
        'time_key': int(date.strftime('%Y%m%d')),
        'date': date,
        'day_of_week': date.strftime('%A'),
        'month': date.strftime('%B'),
        'quarter': f"Q{(date.month-1)//3+1}",
        'year': date.year
    })
    
time_df = pd.DataFrame(time_dim)

# Process customer dimension
customer_df = customers_df.rename(columns={
    'id': 'customer_id',
    'registration_date': 'customer_since'
})
customer_df['customer_key'] = customer_df.index + 1

# Process product dimension
product_df = products_df.rename(columns={
    'id': 'product_id',
    'name': 'product_name'
})
product_df['product_key'] = product_df.index + 1

# Process sales fact
sales_facts = []

for _, row in orders_df.iterrows():
    time_key = int(datetime.strptime(row['order_date'], '%Y-%m-%d').strftime('%Y%m%d'))
    customer_key = customer_df[customer_df['customer_id'] == row['customer_id']]['customer_key'].iloc[0]
    product_key = product_df[product_df['product_id'] == row['product_id']]['product_key'].iloc[0]
    
    sales_facts.append({
        'sales_key': len(sales_facts) + 1,
        'time_key': time_key,
        'customer_key': customer_key,
        'product_key': product_key,
        'order_id': row['order_id'],
        'quantity': row['quantity'],
        'sales_amount': row['price'] * row['quantity'],
        'discount_amount': row['discount'],
        'profit_amount': (row['price'] - row['cost']) * row['quantity'] - row['discount']
    })

sales_df = pd.DataFrame(sales_facts)

# 3. Load data into the warehouse
time_df.to_sql('dim_time', engine, if_exists='append', index=False)
customer_df.to_sql('dim_customer', engine, if_exists='append', index=False)
product_df.to_sql('dim_product', engine, if_exists='append', index=False)
sales_df.to_sql('fact_sales', engine, if_exists='append', index=False)

print("ETL process completed successfully!")

Step 4: Query the Data Warehouse

Once data is loaded, we can perform business analysis using SQL queries:

sql
-- Query to get monthly sales by product category
SELECT 
    dt.year,
    dt.month,
    dp.category,
    SUM(fs.sales_amount) as total_sales,
    SUM(fs.profit_amount) as total_profit
FROM 
    fact_sales fs
    JOIN dim_time dt ON fs.time_key = dt.time_key
    JOIN dim_product dp ON fs.product_key = dp.product_key
GROUP BY 
    dt.year, dt.month, dp.category
ORDER BY 
    dt.year, dt.month, total_sales DESC;

Sample output:

 year | month    | category     | total_sales | total_profit
------+----------+--------------+-------------+--------------
| January  | Electronics  | 124560.50   | 42980.25
| January  | Clothing     |  89270.75   | 32150.60
| January  | Home Goods   |  65432.25   | 18765.40
| February | Electronics  | 135670.80   | 48790.35
| February | Clothing     |  92345.60   | 35670.20
| February | Home Goods   |  68790.35   | 21345.75

Real-World Application: E-commerce Analytics Platform

Let's examine how a real e-commerce company might implement their data warehouse architecture:

Business Requirements

Track customer purchasing patterns and behaviors
Analyze product performance across different categories
Monitor sales performance by geographical regions
Forecast inventory requirements
Support marketing campaign analysis

Architecture Implementation

The company adopts a hybrid approach using Kimball methodology with cloud data warehouse technology:

Source Systems:
- Transactional database (orders, customers, products)
- Web analytics platform (clickstream data)
- Marketing automation tools (campaign data)
- Inventory management system
Data Integration:
- Cloud-based ETL tool extracts data daily
- Real-time streaming for high-priority data (orders, inventory)
- Data quality rules applied during transformation
Data Warehouse:
- Star schema design with conformed dimensions
- Key fact tables: sales, website_visits, inventory_levels
- Key dimension tables: customers, products, time, geography, campaigns
Data Access:
- Self-service BI tool for business analysts
- Executive dashboards for KPI tracking
- Data API for application integration
- Machine learning models for personalization

Business Outcomes

With this architecture in place, the company can:

Reduce time to generate business reports from days to minutes
Increase accuracy of inventory forecasts by 35%
Improve marketing ROI through targeted campaign analysis
Enable personalized customer experiences through data-driven insights
Support rapid business decisions with near real-time data

Comparison of Data Warehouse Architectures

Architecture	Pros	Cons	Best For
Kimball (Dimensional)	- Faster implementation time - Business-process focused - Easier for business users to understand	- May lead to data silos - Requires careful dimension design	Organizations needing quick wins and departmental solutions
Inmon (Enterprise)	- Enterprise-wide consistency - Strong data integrity - Comprehensive historical data	- Longer implementation time - More complex initial design - Higher upfront cost	Large enterprises requiring integrated, consistent view across all data
Data Vault	- Highly adaptable to change - Excellent auditability - Scales well for big data	- Steeper learning curve - More complex queries - Performance overhead	Organizations with regulatory requirements and frequent business changes
Cloud-based	- Lower infrastructure costs - Faster time to market - Scalability on demand	- Potential vendor lock-in - Data transfer costs - Security considerations	Companies seeking agility, startups, and organizations with varying workloads

Best Practices for Data Warehouse Architecture

Regardless of which architecture you choose, follow these best practices:

Begin with the end in mind: Understand business requirements before designing
Start small but think big: Implement incrementally but plan for growth
Design for flexibility: Business needs will change over time
Prioritize data quality: Clean data is more valuable than more data
Implement robust metadata management: Document everything
Consider performance from day one: Design with query patterns in mind
Balance standardization and innovation: Create guidelines but allow for experimentation
Plan for data governance: Consider security, privacy, and compliance
Automate wherever possible: Reduce manual interventions in data pipelines
Monitor and measure: Track usage, performance, and business impact

Summary

Data warehouse architecture provides the foundation for an organization's business intelligence and analytics capabilities. We've explored:

The core components that make up data warehouse systems
Traditional three-tier architecture that separates sources, storage, and presentation
Kimball's dimensional approach focusing on business processes and star schemas
Inmon's enterprise-wide normalized data model
Data Vault's flexible and auditable design
Modern cloud-based architectures leveraging separation of storage and compute

Each architectural pattern has its strengths and weaknesses, and the right choice depends on your specific business requirements, existing technology stack, and organizational culture.

As data volumes grow and business needs evolve, your data warehouse architecture may need to adapt. Starting with a clear understanding of these foundational patterns will help you make informed decisions as you build and scale your data analytics capabilities.

Further Learning

To deepen your understanding of data warehouse architecture, consider:

Practice Exercises:
- Design a star schema for a retail business
- Create a simple ETL pipeline for loading data
- Write SQL queries to analyze warehoused data
Advanced Topics to Explore:
- Slowly Changing Dimensions (SCD) techniques
- Data modeling patterns for specific industries
- Real-time data warehousing approaches
- Data mesh and data fabric architectures
Tools to Experiment With:
- Open-source data pipeline tools like Apache Airflow
- Cloud data warehouses like Snowflake, BigQuery, or Redshift
- Data modeling tools like dbt (data build tool)

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Core Components of Data Warehouse Architecture​

1. Source Systems​

2. Data Integration Layer (ETL/ELT)​

3. Data Storage Layer​

4. Metadata Repository​

5. Access Layer​

Common Data Warehouse Architectural Patterns​

1. Traditional Three-Tier Architecture​

Tiers explained:​

2. Kimball's Dimensional Modeling Architecture​

3. Inmon's Corporate Information Factory (CIF)​

4. Data Vault Architecture​

5. Modern Cloud Data Warehouse Architecture​

Implementation Example: Setting Up a Simple Data Warehouse​

Step 1: Create Dimension Tables​

Step 2: Create Fact Table​

Step 3: Create a Simple ETL Process​

Step 4: Query the Data Warehouse​

Real-World Application: E-commerce Analytics Platform​

Business Requirements​

Architecture Implementation​

Business Outcomes​

Comparison of Data Warehouse Architectures​

Best Practices for Data Warehouse Architecture​

Summary​

Further Learning​