Home Improvements

Data Warehouse Design: Architecture, Schemas, and Where to Actually Start

A data warehouse is a centralized repository that stores large volumes of structured, historical data from multiple sources. Unlike a standard database built for daily transactions, a data warehouse design is optimized specifically for analysis and business intelligence. By separating analytical workloads from operational ones, companies can run complex queries across massive datasets to identify long-term trends without slowing down their primary applications.

The design decisions you make early – schema type, ETL strategy, layer architecture – will determine whether your warehouse stays performant and maintainable at scale or becomes a source of technical debt within two years.

Data Warehouse Architecture: The Three Layers

Layer Name Purpose Tools
Source Layer Raw / Staging Raw data ingested from source systems as-is Fivetran, Airbyte, Stitch
Storage Layer Data Warehouse Core Cleaned, modeled, organized data Snowflake, BigQuery, Redshift
Presentation Layer Data Marts / Reports Subject-specific views for end users dbt, Looker, Tableau

A well-designed warehouse keeps these layers clearly separated. The raw layer preserves original data (critical for debugging and reprocessing). The core layer applies business logic. The presentation layer delivers curated views to business users.

Schema Design: Star vs Snowflake vs Data Vault

This is where most warehouse design conversations begin – and where many teams get stuck.

Schema Structure Pros Cons Best For
Star Schema Fact table + denormalized dimensions Simple queries; fast aggregations; easy for BI tools Data redundancy in dimensions Most analytics workloads
Snowflake Schema Fact table + normalized dimensions Less redundancy; consistent hierarchies More joins; harder for non-technical users Complex hierarchies, strict normalization
Data Vault Hubs, Links, Satellites Highly auditable; handles schema changes well Complex; steep learning curve Enterprise DWH; regulatory compliance

The practical recommendation: Start with a star schema. The query simplicity and BI tool compatibility outweigh the normalization benefits of snowflake for most teams. Move to Data Vault only if audit requirements or extremely complex historical tracking demands it.

The Fact Table: Heart of the Star Schema

The fact table stores measurable business events – sales transactions, website clicks, inventory movements. Each row represents one event.

What goes in a fact table:

  • Foreign keys to dimension tables
  • Numeric measures (revenue, quantity, duration)
  • Date keys (joining to date dimension)

What doesn’t belong:

  • Descriptive attributes (those go in dimensions)
  • Text fields (bad for aggregation)
  • Calculated fields that can be derived at query time

Dimension Tables: The Context Around Facts

Dimension tables provide the “who, what, where, when” context for your facts.

Dimension What It Describes Example Columns
Date dimension Calendar hierarchy Date, day, month, quarter, year, fiscal period
Customer dimension Customer attributes Name, segment, region, join date
Product dimension Product attributes Name, category, SKU, price
Geography dimension Location hierarchy City, state, country, region

Critical concept: Slowly Changing Dimensions (SCD)

What happens when a customer moves to a different state or a product changes its category? How you handle this determines whether your historical data is accurate:

SCD Type How It Works When to Use
Type 1 Overwrite old value History doesn’t matter
Type 2 Add new row with effective dates Need full historical accuracy
Type 3 Add new column for current/previous Only current and one prior state needed

Type 2 is the most common – it preserves history by adding a new dimension row with start/end dates whenever an attribute changes.

Modern Data Warehouse Stack (2024-2025)

Component Leading Tools
Data ingestion (ELT) Fivetran, Airbyte, dbt
Storage and compute Snowflake, BigQuery, Databricks, Redshift
Transformation layer dbt (industry standard for transformations)
Orchestration Airflow, Prefect, Dagster
BI / visualization Tableau, Looker, Power BI, Metabase
Data catalog dbt docs, Atlan, Alation

The modern shift is from ETL (transform before loading) to ELT (load raw, then transform in the warehouse). Cloud warehouses have the compute power to transform at scale – moving the transformation logic into dbt models rather than pre-warehouse pipelines.

Common Data Warehouse Design Mistakes

Mistake What Happens Fix
Skipping the raw/staging layer No source data to reprocess when logic changes Always land raw data first
One giant fact table Unmanageable; mixed granularity One fact table per business process
No date dimension Date-based queries become painful Pre-build a date dimension spanning years
Business logic in the BI layer Reports become inconsistent across tools Define metrics in the warehouse / dbt
Ignoring grain definition Queries return wrong aggregations Define exactly what one row represents

The Bottom Line

Data warehouse design is fundamentally about separating concerns: raw data from transformed data, facts from dimensions, source logic from business logic. Start with a star schema, build a robust staging layer, adopt dbt for transformations, and define your grain before writing a single table. The teams that get this right early avoid the painful rewrites that slow every analytics team down at scale.