data-warehousing

Use this skill when designing data warehouses, building star or snowflake schemas, implementing slowly changing dimensions (SCDs), writing analytical SQL for Snowflake or BigQuery, creating fact and dimension tables, or planning ETL/ELT pipelines for analytics. Triggers on dimensional modeling, surrogate keys, conformed dimensions, warehouse architecture, data vault, partitioning strategies, materialized views, and any task requiring OLAP schema design or warehouse query optimization.

What is data-warehousing?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill data-warehousing
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The data-warehousing skill is now active and ready to use

Overview Files

data-warehousing

data-warehousing is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Designing data warehouses, building star or snowflake schemas, implementing slowly changing dimensions (SCDs), writing analytical SQL for Snowflake or BigQuery, creating fact and dimension tables, or planning ETL/ELT pipelines for analytics.

Quick Facts

Field	Value
Category	data
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill data-warehousing

The data-warehousing skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

A practical framework for designing, building, and optimizing analytical data warehouses using dimensional modeling. This skill covers star and snowflake schema design, slowly changing dimension (SCD) patterns, and platform-specific guidance for Snowflake and BigQuery. The focus is on making the right modeling decisions that balance query performance, storage cost, and maintainability for downstream analytics consumers.

Platforms

claude-code
gemini-cli
openai-codex

Related Skills

Pair data-warehousing with these complementary skills:

Frequently Asked Questions

What is data-warehousing?

How do I install data-warehousing?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill data-warehousing in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support data-warehousing?

This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

Data Warehousing

When to use this skill

Trigger this skill when the user:

Designs a star schema or snowflake schema for analytical workloads
Implements slowly changing dimensions (Type 1, 2, 3, or hybrid)
Builds fact tables (transactional, periodic snapshot, or accumulating snapshot)
Writes analytical SQL targeting Snowflake or BigQuery
Plans ETL/ELT pipelines that load data into a warehouse
Creates conformed dimensions shared across multiple fact tables
Optimizes warehouse query performance (clustering, partitioning, materialized views)
Chooses between Snowflake and BigQuery for a new project

Do NOT trigger this skill for:

OLTP schema design or transactional database tuning (use database-engineering)
Data pipeline orchestration tools like Airflow or dbt (those have their own skills)

Key principles

Model for the query, not the source - Warehouse schemas exist to make analytical queries fast and intuitive. Denormalize aggressively compared to OLTP. If analysts need to join seven tables to answer a basic question, the model is wrong.
Grain is the single most important decision - Every fact table must have a clearly declared grain (one row = one transaction, one day per customer, etc.). Mixing grains in a single fact table causes double-counting and broken aggregations that are extremely hard to debug.
Conformed dimensions enable cross-process analysis - A shared dim_customer or dim_date table used across all fact tables lets analysts drill across business processes without reconciliation headaches. Build conformed dimensions first.
Slowly changing dimensions must be an explicit design choice - Every dimension attribute changes over time. Decide upfront whether to overwrite (Type 1), track history (Type 2), or store previous value (Type 3). Defaulting to Type 1 and later needing history is a painful migration.
Partition and cluster for your access patterns - Cloud warehouses charge by data scanned. Partitioning by date and clustering by high-cardinality filter columns can reduce costs and query times by 10-100x. Design these at table creation time.

Core concepts

Dimensional modeling entities

Entity	Role	Example
Fact table	Stores measurable business events (metrics)	`fct_orders`, `fct_page_views`
Dimension table	Stores descriptive context for facts	`dim_customer`, `dim_product`, `dim_date`
Surrogate key	Warehouse-generated integer/hash PK for dimensions	`customer_sk` (vs natural key `customer_id`)
Degenerate dimension	Dimension attribute stored directly on the fact table	`order_number` on `fct_order_items`
Conformed dimension	A dimension shared identically across multiple fact tables	`dim_date`, `dim_geography`

Star schema vs snowflake schema

A star schema has fact tables at the center with denormalized dimension tables radiating outward - one join from fact to any dimension. A snowflake schema normalizes dimensions into sub-dimensions (e.g., dim_product -> dim_category -> dim_department). Star schemas are preferred for most analytical workloads because they minimize joins and are easier for BI tools to consume. Snowflake schemas save storage but add join complexity - only use them when dimension tables are extremely large (100M+ rows) and share sub-dimensions across many parents.

Fact table types

Type	Grain	Example	When to use
Transaction	One row per event	`fct_orders`	Most common; captures atomic events
Periodic snapshot	One row per entity per period	`fct_daily_inventory`	Regular status measurements
Accumulating snapshot	One row per process lifetime	`fct_order_fulfillment`	Track milestones (ordered, shipped, delivered)
Factless fact	No measures, only dimension keys	`fct_student_attendance`	Record that an event occurred

Slowly changing dimensions (SCD)

Type	Behavior	Trade-off
Type 0	Never changes (fixed attributes)	Use for birth date, original sign-up date
Type 1	Overwrite old value	Simple but loses history
Type 2	Add new row with version tracking	Preserves full history; most common for analytics
Type 3	Add column for previous value	Tracks one prior value only; rarely sufficient
Type 6 (hybrid 1+2+3)	Type 2 rows + current value column	Best of both: history + easy current-state queries

Common tasks

Design a star schema

Model a retail sales domain with conformed dimensions.

-- Date dimension (conformed - used by all fact tables)
CREATE TABLE dim_date (
  date_sk       INT         PRIMARY KEY,  -- YYYYMMDD integer
  full_date     DATE        NOT NULL,
  day_of_week   VARCHAR(10) NOT NULL,
  month_name    VARCHAR(10) NOT NULL,
  quarter       INT         NOT NULL,
  fiscal_year   INT         NOT NULL,
  is_weekend    BOOLEAN     NOT NULL,
  is_holiday    BOOLEAN     NOT NULL
);

-- Customer dimension
CREATE TABLE dim_customer (
  customer_sk   INT         PRIMARY KEY,  -- surrogate key
  customer_id   VARCHAR(50) NOT NULL,     -- natural key
  name          VARCHAR(200),
  segment       VARCHAR(50),
  region        VARCHAR(100),
  -- SCD Type 2 tracking
  effective_date DATE       NOT NULL,
  expiry_date    DATE       NOT NULL DEFAULT '9999-12-31',
  is_current     BOOLEAN    NOT NULL DEFAULT TRUE
);

-- Product dimension
CREATE TABLE dim_product (
  product_sk    INT         PRIMARY KEY,
  product_id    VARCHAR(50) NOT NULL,
  product_name  VARCHAR(200),
  category      VARCHAR(100),
  subcategory   VARCHAR(100),
  brand         VARCHAR(100),
  unit_cost     DECIMAL(12,2)
);

-- Sales fact table (transaction grain: one row per line item)
CREATE TABLE fct_sales (
  sale_sk       BIGINT      PRIMARY KEY,
  date_sk       INT         NOT NULL REFERENCES dim_date(date_sk),
  customer_sk   INT         NOT NULL REFERENCES dim_customer(customer_sk),
  product_sk    INT         NOT NULL REFERENCES dim_product(product_sk),
  quantity      INT         NOT NULL,
  unit_price    DECIMAL(12,2) NOT NULL,
  discount_amt  DECIMAL(12,2) NOT NULL DEFAULT 0,
  net_amount    DECIMAL(12,2) NOT NULL,
  order_number  VARCHAR(50) NOT NULL  -- degenerate dimension
);

Declare the grain explicitly in a comment or documentation: "One row per order line item per day." Every team member must agree on the grain before building downstream reports.

Implement SCD Type 2 in Snowflake

Track full history of customer attribute changes using MERGE.

-- Snowflake MERGE for SCD Type 2
MERGE INTO dim_customer AS target
USING staging_customers AS source
  ON target.customer_id = source.customer_id
     AND target.is_current = TRUE

-- Existing row where attributes changed: expire it
WHEN MATCHED
  AND (target.segment  != source.segment
    OR target.region   != source.region)
THEN UPDATE SET
  target.expiry_date = CURRENT_DATE - 1,
  target.is_current  = FALSE

-- No match: brand new customer
WHEN NOT MATCHED THEN INSERT (
  customer_sk, customer_id, name, segment, region,
  effective_date, expiry_date, is_current
) VALUES (
  dim_customer_seq.NEXTVAL,
  source.customer_id, source.name, source.segment, source.region,
  CURRENT_DATE, '9999-12-31', TRUE
);

-- Second pass: insert the new current row for changed records
INSERT INTO dim_customer
SELECT dim_customer_seq.NEXTVAL,
       s.customer_id, s.name, s.segment, s.region,
       CURRENT_DATE, '9999-12-31', TRUE
FROM staging_customers s
JOIN dim_customer d
  ON s.customer_id = d.customer_id
  AND d.expiry_date = CURRENT_DATE - 1
  AND d.is_current = FALSE;

Implement SCD Type 2 in BigQuery

BigQuery lacks MERGE with multiple actions on the same row, so use a MERGE + INSERT pattern.

-- BigQuery SCD Type 2 using MERGE
MERGE `project.dataset.dim_customer` AS target
USING `project.dataset.staging_customers` AS source
  ON target.customer_id = source.customer_id
     AND target.is_current = TRUE

WHEN MATCHED
  AND (target.segment != source.segment
    OR target.region  != source.region)
THEN UPDATE SET
  expiry_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY),
  is_current  = FALSE

WHEN NOT MATCHED BY TARGET THEN INSERT (
  customer_sk, customer_id, name, segment, region,
  effective_date, expiry_date, is_current
) VALUES (
  GENERATE_UUID(),
  source.customer_id, source.name, source.segment, source.region,
  CURRENT_DATE(), DATE '9999-12-31', TRUE
);

-- Insert new current rows for changed records
INSERT INTO `project.dataset.dim_customer`
SELECT GENERATE_UUID(), s.customer_id, s.name, s.segment, s.region,
       CURRENT_DATE(), DATE '9999-12-31', TRUE
FROM `project.dataset.staging_customers` s
INNER JOIN `project.dataset.dim_customer` d
  ON s.customer_id = d.customer_id
  AND d.expiry_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);

Optimize Snowflake performance

-- Cluster keys for large fact tables (order matters: most filtered first)
ALTER TABLE fct_sales CLUSTER BY (date_sk, customer_sk);

-- Monitor clustering depth
SELECT SYSTEM$CLUSTERING_DEPTH('fct_sales');
-- Values close to 1.0 = well clustered; > 5 = recluster needed

-- Use result caching and warehouse sizing
ALTER WAREHOUSE analytics_wh SET
  WAREHOUSE_SIZE = 'MEDIUM'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE;

-- Materialized views for expensive aggregations
CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT date_sk,
       SUM(net_amount) AS total_sales,
       COUNT(*)        AS transaction_count
FROM fct_sales
GROUP BY date_sk;

Snowflake charges per second of compute. Use AUTO_SUSPEND = 60 to avoid paying for idle warehouses. Separate warehouses for ETL and analytics to prevent contention.

Optimize BigQuery performance

-- Partition by date (reduces bytes scanned = lower cost)
CREATE TABLE `project.dataset.fct_sales`
(
  sale_sk       STRING      NOT NULL,
  sale_date     DATE        NOT NULL,
  customer_sk   STRING      NOT NULL,
  product_sk    STRING      NOT NULL,
  quantity      INT64       NOT NULL,
  net_amount    NUMERIC     NOT NULL
)
PARTITION BY sale_date
CLUSTER BY customer_sk, product_sk;

-- Check bytes scanned before running expensive queries
-- Use dry run: bq query --dry_run --use_legacy_sql=false 'SELECT ...'

-- Materialized view with automatic refresh
CREATE MATERIALIZED VIEW `project.dataset.mv_daily_sales`
AS
SELECT sale_date,
       SUM(net_amount) AS total_sales,
       COUNT(*)        AS transaction_count
FROM `project.dataset.fct_sales`
GROUP BY sale_date;

BigQuery charges per TB scanned. Always partition by the primary date filter column and cluster by up to four frequently filtered columns. Use INFORMATION_SCHEMA.JOBS to monitor cost per query.

Build a date dimension

Every warehouse needs a pre-populated date dimension. Generate it once.

-- BigQuery: generate a date spine
CREATE TABLE `project.dataset.dim_date` AS
WITH date_spine AS (
  SELECT date
  FROM UNNEST(
    GENERATE_DATE_ARRAY('2020-01-01', '2030-12-31', INTERVAL 1 DAY)
  ) AS date
)
SELECT
  CAST(FORMAT_DATE('%Y%m%d', date) AS INT64) AS date_sk,
  date                                        AS full_date,
  FORMAT_DATE('%A', date)                     AS day_of_week,
  FORMAT_DATE('%B', date)                     AS month_name,
  EXTRACT(QUARTER FROM date)                  AS quarter,
  EXTRACT(YEAR FROM date)                     AS fiscal_year,
  EXTRACT(DAYOFWEEK FROM date) IN (1, 7)      AS is_weekend
FROM date_spine;

Anti-patterns / common mistakes

Mistake	Why it's wrong	What to do instead
Mixed grain in a fact table	Adding daily and monthly aggregates in one table causes double-counting when summed	Create separate fact tables per grain
Using natural keys as fact table foreign keys	Natural keys change (email, product code); joins break silently	Use surrogate keys for all dimension references
No date dimension (joining on raw dates)	Loses fiscal calendar, holiday flags, and forces repeated date logic in every query	Build a shared `dim_date` and join all facts to it
Defaulting everything to SCD Type 1	Loses history; cannot answer "what segment was this customer in last quarter?"	Choose SCD type per attribute explicitly during design
No partitioning on large fact tables	Full table scans on every query; cloud costs explode	Partition by date and cluster by top filter columns
Over-normalizing dimensions (deep snowflake)	Adds join complexity; BI tools struggle with 5+ join paths	Flatten to star schema unless dimension is 100M+ rows

Gotchas

Mixed grain in a fact table is the hardest bug to find - Adding a row-level sale and a daily summary row into the same fact table looks harmless until someone sums net_amount across a date range and gets 2x the correct revenue. Grain violations produce numerically plausible but wrong answers. Declare the grain in a comment on the CREATE TABLE statement and enforce it with a uniqueness test.
SCD Type 2 MERGE leaving orphaned "current" rows - A MERGE statement that expires old rows but fails (network timeout, transaction rollback) after the UPDATE but before the INSERT leaves the dimension in an inconsistent state: the old row is expired but no new current row exists. Always wrap SCD Type 2 updates in a transaction and verify row count after each MERGE + INSERT pair.
Natural keys as fact table foreign keys break on source changes - Using customer_email instead of customer_sk as the foreign key on a fact table means a customer who changes their email address breaks all historical joins. Surrogate keys are immune to natural key changes. Always generate surrogate keys for dimension foreign key references.
BigQuery partition filter not used in JOIN conditions - BigQuery only prunes partitions when the partition column appears in a WHERE clause, not in a JOIN ON condition. A query joining a partitioned table only on customer_id performs a full table scan on the partitioned table, negating the cost benefit. Always include the partition column in the WHERE clause.
Snowflake warehouse not auto-suspending in development - A development or analytics warehouse left running with AUTO_SUSPEND disabled (or set to a long interval) accumulates compute costs continuously. Set AUTO_SUSPEND = 60 on all non-production warehouses, and use separate warehouses for ETL loads vs. analyst queries to prevent compute contention.

References

For detailed implementation patterns and platform-specific guidance, load the relevant file from references/:

references/snowflake-patterns.md - Snowflake-specific features: stages, streams, tasks, time travel, zero-copy cloning, and warehouse sizing strategies
references/bigquery-patterns.md - BigQuery-specific features: nested/repeated fields, federated queries, BI Engine, slots vs on-demand pricing, and scheduled queries
references/scd-patterns.md - Deep dive on all SCD types with complete SQL implementations, hybrid patterns, and migration strategies between types

Only load a references file if the current task requires it - they are long and will consume context.

References

bigquery-patterns.md

BigQuery Patterns

Architecture overview

BigQuery is a serverless columnar warehouse. There are no clusters to manage - you submit SQL and Google allocates compute automatically. Storage and compute are decoupled. Pricing is either on-demand (per TB scanned) or capacity-based (reserved slots). All data is stored in Capacitor columnar format, replicated across zones.

Pricing models

Model	How it works	Best for
On-demand	$6.25/TB scanned (first 1 TB/month free)	Exploration, low-volume analytics
Standard edition	$0.04/slot-hour (autoscaled)	Predictable workloads
Enterprise edition	$0.06/slot-hour (with advanced features)	Large orgs needing governance

Key insight: On-demand charges are based on bytes scanned, not rows returned. A SELECT * on a 500 GB table costs ~$3.13 even if the result is 10 rows. Partitioning and clustering directly reduce bytes scanned and therefore cost.

Partitioning

-- Partition by date column (most common)
CREATE TABLE `project.dataset.fct_events`
(
  event_id    STRING    NOT NULL,
  event_date  DATE      NOT NULL,
  user_id     STRING    NOT NULL,
  event_type  STRING    NOT NULL,
  properties  JSON
)
PARTITION BY event_date
OPTIONS (
  partition_expiration_days = 365,  -- auto-delete old partitions
  require_partition_filter = TRUE   -- prevent full scans
);

-- Partition by ingestion time (when source has no reliable date column)
CREATE TABLE `project.dataset.fct_raw_events`
(
  payload STRING
)
PARTITION BY _PARTITIONDATE;

-- Integer range partitioning
CREATE TABLE `project.dataset.fct_transactions`
(
  txn_id      INT64,
  customer_id INT64,
  amount      NUMERIC
)
PARTITION BY RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 10000000, 100000));

require_partition_filter = TRUE is a guardrail that prevents queries without a partition filter from running. Use it on all large fact tables.

Clustering

-- Cluster by up to 4 columns (order matters: most filtered first)
CREATE TABLE `project.dataset.fct_sales`
(
  sale_date    DATE      NOT NULL,
  region       STRING    NOT NULL,
  customer_id  STRING    NOT NULL,
  product_id   STRING    NOT NULL,
  amount       NUMERIC   NOT NULL
)
PARTITION BY sale_date
CLUSTER BY region, customer_id;

-- BigQuery auto-reclusters in the background at no cost
-- Unlike Snowflake, there is no manual recluster command

Clustering column selection: Choose columns that appear frequently in WHERE clauses, JOIN conditions, or GROUP BY. High-cardinality columns (like customer_id) benefit most.

Nested and repeated fields (STRUCT and ARRAY)

BigQuery natively supports nested data, which avoids joins.

-- Schema with nested and repeated fields
CREATE TABLE `project.dataset.orders_nested`
(
  order_id    STRING NOT NULL,
  order_date  DATE   NOT NULL,
  customer    STRUCT<
    id        STRING,
    name      STRING,
    segment   STRING
  >,
  items       ARRAY<STRUCT<
    product_id STRING,
    quantity   INT64,
    unit_price NUMERIC
  >>
);

-- Query nested fields with dot notation
SELECT
  order_id,
  customer.name,
  customer.segment
FROM `project.dataset.orders_nested`
WHERE order_date = '2026-03-14';

-- Unnest arrays to flatten
SELECT
  o.order_id,
  item.product_id,
  item.quantity * item.unit_price AS line_total
FROM `project.dataset.orders_nested` o,
  UNNEST(o.items) AS item;

Use nested/repeated fields when the child data is always accessed with the parent. This eliminates join costs and scans less data. Do not use when the child needs independent access patterns (keep it as a separate table).

Materialized views

CREATE MATERIALIZED VIEW `project.dataset.mv_daily_revenue`
AS
SELECT
  sale_date,
  region,
  SUM(amount)  AS total_revenue,
  COUNT(*)     AS txn_count
FROM `project.dataset.fct_sales`
GROUP BY sale_date, region;

-- BigQuery automatically uses the MV when it can answer a query
-- No need to query the MV directly (smart tuning / automatic rewriting)

-- Refresh is automatic and incremental
-- Force refresh if needed:
CALL BQ.REFRESH_MATERIALIZED_VIEW('project.dataset.mv_daily_revenue');

Limitations: MVs cannot use non-deterministic functions, OUTER JOINs, or window functions. They work best for simple aggregations on partitioned/clustered tables.

BI Engine

BI Engine is an in-memory analysis service that accelerates queries from BI tools (Looker, Data Studio, Tableau). Reserve BI Engine capacity in the BigQuery console.

Queries that fit in the BI Engine reservation are served from memory (~1 second)
No code changes needed - it transparently accelerates compatible queries
Best for dashboards with repetitive query patterns on a stable dataset

Scheduled queries

-- Schedule a daily aggregation (runs as a scheduled query in BigQuery UI or API)
-- In the BigQuery console: More > Schedule > set schedule

-- Or via bq CLI:
-- bq mk --transfer_config \
--   --project_id=myproject \
--   --data_source=scheduled_query \
--   --target_dataset=dataset \
--   --display_name='Daily Sales Agg' \
--   --schedule='every 24 hours' \
--   --params='{"query":"INSERT INTO dataset.daily_sales SELECT ..."}'

Federated queries

Query external data without loading it into BigQuery.

-- Query Cloud SQL (PostgreSQL) directly from BigQuery
SELECT * FROM EXTERNAL_QUERY(
  'projects/myproject/locations/us/connections/my-cloudsql',
  'SELECT id, email, created_at FROM users WHERE created_at > NOW() - INTERVAL 1 DAY'
);

-- External tables over Cloud Storage (Parquet, CSV, JSON)
CREATE EXTERNAL TABLE `project.dataset.ext_events`
  OPTIONS (
    format = 'PARQUET',
    uris = ['gs://my-bucket/events/*.parquet']
  );

Federated queries are slower than native tables. Use them for ad-hoc exploration or infrequent joins with external systems. For regular analytics, load data into BigQuery native tables.

Cost optimization checklist

Always partition fact tables by date and set require_partition_filter = TRUE
Cluster by top 2-4 filter/group columns
**Avoid SELECT ***; list only needed columns (columnar storage = less scan)
Use dry runs (bq query --dry_run) to estimate cost before executing
Set custom cost controls in project settings (max bytes billed per query)
Monitor with INFORMATION_SCHEMA.JOBS to find expensive queries
Consider editions pricing when monthly spend exceeds ~$2,000 on-demand
Use table expiration and partition expiration to auto-delete old data

scd-patterns.md

SCD Patterns

Slowly Changing Dimensions (SCDs) handle the reality that dimension attributes change over time. The type you choose determines whether history is preserved and how it is queried. This reference covers all standard types with complete SQL implementations.

SCD Type 0 - Fixed attribute

The attribute never changes after initial load. Use for truly immutable data.

-- Examples: birth_date, original_signup_date, first_order_date
-- Implementation: simply never update these columns
-- In MERGE statements, exclude Type 0 columns from the WHEN MATCHED clause

SCD Type 1 - Overwrite

Replace the old value with the new value. No history is preserved.

-- Snowflake Type 1 MERGE
MERGE INTO dim_product AS target
USING staging_products AS source
  ON target.product_id = source.product_id
WHEN MATCHED
  AND (target.category != source.category
    OR target.brand    != source.brand)
THEN UPDATE SET
  target.category = source.category,
  target.brand    = source.brand,
  target.updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN INSERT (
  product_sk, product_id, product_name, category, brand, updated_at
) VALUES (
  dim_product_seq.NEXTVAL,
  source.product_id, source.product_name, source.category, source.brand,
  CURRENT_TIMESTAMP()
);

When to use: Corrections (fixing a typo), attributes where history is irrelevant (e.g., updating a product description), or when storage/complexity must be minimized.

Trade-off: Simple and fast but you cannot answer "what was the category last month?"

SCD Type 2 - Add new row

Create a new row for each change, preserving full history. This is the most common type for analytical warehouses.

Required columns

Column	Purpose
`surrogate_key`	Unique per row (version), not per entity
`natural_key`	Business identifier (same across all versions)
`effective_date`	When this version became active
`expiry_date`	When this version was superseded (`9999-12-31` for current)
`is_current`	Boolean flag for easy filtering of current records

Snowflake implementation

-- Step 1: Expire changed rows
MERGE INTO dim_customer AS target
USING staging_customers AS source
  ON target.customer_id = source.customer_id
     AND target.is_current = TRUE
WHEN MATCHED
  AND (target.segment  != source.segment
    OR target.region   != source.region
    OR target.tier     != source.tier)
THEN UPDATE SET
  target.expiry_date = CURRENT_DATE - 1,
  target.is_current  = FALSE;

-- Step 2: Insert new current rows for changed and new records
INSERT INTO dim_customer (
  customer_sk, customer_id, name, segment, region, tier,
  effective_date, expiry_date, is_current
)
SELECT
  dim_customer_seq.NEXTVAL,
  s.customer_id, s.name, s.segment, s.region, s.tier,
  CURRENT_DATE, '9999-12-31', TRUE
FROM staging_customers s
WHERE NOT EXISTS (
  SELECT 1 FROM dim_customer d
  WHERE d.customer_id = s.customer_id
    AND d.is_current = TRUE
    AND d.segment = s.segment
    AND d.region  = s.region
    AND d.tier    = s.tier
);

BigQuery implementation

-- Step 1: Expire changed rows
MERGE `project.dataset.dim_customer` AS target
USING `project.dataset.staging_customers` AS source
  ON target.customer_id = source.customer_id
     AND target.is_current = TRUE
WHEN MATCHED
  AND (target.segment != source.segment
    OR target.region  != source.region
    OR target.tier    != source.tier)
THEN UPDATE SET
  expiry_date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY),
  is_current  = FALSE;

-- Step 2: Insert new current rows
INSERT INTO `project.dataset.dim_customer` (
  customer_sk, customer_id, name, segment, region, tier,
  effective_date, expiry_date, is_current
)
SELECT
  GENERATE_UUID(),
  s.customer_id, s.name, s.segment, s.region, s.tier,
  CURRENT_DATE(), DATE '9999-12-31', TRUE
FROM `project.dataset.staging_customers` s
WHERE NOT EXISTS (
  SELECT 1 FROM `project.dataset.dim_customer` d
  WHERE d.customer_id = s.customer_id
    AND d.is_current = TRUE
    AND d.segment = s.segment
    AND d.region  = s.region
    AND d.tier    = s.tier
);

Querying Type 2 dimensions

-- Get current state of all customers
SELECT * FROM dim_customer WHERE is_current = TRUE;

-- Point-in-time lookup: what segment was customer C123 in on 2025-06-15?
SELECT segment
FROM dim_customer
WHERE customer_id = 'C123'
  AND effective_date <= '2025-06-15'
  AND expiry_date   >= '2025-06-15';

-- Join fact to dimension for historical accuracy
SELECT f.sale_date, d.segment, SUM(f.net_amount)
FROM fct_sales f
JOIN dim_customer d
  ON f.customer_sk = d.customer_sk  -- surrogate key join: gets the version active at sale time
GROUP BY f.sale_date, d.segment;

SCD Type 3 - Previous value column

Store the current and one previous value as separate columns. Limited history.

CREATE TABLE dim_customer_type3 (
  customer_sk       INT         PRIMARY KEY,
  customer_id       VARCHAR(50) NOT NULL,
  name              VARCHAR(200),
  current_segment   VARCHAR(50),
  previous_segment  VARCHAR(50),
  segment_change_date DATE
);

-- Update: shift current to previous
UPDATE dim_customer_type3
SET previous_segment  = current_segment,
    current_segment   = 'Enterprise',
    segment_change_date = CURRENT_DATE
WHERE customer_id = 'C123';

When to use: Rarely. Only when you need exactly one prior value and never more. Most analytical needs are better served by Type 2.

SCD Type 6 - Hybrid (1 + 2 + 3)

Combines Type 2 history rows with a Type 1 overwrite column for the current value. This gives you both full history and easy access to the current state on every row.

CREATE TABLE dim_customer_type6 (
  customer_sk        INT         PRIMARY KEY,
  customer_id        VARCHAR(50) NOT NULL,
  name               VARCHAR(200),
  -- Type 2 historical value
  historical_segment VARCHAR(50),
  -- Type 1 current value (same on ALL rows for this customer)
  current_segment    VARCHAR(50),
  -- Type 2 tracking
  effective_date     DATE        NOT NULL,
  expiry_date        DATE        NOT NULL DEFAULT '9999-12-31',
  is_current         BOOLEAN     NOT NULL DEFAULT TRUE
);

-- When segment changes:
-- 1. Expire the current row (Type 2)
UPDATE dim_customer_type6
SET expiry_date = CURRENT_DATE - 1,
    is_current  = FALSE,
    current_segment = 'Enterprise'  -- Type 1: update on ALL rows
WHERE customer_id = 'C123' AND is_current = TRUE;

-- 2. Insert new current row (Type 2)
INSERT INTO dim_customer_type6 VALUES (
  NEXTVAL, 'C123', 'Acme Corp',
  'Enterprise',   -- historical_segment = the new value
  'Enterprise',   -- current_segment = the new value
  CURRENT_DATE, '9999-12-31', TRUE
);

-- 3. Backfill current_segment on all historical rows (Type 1 overwrite)
UPDATE dim_customer_type6
SET current_segment = 'Enterprise'
WHERE customer_id = 'C123';

Querying Type 6

-- "What segment is the customer in NOW?" (use current_segment on any row)
SELECT DISTINCT customer_id, current_segment
FROM dim_customer_type6
WHERE customer_id = 'C123';

-- "What segment were they in when this sale happened?" (use historical_segment)
SELECT f.sale_date, d.historical_segment, SUM(f.net_amount)
FROM fct_sales f
JOIN dim_customer_type6 d ON f.customer_sk = d.customer_sk
GROUP BY f.sale_date, d.historical_segment;

When to use: When analysts frequently need both "current state" and "historical state" queries and you want to avoid a self-join to the current row.

Choosing the right SCD type

Need	Recommended type
Attribute never changes	Type 0
Only current value matters, no history needed	Type 1
Full history required for point-in-time analysis	Type 2
Only need one previous value	Type 3
Full history + easy current value access	Type 6

Default recommendation: Start with Type 2 for any attribute where there is even a possibility of needing historical analysis. The storage cost is minimal in cloud warehouses. Migrating from Type 1 to Type 2 retroactively is painful because the history is already lost.

Migration: Type 1 to Type 2

When you realize you need history after starting with Type 1:

-- Add tracking columns
ALTER TABLE dim_customer ADD COLUMN effective_date DATE DEFAULT '1900-01-01';
ALTER TABLE dim_customer ADD COLUMN expiry_date DATE DEFAULT '9999-12-31';
ALTER TABLE dim_customer ADD COLUMN is_current BOOLEAN DEFAULT TRUE;

-- Set effective_date to the best available timestamp
UPDATE dim_customer
SET effective_date = COALESCE(last_modified_at, created_at, '1900-01-01');

-- From this point forward, use Type 2 MERGE logic
-- Note: all history before this migration is lost - you only have the current state

This migration preserves only the current snapshot. Any changes that occurred before the migration are permanently lost. This is why defaulting to Type 2 from the start is strongly recommended for analytics-critical attributes.

snowflake-patterns.md

Snowflake Patterns

Architecture overview

Snowflake separates storage, compute, and services into three independent layers. Storage is managed in a columnar format on cloud object storage. Compute is provided by virtual warehouses (clusters of nodes) that can be resized, suspended, and resumed independently. The services layer handles authentication, metadata, query parsing, and optimization.

This separation means you can scale compute without affecting storage costs and vice versa.

Virtual warehouses

Sizing guidelines

Size	Credits/hr	Best for
X-Small	1	Development, light queries
Small	2	Single-user analytics, small ETL
Medium	4	Team analytics, moderate ETL
Large	8	Complex joins on large tables
X-Large+	16+	Heavy ETL, large data loads

Multi-cluster warehouses

-- Auto-scale for concurrent query load
ALTER WAREHOUSE analytics_wh SET
  WAREHOUSE_SIZE = 'MEDIUM'
  MIN_CLUSTER_COUNT = 1
  MAX_CLUSTER_COUNT = 3
  SCALING_POLICY = 'STANDARD'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE;

Best practice: Separate warehouses for ETL (etl_wh) and analytics (analytics_wh) to prevent loading jobs from starving interactive queries.

Clustering keys

Snowflake micro-partitions store data in 50-500 MB compressed blocks. Clustering keys control how data is organized within these partitions.

-- Cluster by the most common filter columns (left to right = most selective)
ALTER TABLE fct_sales CLUSTER BY (date_sk, region);

-- Check clustering quality
SELECT SYSTEM$CLUSTERING_INFORMATION('fct_sales');
-- average_depth close to 1 = well clustered
-- average_depth > 5 = needs reclustering

-- Snowflake automatically reclusters in the background (costs credits)
-- Monitor with:
SELECT * FROM TABLE(INFORMATION_SCHEMA.AUTOMATIC_CLUSTERING_HISTORY(
  TABLE_NAME => 'fct_sales',
  DATE_RANGE_START => DATEADD(DAY, -7, CURRENT_DATE())
));

When to cluster: Tables > 1 TB with predictable filter patterns. Do not cluster small tables - the overhead exceeds the benefit.

Time travel and fail-safe

-- Query data as it existed 30 minutes ago
SELECT * FROM fct_sales AT(OFFSET => -1800);

-- Query data as it existed at a specific timestamp
SELECT * FROM fct_sales AT(TIMESTAMP => '2026-03-13 10:00:00'::TIMESTAMP);

-- Restore a dropped table
UNDROP TABLE fct_sales;

-- Set retention period (default 1 day; Enterprise edition: up to 90 days)
ALTER TABLE fct_sales SET DATA_RETENTION_TIME_IN_DAYS = 30;

Time travel storage is billed separately. Set retention to the minimum you need for operational recovery. Fail-safe adds 7 days of additional protection beyond the retention period but is not user-accessible - Snowflake support must restore it.

Zero-copy cloning

-- Clone an entire database for development (instant, no storage cost until divergence)
CREATE DATABASE dev_warehouse CLONE prod_warehouse;

-- Clone a single table
CREATE TABLE fct_sales_test CLONE fct_sales;

-- Clone at a point in time (combines with time travel)
CREATE TABLE fct_sales_snapshot CLONE fct_sales
  AT(TIMESTAMP => '2026-03-13 08:00:00'::TIMESTAMP);

Clones share the underlying micro-partitions. Storage cost only accrues when data in the clone diverges from the source. Use clones for safe testing of schema changes and ETL.

Streams and tasks (CDC pipeline)

Streams capture change data on tables. Tasks schedule SQL execution.

-- Create a stream to track changes on the staging table
CREATE STREAM stg_customers_stream ON TABLE staging_customers;

-- Create a task to process changes every 5 minutes
CREATE TASK scd2_customer_task
  WAREHOUSE = etl_wh
  SCHEDULE = '5 MINUTE'
  WHEN SYSTEM$STREAM_HAS_DATA('stg_customers_stream')
AS
  MERGE INTO dim_customer AS target
  USING stg_customers_stream AS source
    ON target.customer_id = source.customer_id
       AND target.is_current = TRUE
  WHEN MATCHED
    AND (target.segment != source.segment
      OR target.region  != source.region)
  THEN UPDATE SET
    expiry_date = CURRENT_DATE - 1,
    is_current  = FALSE
  WHEN NOT MATCHED THEN INSERT (
    customer_sk, customer_id, name, segment, region,
    effective_date, expiry_date, is_current
  ) VALUES (
    dim_customer_seq.NEXTVAL,
    source.customer_id, source.name, source.segment, source.region,
    CURRENT_DATE, '9999-12-31', TRUE
  );

-- Resume the task (tasks are created in suspended state)
ALTER TASK scd2_customer_task RESUME;

Stages and data loading

-- External stage pointing to S3
CREATE STAGE s3_raw_data
  URL = 's3://my-bucket/raw/'
  STORAGE_INTEGRATION = my_s3_integration
  FILE_FORMAT = (TYPE = 'PARQUET');

-- List files in stage
LIST @s3_raw_data;

-- COPY INTO for bulk loading (most efficient method)
COPY INTO staging_customers
FROM @s3_raw_data/customers/
  FILE_FORMAT = (TYPE = 'PARQUET')
  MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

-- Snowpipe for continuous loading (serverless, event-driven)
CREATE PIPE customer_pipe
  AUTO_INGEST = TRUE
AS
COPY INTO staging_customers
FROM @s3_raw_data/customers/
  FILE_FORMAT = (TYPE = 'PARQUET');

Cost optimization checklist

Auto-suspend all warehouses at 60 seconds (or 300 for heavy workloads)
Use resource monitors to set credit quotas per warehouse per month
Cluster only large tables (> 1 TB) with clear filter patterns
Minimize time travel retention to what you actually need
Use transient tables for staging/temp data (no fail-safe, lower storage cost)
Monitor with ACCOUNT_USAGE views: WAREHOUSE_METERING_HISTORY, STORAGE_USAGE, QUERY_HISTORY

Frequently Asked Questions

What is data-warehousing?

How do I install data-warehousing?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill data-warehousing in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support data-warehousing?

data-warehousing works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Is data-warehousing free?

Yes, data-warehousing is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between data-warehousing and similar tools?

data-warehousing is an AI agent skill that teaches your coding agent specialized data engineering knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use data-warehousing with Cursor or Windsurf?

data-warehousing works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

data-warehousing

What is data-warehousing?

Quick Start

data-warehousing

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is data-warehousing?

How do I install data-warehousing?

What AI agents support data-warehousing?

Maintainers

SKILL.md

Data Warehousing

When to use this skill

Key principles

Core concepts

Dimensional modeling entities

Star schema vs snowflake schema

Fact table types

Slowly changing dimensions (SCD)

Common tasks

Design a star schema

Implement SCD Type 2 in Snowflake

Implement SCD Type 2 in BigQuery

Optimize Snowflake performance

Optimize BigQuery performance

Build a date dimension

Anti-patterns / common mistakes

Gotchas

References

References

bigquery-patterns.md

BigQuery Patterns

Architecture overview

Pricing models

Partitioning

Clustering

Nested and repeated fields (STRUCT and ARRAY)

Materialized views

BI Engine

Scheduled queries

Federated queries

Cost optimization checklist

scd-patterns.md

SCD Patterns

SCD Type 0 - Fixed attribute

SCD Type 1 - Overwrite

SCD Type 2 - Add new row

Required columns

Snowflake implementation

BigQuery implementation

Querying Type 2 dimensions

SCD Type 3 - Previous value column

SCD Type 6 - Hybrid (1 + 2 + 3)

Querying Type 6

Choosing the right SCD type

Migration: Type 1 to Type 2

snowflake-patterns.md

Snowflake Patterns

Architecture overview

Virtual warehouses

Sizing guidelines

Multi-cluster warehouses

Clustering keys

Time travel and fail-safe

Zero-copy cloning

Streams and tasks (CDC pipeline)

Stages and data loading

Cost optimization checklist

Frequently Asked Questions

What is data-warehousing?

How do I install data-warehousing?

What AI agents support data-warehousing?

Is data-warehousing free?

What is the difference between data-warehousing and similar tools?

Can I use data-warehousing with Cursor or Windsurf?