analytics-engineering

Use this skill when building dbt models, designing semantic layers, defining metrics, creating self-serve analytics, or structuring a data warehouse for analyst consumption. Triggers on dbt project setup, model layering (staging, intermediate, marts), ref() and source() usage, YAML schema definitions, metrics definitions, semantic layer configuration, dimensional modeling, slowly changing dimensions, data testing, and any task requiring analytics engineering best practices.

What is analytics-engineering?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill analytics-engineering
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The analytics-engineering skill is now active and ready to use

Overview Files

analytics-engineering

analytics-engineering is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Building dbt models, designing semantic layers, defining metrics, creating self-serve analytics, or structuring a data warehouse for analyst consumption.

Quick Facts

Field	Value
Category	data
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill analytics-engineering

The analytics-engineering skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

A disciplined framework for building trustworthy, well-tested data transformation pipelines using dbt and modern analytics engineering practices. This skill covers dbt model layering, semantic layer design, metrics definitions, dimensional modeling, and self-serve analytics patterns. It is opinionated about dbt Core/Cloud but the modeling principles apply to any SQL-based transformation tool. The goal is to help you build a data warehouse that analysts can trust and navigate without engineering support.

Platforms

claude-code
gemini-cli
openai-codex

Related Skills

Pair analytics-engineering with these complementary skills:

Frequently Asked Questions

What is analytics-engineering?

How do I install analytics-engineering?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill analytics-engineering in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support analytics-engineering?

This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

Analytics Engineering

When to use this skill

Trigger this skill when the user:

Sets up a new dbt project or restructures an existing one
Designs the model layer hierarchy (staging, intermediate, marts)
Writes or reviews dbt models using ref(), source(), or macros
Defines metrics in YAML (dbt Metrics, MetricFlow, or Cube)
Builds a semantic layer for self-serve analytics
Implements slowly changing dimensions (SCD Type 1, 2, 3)
Writes dbt tests (generic, singular, or custom) and data contracts
Configures sources, exposures, or freshness checks
Asks about dimensional modeling (star schema, snowflake schema, OBT)

Do NOT trigger this skill for:

Data pipeline orchestration (Airflow, Dagster) unrelated to dbt models
Raw data ingestion or ELT tool configuration (Fivetran, Airbyte connectors)

Key principles

Layer your models deliberately - Use a three-layer architecture: staging (1:1 with source tables, rename and cast only), intermediate (business logic joins and filters), and marts (wide, denormalized tables ready for analysts). Every model lives in exactly one layer. No skipping layers.
One source of truth per grain - Each mart model must have a clearly defined grain (one row = one what?). Document it in the YAML schema. If two mart models have the same grain, one of them should not exist.
Test everything that matters, nothing that doesn't - Test primary keys with unique and not_null. Test foreign keys with relationships. Test business rules with custom singular tests. Do not write tests that duplicate what the warehouse already enforces.
Metrics are code, not queries - Define metrics in version-controlled YAML, not in BI tool calculated fields. This ensures a single definition that every consumer (dashboard, ad-hoc query, API) shares. Disagreements about numbers end when metric definitions are in the repo.
Build for self-serve, not for tickets - Every mart should be understandable by a non-engineer. Use clear column names (no abbreviations), add descriptions to every column in the YAML schema, and expose models as documented datasets in the BI tool. If analysts file tickets asking what a column means, the model is incomplete.

Core concepts

Model layer architecture

Layer	Prefix	Purpose	Example
Staging	`stg_`	1:1 with source, rename + cast + basic cleaning	`stg_stripe__payments`
Intermediate	`int_`	Business logic, joins across staging models	`int_orders__pivoted_payments`
Marts	`fct_` / `dim_`	Analyst-facing, denormalized, documented	`fct_orders`, `dim_customers`

Staging models should be views (no materialization cost). Intermediate models are tables or ephemeral depending on reuse. Marts are always tables (or incremental).

Dimensional modeling

Fact tables (fct_) contain measurable events at a specific grain - orders, payments, page views. They hold foreign keys to dimension tables and numeric measures.

Dimension tables (dim_) contain descriptive attributes - customers, products, dates. They provide the "who, what, where, when" context for facts.

One Big Table (OBT) is a pre-joined wide table combining facts and dimensions. Use OBT for BI tools that perform poorly with joins. It trades storage for query simplicity.

The semantic layer

A semantic layer sits between the data warehouse and consumers (BI tools, notebooks, APIs). It defines metrics, dimensions, and entities in a declarative format so that every consumer gets the same answers. dbt's MetricFlow, Cube, and Looker's LookML are implementations of this pattern. The semantic layer eliminates "which number is right?" debates by making metric logic authoritative and centralized.

Incremental models

For large fact tables, use dbt incremental models to process only new/changed rows instead of rebuilding the entire table. The is_incremental() macro gates the WHERE clause to filter for rows since the last run. Always define a unique_key to handle late-arriving or updated records via merge behavior.

Common tasks

Set up dbt project structure

my_project/
  dbt_project.yml
  models/
    staging/
      stripe/
        _stripe__models.yml    # source + model definitions
        _stripe__sources.yml   # source freshness config
        stg_stripe__payments.sql
        stg_stripe__customers.sql
      shopify/
        _shopify__models.yml
        _shopify__sources.yml
        stg_shopify__orders.sql
    intermediate/
      int_orders__pivoted_payments.sql
    marts/
      finance/
        _finance__models.yml
        fct_orders.sql
        dim_customers.sql
      marketing/
        _marketing__models.yml
        fct_ad_spend.sql
  tests/
    singular/
      assert_order_total_positive.sql
  macros/
    cents_to_dollars.sql

Use underscores for filenames, double underscores to separate source system from entity (e.g. stg_stripe__payments). Group staging models by source system, marts by business domain.

Write a staging model

Staging models rename, cast, and apply minimal cleaning. No joins, no business logic.

-- models/staging/stripe/stg_stripe__payments.sql
with source as (
    select * from {{ source('stripe', 'payments') }}
),

renamed as (
    select
        id as payment_id,
        order_id,
        cast(amount as integer) as amount_cents,
        cast(created as timestamp) as created_at,
        status,
        lower(currency) as currency
    from source
)

select * from renamed

Build a mart fact table

-- models/marts/finance/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        on_schema_change='sync_all_columns'
    )
}}

with orders as (
    select * from {{ ref('stg_shopify__orders') }}
),

payments as (
    select * from {{ ref('int_orders__pivoted_payments') }}
),

final as (
    select
        orders.order_id,
        orders.customer_id,
        orders.order_date,
        orders.status,
        payments.total_amount_cents,
        payments.payment_method,
        payments.total_amount_cents / 100.0 as total_amount_dollars
    from orders
    left join payments on orders.order_id = payments.order_id
    {% if is_incremental() %}
    where orders.updated_at > (select max(updated_at) from {{ this }})
    {% endif %}
)

select * from final

Define metrics in YAML (MetricFlow)

# models/marts/finance/_finance__models.yml
semantic_models:
  - name: orders
    defaults:
      agg_time_dimension: order_date
    model: ref('fct_orders')
    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign
    dimensions:
      - name: order_date
        type: time
        type_params:
          time_granularity: day
      - name: status
        type: categorical
    measures:
      - name: order_count
        agg: count
        expr: order_id
      - name: total_revenue_cents
        agg: sum
        expr: total_amount_cents
      - name: average_order_value_cents
        agg: average
        expr: total_amount_cents

metrics:
  - name: revenue
    type: derived
    label: "Total Revenue"
    description: "Sum of all order payments in dollars"
    type_params:
      expr: total_revenue_cents / 100
      metrics:
        - name: total_revenue_cents
  - name: order_count
    type: simple
    label: "Order Count"
    type_params:
      measure: order_count

Write dbt tests and data contracts

# models/marts/finance/_finance__models.yml
models:
  - name: fct_orders
    description: "One row per order. Grain: order_id."
    config:
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: varchar
        description: "Primary key - unique order identifier"
        tests:
          - unique
          - not_null
      - name: customer_id
        description: "FK to dim_customers"
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: total_amount_cents
        data_type: integer
        description: "Total order value in cents"
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0

-- tests/singular/assert_order_total_positive.sql
-- Returns rows that violate the rule (should return 0 rows to pass)
select order_id, total_amount_cents
from {{ ref('fct_orders') }}
where total_amount_cents < 0

Configure source freshness

# models/staging/stripe/_stripe__sources.yml
sources:
  - name: stripe
    database: raw
    schema: stripe
    loaded_at_field: _loaded_at
    freshness:
      warn_after: { count: 12, period: hour }
      error_after: { count: 24, period: hour }
    tables:
      - name: payments
        description: "Raw Stripe payment events"
        columns:
          - name: id
            tests:
              - unique
              - not_null

Run dbt source freshness in CI to catch stale source data before it propagates into marts.

Build a self-serve dimension table

-- models/marts/finance/dim_customers.sql
with customers as (
    select * from {{ ref('stg_shopify__customers') }}
),

orders as (
    select * from {{ ref('fct_orders') }}
),

customer_metrics as (
    select
        customer_id,
        count(*) as lifetime_order_count,
        sum(total_amount_cents) as lifetime_value_cents,
        min(order_date) as first_order_date,
        max(order_date) as most_recent_order_date
    from orders
    group by customer_id
),

final as (
    select
        customers.customer_id,
        customers.full_name,
        customers.email,
        customers.created_at as customer_since,
        coalesce(customer_metrics.lifetime_order_count, 0)
            as lifetime_order_count,
        coalesce(customer_metrics.lifetime_value_cents, 0)
            as lifetime_value_cents,
        customer_metrics.first_order_date,
        customer_metrics.most_recent_order_date,
        case
            when customer_metrics.lifetime_order_count >= 5
                then 'high_value'
            when customer_metrics.lifetime_order_count >= 2
                then 'returning'
            else 'new'
        end as customer_segment
    from customers
    left join customer_metrics
        on customers.customer_id = customer_metrics.customer_id
)

select * from final

Every column has a clear, human-readable name. Analysts should never need to ask what lv_cents means - call it lifetime_value_cents.

Anti-patterns / common mistakes

Mistake	Why it's wrong	What to do instead
Business logic in staging models	Staging should be a clean 1:1 mirror; mixing logic here makes debugging impossible	Move all joins, filters, and calculations to intermediate or mart layers
Metrics defined in BI tool only	Multiple dashboards will define "revenue" differently, causing trust erosion	Define metrics in YAML (MetricFlow/Cube) and expose through the semantic layer
No grain documentation	Without a stated grain, analysts build incorrect aggregations (double-counting)	Add "Grain: one row per X" to every mart model's YAML description
Skipping the intermediate layer	Mart models become 300+ line monsters with 8 CTEs and nested joins	Extract reusable transformations into `int_` models that marts can ref()
Using `SELECT *` in models	Schema changes upstream silently add/remove columns, breaking downstream	Explicitly list every column in staging models
Hardcoded filter values	`WHERE status != 'test'` in 12 models; when the value changes, half get missed	Create a macro or a staging-layer filter applied once at the source boundary
No incremental strategy for large tables	Full table rebuilds take hours and spike warehouse costs	Use incremental models with a reliable `updated_at` or event timestamp

Gotchas

Incremental models with a broken unique_key silently duplicate rows - If the unique_key doesn't match how the source system generates IDs (e.g., composite keys, NULL-able columns), the merge strategy falls back to appending and your fact table will have duplicate rows. Always test with a unique dbt test on the mart's primary key after the first incremental run.
ref() creates a compile-time dependency but not a runtime guarantee - dbt's ref() ensures build order, but if a staging model's source table is empty or missing, the downstream mart builds with zero rows and no error. Add not_null and row count tests to staging models so silent empty builds surface in CI.
Metrics defined in both the semantic layer and the BI tool diverge - If analysts can also create calculated fields in Looker/Tableau/Power BI, they will. Within months there will be two definitions of "revenue" and no one knows which is correct. Enforce a semantic-layer-first policy and audit BI tool custom fields quarterly.
SELECT * in staging models breaks on upstream schema changes - When a source table adds or removes a column, SELECT * staging models silently change shape, potentially breaking downstream marts. Explicitly list every column in staging models so schema changes cause a compile error rather than silent breakage.
Hardcoded dates in incremental WHERE clauses don't survive full refreshes - An incremental model that filters with where created_at > '2024-01-01' will drop historical data on a --full-refresh. Use {{ this }} to reference the current max timestamp, and document what happens on a forced full refresh.

References

For detailed patterns and implementation guidance, load the relevant file from references/:

references/dbt-patterns.md - Advanced dbt patterns including macros, packages, hooks, custom materializations, and CI/CD integration
references/semantic-layer.md - Deep dive into MetricFlow configuration, Cube setup, dimension/measure types, and BI tool integration
references/self-serve-analytics.md - Patterns for building analyst-friendly data platforms, documentation strategies, and data catalog integration

Only load a references file if the current task requires it - they are long and will consume context.

References

dbt-patterns.md

Advanced dbt Patterns

Macros

Macros are reusable Jinja functions. Use them to DRY up repeated SQL patterns.

Utility macro: cents to dollars

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, precision=2) %}
    round({{ column_name }} / 100.0, {{ precision }})
{% endmacro %}

Usage in a model:

select
    order_id,
    amount_cents,
    {{ cents_to_dollars('amount_cents') }} as amount_dollars
from {{ ref('stg_stripe__payments') }}

Generate schema name macro

Override the default schema naming to control where models land:

-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}
        {{ default_schema }}
    {%- elif target.name == 'prod' -%}
        {{ custom_schema_name | trim }}
    {%- else -%}
        {{ default_schema }}_{{ custom_schema_name | trim }}
    {%- endif -%}
{%- endmacro %}

This ensures schema: finance resolves to finance in prod but dev_jsmith_finance in development - preventing dev runs from writing to production schemas.

Pivot macro

-- macros/pivot.sql
{% macro pivot(column, values, alias=True, agg='sum', then_value=1, else_value=0) %}
    {% for value in values %}
        {{ agg }}(
            case when {{ column }} = '{{ value }}'
                then {{ then_value }}
                else {{ else_value }}
            end
        ) {% if alias %} as {{ column }}_{{ value | replace(' ', '_') | lower }} {% endif %}
        {% if not loop.last %},{% endif %}
    {% endfor %}
{% endmacro %}

Essential packages

Add to packages.yml:

packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.0.0", "<2.0.0"]
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<0.11.0"]
  - package: dbt-labs/codegen
    version: [">=0.12.0", "<0.13.0"]

dbt_utils highlights

Macro	Purpose
`surrogate_key(['col1', 'col2'])`	Generate deterministic hash keys for dimensions
`star(from=ref('model'), except=['col'])`	Select all columns except specified ones
`date_spine(datepart, start, end)`	Generate a continuous date dimension
`pivot(column, values, then_value)`	Pivot rows to columns
`union_relations([ref('a'), ref('b')])`	UNION ALL with column alignment
`accepted_range(min, max)`	Test that values fall within a range
`at_least_one`	Test that a column has at least one non-null value

dbt_expectations highlights

Statistical and distribution tests:

columns:
  - name: amount_cents
    tests:
      - dbt_expectations.expect_column_values_to_be_between:
          min_value: 0
          max_value: 10000000
      - dbt_expectations.expect_column_mean_to_be_between:
          min_value: 1000
          max_value: 50000

Hooks

Pre-hook: grant access after model build

# dbt_project.yml
models:
  my_project:
    marts:
      +post-hook:
        - "grant select on {{ this }} to role analyst_readonly"

On-run-end: refresh BI cache

on-run-end:
  - "{{ log('Run completed at ' ~ run_started_at, info=True) }}"

Custom materializations

Incremental with delete+insert strategy

For warehouses that do not support MERGE (e.g., Redshift without MERGE support):

{{
    config(
        materialized='incremental',
        incremental_strategy='delete+insert',
        unique_key='event_id',
        on_schema_change='append_new_columns'
    )
}}

Snapshot (SCD Type 2)

-- snapshots/snap_customers.sql
{% snapshot snap_customers %}
{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at',
        invalidate_hard_deletes=True
    )
}}

select * from {{ source('shopify', 'customers') }}

{% endsnapshot %}

This creates dbt_valid_from, dbt_valid_to, and dbt_updated_at columns automatically. A row with dbt_valid_to IS NULL is the current version.

CI/CD integration

Slim CI with state comparison

# Only run models that changed since last production run
dbt build --select state:modified+ --state ./prod-manifest/

CI pipeline steps

# .github/workflows/dbt-ci.yml (simplified)
steps:
  - name: Install deps
    run: dbt deps

  - name: Check source freshness
    run: dbt source freshness --target ci

  - name: Build changed models
    run: dbt build --select state:modified+ --state ./prod-manifest/ --target ci

  - name: Run all tests on changed models
    run: dbt test --select state:modified+ --state ci

  - name: Generate docs
    run: dbt docs generate

Production run pattern

# Full refresh on schedule (weekly or as needed)
dbt build --full-refresh --target prod

# Daily incremental run
dbt build --target prod

# Source freshness check before build
dbt source freshness --target prod && dbt build --target prod

Model selection syntax

Selector	Meaning
`dbt run -s my_model`	Run one model
`dbt run -s my_model+`	Run model and all downstream
`dbt run -s +my_model`	Run model and all upstream
`dbt run -s +my_model+`	Run full lineage (upstream + downstream)
`dbt run -s tag:finance`	Run all models tagged "finance"
`dbt run -s path:models/marts/finance`	Run all models in a directory
`dbt run -s state:modified+`	Run changed models and downstream (CI)
`dbt run --exclude stg_legacy__*`	Run everything except legacy staging

Project configuration best practices

# dbt_project.yml
name: my_analytics
version: '1.0.0'

profile: my_analytics

vars:
  my_analytics:
    start_date: '2020-01-01'

models:
  my_analytics:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table
      +schema: analytics
      finance:
        +tags: ['finance', 'daily']
      marketing:
        +tags: ['marketing', 'daily']

seeds:
  my_analytics:
    +schema: seeds

snapshots:
  my_analytics:
    +target_schema: snapshots

Key rules:

Staging as views (no warehouse cost, always fresh)
Intermediate as ephemeral (inlined into downstream queries, no table created)
Marts as tables (fast for analyst queries)
Override per-folder when needed (e.g., a heavy intermediate that should be a table)

self-serve-analytics.md

Self-Serve Analytics Patterns

The self-serve spectrum

Self-serve analytics is not binary. Teams operate on a spectrum:

Level	Who queries	What they need	Analytics eng. effort
1. Dashboard consumers	Everyone	Pre-built dashboards, no SQL	Low - build once, maintain
2. Guided exploration	Analysts	Curated datasets + BI tool	Medium - model + document
3. SQL self-serve	Data-literate staff	Clean marts + docs	High - full modeling + semantic layer
4. Full autonomy	Data engineers	Raw + modeled data	Minimal - provide access + catalog

Most teams should target Level 2-3. Level 4 is for data teams querying their own warehouse. Level 1 alone creates a dashboard factory with infinite ticket queues.

Building analyst-friendly marts

Naming conventions

Pattern	Example	Rule
Fact tables	`fct_orders`	Prefix with `fct_`, use plural noun
Dimension tables	`dim_customers`	Prefix with `dim_`, use plural noun
Date columns	`order_date`, `created_at`	Suffix with `_date` (date) or `_at` (timestamp)
Boolean columns	`is_active`, `has_subscription`	Prefix with `is_` or `has_`
Amount columns	`total_amount_cents`	Include unit in name (`_cents`, `_usd`, `_seconds`)
Count columns	`lifetime_order_count`	Suffix with `_count`
ID columns	`customer_id`	Suffix with `_id`, match the dimension table name

Never use abbreviations. cust_id saves 4 characters and costs every analyst a lookup. Use customer_id.

Column descriptions in YAML

Every column in every mart model must have a description:

models:
  - name: fct_orders
    description: >
      One row per order. Grain: order_id. Includes payment totals joined from
      the payments intermediate model. Updated incrementally on each dbt run.
    columns:
      - name: order_id
        description: "Unique order identifier from Shopify. Primary key."
      - name: customer_id
        description: "FK to dim_customers. The customer who placed this order."
      - name: order_date
        description: "Date the order was placed (UTC, date only, no time)."
      - name: status
        description: >
          Current order status. Values: pending, confirmed, shipped, delivered,
          cancelled, refunded.
      - name: total_amount_cents
        description: >
          Total order value in US cents (integer). Divide by 100 for dollars.
          Includes tax, excludes shipping.
      - name: is_first_order
        description: >
          True if this is the customer's first order by order_date. Useful for
          new vs returning customer analysis.

Grain documentation pattern

Every mart model's YAML description must state the grain explicitly:

Grain: one row per <entity>

Examples:

fct_orders: "Grain: one row per order"
fct_order_items: "Grain: one row per order line item"
dim_customers: "Grain: one row per customer (current state)"
fct_daily_active_users: "Grain: one row per user per day"

If you cannot state the grain in one sentence, the model is likely mixing grains and should be split.

Documentation strategies

dbt docs

Generate and host dbt docs for the entire project:

dbt docs generate
dbt docs serve --port 8080

For production, deploy to a static hosting service (S3 + CloudFront, Netlify, etc.) or use dbt Cloud's hosted docs.

What to document beyond column descriptions

Model-level description - What this model is, its grain, key assumptions
Source freshness expectations - How often data arrives, acceptable lag
Known limitations - "Does not include orders from the legacy system before 2021"
Business rules - "An order is considered 'completed' when status = 'delivered' AND payment_status = 'captured'"
Metric definitions - Link to the semantic layer metric, not a BI dashboard

Data dictionary template

For teams that need a document outside of dbt docs:

## fct_orders

| Column | Type | Description | Example |
|---|---|---|---|
| order_id | varchar | Unique order ID from Shopify | ord_abc123 |
| customer_id | varchar | FK to dim_customers | cust_xyz789 |
| order_date | date | Date order was placed (UTC) | 2024-03-15 |
| status | varchar | Order status enum | delivered |
| total_amount_cents | integer | Total in cents, incl. tax, excl. shipping | 4999 |

**Grain**: One row per order
**Refresh**: Incremental, every 6 hours
**Owner**: Data Engineering (@data-eng)

Data catalog integration

Popular catalog tools

Tool	Type	Best for
dbt Docs	Built-in	Teams already using dbt
Atlan	SaaS catalog	Enterprise, governance-heavy
DataHub (LinkedIn)	Open source	Large orgs, custom metadata
Amundsen (Lyft)	Open source	Discovery-focused
Select Star	SaaS	Automated lineage
Monte Carlo	SaaS	Data observability + catalog

Catalog integration pattern

Run dbt docs generate to produce manifest.json and catalog.json
Push these artifacts to your catalog tool's API
The catalog ingests model descriptions, column descriptions, lineage, and tests
Analysts search the catalog, find the right table, and start querying

# Example: push dbt artifacts to DataHub
datahub ingest -c dbt_recipe.yml

# dbt_recipe.yml (DataHub)
source:
  type: dbt
  config:
    manifest_path: target/manifest.json
    catalog_path: target/catalog.json
    target_platform: snowflake

Access patterns for self-serve

Role-based access

-- Create analyst role with read-only access to marts
CREATE ROLE analyst_readonly;
GRANT USAGE ON SCHEMA analytics TO analyst_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO analyst_readonly;
ALTER DEFAULT PRIVILEGES IN SCHEMA analytics
  GRANT SELECT ON TABLES TO analyst_readonly;

-- Staging and raw schemas are NOT granted to analysts
-- They should only query marts

Query guardrails

Prevent runaway queries from consuming warehouse resources:

-- Snowflake: warehouse-level query timeout
ALTER WAREHOUSE analyst_wh SET
  STATEMENT_TIMEOUT_IN_SECONDS = 300;

-- BigQuery: use BI Engine for fast, bounded queries
-- Redshift: WLM queue with memory and concurrency limits

Onboarding analysts to self-serve

The self-serve onboarding checklist

Share the dbt docs URL (or data catalog link)
Walk through the 3-5 most important mart tables and their grains
Show how to find metric definitions in the semantic layer
Provide 5-10 example queries covering common analysis patterns
Set up a Slack channel (#data-questions) for questions with a 24h SLA
Review their first 3 queries/dashboards for correctness
After 2 weeks, check: are they filing fewer data tickets? If not, the models or documentation need improvement, not the analyst.

Example starter queries for analysts

-- Revenue by month
SELECT
  DATE_TRUNC('month', order_date) AS month,
  SUM(total_amount_cents) / 100.0 AS revenue_dollars
FROM analytics.fct_orders
WHERE status = 'delivered'
GROUP BY 1
ORDER BY 1;

-- Customer segmentation breakdown
SELECT
  customer_segment,
  COUNT(*) AS customer_count,
  AVG(lifetime_value_cents) / 100.0 AS avg_ltv_dollars
FROM analytics.dim_customers
GROUP BY 1
ORDER BY 3 DESC;

-- New vs returning customers by week
SELECT
  DATE_TRUNC('week', o.order_date) AS week,
  COUNT(CASE WHEN o.is_first_order THEN 1 END) AS new_customers,
  COUNT(CASE WHEN NOT o.is_first_order THEN 1 END) AS returning_customers
FROM analytics.fct_orders o
WHERE o.status != 'cancelled'
GROUP BY 1
ORDER BY 1;

Measuring self-serve success

Track these metrics to know if self-serve is working:

Metric	Target	How to measure
Data ticket volume	Decreasing month-over-month	Count tickets tagged "data request"
Time to first query	< 1 week for new analysts	Warehouse audit logs
Query error rate	< 10% of analyst queries fail	Warehouse query history
Dashboard trust score	> 4/5 in quarterly survey	Survey stakeholders
Metric definition coverage	> 80% of KPIs in semantic layer	Audit metrics YAML vs BI tools

semantic-layer.md

Semantic Layer Deep Dive

What is a semantic layer?

A semantic layer is a declarative abstraction between your data warehouse tables and the consumers (BI tools, notebooks, APIs, LLM agents). It defines:

Entities - the join keys and relationships between tables
Dimensions - the columns you group by or filter on
Measures - the aggregations (sum, count, average) applied to columns
Metrics - business-level calculations composed from measures

The semantic layer compiles these definitions into optimized SQL at query time. Every consumer gets the same metric logic, regardless of which tool they use.

MetricFlow (dbt Semantic Layer)

MetricFlow is the engine behind the dbt Semantic Layer (dbt Cloud). It uses semantic_models and metrics defined in YAML.

Semantic model anatomy

semantic_models:
  - name: orders
    description: "Order fact table"
    model: ref('fct_orders')
    defaults:
      agg_time_dimension: order_date

    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign
      - name: product_id
        type: foreign

    dimensions:
      - name: order_date
        type: time
        type_params:
          time_granularity: day
      - name: status
        type: categorical
      - name: channel
        type: categorical

    measures:
      - name: order_count
        agg: count
        expr: order_id
        description: "Count of orders"
      - name: total_revenue
        agg: sum
        expr: total_amount_cents
        description: "Sum of order amounts in cents"
      - name: avg_order_value
        agg: average
        expr: total_amount_cents

Entity types

Type	Meaning	Example
`primary`	Unique identifier for this semantic model	`order_id` in orders
`foreign`	References a primary entity in another model	`customer_id` in orders
`unique`	Unique but not the primary grain	`email` in customers
`natural`	Non-unique identifier (used for joins)	`session_id` shared across events

Dimension types

Type	Use for	Granularity options
`categorical`	String/enum columns you group by	N/A
`time`	Timestamp/date columns for time-series	`day`, `week`, `month`, `quarter`, `year`

Measure aggregation types

Agg	SQL equivalent	Notes
`sum`	`SUM(expr)`	Most common for revenue, quantities
`count`	`COUNT(expr)`	Count of non-null values
`count_distinct`	`COUNT(DISTINCT expr)`	Unique customers, sessions
`average`	`AVG(expr)`	Use with caution - averages of averages are wrong
`min` / `max`	`MIN(expr)` / `MAX(expr)`	First/last dates, extremes
`median`	`PERCENTILE_CONT(0.5)`	Warehouse support varies
`sum_boolean`	`SUM(CASE WHEN expr THEN 1 ELSE 0 END)`	Count of true values

Metric types

Simple metric

Direct reference to a single measure:

metrics:
  - name: total_orders
    type: simple
    label: "Total Orders"
    description: "Count of all orders"
    type_params:
      measure: order_count

Derived metric

Calculation across multiple measures:

metrics:
  - name: average_revenue_per_customer
    type: derived
    label: "Revenue per Customer"
    description: "Total revenue divided by unique customer count"
    type_params:
      expr: total_revenue / unique_customers
      metrics:
        - name: total_revenue
        - name: unique_customers

Ratio metric

Special case of derived for common ratio patterns:

metrics:
  - name: order_conversion_rate
    type: ratio
    label: "Order Conversion Rate"
    type_params:
      numerator: completed_orders
      denominator: total_orders

Cumulative metric

Running totals over time:

metrics:
  - name: cumulative_revenue
    type: cumulative
    label: "Cumulative Revenue"
    type_params:
      measure: total_revenue
      window: 1 month
      grain_to_date: month

Cube semantic layer

Cube is an alternative semantic layer that works with any BI tool via REST/GraphQL API.

Cube schema file

// schema/Orders.js
cube('Orders', {
  sql_table: 'analytics.fct_orders',

  joins: {
    Customers: {
      relationship: 'many_to_one',
      sql: `${CUBE}.customer_id = ${Customers}.customer_id`
    }
  },

  dimensions: {
    orderId: {
      sql: 'order_id',
      type: 'string',
      primary_key: true
    },
    status: {
      sql: 'status',
      type: 'string'
    },
    orderDate: {
      sql: 'order_date',
      type: 'time'
    }
  },

  measures: {
    count: {
      type: 'count'
    },
    totalRevenue: {
      sql: 'total_amount_cents',
      type: 'sum'
    },
    averageOrderValue: {
      sql: 'total_amount_cents',
      type: 'avg'
    }
  }
});

Cube vs MetricFlow comparison

Feature	MetricFlow (dbt)	Cube
Definition format	YAML	JavaScript/YAML
Query API	dbt Cloud Semantic Layer API	REST + GraphQL
Caching	Warehouse-level	Built-in pre-aggregations
BI integrations	Tableau, Hex, Mode (via dbt Cloud)	Any tool via API
Join handling	Entity-based auto-joins	Explicit join definitions
Best for	dbt-centric stacks	Multi-tool, API-first stacks

BI tool integration patterns

Exposures (dbt)

Document which BI assets depend on which models:

# models/exposures.yml
exposures:
  - name: weekly_revenue_dashboard
    type: dashboard
    maturity: high
    url: https://bi.company.com/dashboards/42
    description: "Executive revenue dashboard, refreshed daily"
    depends_on:
      - ref('fct_orders')
      - ref('dim_customers')
    owner:
      name: Data Team
      email: data@company.com

Exposures show up in the dbt docs DAG, making it visible which dashboards break when a model changes.

Metric query patterns

When querying metrics through the semantic layer API:

# dbt Cloud Semantic Layer (Python SDK)
from dbt_sl_sdk import SemanticLayerClient

client = SemanticLayerClient(
    environment_id=12345,
    auth_token="dbt_cloud_token"
)

result = client.query(
    metrics=["revenue", "order_count"],
    group_by=["metric_time__month", "customer__segment"],
    where=["{{ Dimension('order__status') }} = 'completed'"],
    order_by=["-metric_time__month"],
    limit=100
)

Common semantic layer pitfalls

Pitfall	Impact	Fix
Averaging an average	Mathematically incorrect results	Use `sum / count` as a derived metric instead
Missing time dimension	Cannot do time-series analysis on the metric	Every semantic model needs at least one time dimension
Fanout joins	Measures inflate when joining one-to-many	Define entities correctly; use `count_distinct` not `count`
No metric descriptions	Analysts cannot discover or trust metrics	Every metric must have a `description` and `label`
Too many metrics	Decision paralysis, conflicting definitions	Curate 15-25 core metrics; archive the rest

Frequently Asked Questions

What is analytics-engineering?

How do I install analytics-engineering?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill analytics-engineering in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support analytics-engineering?

analytics-engineering works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Is analytics-engineering free?

Yes, analytics-engineering is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between analytics-engineering and similar tools?

analytics-engineering is an AI agent skill that teaches your coding agent specialized data engineering knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use analytics-engineering with Cursor or Windsurf?

analytics-engineering works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

analytics-engineering

What is analytics-engineering?

Quick Start

analytics-engineering

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is analytics-engineering?

How do I install analytics-engineering?

What AI agents support analytics-engineering?

Maintainers

SKILL.md

Analytics Engineering

When to use this skill

Key principles

Core concepts

Model layer architecture

Dimensional modeling

The semantic layer

Incremental models

Common tasks

Set up dbt project structure

Write a staging model

Build a mart fact table

Define metrics in YAML (MetricFlow)

Write dbt tests and data contracts

Configure source freshness

Build a self-serve dimension table

Anti-patterns / common mistakes

Gotchas

References

References

dbt-patterns.md

Advanced dbt Patterns

Macros

Utility macro: cents to dollars

Generate schema name macro

Pivot macro

Essential packages

dbt_utils highlights

dbt_expectations highlights

Hooks

Pre-hook: grant access after model build

On-run-end: refresh BI cache

Custom materializations

Incremental with delete+insert strategy

Snapshot (SCD Type 2)

CI/CD integration

Slim CI with state comparison

CI pipeline steps

Production run pattern

Model selection syntax

Project configuration best practices

self-serve-analytics.md

Self-Serve Analytics Patterns

The self-serve spectrum

Building analyst-friendly marts

Naming conventions

Column descriptions in YAML

Grain documentation pattern

Documentation strategies

dbt docs

What to document beyond column descriptions

Data dictionary template

Data catalog integration

Popular catalog tools

Catalog integration pattern

Access patterns for self-serve

Role-based access

Query guardrails

Onboarding analysts to self-serve

The self-serve onboarding checklist

Example starter queries for analysts

Measuring self-serve success

semantic-layer.md

Semantic Layer Deep Dive