analytics-engineering
Use this skill when building dbt models, designing semantic layers, defining metrics, creating self-serve analytics, or structuring a data warehouse for analyst consumption. Triggers on dbt project setup, model layering (staging, intermediate, marts), ref() and source() usage, YAML schema definitions, metrics definitions, semantic layer configuration, dimensional modeling, slowly changing dimensions, data testing, and any task requiring analytics engineering best practices.
data dbtanalyticsmetricssemantic-layerdata-warehouseself-serveWhat is analytics-engineering?
Use this skill when building dbt models, designing semantic layers, defining metrics, creating self-serve analytics, or structuring a data warehouse for analyst consumption. Triggers on dbt project setup, model layering (staging, intermediate, marts), ref() and source() usage, YAML schema definitions, metrics definitions, semantic layer configuration, dimensional modeling, slowly changing dimensions, data testing, and any task requiring analytics engineering best practices.
analytics-engineering
analytics-engineering is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Building dbt models, designing semantic layers, defining metrics, creating self-serve analytics, or structuring a data warehouse for analyst consumption.
Quick Facts
| Field | Value |
|---|---|
| Category | data |
| Version | 0.1.0 |
| Platforms | claude-code, gemini-cli, openai-codex |
| License | MIT |
How to Install
- Make sure you have Node.js installed on your machine.
- Run the following command in your terminal:
npx skills add AbsolutelySkilled/AbsolutelySkilled --skill analytics-engineering- The analytics-engineering skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).
Overview
A disciplined framework for building trustworthy, well-tested data transformation pipelines using dbt and modern analytics engineering practices. This skill covers dbt model layering, semantic layer design, metrics definitions, dimensional modeling, and self-serve analytics patterns. It is opinionated about dbt Core/Cloud but the modeling principles apply to any SQL-based transformation tool. The goal is to help you build a data warehouse that analysts can trust and navigate without engineering support.
Tags
dbt analytics metrics semantic-layer data-warehouse self-serve
Platforms
- claude-code
- gemini-cli
- openai-codex
Related Skills
Pair analytics-engineering with these complementary skills:
Frequently Asked Questions
What is analytics-engineering?
Use this skill when building dbt models, designing semantic layers, defining metrics, creating self-serve analytics, or structuring a data warehouse for analyst consumption. Triggers on dbt project setup, model layering (staging, intermediate, marts), ref() and source() usage, YAML schema definitions, metrics definitions, semantic layer configuration, dimensional modeling, slowly changing dimensions, data testing, and any task requiring analytics engineering best practices.
How do I install analytics-engineering?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill analytics-engineering in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support analytics-engineering?
This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.
Maintainers
Generated from AbsolutelySkilled
SKILL.md
Analytics Engineering
A disciplined framework for building trustworthy, well-tested data transformation pipelines using dbt and modern analytics engineering practices. This skill covers dbt model layering, semantic layer design, metrics definitions, dimensional modeling, and self-serve analytics patterns. It is opinionated about dbt Core/Cloud but the modeling principles apply to any SQL-based transformation tool. The goal is to help you build a data warehouse that analysts can trust and navigate without engineering support.
When to use this skill
Trigger this skill when the user:
- Sets up a new dbt project or restructures an existing one
- Designs the model layer hierarchy (staging, intermediate, marts)
- Writes or reviews dbt models using ref(), source(), or macros
- Defines metrics in YAML (dbt Metrics, MetricFlow, or Cube)
- Builds a semantic layer for self-serve analytics
- Implements slowly changing dimensions (SCD Type 1, 2, 3)
- Writes dbt tests (generic, singular, or custom) and data contracts
- Configures sources, exposures, or freshness checks
- Asks about dimensional modeling (star schema, snowflake schema, OBT)
Do NOT trigger this skill for:
- Data pipeline orchestration (Airflow, Dagster) unrelated to dbt models
- Raw data ingestion or ELT tool configuration (Fivetran, Airbyte connectors)
Key principles
Layer your models deliberately - Use a three-layer architecture: staging (1:1 with source tables, rename and cast only), intermediate (business logic joins and filters), and marts (wide, denormalized tables ready for analysts). Every model lives in exactly one layer. No skipping layers.
One source of truth per grain - Each mart model must have a clearly defined grain (one row = one what?). Document it in the YAML schema. If two mart models have the same grain, one of them should not exist.
Test everything that matters, nothing that doesn't - Test primary keys with
uniqueandnot_null. Test foreign keys withrelationships. Test business rules with custom singular tests. Do not write tests that duplicate what the warehouse already enforces.Metrics are code, not queries - Define metrics in version-controlled YAML, not in BI tool calculated fields. This ensures a single definition that every consumer (dashboard, ad-hoc query, API) shares. Disagreements about numbers end when metric definitions are in the repo.
Build for self-serve, not for tickets - Every mart should be understandable by a non-engineer. Use clear column names (no abbreviations), add descriptions to every column in the YAML schema, and expose models as documented datasets in the BI tool. If analysts file tickets asking what a column means, the model is incomplete.
Core concepts
Model layer architecture
| Layer | Prefix | Purpose | Example |
|---|---|---|---|
| Staging | stg_ |
1:1 with source, rename + cast + basic cleaning | stg_stripe__payments |
| Intermediate | int_ |
Business logic, joins across staging models | int_orders__pivoted_payments |
| Marts | fct_ / dim_ |
Analyst-facing, denormalized, documented | fct_orders, dim_customers |
Staging models should be views (no materialization cost). Intermediate models are tables or ephemeral depending on reuse. Marts are always tables (or incremental).
Dimensional modeling
Fact tables (fct_) contain measurable events at a specific grain - orders,
payments, page views. They hold foreign keys to dimension tables and numeric measures.
Dimension tables (dim_) contain descriptive attributes - customers, products,
dates. They provide the "who, what, where, when" context for facts.
One Big Table (OBT) is a pre-joined wide table combining facts and dimensions. Use OBT for BI tools that perform poorly with joins. It trades storage for query simplicity.
The semantic layer
A semantic layer sits between the data warehouse and consumers (BI tools, notebooks, APIs). It defines metrics, dimensions, and entities in a declarative format so that every consumer gets the same answers. dbt's MetricFlow, Cube, and Looker's LookML are implementations of this pattern. The semantic layer eliminates "which number is right?" debates by making metric logic authoritative and centralized.
Incremental models
For large fact tables, use dbt incremental models to process only new/changed rows
instead of rebuilding the entire table. The is_incremental() macro gates the WHERE
clause to filter for rows since the last run. Always define a unique_key to handle
late-arriving or updated records via merge behavior.
Common tasks
Set up dbt project structure
my_project/
dbt_project.yml
models/
staging/
stripe/
_stripe__models.yml # source + model definitions
_stripe__sources.yml # source freshness config
stg_stripe__payments.sql
stg_stripe__customers.sql
shopify/
_shopify__models.yml
_shopify__sources.yml
stg_shopify__orders.sql
intermediate/
int_orders__pivoted_payments.sql
marts/
finance/
_finance__models.yml
fct_orders.sql
dim_customers.sql
marketing/
_marketing__models.yml
fct_ad_spend.sql
tests/
singular/
assert_order_total_positive.sql
macros/
cents_to_dollars.sqlUse underscores for filenames, double underscores to separate source system from entity (e.g.
stg_stripe__payments). Group staging models by source system, marts by business domain.
Write a staging model
Staging models rename, cast, and apply minimal cleaning. No joins, no business logic.
-- models/staging/stripe/stg_stripe__payments.sql
with source as (
select * from {{ source('stripe', 'payments') }}
),
renamed as (
select
id as payment_id,
order_id,
cast(amount as integer) as amount_cents,
cast(created as timestamp) as created_at,
status,
lower(currency) as currency
from source
)
select * from renamedBuild a mart fact table
-- models/marts/finance/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
on_schema_change='sync_all_columns'
)
}}
with orders as (
select * from {{ ref('stg_shopify__orders') }}
),
payments as (
select * from {{ ref('int_orders__pivoted_payments') }}
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
payments.total_amount_cents,
payments.payment_method,
payments.total_amount_cents / 100.0 as total_amount_dollars
from orders
left join payments on orders.order_id = payments.order_id
{% if is_incremental() %}
where orders.updated_at > (select max(updated_at) from {{ this }})
{% endif %}
)
select * from finalDefine metrics in YAML (MetricFlow)
# models/marts/finance/_finance__models.yml
semantic_models:
- name: orders
defaults:
agg_time_dimension: order_date
model: ref('fct_orders')
entities:
- name: order_id
type: primary
- name: customer_id
type: foreign
dimensions:
- name: order_date
type: time
type_params:
time_granularity: day
- name: status
type: categorical
measures:
- name: order_count
agg: count
expr: order_id
- name: total_revenue_cents
agg: sum
expr: total_amount_cents
- name: average_order_value_cents
agg: average
expr: total_amount_cents
metrics:
- name: revenue
type: derived
label: "Total Revenue"
description: "Sum of all order payments in dollars"
type_params:
expr: total_revenue_cents / 100
metrics:
- name: total_revenue_cents
- name: order_count
type: simple
label: "Order Count"
type_params:
measure: order_countWrite dbt tests and data contracts
# models/marts/finance/_finance__models.yml
models:
- name: fct_orders
description: "One row per order. Grain: order_id."
config:
contract:
enforced: true
columns:
- name: order_id
data_type: varchar
description: "Primary key - unique order identifier"
tests:
- unique
- not_null
- name: customer_id
description: "FK to dim_customers"
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: total_amount_cents
data_type: integer
description: "Total order value in cents"
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0-- tests/singular/assert_order_total_positive.sql
-- Returns rows that violate the rule (should return 0 rows to pass)
select order_id, total_amount_cents
from {{ ref('fct_orders') }}
where total_amount_cents < 0Configure source freshness
# models/staging/stripe/_stripe__sources.yml
sources:
- name: stripe
database: raw
schema: stripe
loaded_at_field: _loaded_at
freshness:
warn_after: { count: 12, period: hour }
error_after: { count: 24, period: hour }
tables:
- name: payments
description: "Raw Stripe payment events"
columns:
- name: id
tests:
- unique
- not_nullRun
dbt source freshnessin CI to catch stale source data before it propagates into marts.
Build a self-serve dimension table
-- models/marts/finance/dim_customers.sql
with customers as (
select * from {{ ref('stg_shopify__customers') }}
),
orders as (
select * from {{ ref('fct_orders') }}
),
customer_metrics as (
select
customer_id,
count(*) as lifetime_order_count,
sum(total_amount_cents) as lifetime_value_cents,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date
from orders
group by customer_id
),
final as (
select
customers.customer_id,
customers.full_name,
customers.email,
customers.created_at as customer_since,
coalesce(customer_metrics.lifetime_order_count, 0)
as lifetime_order_count,
coalesce(customer_metrics.lifetime_value_cents, 0)
as lifetime_value_cents,
customer_metrics.first_order_date,
customer_metrics.most_recent_order_date,
case
when customer_metrics.lifetime_order_count >= 5
then 'high_value'
when customer_metrics.lifetime_order_count >= 2
then 'returning'
else 'new'
end as customer_segment
from customers
left join customer_metrics
on customers.customer_id = customer_metrics.customer_id
)
select * from finalEvery column has a clear, human-readable name. Analysts should never need to ask what
lv_centsmeans - call itlifetime_value_cents.
Anti-patterns / common mistakes
| Mistake | Why it's wrong | What to do instead |
|---|---|---|
| Business logic in staging models | Staging should be a clean 1:1 mirror; mixing logic here makes debugging impossible | Move all joins, filters, and calculations to intermediate or mart layers |
| Metrics defined in BI tool only | Multiple dashboards will define "revenue" differently, causing trust erosion | Define metrics in YAML (MetricFlow/Cube) and expose through the semantic layer |
| No grain documentation | Without a stated grain, analysts build incorrect aggregations (double-counting) | Add "Grain: one row per X" to every mart model's YAML description |
| Skipping the intermediate layer | Mart models become 300+ line monsters with 8 CTEs and nested joins | Extract reusable transformations into int_ models that marts can ref() |
Using SELECT * in models |
Schema changes upstream silently add/remove columns, breaking downstream | Explicitly list every column in staging models |
| Hardcoded filter values | WHERE status != 'test' in 12 models; when the value changes, half get missed |
Create a macro or a staging-layer filter applied once at the source boundary |
| No incremental strategy for large tables | Full table rebuilds take hours and spike warehouse costs | Use incremental models with a reliable updated_at or event timestamp |
Gotchas
Incremental models with a broken
unique_keysilently duplicate rows - If theunique_keydoesn't match how the source system generates IDs (e.g., composite keys, NULL-able columns), the merge strategy falls back to appending and your fact table will have duplicate rows. Always test with auniquedbt test on the mart's primary key after the first incremental run.ref()creates a compile-time dependency but not a runtime guarantee - dbt'sref()ensures build order, but if a staging model's source table is empty or missing, the downstream mart builds with zero rows and no error. Addnot_nulland row count tests to staging models so silent empty builds surface in CI.Metrics defined in both the semantic layer and the BI tool diverge - If analysts can also create calculated fields in Looker/Tableau/Power BI, they will. Within months there will be two definitions of "revenue" and no one knows which is correct. Enforce a semantic-layer-first policy and audit BI tool custom fields quarterly.
SELECT *in staging models breaks on upstream schema changes - When a source table adds or removes a column,SELECT *staging models silently change shape, potentially breaking downstream marts. Explicitly list every column in staging models so schema changes cause a compile error rather than silent breakage.Hardcoded dates in incremental WHERE clauses don't survive full refreshes - An incremental model that filters with
where created_at > '2024-01-01'will drop historical data on a--full-refresh. Use{{ this }}to reference the current max timestamp, and document what happens on a forced full refresh.
References
For detailed patterns and implementation guidance, load the relevant file from
references/:
references/dbt-patterns.md- Advanced dbt patterns including macros, packages, hooks, custom materializations, and CI/CD integrationreferences/semantic-layer.md- Deep dive into MetricFlow configuration, Cube setup, dimension/measure types, and BI tool integrationreferences/self-serve-analytics.md- Patterns for building analyst-friendly data platforms, documentation strategies, and data catalog integration
Only load a references file if the current task requires it - they are long and will consume context.
References
dbt-patterns.md
Advanced dbt Patterns
Macros
Macros are reusable Jinja functions. Use them to DRY up repeated SQL patterns.
Utility macro: cents to dollars
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, precision=2) %}
round({{ column_name }} / 100.0, {{ precision }})
{% endmacro %}Usage in a model:
select
order_id,
amount_cents,
{{ cents_to_dollars('amount_cents') }} as amount_dollars
from {{ ref('stg_stripe__payments') }}Generate schema name macro
Override the default schema naming to control where models land:
-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- elif target.name == 'prod' -%}
{{ custom_schema_name | trim }}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}This ensures schema: finance resolves to finance in prod but dev_jsmith_finance
in development - preventing dev runs from writing to production schemas.
Pivot macro
-- macros/pivot.sql
{% macro pivot(column, values, alias=True, agg='sum', then_value=1, else_value=0) %}
{% for value in values %}
{{ agg }}(
case when {{ column }} = '{{ value }}'
then {{ then_value }}
else {{ else_value }}
end
) {% if alias %} as {{ column }}_{{ value | replace(' ', '_') | lower }} {% endif %}
{% if not loop.last %},{% endif %}
{% endfor %}
{% endmacro %}Essential packages
Add to packages.yml:
packages:
- package: dbt-labs/dbt_utils
version: [">=1.0.0", "<2.0.0"]
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]
- package: dbt-labs/codegen
version: [">=0.12.0", "<0.13.0"]dbt_utils highlights
| Macro | Purpose |
|---|---|
surrogate_key(['col1', 'col2']) |
Generate deterministic hash keys for dimensions |
star(from=ref('model'), except=['col']) |
Select all columns except specified ones |
date_spine(datepart, start, end) |
Generate a continuous date dimension |
pivot(column, values, then_value) |
Pivot rows to columns |
union_relations([ref('a'), ref('b')]) |
UNION ALL with column alignment |
accepted_range(min, max) |
Test that values fall within a range |
at_least_one |
Test that a column has at least one non-null value |
dbt_expectations highlights
Statistical and distribution tests:
columns:
- name: amount_cents
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 10000000
- dbt_expectations.expect_column_mean_to_be_between:
min_value: 1000
max_value: 50000Hooks
Pre-hook: grant access after model build
# dbt_project.yml
models:
my_project:
marts:
+post-hook:
- "grant select on {{ this }} to role analyst_readonly"On-run-end: refresh BI cache
on-run-end:
- "{{ log('Run completed at ' ~ run_started_at, info=True) }}"Custom materializations
Incremental with delete+insert strategy
For warehouses that do not support MERGE (e.g., Redshift without MERGE support):
{{
config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='event_id',
on_schema_change='append_new_columns'
)
}}Snapshot (SCD Type 2)
-- snapshots/snap_customers.sql
{% snapshot snap_customers %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
invalidate_hard_deletes=True
)
}}
select * from {{ source('shopify', 'customers') }}
{% endsnapshot %}This creates dbt_valid_from, dbt_valid_to, and dbt_updated_at columns
automatically. A row with dbt_valid_to IS NULL is the current version.
CI/CD integration
Slim CI with state comparison
# Only run models that changed since last production run
dbt build --select state:modified+ --state ./prod-manifest/CI pipeline steps
# .github/workflows/dbt-ci.yml (simplified)
steps:
- name: Install deps
run: dbt deps
- name: Check source freshness
run: dbt source freshness --target ci
- name: Build changed models
run: dbt build --select state:modified+ --state ./prod-manifest/ --target ci
- name: Run all tests on changed models
run: dbt test --select state:modified+ --state ci
- name: Generate docs
run: dbt docs generateProduction run pattern
# Full refresh on schedule (weekly or as needed)
dbt build --full-refresh --target prod
# Daily incremental run
dbt build --target prod
# Source freshness check before build
dbt source freshness --target prod && dbt build --target prodModel selection syntax
| Selector | Meaning |
|---|---|
dbt run -s my_model |
Run one model |
dbt run -s my_model+ |
Run model and all downstream |
dbt run -s +my_model |
Run model and all upstream |
dbt run -s +my_model+ |
Run full lineage (upstream + downstream) |
dbt run -s tag:finance |
Run all models tagged "finance" |
dbt run -s path:models/marts/finance |
Run all models in a directory |
dbt run -s state:modified+ |
Run changed models and downstream (CI) |
dbt run --exclude stg_legacy__* |
Run everything except legacy staging |
Project configuration best practices
# dbt_project.yml
name: my_analytics
version: '1.0.0'
profile: my_analytics
vars:
my_analytics:
start_date: '2020-01-01'
models:
my_analytics:
staging:
+materialized: view
+schema: staging
intermediate:
+materialized: ephemeral
marts:
+materialized: table
+schema: analytics
finance:
+tags: ['finance', 'daily']
marketing:
+tags: ['marketing', 'daily']
seeds:
my_analytics:
+schema: seeds
snapshots:
my_analytics:
+target_schema: snapshotsKey rules:
- Staging as views (no warehouse cost, always fresh)
- Intermediate as ephemeral (inlined into downstream queries, no table created)
- Marts as tables (fast for analyst queries)
- Override per-folder when needed (e.g., a heavy intermediate that should be a table)
self-serve-analytics.md
Self-Serve Analytics Patterns
The self-serve spectrum
Self-serve analytics is not binary. Teams operate on a spectrum:
| Level | Who queries | What they need | Analytics eng. effort |
|---|---|---|---|
| 1. Dashboard consumers | Everyone | Pre-built dashboards, no SQL | Low - build once, maintain |
| 2. Guided exploration | Analysts | Curated datasets + BI tool | Medium - model + document |
| 3. SQL self-serve | Data-literate staff | Clean marts + docs | High - full modeling + semantic layer |
| 4. Full autonomy | Data engineers | Raw + modeled data | Minimal - provide access + catalog |
Most teams should target Level 2-3. Level 4 is for data teams querying their own warehouse. Level 1 alone creates a dashboard factory with infinite ticket queues.
Building analyst-friendly marts
Naming conventions
| Pattern | Example | Rule |
|---|---|---|
| Fact tables | fct_orders |
Prefix with fct_, use plural noun |
| Dimension tables | dim_customers |
Prefix with dim_, use plural noun |
| Date columns | order_date, created_at |
Suffix with _date (date) or _at (timestamp) |
| Boolean columns | is_active, has_subscription |
Prefix with is_ or has_ |
| Amount columns | total_amount_cents |
Include unit in name (_cents, _usd, _seconds) |
| Count columns | lifetime_order_count |
Suffix with _count |
| ID columns | customer_id |
Suffix with _id, match the dimension table name |
Never use abbreviations. cust_id saves 4 characters and costs every analyst a
lookup. Use customer_id.
Column descriptions in YAML
Every column in every mart model must have a description:
models:
- name: fct_orders
description: >
One row per order. Grain: order_id. Includes payment totals joined from
the payments intermediate model. Updated incrementally on each dbt run.
columns:
- name: order_id
description: "Unique order identifier from Shopify. Primary key."
- name: customer_id
description: "FK to dim_customers. The customer who placed this order."
- name: order_date
description: "Date the order was placed (UTC, date only, no time)."
- name: status
description: >
Current order status. Values: pending, confirmed, shipped, delivered,
cancelled, refunded.
- name: total_amount_cents
description: >
Total order value in US cents (integer). Divide by 100 for dollars.
Includes tax, excludes shipping.
- name: is_first_order
description: >
True if this is the customer's first order by order_date. Useful for
new vs returning customer analysis.Grain documentation pattern
Every mart model's YAML description must state the grain explicitly:
Grain: one row per <entity>Examples:
fct_orders: "Grain: one row per order"fct_order_items: "Grain: one row per order line item"dim_customers: "Grain: one row per customer (current state)"fct_daily_active_users: "Grain: one row per user per day"
If you cannot state the grain in one sentence, the model is likely mixing grains and should be split.
Documentation strategies
dbt docs
Generate and host dbt docs for the entire project:
dbt docs generate
dbt docs serve --port 8080For production, deploy to a static hosting service (S3 + CloudFront, Netlify, etc.) or use dbt Cloud's hosted docs.
What to document beyond column descriptions
- Model-level description - What this model is, its grain, key assumptions
- Source freshness expectations - How often data arrives, acceptable lag
- Known limitations - "Does not include orders from the legacy system before 2021"
- Business rules - "An order is considered 'completed' when status = 'delivered' AND payment_status = 'captured'"
- Metric definitions - Link to the semantic layer metric, not a BI dashboard
Data dictionary template
For teams that need a document outside of dbt docs:
## fct_orders
| Column | Type | Description | Example |
|---|---|---|---|
| order_id | varchar | Unique order ID from Shopify | ord_abc123 |
| customer_id | varchar | FK to dim_customers | cust_xyz789 |
| order_date | date | Date order was placed (UTC) | 2024-03-15 |
| status | varchar | Order status enum | delivered |
| total_amount_cents | integer | Total in cents, incl. tax, excl. shipping | 4999 |
**Grain**: One row per order
**Refresh**: Incremental, every 6 hours
**Owner**: Data Engineering (@data-eng)Data catalog integration
Popular catalog tools
| Tool | Type | Best for |
|---|---|---|
| dbt Docs | Built-in | Teams already using dbt |
| Atlan | SaaS catalog | Enterprise, governance-heavy |
| DataHub (LinkedIn) | Open source | Large orgs, custom metadata |
| Amundsen (Lyft) | Open source | Discovery-focused |
| Select Star | SaaS | Automated lineage |
| Monte Carlo | SaaS | Data observability + catalog |
Catalog integration pattern
- Run
dbt docs generateto producemanifest.jsonandcatalog.json - Push these artifacts to your catalog tool's API
- The catalog ingests model descriptions, column descriptions, lineage, and tests
- Analysts search the catalog, find the right table, and start querying
# Example: push dbt artifacts to DataHub
datahub ingest -c dbt_recipe.yml# dbt_recipe.yml (DataHub)
source:
type: dbt
config:
manifest_path: target/manifest.json
catalog_path: target/catalog.json
target_platform: snowflakeAccess patterns for self-serve
Role-based access
-- Create analyst role with read-only access to marts
CREATE ROLE analyst_readonly;
GRANT USAGE ON SCHEMA analytics TO analyst_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO analyst_readonly;
ALTER DEFAULT PRIVILEGES IN SCHEMA analytics
GRANT SELECT ON TABLES TO analyst_readonly;
-- Staging and raw schemas are NOT granted to analysts
-- They should only query martsQuery guardrails
Prevent runaway queries from consuming warehouse resources:
-- Snowflake: warehouse-level query timeout
ALTER WAREHOUSE analyst_wh SET
STATEMENT_TIMEOUT_IN_SECONDS = 300;
-- BigQuery: use BI Engine for fast, bounded queries
-- Redshift: WLM queue with memory and concurrency limitsOnboarding analysts to self-serve
The self-serve onboarding checklist
- Share the dbt docs URL (or data catalog link)
- Walk through the 3-5 most important mart tables and their grains
- Show how to find metric definitions in the semantic layer
- Provide 5-10 example queries covering common analysis patterns
- Set up a Slack channel (#data-questions) for questions with a 24h SLA
- Review their first 3 queries/dashboards for correctness
- After 2 weeks, check: are they filing fewer data tickets? If not, the models or documentation need improvement, not the analyst.
Example starter queries for analysts
-- Revenue by month
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount_cents) / 100.0 AS revenue_dollars
FROM analytics.fct_orders
WHERE status = 'delivered'
GROUP BY 1
ORDER BY 1;
-- Customer segmentation breakdown
SELECT
customer_segment,
COUNT(*) AS customer_count,
AVG(lifetime_value_cents) / 100.0 AS avg_ltv_dollars
FROM analytics.dim_customers
GROUP BY 1
ORDER BY 3 DESC;
-- New vs returning customers by week
SELECT
DATE_TRUNC('week', o.order_date) AS week,
COUNT(CASE WHEN o.is_first_order THEN 1 END) AS new_customers,
COUNT(CASE WHEN NOT o.is_first_order THEN 1 END) AS returning_customers
FROM analytics.fct_orders o
WHERE o.status != 'cancelled'
GROUP BY 1
ORDER BY 1;Measuring self-serve success
Track these metrics to know if self-serve is working:
| Metric | Target | How to measure |
|---|---|---|
| Data ticket volume | Decreasing month-over-month | Count tickets tagged "data request" |
| Time to first query | < 1 week for new analysts | Warehouse audit logs |
| Query error rate | < 10% of analyst queries fail | Warehouse query history |
| Dashboard trust score | > 4/5 in quarterly survey | Survey stakeholders |
| Metric definition coverage | > 80% of KPIs in semantic layer | Audit metrics YAML vs BI tools |
semantic-layer.md
Semantic Layer Deep Dive
What is a semantic layer?
A semantic layer is a declarative abstraction between your data warehouse tables and the consumers (BI tools, notebooks, APIs, LLM agents). It defines:
- Entities - the join keys and relationships between tables
- Dimensions - the columns you group by or filter on
- Measures - the aggregations (sum, count, average) applied to columns
- Metrics - business-level calculations composed from measures
The semantic layer compiles these definitions into optimized SQL at query time. Every consumer gets the same metric logic, regardless of which tool they use.
MetricFlow (dbt Semantic Layer)
MetricFlow is the engine behind the dbt Semantic Layer (dbt Cloud). It uses
semantic_models and metrics defined in YAML.
Semantic model anatomy
semantic_models:
- name: orders
description: "Order fact table"
model: ref('fct_orders')
defaults:
agg_time_dimension: order_date
entities:
- name: order_id
type: primary
- name: customer_id
type: foreign
- name: product_id
type: foreign
dimensions:
- name: order_date
type: time
type_params:
time_granularity: day
- name: status
type: categorical
- name: channel
type: categorical
measures:
- name: order_count
agg: count
expr: order_id
description: "Count of orders"
- name: total_revenue
agg: sum
expr: total_amount_cents
description: "Sum of order amounts in cents"
- name: avg_order_value
agg: average
expr: total_amount_centsEntity types
| Type | Meaning | Example |
|---|---|---|
primary |
Unique identifier for this semantic model | order_id in orders |
foreign |
References a primary entity in another model | customer_id in orders |
unique |
Unique but not the primary grain | email in customers |
natural |
Non-unique identifier (used for joins) | session_id shared across events |
Dimension types
| Type | Use for | Granularity options |
|---|---|---|
categorical |
String/enum columns you group by | N/A |
time |
Timestamp/date columns for time-series | day, week, month, quarter, year |
Measure aggregation types
| Agg | SQL equivalent | Notes |
|---|---|---|
sum |
SUM(expr) |
Most common for revenue, quantities |
count |
COUNT(expr) |
Count of non-null values |
count_distinct |
COUNT(DISTINCT expr) |
Unique customers, sessions |
average |
AVG(expr) |
Use with caution - averages of averages are wrong |
min / max |
MIN(expr) / MAX(expr) |
First/last dates, extremes |
median |
PERCENTILE_CONT(0.5) |
Warehouse support varies |
sum_boolean |
SUM(CASE WHEN expr THEN 1 ELSE 0 END) |
Count of true values |
Metric types
Simple metric
Direct reference to a single measure:
metrics:
- name: total_orders
type: simple
label: "Total Orders"
description: "Count of all orders"
type_params:
measure: order_countDerived metric
Calculation across multiple measures:
metrics:
- name: average_revenue_per_customer
type: derived
label: "Revenue per Customer"
description: "Total revenue divided by unique customer count"
type_params:
expr: total_revenue / unique_customers
metrics:
- name: total_revenue
- name: unique_customersRatio metric
Special case of derived for common ratio patterns:
metrics:
- name: order_conversion_rate
type: ratio
label: "Order Conversion Rate"
type_params:
numerator: completed_orders
denominator: total_ordersCumulative metric
Running totals over time:
metrics:
- name: cumulative_revenue
type: cumulative
label: "Cumulative Revenue"
type_params:
measure: total_revenue
window: 1 month
grain_to_date: monthCube semantic layer
Cube is an alternative semantic layer that works with any BI tool via REST/GraphQL API.
Cube schema file
// schema/Orders.js
cube('Orders', {
sql_table: 'analytics.fct_orders',
joins: {
Customers: {
relationship: 'many_to_one',
sql: `${CUBE}.customer_id = ${Customers}.customer_id`
}
},
dimensions: {
orderId: {
sql: 'order_id',
type: 'string',
primary_key: true
},
status: {
sql: 'status',
type: 'string'
},
orderDate: {
sql: 'order_date',
type: 'time'
}
},
measures: {
count: {
type: 'count'
},
totalRevenue: {
sql: 'total_amount_cents',
type: 'sum'
},
averageOrderValue: {
sql: 'total_amount_cents',
type: 'avg'
}
}
});Cube vs MetricFlow comparison
| Feature | MetricFlow (dbt) | Cube |
|---|---|---|
| Definition format | YAML | JavaScript/YAML |
| Query API | dbt Cloud Semantic Layer API | REST + GraphQL |
| Caching | Warehouse-level | Built-in pre-aggregations |
| BI integrations | Tableau, Hex, Mode (via dbt Cloud) | Any tool via API |
| Join handling | Entity-based auto-joins | Explicit join definitions |
| Best for | dbt-centric stacks | Multi-tool, API-first stacks |
BI tool integration patterns
Exposures (dbt)
Document which BI assets depend on which models:
# models/exposures.yml
exposures:
- name: weekly_revenue_dashboard
type: dashboard
maturity: high
url: https://bi.company.com/dashboards/42
description: "Executive revenue dashboard, refreshed daily"
depends_on:
- ref('fct_orders')
- ref('dim_customers')
owner:
name: Data Team
email: data@company.comExposures show up in the dbt docs DAG, making it visible which dashboards break when a model changes.
Metric query patterns
When querying metrics through the semantic layer API:
# dbt Cloud Semantic Layer (Python SDK)
from dbt_sl_sdk import SemanticLayerClient
client = SemanticLayerClient(
environment_id=12345,
auth_token="dbt_cloud_token"
)
result = client.query(
metrics=["revenue", "order_count"],
group_by=["metric_time__month", "customer__segment"],
where=["{{ Dimension('order__status') }} = 'completed'"],
order_by=["-metric_time__month"],
limit=100
)Common semantic layer pitfalls
| Pitfall | Impact | Fix |
|---|---|---|
| Averaging an average | Mathematically incorrect results | Use sum / count as a derived metric instead |
| Missing time dimension | Cannot do time-series analysis on the metric | Every semantic model needs at least one time dimension |
| Fanout joins | Measures inflate when joining one-to-many | Define entities correctly; use count_distinct not count |
| No metric descriptions | Analysts cannot discover or trust metrics | Every metric must have a description and label |
| Too many metrics | Decision paralysis, conflicting definitions | Curate 15-25 core metrics; archive the rest |
Frequently Asked Questions
What is analytics-engineering?
Use this skill when building dbt models, designing semantic layers, defining metrics, creating self-serve analytics, or structuring a data warehouse for analyst consumption. Triggers on dbt project setup, model layering (staging, intermediate, marts), ref() and source() usage, YAML schema definitions, metrics definitions, semantic layer configuration, dimensional modeling, slowly changing dimensions, data testing, and any task requiring analytics engineering best practices.
How do I install analytics-engineering?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill analytics-engineering in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support analytics-engineering?
analytics-engineering works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.