ml-ops

Use this skill when deploying ML models to production, setting up model monitoring, implementing A/B testing for models, or managing feature stores. Triggers on model deployment, model serving, ML pipelines, feature engineering, model versioning, data drift detection, model registry, experiment tracking, and any task requiring machine learning operations infrastructure.

What is ml-ops?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill ml-ops
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The ml-ops skill is now active and ready to use

Overview Files

ml-ops

ml-ops is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Deploying ML models to production, setting up model monitoring, implementing A/B testing for models, or managing feature stores.

Quick Facts

Field	Value
Category	ai-ml
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill ml-ops

The ml-ops skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

A production engineering framework for the full machine learning lifecycle. MLOps bridges the gap between model experimentation and reliable production systems by applying software engineering discipline to ML workloads. This skill covers model deployment strategies, experiment tracking, feature stores, drift monitoring, A/B testing, and versioning - the infrastructure that makes models trustworthy over time. Think of it as DevOps for models: automate everything, measure what matters, and treat reproducibility as a first-class constraint.

Platforms

claude-code
gemini-cli
openai-codex

Related Skills

Pair ml-ops with these complementary skills:

Frequently Asked Questions

What is ml-ops?

How do I install ml-ops?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill ml-ops in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support ml-ops?

This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

ML Ops

When to use this skill

Trigger this skill when the user:

Deploys a trained model to a production serving endpoint
Sets up experiment tracking for training runs (parameters, metrics, artifacts)
Implements canary or shadow deployments for a new model version
Designs or integrates a feature store for online/offline feature serving
Sets up monitoring for data drift, prediction drift, or model degradation
Runs A/B or champion/challenger tests across model versions in production
Versions models, datasets, or pipelines with DVC or a model registry
Builds or migrates to an automated training/retraining pipeline

Do NOT trigger this skill for:

Core model research, architecture design, or hyperparameter search (use an ML research skill instead - MLOps starts after a candidate model exists)
General software observability (logs, metrics, traces for non-ML services - use the backend-engineering skill)

Key principles

Reproducibility is non-negotiable - Every training run must be reproducible from scratch: fixed seeds, pinned dependency versions, tracked data splits, and logged hyperparameters. If you cannot reproduce a model, you cannot debug it, audit it, or roll back to it safely.
Automate the training pipeline - Manual training is a one-way door to undocumented models. Build an automated pipeline (data ingestion -> preprocessing -> training -> evaluation -> registration) from day one. Humans should only approve a model for promotion, not run the steps.
Monitor data, not just models - Model metrics degrade because the input data changes. Track feature distributions in production against training baselines. Data drift is usually the root cause; model drift is the symptom.
Version everything - Models, datasets, feature definitions, pipeline code, and environment configs all deserve version control. An unversioned artifact is a liability. Use DVC for data/models, a model registry for lifecycle state, and git for code.
Treat ML code like production code - Tests, code review, CI/CD, and on-call rotation apply to training pipelines and serving code. The "it works in the notebook" standard is not a production standard.

Core concepts

ML lifecycle describes the end-to-end journey of a model:

Experiment -> Train -> Validate -> Deploy -> Monitor -> (retrain if drift)

Each stage has gates: an experiment produces a candidate; training on full data with tracked params produces an artifact; validation gates on held-out metrics; deployment chooses a serving strategy; monitoring decides when retraining is needed.

Model registry is the source of truth for model lifecycle state. A model moves through stages: Staging -> Production -> Archived. The registry stores metadata, metrics, lineage, and the artifact URI. MLflow Model Registry, Vertex AI Model Registry, and SageMaker Model Registry are the main options.

Feature stores decouple feature computation from model training and serving. They have two serving paths: an offline store (columnar, batch-oriented, used for training and batch inference) and an online store (low-latency key-value lookup, used at prediction time). The critical guarantee is point-in-time correctness - training features must only use data available before the label timestamp to prevent target leakage.

Data drift occurs when the statistical distribution of input features in production diverges from the training distribution. Concept drift occurs when the relationship between features and labels changes even if feature distributions are stable (e.g., user behavior shifts after a product change).

Shadow deployment runs the new model in parallel with the live model, receiving the same traffic, but its predictions are not served to users. Used to compare behavior before any real traffic exposure.

Common tasks

Design an ML pipeline

Structure pipelines as discrete, testable stages with explicit inputs/outputs:

Data ingestion -> Validation -> Preprocessing -> Training -> Evaluation -> Registration
     |                |               |              |             |
  raw data      schema check     feature eng      model       go/no-go
  versioned     + stats           artifact       artifact      gate

Orchestration choices:

Need	Tool
Python-native, simple DAGs	Prefect, Apache Airflow
Kubernetes-native, reproducible	Kubeflow Pipelines, Argo Workflows
Managed, minimal infra	Vertex AI Pipelines, SageMaker Pipelines
Git-driven, code-first	ZenML, Metaflow

Gate evaluation: define a go/no-go threshold before training starts. A model that does not beat baseline (or the current production model) should never reach the registry.

Set up experiment tracking

Track every training run with: parameters (hyperparams, data version), metrics (loss curves, eval metrics), artifacts (model weights, plots), and environment (library versions, hardware).

MLflow pattern:

import mlflow

mlflow.set_experiment("fraud-detection-v2")

with mlflow.start_run(run_name="xgboost-baseline"):
    mlflow.log_params({
        "max_depth": 6,
        "learning_rate": 0.1,
        "n_estimators": 200,
        "data_version": "2024-03-01"
    })

    model = train(X_train, y_train)

    mlflow.log_metrics({
        "auc_roc": evaluate_auc(model, X_val, y_val),
        "precision_at_k": precision_at_k(model, X_val, y_val, k=100)
    })

    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        registered_model_name="fraud-detector"
    )

Key discipline: log the data version (or dataset hash) as a parameter. Without it, you cannot reproduce the run.

Compare runs on the same held-out test set. Never tune on the test set. Use validation for selection, test set for final reporting only.

Deploy a model with canary rollout

Choose a serving infrastructure before choosing a rollout strategy:

Serving option	Best for	Trade-off
REST microservice (FastAPI + Docker)	Low latency, flexible	You own the infra
Managed endpoint (Vertex AI, SageMaker)	Reduced ops burden	Cost, vendor lock-in
Batch prediction job	High throughput, no latency SLA	Not real-time
Feature-flag-driven (server-side)	A/B testing with business metrics	Needs experimentation platform

Canary rollout stages:

v1: 100% traffic
  -> v2 shadow: 0% served, 100% shadowed (compare outputs)
  -> v2 canary: 5% traffic -> monitor error rate + latency
  -> v2 staged: 25% -> 50% -> 100% with automated rollback triggers

Define rollback triggers before deploying: error rate > X%, prediction latency p99 > Y ms, or business metric (e.g., conversion rate) drops > Z%.

Implement model monitoring

Monitor three layers - input data, predictions, and business outcomes:

Layer	Signal	Method
Input data	Feature distribution drift	PSI, KS test, chi-squared
Predictions	Output distribution drift	PSI on prediction histogram
Business outcome	Actual vs expected labels	Delayed feedback loop

Population Stability Index (PSI) thresholds:

PSI < 0.1  -> No significant change, model stable
PSI 0.1-0.2 -> Moderate drift, investigate
PSI > 0.2  -> Significant drift, retrain or escalate

Monitoring setup pattern:

# On each prediction batch, compute and log feature stats
baseline_stats = load_training_stats()  # saved during training
production_stats = compute_stats(current_batch_features)

for feature in monitored_features:
    psi = compute_psi(baseline_stats[feature], production_stats[feature])
    metrics.gauge(f"drift.psi.{feature}", psi)

    if psi > 0.2:
        alert(f"Significant drift on feature: {feature}")

Set up scheduled monitoring jobs (hourly/daily depending on traffic volume) rather than per-prediction to avoid overhead. Load the references/tool-landscape.md for monitoring platform options.

Build a feature store

Separate feature computation from model code to enable reuse and prevent leakage.

Architecture:

Raw data sources
      |
Feature computation (Spark, dbt, Flink)
      |
      +-----------> Offline store (Parquet/BigQuery) -> Training jobs
      |
      +-----------> Online store (Redis, DynamoDB)  -> Real-time serving

Point-in-time correctness - the most critical correctness property:

# WRONG: uses future data at training time (target leakage)
features = feature_store.get_features(entity_id=user_id)

# CORRECT: fetch features as they existed at the event timestamp
features = feature_store.get_historical_features(
    entity_df=events_df,  # includes entity_id + event_timestamp
    feature_refs=["user:age", "user:30d_spend", "user:country"]
)

Feature naming convention: <entity>:<feature_name> (e.g., user:30d_spend, product:avg_rating_7d). Version feature definitions in a registry (Feast, Tecton, Vertex Feature Store). Never hardcode feature transformations in training scripts.

A/B test models in production

A/B testing models requires statistical rigor. A "better offline metric" does not guarantee better business outcomes.

Setup:

Define the primary metric (business metric, not model metric) and a guardrail metric before the test
Calculate required sample size for desired power (typically 80%) and significance level (typically 5%)
Randomly assign users/sessions to treatment/control - sticky assignment (same user always gets the same model) prevents contamination
Run for full business cycles (minimum 1-2 weeks for weekly seasonality)

Traffic splitting options:

Option A: Load balancer routing (simple %, stateless)
Option B: User-ID hashing (sticky, consistent assignment)
Option C: Experimentation platform (Statsig, Optimizely, LaunchDarkly)

Stopping criteria: Do not peek at p-values daily. Pre-register the minimum runtime and only stop early for clearly harmful outcomes (guardrail breach). Use sequential testing methods (mSPRT) if early stopping is required by business needs.

A model that improves AUC by 2% but reduces revenue is not a better model. Always tie model tests to business metrics.

Version models and datasets

Dataset versioning with DVC:

# Track a dataset in DVC
dvc add data/training/users_2024q1.parquet
git add data/training/users_2024q1.parquet.dvc .gitignore
git commit -m "Track Q1 2024 training dataset"

# Push dataset to remote storage
dvc push

# Reproduce dataset at a specific git commit
git checkout <commit-hash>
dvc pull

Model registry lifecycle:

Training pipeline produces artifact
    -> Registers as version N in "Staging"
    -> QA + validation passes
    -> Promoted to "Production" (previous Production -> "Archived")
    -> On rollback: restore previous version from "Archived"

Lineage tracking: A model version should link to: the training dataset version, the pipeline code commit, the feature definitions version, and the evaluation report. Without lineage, auditing and debugging become guesswork.

Anti-patterns / common mistakes

Mistake	Why it's wrong	What to do instead
Training and serving skew	Features computed differently at train vs serve time - silent accuracy loss	Share feature computation code; use a feature store for consistency
No baseline comparison	Deploying a new model without comparing to the current production model or a simple baseline	Always register the current production model as the benchmark; gate on relative improvement
Testing on test data during development	Inflated metrics, model does not generalize; test set is contaminated	Use train/validation/test splits; touch test set only for final reporting
Monitoring only model metrics, not inputs	Drift in input data causes silent degradation - you notice it in business metrics weeks later	Monitor feature distributions against training baseline as a first-class signal
Manual deployment steps	Undocumented, unrepeatable process; impossible to roll back reliably	Automate the full promote-to-production flow in CI/CD; humans approve, machines execute
A/B testing without sufficient sample size	Statistically underpowered tests produce false positives; teams ship regressions confidently	Calculate sample size upfront using power analysis; commit to minimum runtime before launch

Gotchas

Training-serving skew is silent and deadly - If the feature engineering code that runs during training differs even slightly from what runs at inference time (different library versions, different null handling, different normalization order), the model receives inputs it was never trained on. The model silently produces worse predictions. Share the exact same feature transformation code between training and serving; a feature store enforces this by design.
PSI drift alerts fire on expected seasonal changes, not just real drift - A retail model will always show PSI > 0.2 on Black Friday vs. a July training baseline. Alerting on raw PSI without seasonality context produces alert fatigue and trains teams to ignore drift signals. Baseline your monitoring against the same calendar period from the prior year, or use rolling baselines updated monthly.
DVC pull on a different machine requires remote storage credentials - dvc pull fetches data from the configured remote (S3, GCS, Azure). A teammate who clones the repo and runs dvc pull without configuring remote credentials gets a cryptic access-denied error that looks like a DVC bug. Document remote storage setup in the repo's README and use environment-based credential configuration.
MLflow autologging captures too much and inflates experiment storage - mlflow.autolog() is convenient for notebooks but logs every parameter, metric, and artifact from every library it supports. In training pipelines running thousands of experiments, this creates massive metadata storage and slow UI queries. Enable autologging selectively with mlflow.sklearn.autolog(log_models=False) or log manually with mlflow.log_params/metrics.
A/B tests on models need sticky user assignment, not session assignment - If a user is randomly assigned to the control or treatment model on each request, they experience inconsistent behavior within the same session. This contaminates the experiment (users implicitly see both models) and inflates variance. Hash on user ID to ensure consistent model assignment for the duration of the experiment.

References

For detailed platform comparisons and tool selection guidance, read the relevant file from the references/ folder:

references/tool-landscape.md - MLflow vs W&B vs Vertex AI vs SageMaker, feature store comparison, model serving options

Load references/tool-landscape.md when the task involves selecting or comparing MLOps platforms - it is detailed and will consume context, so only load it when needed.

References

tool-landscape.md

MLOps Tool Landscape

Choosing MLOps tooling is a two-dimensional decision: how much infrastructure you want to own (self-hosted vs fully managed) and how tightly coupled you want to be to a cloud vendor. This reference compares the major platforms across the four core MLOps domains: experiment tracking, model registry, feature stores, and model serving.

1. Experiment Tracking and Model Registry

MLflow

What it is: Open-source, self-hosted experiment tracker and model registry. The most widely deployed option in on-premise and multi-cloud environments.

Strengths:

No vendor lock-in; runs anywhere (local, Kubernetes, Databricks-managed)
Native support for sklearn, PyTorch, TensorFlow, XGBoost, HuggingFace, and more
Unified API: tracking + registry + model serving (MLflow Models) in one library
Strong community; integrations with most ML frameworks

Weaknesses:

UI is functional but not polished; limited collaboration features
Managed hosting options (Databricks) require a Databricks subscription
Scaling the tracking server and artifact store is your problem on self-hosted

Best for: Teams that need full data sovereignty, multi-cloud flexibility, or are already on Databricks.

# Minimal MLflow tracking example
import mlflow

with mlflow.start_run():
    mlflow.log_param("lr", 0.001)
    mlflow.log_metric("val_loss", 0.234)
    mlflow.pytorch.log_model(model, "model")

Weights & Biases (W&B)

What it is: Fully managed experiment tracking, artifact versioning, and collaboration platform. SaaS-first with a strong focus on team workflows.

Strengths:

Best-in-class UI: interactive charts, side-by-side run comparisons, reports
Artifacts API handles datasets, models, and arbitrary files with lineage tracking
W&B Sweeps: built-in hyperparameter search with Bayesian/grid/random strategies
W&B Launch: submit training jobs to cloud compute from within W&B
Strong for research teams and teams that collaborate on experiments heavily

Weaknesses:

SaaS only (W&B Server for self-hosted is a separate, expensive enterprise SKU)
Data leaves your environment unless using the enterprise deployment
Pricing scales with data ingestion and seats

Best for: Research-heavy teams, ML teams that do a lot of collaborative experiment analysis, or startups comfortable with SaaS.

import wandb

wandb.init(project="fraud-detection", config={"lr": 0.001})
wandb.log({"train_loss": loss, "val_auc": auc})
wandb.finish()

Vertex AI (Google Cloud)

What it is: Google Cloud's fully managed MLOps platform. Covers experiment tracking (Vertex Experiments), model registry, pipelines (Vertex Pipelines), and serving (Vertex AI Endpoints).

Strengths:

Fully managed: no infra to operate; scales automatically
Tight integration with BigQuery (feature store, training data), GCS, and Dataflow
Vertex AI Pipelines is built on Kubeflow Pipelines - portable, container-native
Vertex Feature Store handles online/offline serving natively
Strong for teams already on GCP with large data in BigQuery

Weaknesses:

GCP lock-in; difficult to migrate off
Vertex Experiments tracking is less mature than MLflow or W&B
Cost unpredictability; managed endpoints can be expensive at low usage
Less flexible for custom training environments compared to self-hosted

Best for: Teams on GCP, especially those with data already in BigQuery and who want to minimize infra management.

Amazon SageMaker (AWS)

What it is: AWS's fully managed ML platform. Covers experiment tracking (SageMaker Experiments), model registry, pipelines (SageMaker Pipelines), and serving (SageMaker Endpoints).

Strengths:

Deepest managed training infrastructure: managed spot instances, distributed training, automatic model tuning (hyperparameter optimization)
SageMaker Feature Store: online + offline with point-in-time correct queries
Tight integration with S3, Glue, Redshift, and AWS ecosystem
SageMaker Model Monitor: built-in data quality and drift monitoring
Most mature managed MLOps platform (longest track record)

Weaknesses:

AWS lock-in; SageMaker SDK is verbose and AWS-specific
Steep learning curve; abstraction layers can obscure what's actually happening
Pipelines DSL is more constrained than Kubeflow/Argo
Cost management is complex; easy to incur charges from idle endpoints

Best for: Teams deeply invested in AWS infrastructure who want managed training and serving without managing Kubernetes.

Head-to-Head Comparison

Capability	MLflow	W&B	Vertex AI	SageMaker
Experiment tracking	Excellent	Excellent	Good	Good
Model registry	Good	Good	Good	Excellent
Pipeline orchestration	Basic (Projects)	Limited	Good (KFP)	Good
Feature store	None (use Feast)	None	Native	Native
Model serving	Basic (MLflow Models)	None	Native	Native
Drift monitoring	None (use Evidently)	None	Basic	Good (Model Monitor)
Collaboration UI	Basic	Best	Good	Basic
Vendor lock-in	None	SaaS	GCP	AWS
Self-hosted option	Yes	Enterprise	No	No
Cost model	OSS + infra	Per seat/usage	Per usage	Per usage

2. Feature Stores

Feature stores are specialized. The right choice depends on scale, latency requirements, and cloud affinity.

Feast (Open Source)

What it is: The leading open-source feature store. Orchestrates feature computation, stores features in online/offline stores of your choice, and handles point-in-time correct retrieval.

Strengths:

Cloud-agnostic; works with any online store (Redis, DynamoDB, Cassandra) and offline store (BigQuery, Snowflake, Redshift, Parquet)
Strong community; actively maintained by Tecton alumni and community
Point-in-time correct historical retrieval with get_historical_features
Declarative feature definitions in Python

Weaknesses:

No managed option; you operate everything
Feature transformation is compute-agnostic (you bring Spark/dbt), Feast only manages storage and retrieval
Monitoring and feature quality checks require external tooling

from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Offline retrieval for training (point-in-time correct)
training_df = store.get_historical_features(
    entity_df=events_df,
    features=["user_stats:30d_spend", "user_stats:country"]
).to_df()

# Online retrieval for serving (low-latency)
feature_vector = store.get_online_features(
    features=["user_stats:30d_spend", "user_stats:country"],
    entity_rows=[{"user_id": "u-123"}]
).to_dict()

Tecton

What it is: Fully managed feature platform built by the team that created Uber's Michelangelo feature store. The most capable commercial feature store option.

Strengths:

Manages the full feature lifecycle: definition, computation, storage, serving, monitoring, and lineage
Supports batch, streaming (Spark Streaming, Flink), and real-time feature pipelines
Built-in feature monitoring, data quality, and SLOs
Point-in-time correct historical retrieval

Weaknesses:

Most expensive option; pricing is not public
Vendor lock-in (though feature definitions are Python)
Overkill for small teams or simple feature sets

Best for: Enterprise ML teams with complex real-time feature requirements and budget for a managed platform.

Vertex AI Feature Store

Best for: Teams already on GCP who want zero infra management. Online serving uses Bigtable under the hood. Point-in-time queries against BigQuery offline store. Limitation: GCP lock-in, and the API is more constrained than Feast or Tecton.

SageMaker Feature Store

Best for: Teams on AWS. Tight integration with SageMaker training jobs. Online store backed by DynamoDB, offline store in S3 + Glue catalog. Limitation: AWS lock-in, and feature transformation must happen outside the feature store.

Feature Store Comparison

Capability	Feast	Tecton	Vertex Feature Store	SageMaker Feature Store
Managed	No	Yes	Yes	Yes
Real-time features	Via Redis/Cassandra	Yes (streaming)	Limited	Limited
Point-in-time correct	Yes	Yes	Yes	Yes
Built-in monitoring	No	Yes	Basic	Basic
Cloud agnostic	Yes	Mostly	No (GCP)	No (AWS)
Cost	Infra only	Enterprise	Per usage	Per usage

3. Model Serving

Frameworks

Tool	Type	Best for
BentoML	Framework	Packaging any model as a containerized service; strong for custom logic
Seldon Core	Kubernetes-native	Complex serving graphs, A/B testing, explainability on K8s
KServe	Kubernetes-native	Standard model serving on K8s; successor to KFServing
Ray Serve	Python-native	High-throughput, composable serving; integrates with Ray training
TorchServe	PyTorch-specific	Serving PyTorch models with batching and versioning
TF Serving	TF-specific	Serving TensorFlow SavedModels at scale

Managed Endpoints

Platform	Managed serving option	Key feature
GCP	Vertex AI Endpoints	Auto-scaling, traffic split for A/B, built-in monitoring
AWS	SageMaker Endpoints	Real-time + batch transform, auto-scaling, Model Monitor
Azure	Azure ML Online Endpoints	Managed K8s-backed, traffic split
Self-hosted	Seldon + KServe	Full control, cloud-agnostic

Serving Decision Flowchart

Do you need real-time (<100ms) predictions?
  NO  -> Batch prediction job (BigQuery ML, SageMaker Batch Transform)
  YES -> What is your latency SLA?
    >200ms  -> REST endpoint (BentoML, Flask, FastAPI)
    <50ms   -> Consider feature pre-computation + cached lookup
    gRPC?   -> KServe / Seldon for gRPC protocol support

Are you on a cloud provider?
  GCP -> Vertex AI Endpoint (easiest path)
  AWS -> SageMaker Endpoint
  Azure -> Azure ML Endpoint
  Multi-cloud / on-prem -> BentoML + Kubernetes (KServe)

4. Drift Monitoring Tools

Tool	What it monitors	Integration
Evidently AI	Data drift, prediction drift, data quality; generates HTML reports	OSS, works with any serving setup
WhyLabs	Statistical profiles, data quality, model performance	Managed SaaS with OSS SDK (whylogs)
Arize AI	Model performance, drift, explainability	Managed SaaS
SageMaker Model Monitor	Data quality, model quality, bias, feature attribution drift	AWS-native only
Vertex AI Model Monitoring	Feature skew and drift detection	GCP-native only

Recommendation for most teams: Start with Evidently AI (OSS) for data and prediction drift. It generates shareable HTML reports and integrates with any serving infrastructure. Move to WhyLabs or Arize when you need a managed dashboard and alerting at scale.

# Evidently drift report example
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=training_df, current_data=production_batch_df)
report.save_html("drift_report.html")

5. Summary: Recommended Stack by Team Size

Small team (1-5 ML engineers), startup

Experiment tracking: MLflow (self-hosted on a cheap VM or Databricks Community)
Feature store: Skip it - use a shared Pandas utility library until you have

5 features reused across models
Model serving: BentoML or a simple FastAPI container
Monitoring: Evidently AI scheduled batch reports

Mid-size team (5-20 ML engineers), growth company

Experiment tracking: W&B (collaboration features pay off at this size)
Feature store: Feast + Redis (online) + Snowflake/BigQuery (offline)
Model serving: Kubernetes + KServe or Vertex AI / SageMaker endpoints
Monitoring: WhyLabs or Evidently with automated alerting

Large team (20+ ML engineers), enterprise

Experiment tracking: MLflow on Databricks or W&B Enterprise
Feature store: Tecton or SageMaker Feature Store (if AWS)
Model serving: Seldon Core or managed cloud endpoints with traffic splits
Monitoring: Arize AI or WhyLabs with SLO-based alerting

Frequently Asked Questions

What is ml-ops?

How do I install ml-ops?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill ml-ops in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support ml-ops?

ml-ops works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Is ml-ops free?

Yes, ml-ops is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between ml-ops and similar tools?

ml-ops is an AI agent skill that teaches your coding agent specialized ai & machine learning knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use ml-ops with Cursor or Windsurf?

ml-ops works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

ml-ops

What is ml-ops?

Quick Start

ml-ops

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is ml-ops?

How do I install ml-ops?

What AI agents support ml-ops?

Maintainers

SKILL.md

ML Ops

When to use this skill

Key principles

Core concepts

Common tasks

Design an ML pipeline

Set up experiment tracking

Deploy a model with canary rollout

Implement model monitoring

Build a feature store

A/B test models in production

Version models and datasets

Anti-patterns / common mistakes

Gotchas

References

References

tool-landscape.md

MLOps Tool Landscape

1. Experiment Tracking and Model Registry

MLflow

Weights & Biases (W&B)

Vertex AI (Google Cloud)

Amazon SageMaker (AWS)

Head-to-Head Comparison

2. Feature Stores

Feast (Open Source)

Tecton

Vertex AI Feature Store

SageMaker Feature Store

Feature Store Comparison

3. Model Serving

Frameworks

Managed Endpoints

Serving Decision Flowchart

4. Drift Monitoring Tools

5. Summary: Recommended Stack by Team Size

Small team (1-5 ML engineers), startup

Mid-size team (5-20 ML engineers), growth company

Large team (20+ ML engineers), enterprise

Frequently Asked Questions

What is ml-ops?

How do I install ml-ops?

What AI agents support ml-ops?

Is ml-ops free?

What is the difference between ml-ops and similar tools?

Can I use ml-ops with Cursor or Windsurf?