computer-vision

Use this skill when building computer vision applications, implementing image classification, object detection, or segmentation pipelines. Triggers on image classification, object detection, YOLO, semantic segmentation, image preprocessing, data augmentation, transfer learning, CNN architectures, vision transformers, and any task requiring visual recognition or image analysis.

What is computer-vision?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill computer-vision
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The computer-vision skill is now active and ready to use

Overview Files

computer-vision

computer-vision is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Building computer vision applications, implementing image classification, object detection, or segmentation pipelines.

Quick Facts

Field	Value
Category	ai-ml
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill computer-vision

The computer-vision skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

Computer vision enables machines to interpret and reason about visual data - images, video, and multi-modal inputs. Modern CV pipelines are built on deep neural networks pretrained on large datasets (ImageNet, COCO, ADE20K) and fine-tuned for specific domains. PyTorch and its ecosystem (torchvision, timm, ultralytics, albumentations) cover the full stack from data loading through deployment. Foundation models like SAM, DINOv2, and OpenCLIP have shifted best practice toward prompt-based and zero-shot approaches before committing to full training runs.

Platforms

claude-code
gemini-cli
openai-codex

Related Skills

Pair computer-vision with these complementary skills:

Frequently Asked Questions

What is computer-vision?

How do I install computer-vision?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill computer-vision in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support computer-vision?

This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

Computer Vision

When to use this skill

Trigger this skill when the user:

Trains or fine-tunes an image classifier on a custom dataset
Runs inference with YOLO, DETR, or other detection models
Builds a semantic or instance segmentation pipeline
Implements data augmentation for CV training
Preprocesses images for model ingestion (resize, normalize, batch)
Exports a vision model to ONNX or optimizes with TensorRT
Evaluates a vision model (mAP, confusion matrix, per-class metrics)
Implements a U-Net, DeepLabV3, or similar segmentation architecture

Do NOT trigger this skill for:

Pure NLP tasks with no visual component (use a language-model skill instead)
3D point-cloud processing or LiDAR-only pipelines (overlap is limited; check domain)

Key principles

Start with pretrained models - Fine-tune ImageNet/COCO weights before training from scratch. Even a frozen backbone with a new head beats random init on small datasets.
Augment data aggressively - Real-world distribution shifts are unavoidable. Use albumentations with geometric, color, and noise transforms. Target-aware augments (mosaic, copy-paste) matter especially for detection.
Validate on representative data - Always hold out data from the exact deployment distribution. Benchmark on in-distribution AND out-of-distribution splits separately.
Optimize inference separately from training - Training precision (FP32/AMP) and inference precision (INT8/FP16) have different tradeoffs. Profile, export to ONNX, then apply TensorRT or OpenVINO post-training quantization.
Monitor for distribution shift - Production images drift from training data (lighting changes, new object classes, compression artifacts). Log prediction confidence distributions and trigger retraining pipelines when they degrade.

Core concepts

Task taxonomy

Task	Output	Typical metric
Classification	Single label per image	Top-1 / Top-5 accuracy
Detection	Bounding boxes + labels	mAP@0.5, mAP@0.5:0.95
Semantic segmentation	Per-pixel class mask	mIoU
Instance segmentation	Per-object mask + label	mask AP
Generation / synthesis	New images	FID, LPIPS

Backbone architectures

Backbone	Strengths	Typical use
ResNet-50/101	Stable, well-understood	Classification baseline, feature extractor
EfficientNet-B0..B7	Accuracy/FLOP Pareto front	Mobile + server classification
ViT-B/16, ViT-L/16	Strong with large data, attention maps	High-accuracy classification, zero-shot
ConvNeXt-T/B	CNN with transformer-like training recipe	Drop-in ResNet replacement
DINOv2 (ViT)	Strong self-supervised features	Few-shot, feature extraction

Anchor-free vs anchor-based detection

Anchor-based (YOLOv5, Faster R-CNN) - predefined box aspect ratios per grid cell. Fast training convergence, tuning required for unusual object scales.
Anchor-free (YOLO11/v8, FCOS, DETR) - predict box center + offsets directly. Cleaner training, no anchor hyperparameter search, now the default for new projects.

Loss functions

Loss	Used for
Cross-entropy	Classification (multi-class), segmentation pixel-wise
Focal loss	Detection classification head - down-weights easy negatives
IoU / GIoU / CIoU / DIoU	Bounding box regression
Dice loss	Segmentation - handles class imbalance better than cross-entropy
Binary cross-entropy	Multi-label classification, mask prediction

Common tasks

Fine-tune an image classifier

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models

# 1. Data transforms
train_tf = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
val_tf = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

train_ds = datasets.ImageFolder("data/train", transform=train_tf)
val_ds   = datasets.ImageFolder("data/val",   transform=val_tf)
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True,  num_workers=4)
val_loader   = DataLoader(val_ds,   batch_size=64, shuffle=False, num_workers=4)

# 2. Load pretrained backbone, replace head
NUM_CLASSES = len(train_ds.classes)
model = models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.DEFAULT)
model.classifier[1] = nn.Linear(model.classifier[1].in_features, NUM_CLASSES)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# 3. Two-phase training: head first, then unfreeze backbone
optimizer = torch.optim.AdamW(model.classifier.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

def train_one_epoch(loader):
    model.train()
    for imgs, labels in loader:
        imgs, labels = imgs.to(device), labels.to(device)
        optimizer.zero_grad()
        loss = criterion(model(imgs), labels)
        loss.backward()
        optimizer.step()
    scheduler.step()

# Phase 1 - head only (5 epochs)
for epoch in range(5):
    train_one_epoch(train_loader)

# Phase 2 - unfreeze everything with lower LR
for p in model.parameters():
    p.requires_grad = True
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)
for epoch in range(10):
    train_one_epoch(train_loader)

torch.save(model.state_dict(), "classifier.pth")

Run object detection with YOLO

from ultralytics import YOLO

# --- Inference ---
model = YOLO("yolo11n.pt")  # nano; swap for yolo11s/m/l/x for accuracy
results = model.predict("image.jpg", conf=0.25, iou=0.45, device=0)

for r in results:
    for box in r.boxes:
        cls   = int(box.cls[0])
        label = model.names[cls]
        conf  = float(box.conf[0])
        xyxy  = box.xyxy[0].tolist()   # [x1, y1, x2, y2]
        print(f"{label}: {conf:.2f}  {xyxy}")

# --- Fine-tune on custom dataset ---
# Expects data.yaml with train/val paths and class names
model = YOLO("yolo11s.pt")
results = model.train(
    data="data.yaml",
    epochs=100,
    imgsz=640,
    batch=16,
    device=0,
    optimizer="AdamW",
    lr0=1e-3,
    weight_decay=0.0005,
    augment=True,         # built-in mosaic, mixup, copy-paste
    cos_lr=True,
    patience=20,          # early stopping
    project="runs/detect",
    name="custom_v1",
)
print(results.results_dict)  # mAP50, mAP50-95, precision, recall

Implement a data augmentation pipeline

import albumentations as A
from albumentations.pytorch import ToTensorV2
import numpy as np

# Classification pipeline
clf_transform = A.Compose([
    A.RandomResizedCrop(height=224, width=224, scale=(0.6, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1, rotate_limit=15, p=0.5),
    A.OneOf([
        A.GaussNoise(var_limit=(10, 50)),
        A.GaussianBlur(blur_limit=3),
        A.MotionBlur(blur_limit=3),
    ], p=0.3),
    A.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.2, hue=0.05, p=0.5),
    A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
    ToTensorV2(),
])

# Detection pipeline - bbox-aware transforms
det_transform = A.Compose([
    A.RandomResizedCrop(height=640, width=640, scale=(0.5, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.4),
    A.HueSaturationValue(p=0.3),
    A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
    ToTensorV2(),
], bbox_params=A.BboxParams(format="yolo", label_fields=["class_labels"]))

# Usage
image = np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8)
out = clf_transform(image=image)["image"]  # torch.Tensor [3, 224, 224]

Build an image preprocessing pipeline

import torch
from torchvision.transforms import v2 as T
from PIL import Image

# Production preprocessing - deterministic, no augmentation
preprocess = T.Compose([
    T.Resize((256, 256), interpolation=T.InterpolationMode.BILINEAR, antialias=True),
    T.CenterCrop(224),
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

def load_batch(paths: list[str], device: torch.device) -> torch.Tensor:
    """Load, preprocess, and batch a list of image paths."""
    tensors = []
    for p in paths:
        img = Image.open(p).convert("RGB")
        tensors.append(preprocess(img))
    return torch.stack(tensors).to(device)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch = load_batch(["a.jpg", "b.jpg", "c.jpg"], device)
print(batch.shape)  # [3, 3, 224, 224]

Deploy a vision model

import torch
import torch.onnx
import onnxruntime as ort
import numpy as np

# --- Export to ONNX ---
model = torch.load("classifier.pth", map_location="cpu")
model.eval()

dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model,
    dummy,
    "classifier.onnx",
    input_names=["image"],
    output_names=["logits"],
    dynamic_axes={"image": {0: "batch"}, "logits": {0: "batch"}},
    opset_version=17,
)

# --- ONNX Runtime inference (CPU or CUDA EP) ---
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
session = ort.InferenceSession("classifier.onnx", providers=providers)
input_name = session.get_inputs()[0].name

def infer_onnx(batch_np: np.ndarray) -> np.ndarray:
    return session.run(None, {input_name: batch_np})[0]

# --- TensorRT optimization (requires tensorrt package) ---
# Run once offline to build the engine:
#   trtexec --onnx=classifier.onnx --saveEngine=classifier.trt \
#           --fp16 --minShapes=image:1x3x224x224 \
#           --optShapes=image:8x3x224x224 \
#           --maxShapes=image:32x3x224x224

Evaluate model performance

import torch
import numpy as np
from torchmetrics.classification import (
    MulticlassAccuracy,
    MulticlassConfusionMatrix,
    MulticlassPrecision,
    MulticlassRecall,
    MulticlassF1Score,
)
from torchmetrics.detection import MeanAveragePrecision

# --- Classification metrics ---
def evaluate_classifier(model, loader, num_classes, device):
    model.eval()
    metrics = {
        "acc":  MulticlassAccuracy(num_classes=num_classes, top_k=1).to(device),
        "prec": MulticlassPrecision(num_classes=num_classes, average="macro").to(device),
        "rec":  MulticlassRecall(num_classes=num_classes, average="macro").to(device),
        "f1":   MulticlassF1Score(num_classes=num_classes, average="macro").to(device),
        "cm":   MulticlassConfusionMatrix(num_classes=num_classes).to(device),
    }
    with torch.no_grad():
        for imgs, labels in loader:
            imgs, labels = imgs.to(device), labels.to(device)
            preds = model(imgs)
            for m in metrics.values():
                m.update(preds, labels)
    return {k: v.compute() for k, v in metrics.items()}

# --- Detection metrics (COCO mAP) ---
map_metric = MeanAveragePrecision(iou_type="bbox")
# preds and targets follow torchmetrics dict format
preds = [{"boxes": torch.tensor([[10, 20, 100, 200]]), "scores": torch.tensor([0.9]), "labels": torch.tensor([0])}]
tgts  = [{"boxes": torch.tensor([[12, 22, 102, 202]]), "labels": torch.tensor([0])}]
map_metric.update(preds, tgts)
result = map_metric.compute()
print(f"mAP@0.5: {result['map_50']:.4f}  mAP@0.5:0.95: {result['map']:.4f}")

Implement semantic segmentation

import torch
import torch.nn as nn
from torchvision.models.segmentation import deeplabv3_resnet50, DeepLabV3_ResNet50_Weights

# --- DeepLabV3 fine-tuning ---
NUM_CLASSES = 21  # e.g. PASCAL VOC
model = deeplabv3_resnet50(weights=DeepLabV3_ResNet50_Weights.DEFAULT)
model.classifier[4] = nn.Conv2d(256, NUM_CLASSES, kernel_size=1)
model.aux_classifier[4] = nn.Conv2d(256, NUM_CLASSES, kernel_size=1)

# Training step
def seg_train_step(model, imgs, masks, optimizer, device):
    model.train()
    imgs, masks = imgs.to(device), masks.long().to(device)
    out = model(imgs)
    # main loss + auxiliary loss
    loss = nn.functional.cross_entropy(out["out"], masks)
    loss += 0.4 * nn.functional.cross_entropy(out["aux"], masks)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

# Inference - returns per-pixel class index
def seg_predict(model, img_tensor, device):
    model.eval()
    with torch.no_grad():
        out = model(img_tensor.unsqueeze(0).to(device))
    return out["out"].argmax(dim=1).squeeze(0).cpu()  # [H, W]

# --- Lightweight U-Net-style architecture (custom) ---
class DoubleConv(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1, bias=False),
            nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1, bias=False),
            nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True),
        )
    def forward(self, x): return self.net(x)

class UNet(nn.Module):
    def __init__(self, in_channels=3, num_classes=2, features=(64, 128, 256, 512)):
        super().__init__()
        self.downs = nn.ModuleList()
        self.ups   = nn.ModuleList()
        self.pool  = nn.MaxPool2d(2, 2)
        ch = in_channels
        for f in features:
            self.downs.append(DoubleConv(ch, f)); ch = f
        self.bottleneck = DoubleConv(features[-1], features[-1] * 2)
        for f in reversed(features):
            self.ups.append(nn.ConvTranspose2d(f * 2, f, 2, 2))
            self.ups.append(DoubleConv(f * 2, f))
        self.head = nn.Conv2d(features[0], num_classes, 1)

    def forward(self, x):
        skips = []
        for down in self.downs:
            x = down(x); skips.append(x); x = self.pool(x)
        x = self.bottleneck(x)
        for i in range(0, len(self.ups), 2):
            x = self.ups[i](x)
            skip = skips[-(i // 2 + 1)]
            if x.shape != skip.shape:
                x = torch.nn.functional.interpolate(x, size=skip.shape[2:])
            x = self.ups[i + 1](torch.cat([skip, x], dim=1))
        return self.head(x)

Anti-patterns / common mistakes

Anti-pattern	What goes wrong	Correct approach
Training from scratch on small datasets	Model memorizes noise, poor generalization	Always start from pretrained weights; freeze backbone initially
Normalizing with wrong mean/std	Silent accuracy drop when ImageNet stats misapplied to non-ImageNet data	Compute dataset statistics or use the exact stats that match the pretrained model
Leaking augmentation into validation	Inflated validation metrics; surprises in production	Apply only deterministic transforms (resize, normalize) to val/test splits
Skipping anchor/stride tuning for custom scale objects	Model misses very small or very large objects	Analyse object scale distribution; adjust anchor sizes or use anchor-free models
Exporting to ONNX without dynamic axes	Batch-size-1 locked model; crashes on larger batches in production	Always set `dynamic_axes` for batch dimension (and optionally spatial dims)
Evaluating detection with IoU threshold 0.5 only	Misses regression quality; mAP@0.5:0.95 is 2-3x harder	Report both mAP@0.5 and mAP@0.5:0.95 to COCO convention

Gotchas

Normalizing with wrong mean/std silently degrades accuracy - If you pretrain with ImageNet weights but normalize with different mean/std at inference, predictions silently degrade. The values [0.485, 0.456, 0.406] / [0.229, 0.224, 0.225] are ImageNet-specific; compute your own stats if your data is not RGB photos (e.g., medical images, satellite imagery).
loading="lazy" on the LCP image - This applies to CV deployment: never lazy-load the first above-fold image in a web app. Use fetchpriority="high" on the primary visual.
IV/nonce reuse destroys GCM security - This applies when encrypting model weights or inference results: reusing an IV with the same AES-256-GCM key is catastrophic. Generate fresh randomBytes(12) for every encrypt call.
Augmentation leaking into validation - Applying RandomResizedCrop or ColorJitter to the validation split inflates metrics. Only deterministic transforms (resize, center crop, normalize) belong in the val/test transforms.
ONNX export without dynamic axes locks batch size - Exporting with a fixed batch size of 1 causes runtime crashes in production when the batch size changes. Always set dynamic_axes={"image": {0: "batch"}} during export.
Anchor tuning for unusual object scales - If your objects are very small (satellite imagery, cell microscopy) or very large relative to the image, default YOLO anchor sizes will miss them. Run model.analyze_anchor_fitness() or use anchor-free models for unusual scale distributions.

References

For detailed content on model selection and architecture comparisons, read:

references/model-zoo.md - backbone and detector architecture comparison, pretrained weight sources, speed/accuracy tradeoffs, hardware considerations

Key external resources:

References

model-zoo.md

Computer Vision Model Zoo

Image Classification Backbones

Model	Params	ImageNet Top-1	Latency (A100, bs=1)	Pretrained weights	Best for
ResNet-50	25 M	76.1 %	~3 ms	`torchvision` / timm	Baseline, feature extractor
ResNet-101	45 M	77.4 %	~5 ms	`torchvision` / timm	Higher accuracy vs R50
EfficientNet-B0	5.3 M	77.1 %	~2 ms	`torchvision` / timm	Mobile, low FLOP
EfficientNet-B4	19 M	83.4 %	~7 ms	`torchvision` / timm	Accuracy/speed sweet spot
EfficientNet-B7	66 M	84.4 %	~20 ms	`torchvision` / timm	Max accuracy, constrained deploy
ConvNeXt-Tiny	28 M	82.1 %	~4 ms	`torchvision` / timm	Modern CNN, easy fine-tuning
ConvNeXt-Base	89 M	83.8 %	~9 ms	`torchvision` / timm	Strong general baseline
ViT-B/16	86 M	81.1 %	~6 ms	timm / HuggingFace	Attention maps, large data
ViT-L/16	307 M	82.5 %	~18 ms	timm / HuggingFace	Highest accuracy, data-hungry
DINOv2-ViT-B/14	86 M	84.5 % (linear)	~7 ms	HuggingFace `facebook/dinov2-base`	Few-shot, dense features
DINOv2-ViT-L/14	307 M	86.3 % (linear)	~20 ms	HuggingFace `facebook/dinov2-large`	Best self-supervised features

Loading pretrained weights

import timm

# List available models
timm.list_models("efficientnet*", pretrained=True)

# Load any timm model
model = timm.create_model("efficientnet_b4", pretrained=True, num_classes=0)  # feature extractor
cfg = model.default_cfg  # contains input size, mean, std

Object Detection Models

Model	Backbone	COCO mAP	FPS (A100)	Weights source	Notes
YOLOv5n	CSPDarknet	28.0	1200	Ultralytics	Smallest YOLO, edge deploy
YOLOv5s	CSPDarknet	37.4	600	Ultralytics
YOLOv5m	CSPDarknet	45.4	300	Ultralytics
YOLOv5l	CSPDarknet	49.0	180	Ultralytics
YOLOv8n	C2f	37.3	1300	Ultralytics	Anchor-free, cleaner API
YOLOv8s	C2f	44.9	700	Ultralytics
YOLOv8m	C2f	50.2	300	Ultralytics
YOLOv8l	C2f	52.9	180	Ultralytics
YOLO11n	C3k2	39.5	1400	Ultralytics	Latest generation, default choice
YOLO11s	C3k2	47.0	750	Ultralytics
YOLO11m	C3k2	51.5	350	Ultralytics
YOLO11l	C3k2	53.4	200	Ultralytics
YOLO11x	C3k2	54.7	100	Ultralytics
RT-DETR-L	ResNet-101	53.0	110	Ultralytics / HuggingFace	Transformer, no NMS needed
DETR	ResNet-50	42.0	60	HuggingFace	Foundational transformer detector
Faster R-CNN R50	ResNet-50	37.0	50	`torchvision`	Two-stage, high-precision

Model selection heuristics

Edge / mobile (Jetson Nano, mobile CPU): YOLO11n or YOLOv5n; use INT8 TensorRT export.
Server real-time (>20 FPS on single GPU): YOLO11s or YOLO11m.
Maximum accuracy, offline: YOLO11x or RT-DETR-L.
Unusual aspect ratios or dense small objects: Consider tiled inference with YOLO.
No NMS tuning wanted: RT-DETR removes post-processing sensitivity to IoU threshold.

Semantic Segmentation Models

Model	Backbone	mIoU (Cityscapes)	mIoU (ADE20K)	Weights source	Notes
DeepLabV3 R50	ResNet-50	73.5	-	`torchvision`	Solid baseline
DeepLabV3+ R101	ResNet-101	78.9	-	`torchvision`	Atrous spatial pyramid
SegFormer-B0	MiT-B0	76.2	37.4	HuggingFace `nvidia/segformer-b0`	Lightweight transformer
SegFormer-B2	MiT-B2	81.0	46.5	HuggingFace `nvidia/segformer-b2`	Best efficiency/accuracy
SegFormer-B5	MiT-B5	84.0	51.8	HuggingFace `nvidia/segformer-b5`	Highest accuracy
Mask2Former	Swin-B	-	53.9	HuggingFace `facebook/mask2former-swin-base-ade-semantic`	Universal segmentation
U-Net (custom)	ResNet/EfficientNet	varies	varies	timm encoder	Medical / satellite, custom scale

Instance Segmentation Models

Model	COCO mask AP	Weights source	Notes
YOLO11n-seg	30.7	Ultralytics	Fastest, edge
YOLO11m-seg	40.8	Ultralytics	Balanced
YOLO11x-seg	43.8	Ultralytics	Best accuracy
Mask R-CNN R50	34.6	`torchvision`	Classic two-stage
SAM (ViT-B)	-	Meta / HuggingFace	Promptable, zero-shot masks
SAM (ViT-H)	-	Meta / HuggingFace	Highest quality, slow

SAM quickstart

from transformers import SamModel, SamProcessor
import torch

model = SamModel.from_pretrained("facebook/sam-vit-base").to("cuda")
processor = SamProcessor.from_pretrained("facebook/sam-vit-base")

# Point prompt
inputs = processor(images=image_pil, input_points=[[[x, y]]], return_tensors="pt").to("cuda")
with torch.no_grad():
    outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu()
)

Foundation / Zero-shot Models

Model	Task	Weights source	Notes
CLIP ViT-B/32	Image-text matching	`openai/clip-vit-base-patch32`	Zero-shot classification
CLIP ViT-L/14	Image-text matching	`openai/clip-vit-large-patch14`	Better accuracy
OpenCLIP ViT-H/14	Image-text matching	HuggingFace `laion/CLIP-ViT-H-14-laion2B`	Open weights, LAION trained
Grounding DINO	Open-vocab detection	HuggingFace `IDEA-Research/grounding-dino-base`	Text-prompted detection
OWL-ViT	Open-vocab detection	HuggingFace `google/owlvit-base-patch32`	Few-shot detection

CLIP zero-shot classification

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

labels = ["a cat", "a dog", "a car"]
inputs = processor(text=labels, images=image_pil, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(**inputs).logits_per_image  # [1, num_labels]
probs = logits.softmax(dim=1)
predicted = labels[probs.argmax()]

Pretrained Weight Sources

Source	URL	Notes
torchvision	`torchvision.models`	Official PyTorch models
timm	`pip install timm` / HuggingFace	1000+ models, consistent API
Ultralytics	`ultralytics.com` / `pip install ultralytics`	YOLO family
HuggingFace Hub	`huggingface.co/models`	SegFormer, SAM, CLIP, DINOv2
Meta Research	GitHub releases	SAM, DINOv2 native checkpoints
ONNX Model Zoo	`github.com/onnx/models`	Ready-to-deploy ONNX weights

Hardware Considerations

Scenario	Recommended approach
Training on single GPU (<=8 GB)	EfficientNet-B0/B2 or YOLO11n/s; use AMP (`torch.cuda.amp`)
Training on multi-GPU	`torch.nn.parallel.DistributedDataParallel`; YOLO11 `device=0,1,2,3`
Inference on CPU only	Export to ONNX; use OpenVINO for Intel, XNNPACK for ARM
Inference on Jetson (edge GPU)	Export to TensorRT FP16/INT8 with `trtexec` or `torch2trt`
Inference on Apple Silicon	Use `mps` device (`torch.device("mps")`); CoreML export for on-device
Cloud serving (throughput)	TensorRT on T4/A10G; batch size 8-32; dynamic shape engines

Frequently Asked Questions

What is computer-vision?

How do I install computer-vision?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill computer-vision in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support computer-vision?

computer-vision works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Is computer-vision free?

Yes, computer-vision is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between computer-vision and similar tools?

computer-vision is an AI agent skill that teaches your coding agent specialized ai & machine learning knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use computer-vision with Cursor or Windsurf?

computer-vision works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

computer-vision

What is computer-vision?

Quick Start

computer-vision

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is computer-vision?

How do I install computer-vision?

What AI agents support computer-vision?

Maintainers

SKILL.md

Computer Vision

When to use this skill

Key principles

Core concepts

Task taxonomy

Backbone architectures

Anchor-free vs anchor-based detection

Loss functions

Common tasks

Fine-tune an image classifier

Run object detection with YOLO

Implement a data augmentation pipeline

Build an image preprocessing pipeline

Deploy a vision model

Evaluate model performance

Implement semantic segmentation

Anti-patterns / common mistakes

Gotchas

References

References

model-zoo.md

Computer Vision Model Zoo

Image Classification Backbones

Loading pretrained weights

Object Detection Models

Model selection heuristics

Semantic Segmentation Models

Instance Segmentation Models

SAM quickstart

Foundation / Zero-shot Models

CLIP zero-shot classification

Pretrained Weight Sources

Hardware Considerations

Frequently Asked Questions

What is computer-vision?

How do I install computer-vision?

What AI agents support computer-vision?

Is computer-vision free?

What is the difference between computer-vision and similar tools?

Can I use computer-vision with Cursor or Windsurf?