computer-vision
Use this skill when building computer vision applications, implementing image classification, object detection, or segmentation pipelines. Triggers on image classification, object detection, YOLO, semantic segmentation, image preprocessing, data augmentation, transfer learning, CNN architectures, vision transformers, and any task requiring visual recognition or image analysis.
ai-ml computer-visiondeep-learningobject-detectionsegmentationcnnWhat is computer-vision?
Use this skill when building computer vision applications, implementing image classification, object detection, or segmentation pipelines. Triggers on image classification, object detection, YOLO, semantic segmentation, image preprocessing, data augmentation, transfer learning, CNN architectures, vision transformers, and any task requiring visual recognition or image analysis.
computer-vision
computer-vision is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Building computer vision applications, implementing image classification, object detection, or segmentation pipelines.
Quick Facts
| Field | Value |
|---|---|
| Category | ai-ml |
| Version | 0.1.0 |
| Platforms | claude-code, gemini-cli, openai-codex |
| License | MIT |
How to Install
- Make sure you have Node.js installed on your machine.
- Run the following command in your terminal:
npx skills add AbsolutelySkilled/AbsolutelySkilled --skill computer-vision- The computer-vision skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).
Overview
Computer vision enables machines to interpret and reason about visual data - images, video, and multi-modal inputs. Modern CV pipelines are built on deep neural networks pretrained on large datasets (ImageNet, COCO, ADE20K) and fine-tuned for specific domains. PyTorch and its ecosystem (torchvision, timm, ultralytics, albumentations) cover the full stack from data loading through deployment. Foundation models like SAM, DINOv2, and OpenCLIP have shifted best practice toward prompt-based and zero-shot approaches before committing to full training runs.
Tags
computer-vision deep-learning object-detection segmentation cnn
Platforms
- claude-code
- gemini-cli
- openai-codex
Related Skills
Pair computer-vision with these complementary skills:
Frequently Asked Questions
What is computer-vision?
Use this skill when building computer vision applications, implementing image classification, object detection, or segmentation pipelines. Triggers on image classification, object detection, YOLO, semantic segmentation, image preprocessing, data augmentation, transfer learning, CNN architectures, vision transformers, and any task requiring visual recognition or image analysis.
How do I install computer-vision?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill computer-vision in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support computer-vision?
This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.
Maintainers
Generated from AbsolutelySkilled
SKILL.md
Computer Vision
Computer vision enables machines to interpret and reason about visual data - images, video, and multi-modal inputs. Modern CV pipelines are built on deep neural networks pretrained on large datasets (ImageNet, COCO, ADE20K) and fine-tuned for specific domains. PyTorch and its ecosystem (torchvision, timm, ultralytics, albumentations) cover the full stack from data loading through deployment. Foundation models like SAM, DINOv2, and OpenCLIP have shifted best practice toward prompt-based and zero-shot approaches before committing to full training runs.
When to use this skill
Trigger this skill when the user:
- Trains or fine-tunes an image classifier on a custom dataset
- Runs inference with YOLO, DETR, or other detection models
- Builds a semantic or instance segmentation pipeline
- Implements data augmentation for CV training
- Preprocesses images for model ingestion (resize, normalize, batch)
- Exports a vision model to ONNX or optimizes with TensorRT
- Evaluates a vision model (mAP, confusion matrix, per-class metrics)
- Implements a U-Net, DeepLabV3, or similar segmentation architecture
Do NOT trigger this skill for:
- Pure NLP tasks with no visual component (use a language-model skill instead)
- 3D point-cloud processing or LiDAR-only pipelines (overlap is limited; check domain)
Key principles
- Start with pretrained models - Fine-tune ImageNet/COCO weights before training from scratch. Even a frozen backbone with a new head beats random init on small datasets.
- Augment data aggressively - Real-world distribution shifts are unavoidable. Use albumentations with geometric, color, and noise transforms. Target-aware augments (mosaic, copy-paste) matter especially for detection.
- Validate on representative data - Always hold out data from the exact deployment distribution. Benchmark on in-distribution AND out-of-distribution splits separately.
- Optimize inference separately from training - Training precision (FP32/AMP) and inference precision (INT8/FP16) have different tradeoffs. Profile, export to ONNX, then apply TensorRT or OpenVINO post-training quantization.
- Monitor for distribution shift - Production images drift from training data (lighting changes, new object classes, compression artifacts). Log prediction confidence distributions and trigger retraining pipelines when they degrade.
Core concepts
Task taxonomy
| Task | Output | Typical metric |
|---|---|---|
| Classification | Single label per image | Top-1 / Top-5 accuracy |
| Detection | Bounding boxes + labels | mAP@0.5, mAP@0.5:0.95 |
| Semantic segmentation | Per-pixel class mask | mIoU |
| Instance segmentation | Per-object mask + label | mask AP |
| Generation / synthesis | New images | FID, LPIPS |
Backbone architectures
| Backbone | Strengths | Typical use |
|---|---|---|
| ResNet-50/101 | Stable, well-understood | Classification baseline, feature extractor |
| EfficientNet-B0..B7 | Accuracy/FLOP Pareto front | Mobile + server classification |
| ViT-B/16, ViT-L/16 | Strong with large data, attention maps | High-accuracy classification, zero-shot |
| ConvNeXt-T/B | CNN with transformer-like training recipe | Drop-in ResNet replacement |
| DINOv2 (ViT) | Strong self-supervised features | Few-shot, feature extraction |
Anchor-free vs anchor-based detection
- Anchor-based (YOLOv5, Faster R-CNN) - predefined box aspect ratios per grid cell. Fast training convergence, tuning required for unusual object scales.
- Anchor-free (YOLO11/v8, FCOS, DETR) - predict box center + offsets directly. Cleaner training, no anchor hyperparameter search, now the default for new projects.
Loss functions
| Loss | Used for |
|---|---|
| Cross-entropy | Classification (multi-class), segmentation pixel-wise |
| Focal loss | Detection classification head - down-weights easy negatives |
| IoU / GIoU / CIoU / DIoU | Bounding box regression |
| Dice loss | Segmentation - handles class imbalance better than cross-entropy |
| Binary cross-entropy | Multi-label classification, mask prediction |
Common tasks
Fine-tune an image classifier
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
# 1. Data transforms
train_tf = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.2),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
val_tf = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
train_ds = datasets.ImageFolder("data/train", transform=train_tf)
val_ds = datasets.ImageFolder("data/val", transform=val_tf)
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(val_ds, batch_size=64, shuffle=False, num_workers=4)
# 2. Load pretrained backbone, replace head
NUM_CLASSES = len(train_ds.classes)
model = models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.DEFAULT)
model.classifier[1] = nn.Linear(model.classifier[1].in_features, NUM_CLASSES)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# 3. Two-phase training: head first, then unfreeze backbone
optimizer = torch.optim.AdamW(model.classifier.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
def train_one_epoch(loader):
model.train()
for imgs, labels in loader:
imgs, labels = imgs.to(device), labels.to(device)
optimizer.zero_grad()
loss = criterion(model(imgs), labels)
loss.backward()
optimizer.step()
scheduler.step()
# Phase 1 - head only (5 epochs)
for epoch in range(5):
train_one_epoch(train_loader)
# Phase 2 - unfreeze everything with lower LR
for p in model.parameters():
p.requires_grad = True
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)
for epoch in range(10):
train_one_epoch(train_loader)
torch.save(model.state_dict(), "classifier.pth")Run object detection with YOLO
from ultralytics import YOLO
# --- Inference ---
model = YOLO("yolo11n.pt") # nano; swap for yolo11s/m/l/x for accuracy
results = model.predict("image.jpg", conf=0.25, iou=0.45, device=0)
for r in results:
for box in r.boxes:
cls = int(box.cls[0])
label = model.names[cls]
conf = float(box.conf[0])
xyxy = box.xyxy[0].tolist() # [x1, y1, x2, y2]
print(f"{label}: {conf:.2f} {xyxy}")
# --- Fine-tune on custom dataset ---
# Expects data.yaml with train/val paths and class names
model = YOLO("yolo11s.pt")
results = model.train(
data="data.yaml",
epochs=100,
imgsz=640,
batch=16,
device=0,
optimizer="AdamW",
lr0=1e-3,
weight_decay=0.0005,
augment=True, # built-in mosaic, mixup, copy-paste
cos_lr=True,
patience=20, # early stopping
project="runs/detect",
name="custom_v1",
)
print(results.results_dict) # mAP50, mAP50-95, precision, recallImplement a data augmentation pipeline
import albumentations as A
from albumentations.pytorch import ToTensorV2
import numpy as np
# Classification pipeline
clf_transform = A.Compose([
A.RandomResizedCrop(height=224, width=224, scale=(0.6, 1.0)),
A.HorizontalFlip(p=0.5),
A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1, rotate_limit=15, p=0.5),
A.OneOf([
A.GaussNoise(var_limit=(10, 50)),
A.GaussianBlur(blur_limit=3),
A.MotionBlur(blur_limit=3),
], p=0.3),
A.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.2, hue=0.05, p=0.5),
A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
# Detection pipeline - bbox-aware transforms
det_transform = A.Compose([
A.RandomResizedCrop(height=640, width=640, scale=(0.5, 1.0)),
A.HorizontalFlip(p=0.5),
A.RandomBrightnessContrast(p=0.4),
A.HueSaturationValue(p=0.3),
A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
ToTensorV2(),
], bbox_params=A.BboxParams(format="yolo", label_fields=["class_labels"]))
# Usage
image = np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8)
out = clf_transform(image=image)["image"] # torch.Tensor [3, 224, 224]Build an image preprocessing pipeline
import torch
from torchvision.transforms import v2 as T
from PIL import Image
# Production preprocessing - deterministic, no augmentation
preprocess = T.Compose([
T.Resize((256, 256), interpolation=T.InterpolationMode.BILINEAR, antialias=True),
T.CenterCrop(224),
T.ToImage(),
T.ToDtype(torch.float32, scale=True),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
def load_batch(paths: list[str], device: torch.device) -> torch.Tensor:
"""Load, preprocess, and batch a list of image paths."""
tensors = []
for p in paths:
img = Image.open(p).convert("RGB")
tensors.append(preprocess(img))
return torch.stack(tensors).to(device)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch = load_batch(["a.jpg", "b.jpg", "c.jpg"], device)
print(batch.shape) # [3, 3, 224, 224]Deploy a vision model
import torch
import torch.onnx
import onnxruntime as ort
import numpy as np
# --- Export to ONNX ---
model = torch.load("classifier.pth", map_location="cpu")
model.eval()
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy,
"classifier.onnx",
input_names=["image"],
output_names=["logits"],
dynamic_axes={"image": {0: "batch"}, "logits": {0: "batch"}},
opset_version=17,
)
# --- ONNX Runtime inference (CPU or CUDA EP) ---
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
session = ort.InferenceSession("classifier.onnx", providers=providers)
input_name = session.get_inputs()[0].name
def infer_onnx(batch_np: np.ndarray) -> np.ndarray:
return session.run(None, {input_name: batch_np})[0]
# --- TensorRT optimization (requires tensorrt package) ---
# Run once offline to build the engine:
# trtexec --onnx=classifier.onnx --saveEngine=classifier.trt \
# --fp16 --minShapes=image:1x3x224x224 \
# --optShapes=image:8x3x224x224 \
# --maxShapes=image:32x3x224x224Evaluate model performance
import torch
import numpy as np
from torchmetrics.classification import (
MulticlassAccuracy,
MulticlassConfusionMatrix,
MulticlassPrecision,
MulticlassRecall,
MulticlassF1Score,
)
from torchmetrics.detection import MeanAveragePrecision
# --- Classification metrics ---
def evaluate_classifier(model, loader, num_classes, device):
model.eval()
metrics = {
"acc": MulticlassAccuracy(num_classes=num_classes, top_k=1).to(device),
"prec": MulticlassPrecision(num_classes=num_classes, average="macro").to(device),
"rec": MulticlassRecall(num_classes=num_classes, average="macro").to(device),
"f1": MulticlassF1Score(num_classes=num_classes, average="macro").to(device),
"cm": MulticlassConfusionMatrix(num_classes=num_classes).to(device),
}
with torch.no_grad():
for imgs, labels in loader:
imgs, labels = imgs.to(device), labels.to(device)
preds = model(imgs)
for m in metrics.values():
m.update(preds, labels)
return {k: v.compute() for k, v in metrics.items()}
# --- Detection metrics (COCO mAP) ---
map_metric = MeanAveragePrecision(iou_type="bbox")
# preds and targets follow torchmetrics dict format
preds = [{"boxes": torch.tensor([[10, 20, 100, 200]]), "scores": torch.tensor([0.9]), "labels": torch.tensor([0])}]
tgts = [{"boxes": torch.tensor([[12, 22, 102, 202]]), "labels": torch.tensor([0])}]
map_metric.update(preds, tgts)
result = map_metric.compute()
print(f"mAP@0.5: {result['map_50']:.4f} mAP@0.5:0.95: {result['map']:.4f}")Implement semantic segmentation
import torch
import torch.nn as nn
from torchvision.models.segmentation import deeplabv3_resnet50, DeepLabV3_ResNet50_Weights
# --- DeepLabV3 fine-tuning ---
NUM_CLASSES = 21 # e.g. PASCAL VOC
model = deeplabv3_resnet50(weights=DeepLabV3_ResNet50_Weights.DEFAULT)
model.classifier[4] = nn.Conv2d(256, NUM_CLASSES, kernel_size=1)
model.aux_classifier[4] = nn.Conv2d(256, NUM_CLASSES, kernel_size=1)
# Training step
def seg_train_step(model, imgs, masks, optimizer, device):
model.train()
imgs, masks = imgs.to(device), masks.long().to(device)
out = model(imgs)
# main loss + auxiliary loss
loss = nn.functional.cross_entropy(out["out"], masks)
loss += 0.4 * nn.functional.cross_entropy(out["aux"], masks)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
# Inference - returns per-pixel class index
def seg_predict(model, img_tensor, device):
model.eval()
with torch.no_grad():
out = model(img_tensor.unsqueeze(0).to(device))
return out["out"].argmax(dim=1).squeeze(0).cpu() # [H, W]
# --- Lightweight U-Net-style architecture (custom) ---
class DoubleConv(nn.Module):
def __init__(self, in_ch, out_ch):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1, bias=False),
nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, 3, padding=1, bias=False),
nn.BatchNorm2d(out_ch), nn.ReLU(inplace=True),
)
def forward(self, x): return self.net(x)
class UNet(nn.Module):
def __init__(self, in_channels=3, num_classes=2, features=(64, 128, 256, 512)):
super().__init__()
self.downs = nn.ModuleList()
self.ups = nn.ModuleList()
self.pool = nn.MaxPool2d(2, 2)
ch = in_channels
for f in features:
self.downs.append(DoubleConv(ch, f)); ch = f
self.bottleneck = DoubleConv(features[-1], features[-1] * 2)
for f in reversed(features):
self.ups.append(nn.ConvTranspose2d(f * 2, f, 2, 2))
self.ups.append(DoubleConv(f * 2, f))
self.head = nn.Conv2d(features[0], num_classes, 1)
def forward(self, x):
skips = []
for down in self.downs:
x = down(x); skips.append(x); x = self.pool(x)
x = self.bottleneck(x)
for i in range(0, len(self.ups), 2):
x = self.ups[i](x)
skip = skips[-(i // 2 + 1)]
if x.shape != skip.shape:
x = torch.nn.functional.interpolate(x, size=skip.shape[2:])
x = self.ups[i + 1](torch.cat([skip, x], dim=1))
return self.head(x)Anti-patterns / common mistakes
| Anti-pattern | What goes wrong | Correct approach |
|---|---|---|
| Training from scratch on small datasets | Model memorizes noise, poor generalization | Always start from pretrained weights; freeze backbone initially |
| Normalizing with wrong mean/std | Silent accuracy drop when ImageNet stats misapplied to non-ImageNet data | Compute dataset statistics or use the exact stats that match the pretrained model |
| Leaking augmentation into validation | Inflated validation metrics; surprises in production | Apply only deterministic transforms (resize, normalize) to val/test splits |
| Skipping anchor/stride tuning for custom scale objects | Model misses very small or very large objects | Analyse object scale distribution; adjust anchor sizes or use anchor-free models |
| Exporting to ONNX without dynamic axes | Batch-size-1 locked model; crashes on larger batches in production | Always set dynamic_axes for batch dimension (and optionally spatial dims) |
| Evaluating detection with IoU threshold 0.5 only | Misses regression quality; mAP@0.5:0.95 is 2-3x harder | Report both mAP@0.5 and mAP@0.5:0.95 to COCO convention |
Gotchas
Normalizing with wrong mean/std silently degrades accuracy - If you pretrain with ImageNet weights but normalize with different mean/std at inference, predictions silently degrade. The values
[0.485, 0.456, 0.406]/[0.229, 0.224, 0.225]are ImageNet-specific; compute your own stats if your data is not RGB photos (e.g., medical images, satellite imagery).loading="lazy"on the LCP image - This applies to CV deployment: never lazy-load the first above-fold image in a web app. Usefetchpriority="high"on the primary visual.IV/nonce reuse destroys GCM security - This applies when encrypting model weights or inference results: reusing an IV with the same AES-256-GCM key is catastrophic. Generate fresh
randomBytes(12)for every encrypt call.Augmentation leaking into validation - Applying
RandomResizedCroporColorJitterto the validation split inflates metrics. Only deterministic transforms (resize, center crop, normalize) belong in the val/test transforms.ONNX export without dynamic axes locks batch size - Exporting with a fixed batch size of 1 causes runtime crashes in production when the batch size changes. Always set
dynamic_axes={"image": {0: "batch"}}during export.Anchor tuning for unusual object scales - If your objects are very small (satellite imagery, cell microscopy) or very large relative to the image, default YOLO anchor sizes will miss them. Run
model.analyze_anchor_fitness()or use anchor-free models for unusual scale distributions.
References
For detailed content on model selection and architecture comparisons, read:
references/model-zoo.md- backbone and detector architecture comparison, pretrained weight sources, speed/accuracy tradeoffs, hardware considerations
Key external resources:
References
model-zoo.md
Computer Vision Model Zoo
Image Classification Backbones
| Model | Params | ImageNet Top-1 | Latency (A100, bs=1) | Pretrained weights | Best for |
|---|---|---|---|---|---|
| ResNet-50 | 25 M | 76.1 % | ~3 ms | torchvision / timm |
Baseline, feature extractor |
| ResNet-101 | 45 M | 77.4 % | ~5 ms | torchvision / timm |
Higher accuracy vs R50 |
| EfficientNet-B0 | 5.3 M | 77.1 % | ~2 ms | torchvision / timm |
Mobile, low FLOP |
| EfficientNet-B4 | 19 M | 83.4 % | ~7 ms | torchvision / timm |
Accuracy/speed sweet spot |
| EfficientNet-B7 | 66 M | 84.4 % | ~20 ms | torchvision / timm |
Max accuracy, constrained deploy |
| ConvNeXt-Tiny | 28 M | 82.1 % | ~4 ms | torchvision / timm |
Modern CNN, easy fine-tuning |
| ConvNeXt-Base | 89 M | 83.8 % | ~9 ms | torchvision / timm |
Strong general baseline |
| ViT-B/16 | 86 M | 81.1 % | ~6 ms | timm / HuggingFace | Attention maps, large data |
| ViT-L/16 | 307 M | 82.5 % | ~18 ms | timm / HuggingFace | Highest accuracy, data-hungry |
| DINOv2-ViT-B/14 | 86 M | 84.5 % (linear) | ~7 ms | HuggingFace facebook/dinov2-base |
Few-shot, dense features |
| DINOv2-ViT-L/14 | 307 M | 86.3 % (linear) | ~20 ms | HuggingFace facebook/dinov2-large |
Best self-supervised features |
Loading pretrained weights
import timm
# List available models
timm.list_models("efficientnet*", pretrained=True)
# Load any timm model
model = timm.create_model("efficientnet_b4", pretrained=True, num_classes=0) # feature extractor
cfg = model.default_cfg # contains input size, mean, stdObject Detection Models
| Model | Backbone | COCO mAP | FPS (A100) | Weights source | Notes |
|---|---|---|---|---|---|
| YOLOv5n | CSPDarknet | 28.0 | 1200 | Ultralytics | Smallest YOLO, edge deploy |
| YOLOv5s | CSPDarknet | 37.4 | 600 | Ultralytics | |
| YOLOv5m | CSPDarknet | 45.4 | 300 | Ultralytics | |
| YOLOv5l | CSPDarknet | 49.0 | 180 | Ultralytics | |
| YOLOv8n | C2f | 37.3 | 1300 | Ultralytics | Anchor-free, cleaner API |
| YOLOv8s | C2f | 44.9 | 700 | Ultralytics | |
| YOLOv8m | C2f | 50.2 | 300 | Ultralytics | |
| YOLOv8l | C2f | 52.9 | 180 | Ultralytics | |
| YOLO11n | C3k2 | 39.5 | 1400 | Ultralytics | Latest generation, default choice |
| YOLO11s | C3k2 | 47.0 | 750 | Ultralytics | |
| YOLO11m | C3k2 | 51.5 | 350 | Ultralytics | |
| YOLO11l | C3k2 | 53.4 | 200 | Ultralytics | |
| YOLO11x | C3k2 | 54.7 | 100 | Ultralytics | |
| RT-DETR-L | ResNet-101 | 53.0 | 110 | Ultralytics / HuggingFace | Transformer, no NMS needed |
| DETR | ResNet-50 | 42.0 | 60 | HuggingFace | Foundational transformer detector |
| Faster R-CNN R50 | ResNet-50 | 37.0 | 50 | torchvision |
Two-stage, high-precision |
Model selection heuristics
- Edge / mobile (Jetson Nano, mobile CPU): YOLO11n or YOLOv5n; use INT8 TensorRT export.
- Server real-time (>20 FPS on single GPU): YOLO11s or YOLO11m.
- Maximum accuracy, offline: YOLO11x or RT-DETR-L.
- Unusual aspect ratios or dense small objects: Consider tiled inference with YOLO.
- No NMS tuning wanted: RT-DETR removes post-processing sensitivity to IoU threshold.
Semantic Segmentation Models
| Model | Backbone | mIoU (Cityscapes) | mIoU (ADE20K) | Weights source | Notes |
|---|---|---|---|---|---|
| DeepLabV3 R50 | ResNet-50 | 73.5 | - | torchvision |
Solid baseline |
| DeepLabV3+ R101 | ResNet-101 | 78.9 | - | torchvision |
Atrous spatial pyramid |
| SegFormer-B0 | MiT-B0 | 76.2 | 37.4 | HuggingFace nvidia/segformer-b0 |
Lightweight transformer |
| SegFormer-B2 | MiT-B2 | 81.0 | 46.5 | HuggingFace nvidia/segformer-b2 |
Best efficiency/accuracy |
| SegFormer-B5 | MiT-B5 | 84.0 | 51.8 | HuggingFace nvidia/segformer-b5 |
Highest accuracy |
| Mask2Former | Swin-B | - | 53.9 | HuggingFace facebook/mask2former-swin-base-ade-semantic |
Universal segmentation |
| U-Net (custom) | ResNet/EfficientNet | varies | varies | timm encoder | Medical / satellite, custom scale |
Instance Segmentation Models
| Model | COCO mask AP | Weights source | Notes |
|---|---|---|---|
| YOLO11n-seg | 30.7 | Ultralytics | Fastest, edge |
| YOLO11m-seg | 40.8 | Ultralytics | Balanced |
| YOLO11x-seg | 43.8 | Ultralytics | Best accuracy |
| Mask R-CNN R50 | 34.6 | torchvision |
Classic two-stage |
| SAM (ViT-B) | - | Meta / HuggingFace | Promptable, zero-shot masks |
| SAM (ViT-H) | - | Meta / HuggingFace | Highest quality, slow |
SAM quickstart
from transformers import SamModel, SamProcessor
import torch
model = SamModel.from_pretrained("facebook/sam-vit-base").to("cuda")
processor = SamProcessor.from_pretrained("facebook/sam-vit-base")
# Point prompt
inputs = processor(images=image_pil, input_points=[[[x, y]]], return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(
outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu()
)Foundation / Zero-shot Models
| Model | Task | Weights source | Notes |
|---|---|---|---|
| CLIP ViT-B/32 | Image-text matching | openai/clip-vit-base-patch32 |
Zero-shot classification |
| CLIP ViT-L/14 | Image-text matching | openai/clip-vit-large-patch14 |
Better accuracy |
| OpenCLIP ViT-H/14 | Image-text matching | HuggingFace laion/CLIP-ViT-H-14-laion2B |
Open weights, LAION trained |
| Grounding DINO | Open-vocab detection | HuggingFace IDEA-Research/grounding-dino-base |
Text-prompted detection |
| OWL-ViT | Open-vocab detection | HuggingFace google/owlvit-base-patch32 |
Few-shot detection |
CLIP zero-shot classification
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
labels = ["a cat", "a dog", "a car"]
inputs = processor(text=labels, images=image_pil, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**inputs).logits_per_image # [1, num_labels]
probs = logits.softmax(dim=1)
predicted = labels[probs.argmax()]Pretrained Weight Sources
| Source | URL | Notes |
|---|---|---|
| torchvision | torchvision.models |
Official PyTorch models |
| timm | pip install timm / HuggingFace |
1000+ models, consistent API |
| Ultralytics | ultralytics.com / pip install ultralytics |
YOLO family |
| HuggingFace Hub | huggingface.co/models |
SegFormer, SAM, CLIP, DINOv2 |
| Meta Research | GitHub releases | SAM, DINOv2 native checkpoints |
| ONNX Model Zoo | github.com/onnx/models |
Ready-to-deploy ONNX weights |
Hardware Considerations
| Scenario | Recommended approach |
|---|---|
| Training on single GPU (<=8 GB) | EfficientNet-B0/B2 or YOLO11n/s; use AMP (torch.cuda.amp) |
| Training on multi-GPU | torch.nn.parallel.DistributedDataParallel; YOLO11 device=0,1,2,3 |
| Inference on CPU only | Export to ONNX; use OpenVINO for Intel, XNNPACK for ARM |
| Inference on Jetson (edge GPU) | Export to TensorRT FP16/INT8 with trtexec or torch2trt |
| Inference on Apple Silicon | Use mps device (torch.device("mps")); CoreML export for on-device |
| Cloud serving (throughput) | TensorRT on T4/A10G; batch size 8-32; dynamic shape engines |
Frequently Asked Questions
What is computer-vision?
Use this skill when building computer vision applications, implementing image classification, object detection, or segmentation pipelines. Triggers on image classification, object detection, YOLO, semantic segmentation, image preprocessing, data augmentation, transfer learning, CNN architectures, vision transformers, and any task requiring visual recognition or image analysis.
How do I install computer-vision?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill computer-vision in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support computer-vision?
computer-vision works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.