nlp-engineering

Use this skill when building NLP pipelines, implementing text classification, semantic search, embeddings, or summarization. Triggers on text preprocessing, tokenization, embeddings, vector search, named entity recognition, sentiment analysis, text classification, summarization, and any task requiring natural language processing.

What is nlp-engineering?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill nlp-engineering
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The nlp-engineering skill is now active and ready to use

Overview Files

nlp-engineering

nlp-engineering is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Building NLP pipelines, implementing text classification, semantic search, embeddings, or summarization.

Quick Facts

Field	Value
Category	ai-ml
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill nlp-engineering

The nlp-engineering skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

A practical framework for building production NLP systems. This skill covers the full stack of natural language processing - from raw text ingestion through tokenization, embedding, retrieval, classification, and generation - with an emphasis on making the right architectural choices at each layer. Designed for engineers who know Python and ML basics and need opinionated guidance on building reliable, scalable text processing pipelines.

Platforms

claude-code
gemini-cli
openai-codex

Related Skills

Pair nlp-engineering with these complementary skills:

Frequently Asked Questions

What is nlp-engineering?

How do I install nlp-engineering?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill nlp-engineering in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support nlp-engineering?

This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

NLP Engineering

When to use this skill

Trigger this skill when the user:

Builds a text preprocessing or cleaning pipeline
Generates or stores embeddings for documents or queries
Implements semantic search or similarity-based retrieval
Classifies text into categories (sentiment, intent, topic, etc.)
Extracts named entities, relationships, or structured data from text
Summarizes long documents (extractive or abstractive)
Chunks documents for RAG (Retrieval-Augmented Generation) pipelines
Tunes tokenization strategies (BPE, wordpiece, whitespace)

Do NOT trigger this skill for:

Pure LLM prompt engineering or chain-of-thought with no text processing pipeline
Speech-to-text or image captioning (separate modalities with different toolchains)

Key principles

Preprocessing is load-bearing - Garbage in, garbage out. Inconsistent casing, stray HTML, and unicode noise degrade every downstream component. Invest in a reproducible cleaning pipeline before touching a model.
Match the model to the task - A 66M-parameter sentence-transformer is often better than GPT-4 embeddings for a narrow domain retrieval task, and 100x cheaper. Pick the smallest model that hits your quality bar.
Embed offline, search online - Pre-compute embeddings at index time. Doing embedding + vector search in the request path is an avoidable latency sink. Only re-embed at write time (new docs) or on model upgrade.
Chunk with overlap, not just length - Fixed-length chunking without overlap splits sentences at boundaries and degrades retrieval recall. Always use a sliding window with 10-20% overlap and respect sentence boundaries.
Evaluate before you ship - Define offline metrics (precision@k, NDCG, ROUGE, F1) before building. An NLP system without evals is a system you cannot improve or regress-test.

Core concepts

Tokenization

Tokenization converts raw text into a sequence of tokens a model can process. Modern models use subword tokenizers (BPE, WordPiece, SentencePiece) rather than whitespace splitting, allowing them to handle out-of-vocabulary words gracefully by decomposing them into known subword units.

Key considerations: token budget (LLMs have context windows), language coverage (multilingual text needs a multilingual tokenizer), and domain vocabulary (medical/legal/code text may have poor tokenization with general-purpose tokenizers).

Embeddings

An embedding is a dense vector representation of text that encodes semantic meaning. Similar texts produce vectors with high cosine similarity. Embeddings are the foundation of semantic search, clustering, and classification.

Two categories: encoding models (sentence-transformers, E5, BGE) are fast, cheap, and purpose-built for retrieval. LLM embeddings (OpenAI text-embedding-3, Cohere Embed) are convenient API calls but cost money per token and introduce external latency.

Attention and transformers

Transformers process the full token sequence in parallel using self-attention, letting every token attend to every other token. This gives transformer-based models long-range context understanding that recurrent models lacked. For NLP tasks, you almost never need to implement attention from scratch - use HuggingFace transformers and fine-tune a pretrained checkpoint.

Vector similarity

Three distance metrics dominate:

Metric	Formula (conceptual)	Best for
Cosine similarity	angle between vectors	Normalized embeddings, most retrieval
Dot product	magnitude + angle	When vector magnitude carries information
Euclidean distance	straight-line distance	Rare; prefer cosine for NLP

Most vector stores (Pinecone, Weaviate, pgvector, FAISS) default to cosine or dot product. Normalize your embeddings before storing them to make cosine and dot product equivalent.

Common tasks

Text preprocessing pipeline

Build a reproducible cleaning pipeline before any modeling step. Apply in this order: decode -> strip HTML -> normalize unicode -> lowercase -> remove noise -> normalize whitespace.

import re
import unicodedata
from bs4 import BeautifulSoup

def preprocess(text: str, lowercase: bool = True) -> str:
    # 1. Decode HTML entities and strip tags
    text = BeautifulSoup(text, "html.parser").get_text(separator=" ")

    # 2. Normalize unicode (NFD -> NFC, remove combining chars if needed)
    text = unicodedata.normalize("NFC", text)

    # 3. Lowercase
    if lowercase:
        text = text.lower()

    # 4. Remove URLs, emails, special tokens
    text = re.sub(r"https?://\S+|www\.\S+", " ", text)
    text = re.sub(r"\S+@\S+\.\S+", " ", text)

    # 5. Collapse whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text

# Usage
clean = preprocess("<p>Visit https://example.com for more info.</p>")
# -> "visit for more info."

Persist the preprocessing config (lowercase flag, regex patterns) alongside your model so training and inference use identical transformations.

Generate embeddings

Use sentence-transformers for local, cost-free embeddings or the OpenAI API for convenience. Always batch your calls.

# Option A: sentence-transformers (local, free, fast on GPU)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")

documents = ["The quick brown fox", "Machine learning is fun", "NLP rocks"]

# encode() handles batching internally; show_progress_bar for large corpora
embeddings = model.encode(documents, normalize_embeddings=True, show_progress_bar=True)
# -> numpy array, shape (3, 384)

# Option B: OpenAI embeddings API
from openai import OpenAI

client = OpenAI()

def embed_batch(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    # Strip newlines - they degrade embedding quality per OpenAI docs
    texts = [t.replace("\n", " ") for t in texts]
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

Build semantic search

Index embeddings into a vector store and retrieve by cosine similarity at query time. This example uses FAISS for local search and pgvector for PostgreSQL.

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")

# --- Indexing ---
docs = ["Python is a programming language.", "The Eiffel Tower is in Paris.", ...]
doc_embeddings = model.encode(docs, normalize_embeddings=True).astype("float32")

# Inner product on normalized vectors = cosine similarity
index = faiss.IndexFlatIP(doc_embeddings.shape[1])
index.add(doc_embeddings)

# --- Retrieval ---
def search(query: str, top_k: int = 5) -> list[tuple[str, float]]:
    q_emb = model.encode([query], normalize_embeddings=True).astype("float32")
    scores, indices = index.search(q_emb, top_k)
    return [(docs[i], float(scores[0][j])) for j, i in enumerate(indices[0])]

results = search("programming languages for data science")
# -> [("Python is a programming language.", 0.87), ...]

For production, use faiss.IndexIVFFlat (approximate, faster) or a managed vector store (pgvector, Pinecone, Weaviate) rather than exact IndexFlatIP.

Text classification with transformers

Fine-tune a pretrained encoder for sequence classification. HuggingFace transformers + datasets is the standard stack.

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
import torch

MODEL_ID = "distilbert-base-uncased"
LABELS = ["negative", "neutral", "positive"]
id2label = {i: l for i, l in enumerate(LABELS)}
label2id = {l: i for i, l in enumerate(LABELS)}

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_ID, num_labels=len(LABELS), id2label=id2label, label2id=label2id
)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

# train_data: list of {"text": str, "label": int}
train_ds = Dataset.from_list(train_data).map(tokenize, batched=True)

args = TrainingArguments(
    output_dir="./sentiment-model",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    evaluation_strategy="epoch",
    save_strategy="best",
    load_best_model_at_end=True,
)

trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=eval_ds)
trainer.train()

Use distilbert or roberta-base for most classification tasks. Only escalate to larger models if the smaller ones underperform after fine-tuning.

NER pipeline

Use spaCy for fast rule-augmented NER or a HuggingFace token classification model for custom entity types.

import spacy
from transformers import pipeline

# Option A: spaCy (fast, battle-tested for standard entities)
nlp = spacy.load("en_core_web_sm")

def extract_entities(text: str) -> list[dict]:
    doc = nlp(text)
    return [
        {"text": ent.text, "label": ent.label_, "start": ent.start_char, "end": ent.end_char}
        for ent in doc.ents
    ]

entities = extract_entities("Apple Inc. was founded by Steve Jobs in Cupertino.")
# -> [{"text": "Apple Inc.", "label": "ORG", ...}, {"text": "Steve Jobs", "label": "PERSON", ...}]

# Option B: HuggingFace token classification (custom entities, higher accuracy)
ner = pipeline(
    "token-classification",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple",  # merges B-/I- tokens into spans
)
results = ner("OpenAI released GPT-4 in San Francisco.")

Extractive and abstractive summarization

Choose extractive for faithfulness (no hallucination risk) and abstractive for fluency.

# --- Extractive: rank sentences by TF-IDF centrality ---
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def extractive_summary(text: str, n_sentences: int = 3) -> str:
    sentences = [s.strip() for s in text.split(".") if s.strip()]
    tfidf = TfidfVectorizer().fit_transform(sentences)
    sim_matrix = cosine_similarity(tfidf)
    scores = sim_matrix.sum(axis=1)
    top_indices = np.argsort(scores)[-n_sentences:][::-1]
    return ". ".join(sentences[i] for i in sorted(top_indices)) + "."

# --- Abstractive: seq2seq model ---
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

def abstractive_summary(text: str, max_length: int = 130) -> str:
    # BART has a 1024-token context window - chunk long documents first
    result = summarizer(text, max_length=max_length, min_length=30, do_sample=False)
    return result[0]["summary_text"]

Chunking strategies for long documents

Chunking is critical for RAG quality. Poor chunking is the single most common cause of poor retrieval recall.

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_document(
    text: str,
    chunk_size: int = 512,
    chunk_overlap: int = 64,
) -> list[dict]:
    """
    Recursive splitter tries paragraph -> sentence -> word boundaries in order.
    chunk_overlap ensures context continuity across chunk boundaries.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    chunks = splitter.split_text(text)
    return [{"text": chunk, "chunk_index": i, "total_chunks": len(chunks)} for i, chunk in enumerate(chunks)]

# Semantic chunking (group sentences by embedding similarity instead of length)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # split where similarity drops sharply
    breakpoint_threshold_amount=95,
)
semantic_chunks = semantic_splitter.create_documents([text])

Rule of thumb: chunk_size 256-512 tokens for precise retrieval, 512-1024 for richer context. Always store chunk metadata (source doc ID, page, position) alongside the embedding.

Anti-patterns / common mistakes

Mistake	Why it's wrong	What to do instead
Embedding raw HTML or markdown	Markup tokens poison the semantic space	Strip all markup in preprocessing before embedding
Fixed-size chunks with no overlap	Splits sentences at boundaries, breaks coherence	Use recursive splitter with 10-20% overlap
Re-embedding at query time if corpus is static	Unnecessary latency on every request	Pre-compute all embeddings offline; embed only on writes
Using Euclidean distance for text similarity	Less meaningful than cosine for high-dimensional sparse-ish vectors	Normalize embeddings and use cosine/dot product
Fine-tuning a large model before trying a small pretrained one	Expensive, slow, often unnecessary	Benchmark a frozen small model first; fine-tune only if quality gap exists
Ignoring tokenizer mismatch between training and inference	Token boundaries differ, degrading model accuracy	Use the same tokenizer class and vocab for train and serve

Gotchas

Embedding model upgrades invalidate the entire index - Switching from BAAI/bge-small-en-v1.5 to text-embedding-3-small (or any other model) produces vectors in a different semantic space. Mixing embeddings from two different models in the same index causes meaningless similarity scores. When upgrading embedding models, you must re-embed and re-index every document in the corpus before the new model can be used in production.
Preprocessing applied at index time must be applied identically at query time - If you lowercase and strip HTML when building the index but forget to apply the same preprocessing to the query string, queries produce poor recall because the normalized index vectors don't match un-normalized query vectors. Encapsulate preprocessing in a shared function called by both the indexing pipeline and the query path.
FAISS IndexFlatIP does exact search but does not scale past ~1M vectors - For production corpora above ~500K documents, exact search latency becomes unacceptable. Use IndexIVFFlat (inverted file index, approximate) or a managed vector store. The tradeoff is recall (95-99% instead of 100%) for 10-100x faster search. Benchmark recall vs. latency before committing to an ANN approach.
HuggingFace pipeline() loads the full model on every call in scripts - Calling pipeline("token-classification", model="...") inside a request handler or loop re-loads the model weights from disk on every invocation, causing massive latency. Instantiate the pipeline once at module load time (or application startup) and reuse the same instance across all requests.
Sentence boundary detection matters more than chunk size for retrieval quality - Splitting text every N characters without checking for sentence boundaries creates chunks that start or end mid-sentence. These partial-sentence chunks retrieve poorly because their vectors average semantically incomplete text. Use RecursiveCharacterTextSplitter with sentence-aware separators (["\n\n", "\n", ". ", " "]) rather than a character-count-only splitter.

References

For detailed comparison tables and implementation guidance on specific topics, read the relevant file from the references/ folder:

references/embedding-models.md - comparison of OpenAI, Cohere, sentence-transformers, E5, BGE with dimensions, benchmarks, and cost

Only load a references file if the current task requires it - they are long and will consume context.

References

embedding-models.md

Embedding Models Reference

Opinionated comparison of production embedding models as of 2024-2025. When in doubt, start with BAAI/bge-small-en-v1.5 locally or text-embedding-3-small via API, then benchmark against your actual retrieval task before upgrading.

Quick decision guide

Need	Pick
Free, fast, local, English only	`BAAI/bge-small-en-v1.5`
Free, local, multilingual	`intfloat/multilingual-e5-small`
Best local English quality	`BAAI/bge-large-en-v1.5` or `mixedbread-ai/mxbai-embed-large-v1`
API convenience, cost-sensitive	`text-embedding-3-small` (OpenAI)
API, best retrieval quality	`text-embedding-3-large` (OpenAI)
API, long context (up to 128k tokens)	`embed-english-v3.0` (Cohere)
Multilingual API	`embed-multilingual-v3.0` (Cohere)

Model comparison table

Model	Provider	Dimensions	Max Tokens	MTEB Score*	Params	Cost
`text-embedding-3-small`	OpenAI	1536 (matryoshka)	8191	62.3	-	$0.02 / 1M tokens
`text-embedding-3-large`	OpenAI	3072 (matryoshka)	8191	64.6	-	$0.13 / 1M tokens
`text-embedding-ada-002`	OpenAI	1536	8191	61.0	-	$0.10 / 1M tokens (legacy)
`embed-english-v3.0`	Cohere	1024	512 (default) / 128k	64.5	-	$0.10 / 1M tokens
`embed-multilingual-v3.0`	Cohere	1024	512 (default) / 128k	62.1	-	$0.10 / 1M tokens
`BAAI/bge-large-en-v1.5`	BGE (HuggingFace)	1024	512	64.2	335M	Free (self-hosted)
`BAAI/bge-small-en-v1.5`	BGE (HuggingFace)	384	512	62.2	33M	Free (self-hosted)
`BAAI/bge-m3`	BGE (HuggingFace)	1024	8192	63.5	568M	Free (self-hosted)
`intfloat/e5-large-v2`	E5 (HuggingFace)	1024	512	62.2	335M	Free (self-hosted)
`intfloat/multilingual-e5-large`	E5 (HuggingFace)	1024	512	61.5	560M	Free (self-hosted)
`intfloat/multilingual-e5-small`	E5 (HuggingFace)	384	512	59.3	117M	Free (self-hosted)
`mixedbread-ai/mxbai-embed-large-v1`	Mixedbread	1024	512	64.7	335M	Free (self-hosted)
`all-MiniLM-L6-v2`	SBERT	384	256	56.3	22M	Free (self-hosted)

*MTEB (Massive Text Embedding Benchmark) average across retrieval tasks. Higher is better. Scores shift slightly between benchmark runs; use as a relative guide.

Provider deep-dives

OpenAI

Best for: Teams already using the OpenAI API who want zero infrastructure overhead and a single billing relationship.

Key characteristics:

text-embedding-3-small and text-embedding-3-large support matryoshka representation learning - you can truncate the embedding to a smaller dimension (e.g., 256 from 1536) with minimal quality loss. Use this to reduce vector storage costs.
ada-002 is legacy; migrate to text-embedding-3-small - same price, better quality.
No batch size limit stated, but recommended max ~2048 inputs per API call.
Rate limits apply per organization tier; large indexing jobs need queuing.

Matryoshka truncation example:

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed_truncated(texts: list[str], dimensions: int = 256) -> list[list[float]]:
    """Get embeddings at reduced dimensions to save storage cost."""
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-3-small",
        dimensions=dimensions,  # built-in truncation, no quality hack
    )
    return [item.embedding for item in response.data]

When NOT to use: High-volume indexing (millions of docs), latency-sensitive paths, or air-gapped/on-premise environments.

Cohere

Best for: Applications needing long-context embedding (research papers, legal docs) or strong multilingual retrieval.

Key characteristics:

Unique 128k token context window via the input_type parameter set to "search_document" - the only API-hosted model supporting this at scale.
Exposes four input_type values: "search_document", "search_query", "classification", "clustering". Always set this - it conditions the model to produce better vectors for each use case.
embed-multilingual-v3.0 covers 100+ languages in a shared embedding space, enabling cross-lingual retrieval (query in English, match French documents).
Supports binary and int8 quantized embeddings to reduce storage/memory costs.

Usage pattern:

import cohere

co = cohere.Client("YOUR_API_KEY")

def embed_documents(texts: list[str]) -> list[list[float]]:
    response = co.embed(
        texts=texts,
        model="embed-english-v3.0",
        input_type="search_document",  # ALWAYS set input_type
        embedding_types=["float"],
    )
    return response.embeddings.float

def embed_query(query: str) -> list[float]:
    response = co.embed(
        texts=[query],
        model="embed-english-v3.0",
        input_type="search_query",  # different from document type
        embedding_types=["float"],
    )
    return response.embeddings.float[0]

When NOT to use: Simple English-only tasks where cost matters - bge-small is free and nearly as good.

sentence-transformers (SBERT)

Best for: The default local embedding stack. Handles most use cases with zero API cost.

Key characteristics:

all-MiniLM-L6-v2 is the classic starter - fast, 22M params, 256 token limit. Good for short sentences, weak for long passages.
BAAI/bge-small-en-v1.5 is now the recommended starter - better MTEB score, 384 tokens, still very fast.
Always pass normalize_embeddings=True to encode() so cosine similarity equals dot product, enabling FAISS IndexFlatIP and cheaper ANN indexes.
Supports prompt_name parameter on newer models for task-specific prefixing.

from sentence_transformers import SentenceTransformer

# For retrieval: use BGE small (recommended starter)
model = SentenceTransformer("BAAI/bge-small-en-v1.5")

# BGE models benefit from a query prefix at inference time
query_prefix = "Represent this sentence for searching relevant passages: "

def embed_query(query: str) -> list[float]:
    return model.encode(query_prefix + query, normalize_embeddings=True).tolist()

def embed_documents(docs: list[str]) -> list[list[float]]:
    # No prefix for documents
    return model.encode(docs, normalize_embeddings=True, batch_size=64).tolist()

When NOT to use: Multilingual corpora with more than 3-4 languages (use multilingual-e5 or Cohere instead), or when you need >512 token context.

E5 (intfloat)

Best for: Teams wanting a well-documented, research-backed open model family with strong multilingual support.

Key characteristics:

E5 models use instruction prefixes: prepend "query: " to queries and "passage: " to documents. Missing this prefix degrades retrieval quality.
multilingual-e5-large covers 100 languages; multilingual-e5-small trades quality for speed.
e5-mistral-7b-instruct (7B params) achieves state-of-the-art scores but requires significant GPU memory - only viable for high-budget setups.
Apache 2.0 license on all models.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-large-v2")

def embed_query(query: str) -> list[float]:
    return model.encode("query: " + query, normalize_embeddings=True).tolist()

def embed_documents(docs: list[str]) -> list[list[float]]:
    prefixed = ["passage: " + d for d in docs]
    return model.encode(prefixed, normalize_embeddings=True, batch_size=32).tolist()

BGE (BAAI)

Best for: The highest-quality open embedding models available as of 2024. bge-large-en-v1.5 and mxbai-embed-large-v1 are the go-to local models when quality matters and you have GPU.

Key characteristics:

BAAI/bge-m3 supports three retrieval modes simultaneously: dense (embedding), sparse (BM25-like), and multi-vector (ColBERT-style late interaction). Hybrid search with one model.
bge-reranker-large is a companion cross-encoder reranker - retrieve top-100 with embedding search, then rerank with the cross-encoder for top-10 quality.
bge-small-en-v1.5 is the best speed/quality tradeoff at 33M params.

from sentence_transformers import SentenceTransformer, CrossEncoder

# Retrieval model
bi_encoder = SentenceTransformer("BAAI/bge-small-en-v1.5")

# Reranker (run after retrieval on top-k candidates)
cross_encoder = CrossEncoder("BAAI/bge-reranker-base")

def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
    pairs = [(query, c) for c in candidates]
    scores = cross_encoder.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [text for text, _ in ranked[:top_n]]

Dimension and storage trade-offs

Higher dimensions = better quality, more storage, slower ANN index build.

Dimensions	Relative storage (float32)	Notes
384	1.5 KB / vector	Fine for most retrieval tasks, fits in memory at scale
768	3 KB / vector	Good quality/cost balance
1024	4 KB / vector	Most large open models
1536	6 KB / vector	OpenAI ada-002 / text-embedding-3-small default
3072	12 KB / vector	text-embedding-3-large; only if quality gap is proven

At 1M vectors, 384-dim float32 = ~1.5 GB RAM. Use int8 quantization or binary embeddings (via Cohere or FAISS IndexBinaryFlat) to reduce by 4-32x with modest quality loss.

MTEB benchmark context

MTEB (Massive Text Embedding Benchmark) covers 58 datasets across 8 task types: retrieval, clustering, classification, reranking, STS, summarization, bitext mining, and pair classification. The leaderboard is at huggingface.co/spaces/mteb/leaderboard.

Important caveats:

MTEB covers general English text. Domain-specific corpora (medical, legal, code) may rank models differently.
The best MTEB score does not always win on your specific task. Always run offline evaluation on a sample of your actual queries and documents.
Newer models are added regularly. Check the leaderboard before finalizing a model choice.

Selecting a model: decision checklist

Language - English only or multilingual? Multilingual narrows to E5-multilingual, BGE-M3, or Cohere multilingual.
Deployment - Can you self-host? Local models (BGE, E5, SBERT) are free at scale. API models have per-token cost.
Context window - Docs longer than 512 tokens? BGE-M3 (8192), Cohere (128k), or chunk first.
Latency - Embedding in the request path? Use small models (33M-117M params) or API with batching.
Quality bar - Run BEIR or your own retrieval benchmark on a sample. Start small, upgrade only when the gap is proven.
Reranking - If retrieval quality is borderline, add a bge-reranker cross-encoder before expanding the embedding model.

Frequently Asked Questions

What is nlp-engineering?

How do I install nlp-engineering?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill nlp-engineering in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support nlp-engineering?

nlp-engineering works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Is nlp-engineering free?

Yes, nlp-engineering is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between nlp-engineering and similar tools?

nlp-engineering is an AI agent skill that teaches your coding agent specialized ai & machine learning knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use nlp-engineering with Cursor or Windsurf?

nlp-engineering works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

nlp-engineering

What is nlp-engineering?

Quick Start

nlp-engineering

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is nlp-engineering?

How do I install nlp-engineering?

What AI agents support nlp-engineering?

Maintainers

SKILL.md

NLP Engineering

When to use this skill

Key principles

Core concepts

Tokenization

Embeddings

Attention and transformers

Vector similarity

Common tasks

Text preprocessing pipeline

Generate embeddings

Build semantic search

Text classification with transformers

NER pipeline

Extractive and abstractive summarization

Chunking strategies for long documents

Anti-patterns / common mistakes

Gotchas

References

References

embedding-models.md

Embedding Models Reference

Quick decision guide

Model comparison table

Provider deep-dives

OpenAI

Cohere

sentence-transformers (SBERT)

E5 (intfloat)

BGE (BAAI)

Dimension and storage trade-offs

MTEB benchmark context

Selecting a model: decision checklist

Frequently Asked Questions

What is nlp-engineering?

How do I install nlp-engineering?

What AI agents support nlp-engineering?

Is nlp-engineering free?

What is the difference between nlp-engineering and similar tools?

Can I use nlp-engineering with Cursor or Windsurf?