llm-app-development

Use this skill when building production LLM applications, implementing guardrails, evaluating model outputs, or deciding between prompting and fine-tuning. Triggers on LLM app architecture, AI guardrails, output evaluation, model selection, embedding pipelines, vector databases, fine-tuning, function calling, tool use, and any task requiring production AI application design.

What is llm-app-development?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill llm-app-development
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The llm-app-development skill is now active and ready to use

Overview Files

llm-app-development

llm-app-development is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Building production LLM applications, implementing guardrails, evaluating model outputs, or deciding between prompting and fine-tuning.

Quick Facts

Field	Value
Category	ai-ml
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill llm-app-development

The llm-app-development skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

Building production LLM applications requires more than prompt engineering - it demands the same reliability, observability, and safety thinking applied to any critical system. This skill covers the full stack: architecture, guardrails, evaluation pipelines, RAG, function calling, streaming, and cost optimization. It emphasizes when patterns apply and what to do when they fail, not just happy-path implementation.

Platforms

claude-code
gemini-cli
openai-codex

Related Skills

Pair llm-app-development with these complementary skills:

Frequently Asked Questions

What is llm-app-development?

How do I install llm-app-development?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill llm-app-development in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support llm-app-development?

This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

LLM App Development

When to use this skill

Trigger this skill when the user:

Designs the architecture for a new LLM-powered application or feature
Implements content filtering, PII detection, or schema validation on model I/O
Builds or improves an evaluation pipeline (automated evals, human review, A/B tests)
Sets up a RAG pipeline (chunking, embedding, retrieval, reranking)
Adds function calling or tool use to an agent or chat interface
Streams LLM responses to a client (SSE, token-by-token rendering)
Optimizes inference cost or latency (caching, model routing, prompt compression)
Decides whether to fine-tune a model or improve prompting instead

Do NOT trigger this skill for:

Pure ML research, model training from scratch, or academic benchmarking
Questions about a specific AI framework API (use the framework's own skill, e.g., mastra)

Key principles

Evaluate before you ship - A feature without evals is a feature you cannot safely iterate on. Define success metrics and build automated checks before the first production deployment.
Guardrails are non-negotiable - Validate both input and output on every production request. Content filtering, PII scrubbing, and schema validation belong in your request path, not as optional post-processing.
Start with prompting before fine-tuning - Fine-tuning is expensive, slow to iterate, and hard to maintain. Exhaust systematic prompt engineering, few-shot examples, and RAG before considering fine-tuning.
Design for failure and fallback - LLM calls fail: timeouts, rate limits, malformed outputs, hallucinations. Every integration needs retry logic, output validation, and a fallback response.
Cost-optimize from day one - Track token usage per feature. Cache deterministic outputs. Route cheap queries to smaller models. Set hard budget limits.

Core concepts

LLM app stack

User input
    -> Input guardrails (safety, PII, token limits)
    -> Prompt construction (system prompt, context, few-shots, retrieved docs)
    -> Model call (streaming or batch)
    -> Output guardrails (schema validation, content check, hallucination detection)
    -> Post-processing (formatting, citations, structured extraction)
    -> Response to user

Every layer is an independent failure point and must be observable.

Embedding / vector DB architecture

Documents are chunked into overlapping segments, embedded into dense vectors, and stored in a vector database. At query time the user message is embedded, similar chunks are retrieved via ANN search, optionally reranked by a cross-encoder, and injected into the context window. Chunk quality determines retrieval quality more than model choice.

Caching strategies

Layer	What to cache	TTL
Exact cache	Identical prompt+params hash	Hours to days
Semantic cache	Fuzzy-match on embedding similarity	Minutes to hours
Embedding cache	Vectors for known documents	Until doc changes
KV prefix cache	Shared system prompt prefix (provider-side)	Session

Common tasks

Design LLM app architecture

Key decisions before writing code:

Decision	Options	Guide
Context strategy	Long context vs RAG	RAG if >50% of context is static documents
Output mode	Free text, structured JSON, tool calls	Use structured output for any downstream processing
State	Stateless, session, persistent memory	Default stateless; add memory only when proven necessary

import OpenAI from 'openai'

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })

async function callLLM(systemPrompt: string, userMessage: string, model = 'gpt-4o-mini'): Promise<string> {
  const controller = new AbortController()
  const timeout = setTimeout(() => controller.abort(), 30_000)
  try {
    const res = await client.chat.completions.create(
      { model, max_tokens: 1024, messages: [{ role: 'system', content: systemPrompt }, { role: 'user', content: userMessage }] },
      { signal: controller.signal },
    )
    return res.choices[0].message.content ?? ''
  } finally {
    clearTimeout(timeout)
  }
}

Implement input/output guardrails

import { z } from 'zod'

const PII_PATTERNS = [
  /\b\d{3}-\d{2}-\d{4}\b/g,                              // SSN
  /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi,        // email
  /\b(?:\d{4}[ -]?){3}\d{4}\b/g,                         // credit card
]

function scrubPII(text: string): string {
  return PII_PATTERNS.reduce((t, re) => t.replace(re, '[REDACTED]'), text)
}

function validateInput(text: string): { ok: boolean; reason?: string } {
  if (text.split(/\s+/).length > 4000) return { ok: false, reason: 'Input too long' }
  return { ok: true }
}

const SummarySchema = z.object({
  summary: z.string().min(10).max(500),
  keyPoints: z.array(z.string()).min(1).max(10),
  confidence: z.number().min(0).max(1),
})

async function getSummaryWithGuardrails(text: string) {
  const v = validateInput(text)
  if (!v.ok) throw new Error(`Input rejected: ${v.reason}`)
  const raw = await callLLM('Respond only with valid JSON.', `Summarize as JSON: ${scrubPII(text)}`)
  return SummarySchema.parse(JSON.parse(raw))  // throws ZodError if schema invalid
}

Build an evaluation pipeline

interface EvalCase {
  id: string
  input: string
  expectedContains?: string[]
  expectedNotContains?: string[]
  scoreThreshold?: number  // 0-1 for LLM-as-judge
}

async function runEval(ec: EvalCase, modelFn: (input: string) => Promise<string>) {
  const output = await modelFn(ec.input)
  for (const s of ec.expectedContains ?? [])
    if (!output.includes(s)) return { id: ec.id, passed: false, details: `Missing: "${s}"` }
  for (const s of ec.expectedNotContains ?? [])
    if (output.includes(s)) return { id: ec.id, passed: false, details: `Forbidden: "${s}"` }
  if (ec.scoreThreshold !== undefined) {
    const score = await judgeOutput(ec.input, output)
    if (score < ec.scoreThreshold) return { id: ec.id, passed: false, details: `Score ${score} < ${ec.scoreThreshold}` }
  }
  return { id: ec.id, passed: true, details: 'OK' }
}

async function judgeOutput(input: string, output: string): Promise<number> {
  const score = await callLLM(
    'You are a strict evaluator. Reply with only a number from 0.0 to 1.0.',
    `Input: ${input}\n\nOutput: ${output}\n\nScore quality (0.0=poor, 1.0=excellent):`,
    'gpt-4o',
  )
  return Math.min(1, Math.max(0, parseFloat(score)))
}

Load references/evaluation-framework.md for metrics, benchmarks, and human-in-the-loop protocols.

Implement RAG with vector search

import OpenAI from 'openai'

const client = new OpenAI()

function chunkText(text: string, size = 512, overlap = 64): string[] {
  const words = text.split(/\s+/)
  const chunks: string[] = []
  for (let i = 0; i < words.length; i += size - overlap) {
    chunks.push(words.slice(i, i + size).join(' '))
    if (i + size >= words.length) break
  }
  return chunks
}

async function embedTexts(texts: string[]): Promise<number[][]> {
  const res = await client.embeddings.create({ model: 'text-embedding-3-small', input: texts })
  return res.data.map(d => d.embedding)
}

function cosine(a: number[], b: number[]): number {
  const dot = a.reduce((s, v, i) => s + v * b[i], 0)
  return dot / (Math.sqrt(a.reduce((s, v) => s + v * v, 0)) * Math.sqrt(b.reduce((s, v) => s + v * v, 0)))
}

interface DocChunk { text: string; embedding: number[] }

async function ragQuery(question: string, store: DocChunk[], topK = 5): Promise<string> {
  const [qEmbed] = await embedTexts([question])
  const context = store
    .map(c => ({ text: c.text, score: cosine(qEmbed, c.embedding) }))
    .sort((a, b) => b.score - a.score).slice(0, topK).map(r => r.text)
  return callLLM(
    'Answer using only the provided context. If not found, say "I don\'t know."',
    `Context:\n${context.join('\n---\n')}\n\nQuestion: ${question}`,
  )
}

Add function calling / tool use

import OpenAI from 'openai'

const client = new OpenAI()
type ToolHandlers = Record<string, (args: Record<string, unknown>) => Promise<string>>

const tools: OpenAI.ChatCompletionTool[] = [{
  type: 'function',
  function: {
    name: 'get_weather',
    description: 'Get current weather for a city.',
    parameters: {
      type: 'object',
      properties: { city: { type: 'string' }, units: { type: 'string', enum: ['celsius', 'fahrenheit'] } },
      required: ['city'],
    },
  },
}]

async function runWithTools(userMessage: string, handlers: ToolHandlers): Promise<string> {
  const messages: OpenAI.ChatCompletionMessageParam[] = [{ role: 'user', content: userMessage }]
  for (let step = 0; step < 5; step++) {  // cap tool-use loops to prevent infinite recursion
    const res = await client.chat.completions.create({ model: 'gpt-4o', tools, messages })
    const choice = res.choices[0]
    messages.push(choice.message)
    if (choice.finish_reason === 'stop') return choice.message.content ?? ''
    for (const tc of choice.message.tool_calls ?? []) {
      const fn = handlers[tc.function.name]
      if (!fn) throw new Error(`Unknown tool: ${tc.function.name}`)
      const result = await fn(JSON.parse(tc.function.arguments) as Record<string, unknown>)
      messages.push({ role: 'tool', tool_call_id: tc.id, content: result })
    }
  }
  throw new Error('Tool call loop exceeded max steps')
}

Implement streaming responses

import OpenAI from 'openai'
import type { Response } from 'express'

const client = new OpenAI()

async function streamToResponse(prompt: string, res: Response): Promise<void> {
  res.setHeader('Content-Type', 'text/event-stream')
  res.setHeader('Cache-Control', 'no-cache')
  res.setHeader('Connection', 'keep-alive')
  const stream = await client.chat.completions.create({
    model: 'gpt-4o-mini', stream: true,
    messages: [{ role: 'user', content: prompt }],
  })
  let fullText = ''
  for await (const chunk of stream) {
    const token = chunk.choices[0]?.delta?.content
    if (token) { fullText += token; res.write(`data: ${JSON.stringify({ token })}\n\n`) }
  }
  runOutputGuardrails(fullText)  // validate after stream completes
  res.write('data: [DONE]\n\n')
  res.end()
}

// Client-side consumption
function consumeStream(url: string, onToken: (t: string) => void): void {
  const es = new EventSource(url)
  es.onmessage = (e) => {
    if (e.data === '[DONE]') { es.close(); return }
    onToken((JSON.parse(e.data) as { token: string }).token)
  }
}

function runOutputGuardrails(_text: string): void { /* content policy / schema checks */ }

Optimize cost and latency

import crypto from 'crypto'

const cache = new Map<string, { value: string; expiresAt: number }>()

async function cachedLLMCall(prompt: string, model = 'gpt-4o-mini', ttlMs = 3_600_000): Promise<string> {
  const key = crypto.createHash('sha256').update(`${model}:${prompt}`).digest('hex')
  const cached = cache.get(key)
  if (cached && cached.expiresAt > Date.now()) return cached.value
  const result = await callLLM('', prompt, model)
  cache.set(key, { value: result, expiresAt: Date.now() + ttlMs })
  return result
}

// Route to cheaper model based on prompt complexity
function routeModel(prompt: string): string {
  const words = prompt.split(/\s+/).length
  if (words < 50) return 'gpt-4o-mini'
  if (words < 300) return 'gpt-4o-mini'
  return 'gpt-4o'
}

// Strip redundant whitespace to reduce token count
const compressPrompt = (p: string): string => p.replace(/\s{2,}/g, ' ').trim()

Anti-patterns / common mistakes

Anti-pattern	Problem	Fix
No input validation	Prompt injection, jailbreaks, oversized inputs	Enforce max tokens, topic filters, and PII scrubbing before every call
Trusting raw model output	JSON parse errors, hallucinated fields break downstream code	Always validate output against a Zod or JSON Schema
Fine-tuning as first resort	Weeks of work, costly, hard to update; usually unnecessary	Exhaust few-shot prompting and RAG first
Ignoring token costs in dev	Small test prompts hide 10x token usage in production	Log token counts per call from day one; set usage alerts
Single monolithic prompt	Hard to test or improve any individual step	Decompose into a pipeline of smaller, testable prompt steps
No fallback on LLM failure	Rate limits or downtime = user-facing 500 errors	Retry with exponential backoff; fall back to smaller model or cached response

Gotchas

Streaming guardrails can only run post-completion - You cannot validate a streamed response mid-stream for content policy or schema compliance. The full text is only available after the last token. Run output guardrails after the stream ends, and design your client to handle a late rejection (e.g., replace streamed content with an error state) rather than assuming the stream is always valid.
JSON mode does not guarantee valid JSON on all providers - OpenAI's response_format: { type: "json_object" } reduces but does not eliminate parse errors, especially on long outputs that hit max_tokens. Always wrap JSON.parse() in a try/catch and treat a parse failure as a retriable error, not a crash.
RAG retrieval quality is dominated by chunk boundaries, not embedding models - Switching from text-embedding-3-small to text-embedding-3-large rarely fixes poor retrieval. Poor recall almost always traces to chunks that split mid-sentence or mid-concept. Fix chunking strategy (overlapping windows, semantic boundaries) before upgrading the embedding model.
Tool call loops can exceed maxSteps silently on some SDKs - If the model keeps calling tools without emitting a stop finish reason, some SDK wrappers will retry indefinitely. Always set an explicit maxSteps cap and treat a loop-exceeded condition as a hard error, not a retry.
Semantic caches can return stale or incorrect answers for slightly rephrased queries - A semantic cache that matches "What is the capital of France?" to "Tell me the capital of France" is fine. But caches with broad similarity thresholds can match unrelated questions with similar wording. Set cosine similarity thresholds conservatively (0.97+) for factual queries; use exact caching only for truly deterministic prompts.

References

For detailed content on specific sub-domains, load the relevant reference file:

references/evaluation-framework.md - metrics, benchmarks, human eval protocols, automated testing, A/B testing, eval dataset design

Only load a reference file when the task specifically requires it - they are long and will consume significant context.

References

evaluation-framework.md

LLM Evaluation Framework

Evaluation is the discipline of measuring whether your LLM application does what you intend - reliably, safely, and at the quality level your users expect. There is no single metric; a complete eval strategy combines automated checks, model-based scoring, and structured human review.

Why evals matter

Without evals you cannot:

Know whether a prompt change improved or regressed quality
Catch regressions before they reach users
Build confidence that guardrails are actually working
Make data-driven decisions about model upgrades or fine-tuning

Build your eval suite before your first production deployment, not after.

Eval types

Type	When to use	Latency	Cost
Deterministic string checks	Known phrases, citations, forbidden content	<1 ms	Free
Regex / structural checks	Format, JSON schema, URL patterns	<1 ms	Free
LLM-as-judge	Fluency, helpfulness, coherence, tone	~1 s	Moderate
Human eval	Ambiguous quality, safety edge cases, ground truth labeling	Days	High
A/B / shadow testing	Comparing two model versions on real traffic	Real-time	Low
Embedding similarity	Semantic equivalence when wording varies	~10 ms	Low

Use deterministic checks as the first gate. Add LLM-as-judge only when deterministic checks are insufficient.

Core metrics

Faithfulness (RAG)

Does the answer contain only information present in the retrieved context, or does it hallucinate facts?

const FAITHFULNESS_PROMPT = `
You are a strict fact-checker.

Context provided to the model:
{context}

Model answer:
{answer}

Does every factual claim in the answer appear in the context above?
Reply with JSON: { "faithful": true|false, "violations": ["..."] }
`

Score: 0 (hallucination) to 1 (fully grounded). Target > 0.95 for production RAG.

Answer relevance

Does the answer address what the user actually asked?

async function scoreRelevance(question: string, answer: string): Promise<number> {
  const prompt = `
Question: ${question}
Answer: ${answer}

Rate how well the answer addresses the question on a scale of 0.0 to 1.0.
Reply with only a number.
`
  const score = await callJudge(prompt)
  return parseFloat(score)
}

Context precision and recall (RAG)

Precision: what fraction of retrieved chunks were actually useful?
Recall: did retrieval include all the chunks needed to answer?

High recall requires tuning topK and chunk size. High precision requires reranking. In practice, optimize recall first, then add a reranker to improve precision.

Toxicity / safety

Run every output through a classifier (OpenAI moderation API, Perspective API, or a fine-tuned classifier). Track the rate of flagged outputs per model version.

async function checkToxicity(text: string): Promise<boolean> {
  const response = await openai.moderations.create({ input: text })
  return response.results[0].flagged
}

Latency percentiles

Track p50, p95, p99 end-to-end latency (user request to first token and to complete response). LLM latency is highly variable - p99 matters more than mean.

Cost per query

function estimateCost(promptTokens: number, completionTokens: number, model: string): number {
  const rates: Record<string, { input: number; output: number }> = {
    'gpt-4o': { input: 0.0000025, output: 0.00001 },      // per token
    'gpt-4o-mini': { input: 0.00000015, output: 0.0000006 },
  }
  const rate = rates[model] ?? rates['gpt-4o-mini']
  return promptTokens * rate.input + completionTokens * rate.output
}

Eval dataset design

A good eval dataset has:

Coverage: happy path, edge cases, adversarial inputs, domain-specific fixtures
Ground truth: known-correct answers (for extractive tasks) or human-labeled quality scores (for generative tasks)
Diversity: different user intents, lengths, languages, and phrasings
Freshness: rotate in real production queries that were flagged or escalated

Minimum viable eval set: 50-100 cases. Production-grade: 500+ with stratified sampling.

Example eval case structure

interface EvalCase {
  id: string
  category: 'happy-path' | 'edge-case' | 'adversarial' | 'regression'
  input: {
    userMessage: string
    context?: string  // for RAG evals
  }
  expected: {
    contains?: string[]
    notContains?: string[]
    schema?: Record<string, unknown>  // JSON Schema
    minScore?: number  // for LLM-as-judge
  }
  tags: string[]
}

LLM-as-judge patterns

Single-answer scoring

const JUDGE_PROMPT = `
You are an expert evaluator. Score the following response on a scale of 0-10.

Criteria:
- Accuracy: is the information correct? (0-4 points)
- Helpfulness: does it address the user's need? (0-3 points)
- Conciseness: is it appropriately brief without losing substance? (0-3 points)

User question: {question}
Model response: {response}

Reply with JSON: { "score": <0-10>, "accuracy": <0-4>, "helpfulness": <0-3>, "conciseness": <0-3>, "reasoning": "..." }
`

Pairwise comparison (A/B eval)

const PAIRWISE_PROMPT = `
Compare two responses to the same question. Which is better?

Question: {question}
Response A: {responseA}
Response B: {responseB}

Reply with JSON: { "winner": "A"|"B"|"tie", "reasoning": "..." }
`

Aggregate wins across your eval set to compare model versions. Prefer pairwise eval over absolute scoring - it is more reliable.

Calibration

LLM judges are biased toward:

Longer responses (verbosity bias)
Responses that appear first (position bias)
Their own outputs (self-preference bias)

Mitigate by: running both A/B and B/A orderings and averaging, penalizing length explicitly in the rubric, and using a different model family as judge.

Human evaluation protocols

When to use human eval

Ground truth labeling for a new eval set
Calibrating your LLM-as-judge (check its agreement rate with humans)
Safety review for content policy edge cases
Validating a significant model or prompt change before launch

Annotation guidelines

Define rubrics precisely - "helpful" is ambiguous; "answers the specific question without unnecessary caveats" is not
Use at least 2 annotators per item; measure inter-annotator agreement (Cohen's kappa)
Target kappa > 0.6 for subjective quality; > 0.8 for safety/factuality
Include calibration examples at the start of every annotation session
Rotate annotators to avoid fatigue-induced drift

Annotation template

Item ID: ___________
Annotator: ___________
Date: ___________

User message:
[text]

Model response:
[text]

Scores (1-5):
- Accuracy:       [ ] 1  [ ] 2  [ ] 3  [ ] 4  [ ] 5
- Helpfulness:    [ ] 1  [ ] 2  [ ] 3  [ ] 4  [ ] 5
- Safety:         [ ] 1  [ ] 2  [ ] 3  [ ] 4  [ ] 5

Flags:
[ ] Hallucination  [ ] PII leak  [ ] Harmful content  [ ] Off-topic

Notes:
[free text]

A/B and shadow testing

Shadow mode

Run the new model in parallel with the production model. Log both outputs. Do not show the new output to users yet. Compare metrics offline.

async function shadowCall(prompt: string): Promise<{ production: string; shadow: string }> {
  const [production, shadow] = await Promise.all([
    callLLM({ userMessage: prompt, model: 'gpt-4o-mini' }),
    callLLM({ userMessage: prompt, model: 'gpt-4o' }),
  ])
  logShadowComparison({ prompt, production, shadow })
  return { production, shadow }
}

Gradual rollout

Shadow: 0% of users see new model; compare metrics
Canary: 5% of users; watch error rates and user feedback signals
Ramp: 25% -> 50% -> 100% if metrics hold

Never jump from shadow to 100%.

Regression testing in CI

Run your eval suite on every prompt change, dependency upgrade, or model version bump.

// eval.test.ts (Jest / Vitest)
import { describe, it, expect } from 'vitest'
import evalCases from './eval-cases.json'
import { runEval } from './eval-runner'
import { myModelFn } from '../src/llm'

describe('LLM eval suite', () => {
  for (const evalCase of evalCases) {
    it(`${evalCase.id}: ${evalCase.description}`, async () => {
      const result = await runEval(evalCase, myModelFn)
      expect(result.passed).toBe(true)
    }, 30_000)
  }
})

CI budget tip: use a fast, cheap model (gpt-4o-mini) for deterministic checks in every PR. Reserve expensive judge calls for nightly runs or pre-release gates.

Benchmarks and external references

Benchmark	What it measures	Use when
MMLU	Broad knowledge across 57 subjects	Comparing general-purpose models
HumanEval / MBPP	Code generation correctness	Choosing a model for coding tasks
TruthfulQA	Tendency to hallucinate common misconceptions	RAG and knowledge-retrieval apps
MT-Bench	Multi-turn conversation quality	Chat and assistant applications
RAGAS	RAG-specific: faithfulness, relevance, recall	Building or tuning RAG pipelines
HellaSwag	Common-sense reasoning	Reasoning-heavy pipelines

External benchmarks give relative model comparisons. They do NOT replace task-specific evals on your own data. Always build domain evals alongside benchmarks.

Frequently Asked Questions

What is llm-app-development?

How do I install llm-app-development?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill llm-app-development in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support llm-app-development?

llm-app-development works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Is llm-app-development free?

Yes, llm-app-development is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between llm-app-development and similar tools?

llm-app-development is an AI agent skill that teaches your coding agent specialized ai & machine learning knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use llm-app-development with Cursor or Windsurf?

llm-app-development works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

llm-app-development

What is llm-app-development?

Quick Start

llm-app-development

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is llm-app-development?

How do I install llm-app-development?

What AI agents support llm-app-development?

Maintainers

SKILL.md

LLM App Development

When to use this skill

Key principles

Core concepts

LLM app stack

Embedding / vector DB architecture

Caching strategies

Common tasks

Design LLM app architecture

Implement input/output guardrails

Build an evaluation pipeline

Implement RAG with vector search

Add function calling / tool use

Implement streaming responses

Optimize cost and latency

Anti-patterns / common mistakes

Gotchas

References

References

evaluation-framework.md

LLM Evaluation Framework

Why evals matter

Eval types

Core metrics

Faithfulness (RAG)

Answer relevance

Context precision and recall (RAG)

Toxicity / safety

Latency percentiles

Cost per query

Eval dataset design

Example eval case structure

LLM-as-judge patterns

Single-answer scoring

Pairwise comparison (A/B eval)

Calibration

Human evaluation protocols

When to use human eval

Annotation guidelines

Annotation template

A/B and shadow testing

Shadow mode

Gradual rollout

Regression testing in CI

Benchmarks and external references

Frequently Asked Questions

What is llm-app-development?

How do I install llm-app-development?

What AI agents support llm-app-development?

Is llm-app-development free?

What is the difference between llm-app-development and similar tools?

Can I use llm-app-development with Cursor or Windsurf?