video-audio-design
Use this skill when adding audio to programmatic videos - generating narration with ElevenLabs TTS, sourcing royalty-free background music, creating SFX with FFmpeg, implementing audio ducking, or mixing multiple audio layers in Remotion. Triggers on ElevenLabs, text-to-speech, voice generation, background music, sound effects, audio mixing, and volume ducking.
video elevenlabsttsaudio-designsfxbackground-musicaudio-mixingWhat is video-audio-design?
Use this skill when adding audio to programmatic videos - generating narration with ElevenLabs TTS, sourcing royalty-free background music, creating SFX with FFmpeg, implementing audio ducking, or mixing multiple audio layers in Remotion. Triggers on ElevenLabs, text-to-speech, voice generation, background music, sound effects, audio mixing, and volume ducking.
video-audio-design
video-audio-design is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Adding audio to programmatic videos - generating narration with ElevenLabs TTS, sourcing royalty-free background music, creating SFX with FFmpeg, implementing audio ducking, or mixing multiple audio layers in Remotion.
Quick Facts
| Field | Value |
|---|---|
| Category | video |
| Version | 0.1.0 |
| Platforms | claude-code, gemini-cli, openai-codex |
| License | MIT |
How to Install
- Make sure you have Node.js installed on your machine.
- Run the following command in your terminal:
npx skills add AbsolutelySkilled/AbsolutelySkilled --skill video-audio-design- The video-audio-design skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).
Overview
Video audio design is the practice of layering narration, sound effects, and background music into programmatic video compositions. Great audio transforms a slide-deck video into a polished production - narration guides the viewer, music sets the emotional tone, and SFX punctuate key moments. This skill covers generating speech with ElevenLabs and alternative TTS providers, creating synthetic sound effects with FFmpeg, sourcing royalty-free background music, implementing audio ducking so speech stays intelligible, and mixing all layers together in Remotion compositions with frame-accurate timing.
Tags
elevenlabs tts audio-design sfx background-music audio-mixing
Platforms
- claude-code
- gemini-cli
- openai-codex
Related Skills
Pair video-audio-design with these complementary skills:
Frequently Asked Questions
What is video-audio-design?
Use this skill when adding audio to programmatic videos - generating narration with ElevenLabs TTS, sourcing royalty-free background music, creating SFX with FFmpeg, implementing audio ducking, or mixing multiple audio layers in Remotion. Triggers on ElevenLabs, text-to-speech, voice generation, background music, sound effects, audio mixing, and volume ducking.
How do I install video-audio-design?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill video-audio-design in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support video-audio-design?
This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.
Maintainers
Generated from AbsolutelySkilled
SKILL.md
Video Audio Design
Video audio design is the practice of layering narration, sound effects, and background music into programmatic video compositions. Great audio transforms a slide-deck video into a polished production - narration guides the viewer, music sets the emotional tone, and SFX punctuate key moments. This skill covers generating speech with ElevenLabs and alternative TTS providers, creating synthetic sound effects with FFmpeg, sourcing royalty-free background music, implementing audio ducking so speech stays intelligible, and mixing all layers together in Remotion compositions with frame-accurate timing.
When to use this skill
Trigger this skill when the user:
- Wants to add narration or voiceover to a programmatic video
- Needs to generate speech with ElevenLabs, OpenAI TTS, or Edge TTS
- Asks about voice selection, voice settings, or voice cloning
- Wants to add background music or needs royalty-free music sources
- Asks about creating sound effects programmatically
- Wants to implement audio ducking (lowering music during speech)
- Needs to mix multiple audio layers in Remotion
- Asks about audio timing, volume levels, or frame-based audio sync
Do NOT trigger this skill for:
- Video scripting or storyboarding - use the video-scriptwriting skill
- Remotion component architecture or rendering - use the remotion-video skill
- Professional audio production in a DAW (Ableton, Logic, Pro Tools)
- Music composition or MIDI programming
Key principles
Layered audio architecture - Every video has three audio layers: narration on top (loudest), SFX in the middle (accent volume), and background music at the base (lowest).
Narration drives timing - Generate narration first, measure its duration, then set scene timing to match. Never fit narration into arbitrary scene lengths.
Duck music during speech - Background music must drop 50-60% when narration plays. Use smooth ramps (10-15 frames) to avoid jarring jumps.
SFX as accents, not distractions - Keep SFX short (under 0.5s), subtle in volume, and relevant to on-screen action.
Test audio in context - Always preview the full mix with all layers together. Listen for muddy speech, volume spikes, or dead silence.
Core concepts
3-layer audio architecture
| Layer | Role | Base Volume | During Narration |
|---|---|---|---|
| Narration | Conveys information, drives pacing | 0.8-1.0 | N/A (top layer) |
| SFX | Accents transitions and actions | 0.3-0.5 | 0.3-0.5 (unchanged) |
| Background Music | Sets emotional tone, fills silence | 0.3-0.5 | 0.15-0.25 (ducked) |
ElevenLabs API model
ElevenLabs provides neural TTS via a REST API. The core flow:
- Pick a voice (pre-made or cloned) - each has a
voice_id - Send text + voice settings to
/v1/text-to-speech/{voice_id} - Receive raw audio bytes (mp3 by default)
- Write to file and measure duration for scene timing
Voice settings:
| Setting | Range | Low | High | Recommended |
|---|---|---|---|---|
| stability | 0-1 | More expressive, variable | More consistent, monotone | 0.4-0.6 |
| similarity_boost | 0-1 | More creative | Closer to original voice | 0.6-0.8 |
| style | 0-1 | Neutral delivery | Exaggerated style | 0.3-0.6 |
Audio ducking concept
Audio ducking reduces background music volume when narration starts and
restores it when narration ends. In Remotion, use interpolate():
Music volume: 0.4 ---\ /--- 0.4
\ /
0.15 \__________/
narration start → endRamps should take 10-15 frames (~0.3-0.5s at 30fps).
Frame-based audio sync in Remotion
useCurrentFrame()returns the current frame numberinterpolate()maps frame ranges to value ranges (e.g., volume)<Sequence from={frame}>places audio at a specific frame<Audio volume={fn}>accepts a static number or a per-frame function
Convert seconds to frames: frames = seconds * fps.
Common tasks
1. Set up ElevenLabs API key and generate narration
import fs from 'fs';
const ELEVENLABS_API_URL = 'https://api.elevenlabs.io/v1';
async function generateNarration(
text: string,
voiceId: string,
outputPath: string
): Promise<void> {
const response = await fetch(
`${ELEVENLABS_API_URL}/text-to-speech/${voiceId}`,
{
method: 'POST',
headers: {
'Content-Type': 'application/json',
'xi-api-key': process.env.ELEVEN_LABS_API_KEY!,
},
body: JSON.stringify({
text,
model_id: 'eleven_multilingual_v2',
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
style: 0.5,
use_speaker_boost: true,
},
}),
}
);
if (!response.ok) {
const error = await response.text();
throw new Error(`ElevenLabs API error ${response.status}: ${error}`);
}
const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync(outputPath, buffer);
}2. Select and configure voice settings
Voice selection questions: gender, age range, accent, energy level, warmth.
interface VoiceSettings {
stability: number;
similarity_boost: number;
style: number;
use_speaker_boost: boolean;
}
const presets: Record<string, VoiceSettings> = {
explainer: { stability: 0.6, similarity_boost: 0.75, style: 0.4, use_speaker_boost: true },
promo: { stability: 0.3, similarity_boost: 0.7, style: 0.7, use_speaker_boost: true },
tutorial: { stability: 0.7, similarity_boost: 0.8, style: 0.2, use_speaker_boost: false },
};3. Generate narration per scene from a script
import { execSync } from 'child_process';
import path from 'path';
interface Scene { id: string; narrationText: string; }
interface SceneWithAudio extends Scene {
audioPath: string;
durationMs: number;
durationFrames: number;
}
function getAudioDurationMs(filePath: string): number {
const output = execSync(
`ffprobe -v error -show_entries format=duration -of csv=p=0 "${filePath}"`
).toString().trim();
return Math.round(parseFloat(output) * 1000);
}
async function generateSceneNarrations(
scenes: Scene[], voiceId: string, outputDir: string, fps: number
): Promise<SceneWithAudio[]> {
const results: SceneWithAudio[] = [];
for (const scene of scenes) {
const audioPath = path.join(outputDir, `${scene.id}.mp3`);
await generateNarration(scene.narrationText, voiceId, audioPath);
const durationMs = getAudioDurationMs(audioPath);
results.push({
...scene, audioPath, durationMs,
durationFrames: Math.ceil((durationMs / 1000) * fps),
});
}
return results;
}4. Source background music
Royalty-free music sources:
- Pixabay Audio: https://pixabay.com/music/ (free, no attribution)
- Freesound: https://freesound.org/ (CC0/CC-BY)
- YouTube Audio Library: download from YouTube Studio
- Local files: place in
public/audio/for Remotion'sstaticFile()
5. Generate SFX with FFmpeg
# Click sound - short sine burst
ffmpeg -f lavfi -i "sine=frequency=800:duration=0.05" \
-af "afade=t=out:st=0.02:d=0.03" click.wav
# Keyboard typing - filtered noise burst
ffmpeg -f lavfi -i "anoisesrc=d=0.08:c=white:a=0.3" \
-af "highpass=f=2000,lowpass=f=8000,afade=t=out:st=0.04:d=0.04" type.wav
# Whoosh - frequency sweep
ffmpeg -f lavfi -i "sine=frequency=200:duration=0.4" \
-af "vibrato=f=8:d=0.5,afade=t=in:d=0.1,afade=t=out:st=0.2:d=0.2,lowpass=f=1000" \
whoosh.wav
# Ding/chime - bell synthesis
ffmpeg -f lavfi -i "sine=frequency=1200:duration=0.6" \
-af "afade=t=out:st=0.1:d=0.5,aecho=0.8:0.88:40:0.4" ding.wav
# Pop - impulse
ffmpeg -f lavfi -i "sine=frequency=400:duration=0.08" \
-af "afade=t=out:st=0.02:d=0.06,lowpass=f=600" pop.wav
# Transition swoosh
ffmpeg -f lavfi -i "sine=frequency=300:duration=0.3" \
-af "vibrato=f=12:d=0.8,afade=t=in:d=0.05,afade=t=out:st=0.15:d=0.15,bandpass=f=500:w=400" \
swoosh.wav6. Implement audio ducking in Remotion
import React from 'react';
import { Audio, useCurrentFrame, interpolate, Sequence } from 'remotion';
const AudioMixer: React.FC<{
narrationSrc: string;
musicSrc: string;
narrationStart: number;
narrationDuration: number;
}> = ({ narrationSrc, musicSrc, narrationStart, narrationDuration }) => {
const frame = useCurrentFrame();
const duckRampFrames = 10;
const musicVolume = interpolate(
frame,
[
narrationStart - duckRampFrames,
narrationStart,
narrationStart + narrationDuration,
narrationStart + narrationDuration + duckRampFrames,
],
[0.4, 0.15, 0.15, 0.4],
{ extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
);
return (
<>
<Audio src={musicSrc} volume={musicVolume} />
<Sequence from={narrationStart} durationInFrames={narrationDuration}>
<Audio src={narrationSrc} volume={0.9} />
</Sequence>
</>
);
};
export default AudioMixer;7. Mix 3 audio layers in a Remotion composition
import React from 'react';
import { Audio, Sequence, useCurrentFrame, interpolate } from 'remotion';
interface NarrationSegment { src: string; startFrame: number; durationFrames: number; }
interface SfxEvent { src: string; frame: number; }
const FullAudioMix: React.FC<{
narrations: NarrationSegment[];
sfxEvents: SfxEvent[];
musicSrc: string;
}> = ({ narrations, sfxEvents, musicSrc }) => {
const frame = useCurrentFrame();
const duckRamp = 10;
let musicVolume = 0.4;
for (const seg of narrations) {
const duck = interpolate(
frame,
[seg.startFrame - duckRamp, seg.startFrame,
seg.startFrame + seg.durationFrames, seg.startFrame + seg.durationFrames + duckRamp],
[1, 0.375, 0.375, 1],
{ extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
);
musicVolume = musicVolume * duck;
}
return (
<>
<Audio src={musicSrc} volume={musicVolume} loop />
{sfxEvents.map((sfx, i) => (
<Sequence key={i} from={sfx.frame} durationInFrames={30}>
<Audio src={sfx.src} volume={0.4} />
</Sequence>
))}
{narrations.map((seg, i) => (
<Sequence key={i} from={seg.startFrame} durationInFrames={seg.durationFrames}>
<Audio src={seg.src} volume={0.9} />
</Sequence>
))}
</>
);
};
export default FullAudioMix;8. Use alternative TTS providers
OpenAI TTS - good quality, simple API, six built-in voices:
import OpenAI from 'openai';
import fs from 'fs';
const openai = new OpenAI();
async function generateWithOpenAI(
text: string,
outputPath: string,
voice: 'alloy' | 'echo' | 'fable' | 'onyx' | 'nova' | 'shimmer' = 'alloy'
): Promise<void> {
const mp3 = await openai.audio.speech.create({
model: 'tts-1-hd',
voice,
input: text,
});
const buffer = Buffer.from(await mp3.arrayBuffer());
fs.writeFileSync(outputPath, buffer);
}Edge TTS - free, many voices, uses Microsoft Edge's TTS service:
pip install edge-tts
edge-tts --voice en-US-AriaNeural --text "Hello world" --write-media output.mp3
edge-tts --list-voicesAnti-patterns / common mistakes
| Mistake | Why it is wrong | What to do instead |
|---|---|---|
| Music same volume during narration | Speech becomes unintelligible | Implement audio ducking - drop music 50-60% during speech |
| Hardcoding ElevenLabs API key | Key leaks into version control | Use environment variables: process.env.ELEVEN_LABS_API_KEY |
| Using TTS without measuring duration | Scene timing wrong, narration cut off | Measure audio duration with ffprobe after generation |
| SFX louder than narration | Distracts from content | SFX at 0.3-0.5, narration at 0.8-1.0 |
| No fade on music start/end | Abrupt start/stop sounds like a bug | Add 0.5-1s fade-in at start and fade-out at end |
| Using low-quality TTS model | Robotic voice undermines quality | Use eleven_multilingual_v2 or tts-1-hd |
| Ignoring audio file format | Some formats add silence padding | Use MP3 for narration, WAV for SFX |
Gotchas
ElevenLabs rate limits and character quotas - The free tier has a monthly character limit. Cache generated audio aggressively and only regenerate when text changes. Use a hash of the text as the cache key.
MP3 encoder padding adds silence - MP3 files often have 20-50ms of silence at the start. Trim with
ffmpeg -af silenceremove=1:0:-50dBor account for the offset in frame timing.Remotion Audio volume is per-component, not global - Two
<Audio>components at volume 1.0 can clip. Keep total volume across simultaneous layers under 1.0.FFmpeg SFX sound different across systems - Always specify
-ar 44100 -sample_fmt s16for consistent output across machines.Voice consistency across scenes - ElevenLabs can produce different tones for the same settings with varying text. Use stability >= 0.5 for multi-scene narration.
References
For detailed patterns on specific audio sub-domains, read the relevant file
from the references/ folder:
references/elevenlabs-api.md- advanced ElevenLabs API patterns including voice cloning, streaming TTS, websocket API, pronunciation dictionaries, and quota managementreferences/audio-mixing-patterns.md- advanced mixing patterns including multi-segment ducking, crossfades between scenes, volume automation curves, and mastering the final mixreferences/sfx-generation.md- comprehensive SFX generation with FFmpeg including complex synthesis, layering multiple generators, and building a reusable SFX library
Only load a references file if the current task requires it - they are long and will consume context.
References
audio-mixing-patterns.md
Audio Mixing Patterns
Advanced audio mixing patterns for Remotion video compositions. Load this file when the task requires multi-segment ducking, crossfades, volume automation, or mastering techniques.
Multi-Segment Ducking
When a video has multiple narration segments, the music must duck independently for each one. Calculate a combined duck factor:
import React from 'react';
import { Audio, Sequence, useCurrentFrame, interpolate } from 'remotion';
interface NarrationSegment {
src: string;
startFrame: number;
durationFrames: number;
}
function calculateDuckedVolume(
frame: number,
segments: NarrationSegment[],
baseVolume: number,
duckedVolume: number,
rampFrames: number
): number {
let duckFactor = 1.0;
for (const seg of segments) {
const segDuck = interpolate(
frame,
[
seg.startFrame - rampFrames,
seg.startFrame,
seg.startFrame + seg.durationFrames,
seg.startFrame + seg.durationFrames + rampFrames,
],
[1, duckedVolume / baseVolume, duckedVolume / baseVolume, 1],
{ extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
);
duckFactor = Math.min(duckFactor, segDuck);
}
return baseVolume * duckFactor;
}
const MultiSegmentMix: React.FC<{
narrations: NarrationSegment[];
musicSrc: string;
}> = ({ narrations, musicSrc }) => {
const frame = useCurrentFrame();
const musicVolume = calculateDuckedVolume(frame, narrations, 0.4, 0.15, 10);
return (
<>
<Audio src={musicSrc} volume={musicVolume} loop />
{narrations.map((seg, i) => (
<Sequence key={i} from={seg.startFrame} durationInFrames={seg.durationFrames}>
<Audio src={seg.src} volume={0.9} />
</Sequence>
))}
</>
);
};
export default MultiSegmentMix;Crossfade Between Scenes
Smooth audio transitions between scenes using overlapping fade-out and fade-in:
import React from 'react';
import { Audio, Sequence, useCurrentFrame, interpolate } from 'remotion';
interface SceneAudio {
src: string;
startFrame: number;
durationFrames: number;
}
const CrossfadeAudio: React.FC<{
scenes: SceneAudio[];
crossfadeFrames: number;
}> = ({ scenes, crossfadeFrames }) => {
const frame = useCurrentFrame();
return (
<>
{scenes.map((scene, i) => {
const isFirst = i === 0;
const isLast = i === scenes.length - 1;
// Fade in at the start (except first scene)
const fadeIn = isFirst
? 1
: interpolate(
frame,
[scene.startFrame, scene.startFrame + crossfadeFrames],
[0, 1],
{ extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
);
// Fade out at the end (except last scene)
const endFrame = scene.startFrame + scene.durationFrames;
const fadeOut = isLast
? 1
: interpolate(
frame,
[endFrame - crossfadeFrames, endFrame],
[1, 0],
{ extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
);
const volume = 0.9 * fadeIn * fadeOut;
return (
<Sequence
key={i}
from={scene.startFrame}
durationInFrames={scene.durationFrames}
>
<Audio src={scene.src} volume={volume} />
</Sequence>
);
})}
</>
);
};
export default CrossfadeAudio;Volume Automation Curves
Create custom volume envelopes for music that respond to video content:
import React from 'react';
import { Audio, useCurrentFrame, interpolate } from 'remotion';
interface VolumeKeyframe {
frame: number;
volume: number;
}
function volumeFromKeyframes(
frame: number,
keyframes: VolumeKeyframe[]
): number {
if (keyframes.length === 0) return 0;
if (keyframes.length === 1) return keyframes[0].volume;
const frames = keyframes.map((k) => k.frame);
const volumes = keyframes.map((k) => k.volume);
return interpolate(frame, frames, volumes, {
extrapolateLeft: 'clamp',
extrapolateRight: 'clamp',
});
}
const AutomatedMusic: React.FC<{
musicSrc: string;
keyframes: VolumeKeyframe[];
}> = ({ musicSrc, keyframes }) => {
const frame = useCurrentFrame();
const volume = volumeFromKeyframes(frame, keyframes);
return <Audio src={musicSrc} volume={volume} loop />;
};
// Usage example:
// <AutomatedMusic
// musicSrc={staticFile('audio/music/bg.mp3')}
// keyframes={[
// { frame: 0, volume: 0 }, // Start silent
// { frame: 30, volume: 0.4 }, // Fade in over 1s
// { frame: 90, volume: 0.15 }, // Duck for narration
// { frame: 300, volume: 0.15 }, // Stay ducked
// { frame: 310, volume: 0.4 }, // Restore after narration
// { frame: 570, volume: 0.4 }, // Maintain level
// { frame: 600, volume: 0 }, // Fade out at end
// ]}
// />
export default AutomatedMusic;Intro and Outro Music Patterns
Add distinct music for intro and outro sections with smooth transitions:
import React from 'react';
import {
Audio,
Sequence,
useCurrentFrame,
interpolate,
useVideoConfig,
} from 'remotion';
const IntroOutroMusic: React.FC<{
introMusicSrc: string;
mainMusicSrc: string;
outroMusicSrc: string;
introFrames: number;
outroFrames: number;
}> = ({ introMusicSrc, mainMusicSrc, outroMusicSrc, introFrames, outroFrames }) => {
const frame = useCurrentFrame();
const { durationInFrames } = useVideoConfig();
const outroStart = durationInFrames - outroFrames;
const crossfade = 15;
// Intro music: full volume then fade out
const introVolume = interpolate(
frame,
[0, introFrames - crossfade, introFrames],
[0.5, 0.5, 0],
{ extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
);
// Main music: fade in after intro, fade out before outro
const mainVolume = interpolate(
frame,
[introFrames - crossfade, introFrames, outroStart - crossfade, outroStart],
[0, 0.35, 0.35, 0],
{ extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
);
// Outro music: fade in at end
const outroVolume = interpolate(
frame,
[outroStart - crossfade, outroStart, durationInFrames - 15, durationInFrames],
[0, 0.5, 0.5, 0],
{ extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
);
return (
<>
<Sequence from={0} durationInFrames={introFrames}>
<Audio src={introMusicSrc} volume={introVolume} />
</Sequence>
<Sequence from={introFrames - crossfade} durationInFrames={outroStart - introFrames + 2 * crossfade}>
<Audio src={mainMusicSrc} volume={mainVolume} loop />
</Sequence>
<Sequence from={outroStart - crossfade} durationInFrames={outroFrames + crossfade}>
<Audio src={outroMusicSrc} volume={outroVolume} />
</Sequence>
</>
);
};
export default IntroOutroMusic;SFX Timing Patterns
Align sound effects with visual events using a declarative timeline:
import React from 'react';
import { Audio, Sequence, staticFile } from 'remotion';
interface SfxEvent {
type: 'click' | 'whoosh' | 'ding' | 'pop' | 'type' | 'swoosh';
frame: number;
volume?: number;
}
const SFX_DURATION: Record<string, number> = {
click: 3,
whoosh: 12,
ding: 18,
pop: 3,
type: 3,
swoosh: 9,
};
const SfxTimeline: React.FC<{ events: SfxEvent[] }> = ({ events }) => {
return (
<>
{events.map((event, i) => (
<Sequence
key={i}
from={event.frame}
durationInFrames={SFX_DURATION[event.type] || 10}
>
<Audio
src={staticFile(`audio/sfx/${event.type}.wav`)}
volume={event.volume ?? 0.4}
/>
</Sequence>
))}
</>
);
};
// Usage:
// <SfxTimeline events={[
// { type: 'whoosh', frame: 0 }, // Intro transition
// { type: 'click', frame: 45 }, // Button press
// { type: 'type', frame: 90 }, // Typing animation
// { type: 'ding', frame: 200 }, // Success notification
// { type: 'swoosh', frame: 350 }, // Scene transition
// ]} />
export default SfxTimeline;Final Mix Checklist
Before rendering the final video, verify the audio mix:
- Peak levels - No individual frame should have combined volume > 1.0
- Narration clarity - Play each narration segment with music and verify speech is clearly intelligible
- Duck timing - Ramps should start before narration (pre-duck) so music is already low when speech begins
- SFX placement - Every SFX should correspond to a visible action on screen. Remove any that feel random
- Silence gaps - Brief silence (0.3-0.5s) between scenes feels natural. Continuous non-stop audio is fatiguing
- Fade in/out - Video should start and end with audio fades, never abrupt silence-to-sound or sound-to-silence
- Consistent volume - Narration volume should be uniform across all scenes. Variations feel like a bug
Headroom and Limiting
Keep total volume under 1.0 to prevent digital clipping:
function safeMixVolume(layers: number[]): number[] {
const total = layers.reduce((sum, v) => sum + v, 0);
if (total <= 1.0) return layers;
// Scale all layers proportionally to fit under 1.0
const headroom = 0.95; // Leave 5% headroom
const scale = headroom / total;
return layers.map((v) => v * scale);
}
// Example: three layers that would clip
const [narration, sfx, music] = safeMixVolume([0.9, 0.4, 0.4]);
// Result: [0.502, 0.223, 0.223] - total = 0.95This is a safety net. Proper mixing should keep layers within budget from the start using the volume reference table in the main skill file.
elevenlabs-api.md
ElevenLabs API - Advanced Patterns
Deep-dive reference for ElevenLabs TTS API usage in programmatic video pipelines. Load this file only when the task involves advanced ElevenLabs features beyond basic text-to-speech generation.
API Authentication
All requests require the xi-api-key header. Store the key in environment
variables and never commit it to version control.
const headers = {
'Content-Type': 'application/json',
'xi-api-key': process.env.ELEVEN_LABS_API_KEY!,
};Check quota before batch generation:
async function checkQuota(): Promise<{
characterCount: number;
characterLimit: number;
remaining: number;
}> {
const response = await fetch('https://api.elevenlabs.io/v1/user/subscription', {
headers: { 'xi-api-key': process.env.ELEVEN_LABS_API_KEY! },
});
const data = await response.json();
return {
characterCount: data.character_count,
characterLimit: data.character_limit,
remaining: data.character_limit - data.character_count,
};
}Voice Listing and Selection
Fetch all available voices to pick the right one programmatically:
interface ElevenLabsVoice {
voice_id: string;
name: string;
category: string;
labels: Record<string, string>;
preview_url: string;
}
async function listVoices(): Promise<ElevenLabsVoice[]> {
const response = await fetch('https://api.elevenlabs.io/v1/voices', {
headers: { 'xi-api-key': process.env.ELEVEN_LABS_API_KEY! },
});
const data = await response.json();
return data.voices;
}
// Filter voices by attributes
async function findVoice(criteria: {
gender?: string;
accent?: string;
age?: string;
}): Promise<ElevenLabsVoice | undefined> {
const voices = await listVoices();
return voices.find((v) => {
const labels = v.labels;
if (criteria.gender && labels.gender !== criteria.gender) return false;
if (criteria.accent && labels.accent !== criteria.accent) return false;
if (criteria.age && labels.age !== criteria.age) return false;
return true;
});
}Streaming TTS
For long narrations, stream audio chunks instead of waiting for the full response. This reduces time-to-first-byte and enables progressive processing:
import fs from 'fs';
async function streamNarration(
text: string,
voiceId: string,
outputPath: string
): Promise<void> {
const response = await fetch(
`https://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream`,
{
method: 'POST',
headers: {
'Content-Type': 'application/json',
'xi-api-key': process.env.ELEVEN_LABS_API_KEY!,
},
body: JSON.stringify({
text,
model_id: 'eleven_multilingual_v2',
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
style: 0.5,
use_speaker_boost: true,
},
}),
}
);
if (!response.ok || !response.body) {
throw new Error(`Stream error: ${response.status}`);
}
const writer = fs.createWriteStream(outputPath);
const reader = response.body.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) break;
writer.write(Buffer.from(value));
}
writer.end();
}WebSocket API for Real-time TTS
Use WebSockets for lowest-latency generation. Useful when previewing narration during development:
import WebSocket from 'ws';
import fs from 'fs';
async function realtimeTTS(
text: string,
voiceId: string,
outputPath: string
): Promise<void> {
return new Promise((resolve, reject) => {
const modelId = 'eleven_multilingual_v2';
const wsUrl = `wss://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream-input?model_id=${modelId}`;
const ws = new WebSocket(wsUrl);
const chunks: Buffer[] = [];
ws.on('open', () => {
// Begin stream with settings
ws.send(JSON.stringify({
text: ' ',
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
},
xi_api_key: process.env.ELEVEN_LABS_API_KEY!,
}));
// Send text
ws.send(JSON.stringify({ text }));
// Signal end of input
ws.send(JSON.stringify({ text: '' }));
});
ws.on('message', (data: Buffer) => {
try {
const json = JSON.parse(data.toString());
if (json.audio) {
chunks.push(Buffer.from(json.audio, 'base64'));
}
} catch {
// Binary data
chunks.push(Buffer.from(data));
}
});
ws.on('close', () => {
const audioBuffer = Buffer.concat(chunks);
fs.writeFileSync(outputPath, audioBuffer);
resolve();
});
ws.on('error', reject);
});
}Pronunciation Dictionaries
Control how specific words are pronounced using SSML phoneme tags or the pronunciation dictionary API:
// Inline SSML approach - wrap specific words
function applyPronunciation(
text: string,
dictionary: Record<string, string>
): string {
let result = text;
for (const [word, ipa] of Object.entries(dictionary)) {
const regex = new RegExp(`\\b${word}\\b`, 'gi');
result = result.replace(
regex,
`<phoneme alphabet="ipa" ph="${ipa}">${word}</phoneme>`
);
}
return result;
}
// Common tech pronunciation overrides
const techPronunciations: Record<string, string> = {
'API': 'eI.piː.aI',
'CLI': 'siː.ɛl.aI',
'npm': 'ɛn.piː.ɛm',
'SQL': 'ɛs.kjuː.ɛl',
'OAuth': 'oʊ.ɔːθ',
'YAML': 'jæm.əl',
'nginx': 'ɛn.dʒɪnks',
};Caching and Quota Management
Avoid regenerating audio for unchanged text. Use content-based hashing:
import crypto from 'crypto';
import fs from 'fs';
import path from 'path';
interface CacheKey {
text: string;
voiceId: string;
modelId: string;
stability: number;
similarityBoost: number;
}
function getCacheHash(key: CacheKey): string {
const content = JSON.stringify(key);
return crypto.createHash('sha256').update(content).digest('hex').slice(0, 16);
}
function getCachePath(cacheDir: string, hash: string): string {
return path.join(cacheDir, `${hash}.mp3`);
}
async function generateWithCache(
text: string,
voiceId: string,
cacheDir: string,
generateFn: (text: string, voiceId: string, output: string) => Promise<void>
): Promise<string> {
const hash = getCacheHash({
text,
voiceId,
modelId: 'eleven_multilingual_v2',
stability: 0.5,
similarityBoost: 0.75,
});
const cachePath = getCachePath(cacheDir, hash);
if (fs.existsSync(cachePath)) {
return cachePath;
}
await generateFn(text, voiceId, cachePath);
return cachePath;
}Model Selection
| Model | Quality | Speed | Languages | Best for |
|---|---|---|---|---|
| eleven_multilingual_v2 | Highest | Slower | 28+ | Production narration |
| eleven_turbo_v2_5 | High | Fast | 32+ | Previews, iteration |
| eleven_monolingual_v1 | Good | Fast | English only | Simple English TTS |
Use eleven_turbo_v2_5 during development for faster iteration, then switch
to eleven_multilingual_v2 for the final render.
Error Handling
async function safeGenerate(
text: string,
voiceId: string,
outputPath: string,
maxRetries: number = 3
): Promise<void> {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
await generateNarration(text, voiceId, outputPath);
return;
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
if (message.includes('401')) {
throw new Error('Invalid API key. Check ELEVEN_LABS_API_KEY.');
}
if (message.includes('429')) {
const waitMs = Math.pow(2, attempt) * 1000;
console.warn(`Rate limited. Waiting ${waitMs}ms before retry...`);
await new Promise((r) => setTimeout(r, waitMs));
continue;
}
if (message.includes('422')) {
throw new Error(`Invalid request. Check voice_id "${voiceId}" exists.`);
}
if (attempt === maxRetries) throw error;
console.warn(`Attempt ${attempt} failed: ${message}. Retrying...`);
}
}
}Batch Generation Pipeline
Generate narration for all scenes efficiently:
interface BatchScene {
id: string;
text: string;
}
interface BatchResult {
id: string;
audioPath: string;
durationMs: number;
cached: boolean;
}
async function batchGenerate(
scenes: BatchScene[],
voiceId: string,
outputDir: string,
cacheDir: string,
concurrency: number = 2
): Promise<BatchResult[]> {
const results: BatchResult[] = [];
// Process in batches to respect rate limits
for (let i = 0; i < scenes.length; i += concurrency) {
const batch = scenes.slice(i, i + concurrency);
const batchResults = await Promise.all(
batch.map(async (scene) => {
const hash = getCacheHash({
text: scene.text,
voiceId,
modelId: 'eleven_multilingual_v2',
stability: 0.5,
similarityBoost: 0.75,
});
const cachePath = getCachePath(cacheDir, hash);
const outputPath = path.join(outputDir, `${scene.id}.mp3`);
const cached = fs.existsSync(cachePath);
if (cached) {
fs.copyFileSync(cachePath, outputPath);
} else {
await safeGenerate(scene.text, voiceId, outputPath);
fs.copyFileSync(outputPath, cachePath);
}
const durationMs = getAudioDurationMs(outputPath);
return { id: scene.id, audioPath: outputPath, durationMs, cached };
})
);
results.push(...batchResults);
}
return results;
} sfx-generation.md
SFX Generation with FFmpeg
Comprehensive reference for generating sound effects programmatically using FFmpeg's lavfi audio generators. Load this file when the task involves creating custom SFX, building a sound library, or understanding FFmpeg audio synthesis.
FFmpeg Audio Generators
FFmpeg's lavfi (libavfilter virtual input) provides several audio sources that can be combined to create sound effects without any input files:
| Generator | Description | Key Parameters |
|---|---|---|
sine |
Pure sine wave tone | frequency, duration |
anoisesrc |
White/pink/brown noise | duration, color, amplitude |
aevalsrc |
Custom math expressions | exprs, duration |
anullsrc |
Silence generator | duration, sample_rate |
Basic SFX Recipes
UI Sounds
# Click - short sine burst (good for buttons)
ffmpeg -y -f lavfi -i "sine=frequency=800:duration=0.05" \
-af "afade=t=out:st=0.02:d=0.03" \
-ar 44100 click.wav
# Soft click - lower frequency, gentler
ffmpeg -y -f lavfi -i "sine=frequency=500:duration=0.04" \
-af "afade=t=out:st=0.01:d=0.03,lowpass=f=1000" \
-ar 44100 soft-click.wav
# Toggle on - rising pitch
ffmpeg -y -f lavfi -i "aevalsrc=exprs=sin(2*PI*(600+400*t/0.1)*t):d=0.1" \
-af "afade=t=out:st=0.05:d=0.05" \
-ar 44100 toggle-on.wav
# Toggle off - falling pitch
ffmpeg -y -f lavfi -i "aevalsrc=exprs=sin(2*PI*(1000-400*t/0.1)*t):d=0.1" \
-af "afade=t=out:st=0.05:d=0.05" \
-ar 44100 toggle-off.wav
# Hover - subtle high-frequency blip
ffmpeg -y -f lavfi -i "sine=frequency=2000:duration=0.03" \
-af "afade=t=in:d=0.01,afade=t=out:st=0.01:d=0.02,volume=0.3" \
-ar 44100 hover.wavKeyboard and Typing
# Single keypress
ffmpeg -y -f lavfi -i "anoisesrc=d=0.08:c=white:a=0.3" \
-af "highpass=f=2000,lowpass=f=8000,afade=t=out:st=0.04:d=0.04" \
-ar 44100 type.wav
# Mechanical key - louder with more body
ffmpeg -y -f lavfi -i "anoisesrc=d=0.12:c=white:a=0.5" \
-af "highpass=f=1000,lowpass=f=6000,afade=t=out:st=0.06:d=0.06" \
-ar 44100 mech-key.wav
# Spacebar - deeper, longer
ffmpeg -y -f lavfi -i "anoisesrc=d=0.15:c=white:a=0.4" \
-af "highpass=f=500,lowpass=f=3000,afade=t=out:st=0.08:d=0.07" \
-ar 44100 spacebar.wav
# Enter key - satisfying thunk
ffmpeg -y -f lavfi -i "anoisesrc=d=0.18:c=brown:a=0.5" \
-af "highpass=f=300,lowpass=f=2000,afade=t=out:st=0.08:d=0.1" \
-ar 44100 enter.wavTransitions
# Whoosh - frequency sweep
ffmpeg -y -f lavfi -i "sine=frequency=200:duration=0.4" \
-af "vibrato=f=8:d=0.5,afade=t=in:d=0.1,afade=t=out:st=0.2:d=0.2,lowpass=f=1000" \
-ar 44100 whoosh.wav
# Swoosh - faster, higher pitch
ffmpeg -y -f lavfi -i "sine=frequency=300:duration=0.3" \
-af "vibrato=f=12:d=0.8,afade=t=in:d=0.05,afade=t=out:st=0.15:d=0.15,bandpass=f=500:w=400" \
-ar 44100 swoosh.wav
# Slide in - rising tone with noise
ffmpeg -y -f lavfi -i "aevalsrc=exprs=sin(2*PI*(100+800*t/0.3)*t)*0.3:d=0.3" \
-af "afade=t=in:d=0.05,afade=t=out:st=0.2:d=0.1,lowpass=f=2000" \
-ar 44100 slide-in.wav
# Slide out - falling tone
ffmpeg -y -f lavfi -i "aevalsrc=exprs=sin(2*PI*(900-800*t/0.3)*t)*0.3:d=0.3" \
-af "afade=t=in:d=0.05,afade=t=out:st=0.2:d=0.1,lowpass=f=2000" \
-ar 44100 slide-out.wavNotification Sounds
# Ding/chime - bell synthesis
ffmpeg -y -f lavfi -i "sine=frequency=1200:duration=0.6" \
-af "afade=t=out:st=0.1:d=0.5,aecho=0.8:0.88:40:0.4" \
-ar 44100 ding.wav
# Success - two-tone ascending
ffmpeg -y -f lavfi \
-i "aevalsrc=exprs=sin(2*PI*800*t)*(t<0.15)+sin(2*PI*1200*t)*(t>=0.15):d=0.3" \
-af "afade=t=out:st=0.15:d=0.15" \
-ar 44100 success.wav
# Error - low buzzy tone
ffmpeg -y -f lavfi -i "sine=frequency=200:duration=0.4" \
-af "vibrato=f=20:d=0.5,afade=t=out:st=0.2:d=0.2" \
-ar 44100 error.wav
# Pop - impulse
ffmpeg -y -f lavfi -i "sine=frequency=400:duration=0.08" \
-af "afade=t=out:st=0.02:d=0.06,lowpass=f=600" \
-ar 44100 pop.wav
# Bubble pop - higher, rounder
ffmpeg -y -f lavfi -i "aevalsrc=exprs=sin(2*PI*(800-400*t/0.1)*t)*0.5:d=0.1" \
-af "afade=t=out:st=0.04:d=0.06,lowpass=f=1500" \
-ar 44100 bubble.wavAudio Filters Reference
Key FFmpeg audio filters used in SFX generation:
| Filter | Purpose | Example |
|---|---|---|
afade |
Fade in/out | afade=t=out:st=0.1:d=0.2 |
lowpass |
Remove high frequencies | lowpass=f=1000 |
highpass |
Remove low frequencies | highpass=f=2000 |
bandpass |
Keep frequency range | bandpass=f=500:w=200 |
vibrato |
Add pitch wobble | vibrato=f=8:d=0.5 |
aecho |
Add echo/reverb | aecho=0.8:0.88:40:0.4 |
volume |
Adjust volume | volume=0.5 |
atempo |
Change speed | atempo=1.5 |
areverse |
Reverse audio | areverse |
chorus |
Add richness | chorus=0.5:0.9:50:0.4:0.25:2 |
Chain filters with commas: -af "filter1,filter2,filter3"
Building a Reusable SFX Library
Create a build script that generates all SFX in one pass:
import { execSync } from 'child_process';
import fs from 'fs';
import path from 'path';
interface SfxDefinition {
name: string;
command: string;
}
const SFX_LIBRARY: SfxDefinition[] = [
{
name: 'click',
command: '-f lavfi -i "sine=frequency=800:duration=0.05" -af "afade=t=out:st=0.02:d=0.03"',
},
{
name: 'type',
command: '-f lavfi -i "anoisesrc=d=0.08:c=white:a=0.3" -af "highpass=f=2000,lowpass=f=8000,afade=t=out:st=0.04:d=0.04"',
},
{
name: 'whoosh',
command: '-f lavfi -i "sine=frequency=200:duration=0.4" -af "vibrato=f=8:d=0.5,afade=t=in:d=0.1,afade=t=out:st=0.2:d=0.2,lowpass=f=1000"',
},
{
name: 'ding',
command: '-f lavfi -i "sine=frequency=1200:duration=0.6" -af "afade=t=out:st=0.1:d=0.5,aecho=0.8:0.88:40:0.4"',
},
{
name: 'pop',
command: '-f lavfi -i "sine=frequency=400:duration=0.08" -af "afade=t=out:st=0.02:d=0.06,lowpass=f=600"',
},
{
name: 'swoosh',
command: '-f lavfi -i "sine=frequency=300:duration=0.3" -af "vibrato=f=12:d=0.8,afade=t=in:d=0.05,afade=t=out:st=0.15:d=0.15,bandpass=f=500:w=400"',
},
{
name: 'success',
command: '-f lavfi -i "aevalsrc=exprs=sin(2*PI*800*t)*(t<0.15)+sin(2*PI*1200*t)*(t>=0.15):d=0.3" -af "afade=t=out:st=0.15:d=0.15"',
},
{
name: 'error',
command: '-f lavfi -i "sine=frequency=200:duration=0.4" -af "vibrato=f=20:d=0.5,afade=t=out:st=0.2:d=0.2"',
},
];
function buildSfxLibrary(outputDir: string): void {
if (!fs.existsSync(outputDir)) {
fs.mkdirSync(outputDir, { recursive: true });
}
for (const sfx of SFX_LIBRARY) {
const outputPath = path.join(outputDir, `${sfx.name}.wav`);
const cmd = `ffmpeg -y ${sfx.command} -ar 44100 "${outputPath}"`;
try {
execSync(cmd, { stdio: 'pipe' });
console.log(`Generated: ${sfx.name}.wav`);
} catch (error) {
console.error(`Failed to generate ${sfx.name}:`, error);
}
}
}
// Run: buildSfxLibrary('./public/audio/sfx')Combining Multiple Generators
Layer two generators for richer sounds using FFmpeg's amix filter:
# Rich notification: sine + noise burst
ffmpeg -y \
-f lavfi -i "sine=frequency=1000:duration=0.3" \
-f lavfi -i "anoisesrc=d=0.05:c=white:a=0.2" \
-filter_complex "[0]afade=t=out:st=0.1:d=0.2[a];[1]afade=t=out:st=0.02:d=0.03[b];[a][b]amix=inputs=2:duration=longest" \
-ar 44100 rich-ding.wav
# Laser: two detuned sines
ffmpeg -y \
-f lavfi -i "aevalsrc=exprs=sin(2*PI*(2000-1500*t/0.2)*t):d=0.2" \
-f lavfi -i "aevalsrc=exprs=sin(2*PI*(2100-1600*t/0.2)*t)*0.5:d=0.2" \
-filter_complex "[0][1]amix=inputs=2:duration=shortest" \
-af "afade=t=out:st=0.1:d=0.1" \
-ar 44100 laser.wavConverting SFX for Remotion
Remotion works best with specific audio formats. Convert generated WAV files for optimal compatibility:
# WAV to MP3 (smaller file size for music)
ffmpeg -y -i input.wav -codec:a libmp3lame -b:a 192k output.mp3
# Ensure consistent sample rate
ffmpeg -y -i input.wav -ar 44100 -ac 2 output.wav
# Normalize volume to prevent clipping
ffmpeg -y -i input.wav -af "loudnorm=I=-16:LRA=11:TP=-1.5" output.wav
# Trim silence from start and end
ffmpeg -y -i input.wav \
-af "silenceremove=start_periods=1:start_silence=0.01:start_threshold=-50dB,areverse,silenceremove=start_periods=1:start_silence=0.01:start_threshold=-50dB,areverse" \
output.wavTroubleshooting
| Problem | Cause | Fix |
|---|---|---|
| SFX sounds different on CI | Different FFmpeg version or defaults | Pin -ar 44100 -sample_fmt s16 |
| Click sounds too harsh | High frequency, no envelope | Add afade=t=out and lowpass |
| Silence at start of WAV | Default encoder behavior | Use silenceremove filter |
| Playback too quiet in Remotion | WAV peaks low | Normalize with loudnorm filter |
| SFX not playing at all | Wrong file path | Use staticFile() with correct relative path |
Frequently Asked Questions
What is video-audio-design?
Use this skill when adding audio to programmatic videos - generating narration with ElevenLabs TTS, sourcing royalty-free background music, creating SFX with FFmpeg, implementing audio ducking, or mixing multiple audio layers in Remotion. Triggers on ElevenLabs, text-to-speech, voice generation, background music, sound effects, audio mixing, and volume ducking.
How do I install video-audio-design?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill video-audio-design in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support video-audio-design?
video-audio-design works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.