video-audio-design

Use this skill when adding audio to programmatic videos - generating narration with ElevenLabs TTS, sourcing royalty-free background music, creating SFX with FFmpeg, implementing audio ducking, or mixing multiple audio layers in Remotion. Triggers on ElevenLabs, text-to-speech, voice generation, background music, sound effects, audio mixing, and volume ducking.

What is video-audio-design?

Quick Start

Open your terminal or command prompt
Run: npx skills add AbsolutelySkilled/AbsolutelySkilled --skill video-audio-design
Start your AI coding agent (Claude Code, Cursor, Gemini CLI, or any supported agent)
The video-audio-design skill is now active and ready to use

Overview Files

video-audio-design

video-audio-design is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Adding audio to programmatic videos - generating narration with ElevenLabs TTS, sourcing royalty-free background music, creating SFX with FFmpeg, implementing audio ducking, or mixing multiple audio layers in Remotion.

Quick Facts

Field	Value
Category	video
Version	0.1.0
Platforms	claude-code, gemini-cli, openai-codex
License	MIT

How to Install

Make sure you have Node.js installed on your machine.
Run the following command in your terminal:

npx skills add AbsolutelySkilled/AbsolutelySkilled --skill video-audio-design

The video-audio-design skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).

Overview

Video audio design is the practice of layering narration, sound effects, and background music into programmatic video compositions. Great audio transforms a slide-deck video into a polished production - narration guides the viewer, music sets the emotional tone, and SFX punctuate key moments. This skill covers generating speech with ElevenLabs and alternative TTS providers, creating synthetic sound effects with FFmpeg, sourcing royalty-free background music, implementing audio ducking so speech stays intelligible, and mixing all layers together in Remotion compositions with frame-accurate timing.

Platforms

claude-code
gemini-cli
openai-codex

Related Skills

Pair video-audio-design with these complementary skills:

Frequently Asked Questions

What is video-audio-design?

How do I install video-audio-design?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill video-audio-design in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support video-audio-design?

This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Maintainers

@maddhruv

Generated from AbsolutelySkilled

SKILL.md

Video Audio Design

When to use this skill

Trigger this skill when the user:

Wants to add narration or voiceover to a programmatic video
Needs to generate speech with ElevenLabs, OpenAI TTS, or Edge TTS
Asks about voice selection, voice settings, or voice cloning
Wants to add background music or needs royalty-free music sources
Asks about creating sound effects programmatically
Wants to implement audio ducking (lowering music during speech)
Needs to mix multiple audio layers in Remotion
Asks about audio timing, volume levels, or frame-based audio sync

Do NOT trigger this skill for:

Video scripting or storyboarding - use the video-scriptwriting skill
Remotion component architecture or rendering - use the remotion-video skill
Professional audio production in a DAW (Ableton, Logic, Pro Tools)
Music composition or MIDI programming

Key principles

Layered audio architecture - Every video has three audio layers: narration on top (loudest), SFX in the middle (accent volume), and background music at the base (lowest).
Narration drives timing - Generate narration first, measure its duration, then set scene timing to match. Never fit narration into arbitrary scene lengths.
Duck music during speech - Background music must drop 50-60% when narration plays. Use smooth ramps (10-15 frames) to avoid jarring jumps.
SFX as accents, not distractions - Keep SFX short (under 0.5s), subtle in volume, and relevant to on-screen action.
Test audio in context - Always preview the full mix with all layers together. Listen for muddy speech, volume spikes, or dead silence.

Core concepts

3-layer audio architecture

Layer	Role	Base Volume	During Narration
Narration	Conveys information, drives pacing	0.8-1.0	N/A (top layer)
SFX	Accents transitions and actions	0.3-0.5	0.3-0.5 (unchanged)
Background Music	Sets emotional tone, fills silence	0.3-0.5	0.15-0.25 (ducked)

ElevenLabs API model

ElevenLabs provides neural TTS via a REST API. The core flow:

Pick a voice (pre-made or cloned) - each has a voice_id
Send text + voice settings to /v1/text-to-speech/{voice_id}
Receive raw audio bytes (mp3 by default)
Write to file and measure duration for scene timing

Voice settings:

Setting	Range	Low	High	Recommended
stability	0-1	More expressive, variable	More consistent, monotone	0.4-0.6
similarity_boost	0-1	More creative	Closer to original voice	0.6-0.8
style	0-1	Neutral delivery	Exaggerated style	0.3-0.6

Audio ducking concept

Audio ducking reduces background music volume when narration starts and restores it when narration ends. In Remotion, use interpolate():

Music volume:  0.4 ---\              /--- 0.4
                       \            /
               0.15     \__________/
                     narration start → end

Ramps should take 10-15 frames (~0.3-0.5s at 30fps).

Frame-based audio sync in Remotion

useCurrentFrame() returns the current frame number
interpolate() maps frame ranges to value ranges (e.g., volume)
<Sequence from={frame}> places audio at a specific frame
<Audio volume={fn}> accepts a static number or a per-frame function

Convert seconds to frames: frames = seconds * fps.

Common tasks

1. Set up ElevenLabs API key and generate narration

import fs from 'fs';

const ELEVENLABS_API_URL = 'https://api.elevenlabs.io/v1';

async function generateNarration(
  text: string,
  voiceId: string,
  outputPath: string
): Promise<void> {
  const response = await fetch(
    `${ELEVENLABS_API_URL}/text-to-speech/${voiceId}`,
    {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'xi-api-key': process.env.ELEVEN_LABS_API_KEY!,
      },
      body: JSON.stringify({
        text,
        model_id: 'eleven_multilingual_v2',
        voice_settings: {
          stability: 0.5,
          similarity_boost: 0.75,
          style: 0.5,
          use_speaker_boost: true,
        },
      }),
    }
  );

  if (!response.ok) {
    const error = await response.text();
    throw new Error(`ElevenLabs API error ${response.status}: ${error}`);
  }

  const buffer = Buffer.from(await response.arrayBuffer());
  fs.writeFileSync(outputPath, buffer);
}

2. Select and configure voice settings

Voice selection questions: gender, age range, accent, energy level, warmth.

interface VoiceSettings {
  stability: number;
  similarity_boost: number;
  style: number;
  use_speaker_boost: boolean;
}

const presets: Record<string, VoiceSettings> = {
  explainer: { stability: 0.6, similarity_boost: 0.75, style: 0.4, use_speaker_boost: true },
  promo: { stability: 0.3, similarity_boost: 0.7, style: 0.7, use_speaker_boost: true },
  tutorial: { stability: 0.7, similarity_boost: 0.8, style: 0.2, use_speaker_boost: false },
};

3. Generate narration per scene from a script

import { execSync } from 'child_process';
import path from 'path';

interface Scene { id: string; narrationText: string; }
interface SceneWithAudio extends Scene {
  audioPath: string;
  durationMs: number;
  durationFrames: number;
}

function getAudioDurationMs(filePath: string): number {
  const output = execSync(
    `ffprobe -v error -show_entries format=duration -of csv=p=0 "${filePath}"`
  ).toString().trim();
  return Math.round(parseFloat(output) * 1000);
}

async function generateSceneNarrations(
  scenes: Scene[], voiceId: string, outputDir: string, fps: number
): Promise<SceneWithAudio[]> {
  const results: SceneWithAudio[] = [];
  for (const scene of scenes) {
    const audioPath = path.join(outputDir, `${scene.id}.mp3`);
    await generateNarration(scene.narrationText, voiceId, audioPath);
    const durationMs = getAudioDurationMs(audioPath);
    results.push({
      ...scene, audioPath, durationMs,
      durationFrames: Math.ceil((durationMs / 1000) * fps),
    });
  }
  return results;
}

4. Source background music

Royalty-free music sources:

Pixabay Audio: https://pixabay.com/music/ (free, no attribution)
Freesound: https://freesound.org/ (CC0/CC-BY)
YouTube Audio Library: download from YouTube Studio
Local files: place in public/audio/ for Remotion's staticFile()

5. Generate SFX with FFmpeg

# Click sound - short sine burst
ffmpeg -f lavfi -i "sine=frequency=800:duration=0.05" \
  -af "afade=t=out:st=0.02:d=0.03" click.wav

# Keyboard typing - filtered noise burst
ffmpeg -f lavfi -i "anoisesrc=d=0.08:c=white:a=0.3" \
  -af "highpass=f=2000,lowpass=f=8000,afade=t=out:st=0.04:d=0.04" type.wav

# Whoosh - frequency sweep
ffmpeg -f lavfi -i "sine=frequency=200:duration=0.4" \
  -af "vibrato=f=8:d=0.5,afade=t=in:d=0.1,afade=t=out:st=0.2:d=0.2,lowpass=f=1000" \
  whoosh.wav

# Ding/chime - bell synthesis
ffmpeg -f lavfi -i "sine=frequency=1200:duration=0.6" \
  -af "afade=t=out:st=0.1:d=0.5,aecho=0.8:0.88:40:0.4" ding.wav

# Pop - impulse
ffmpeg -f lavfi -i "sine=frequency=400:duration=0.08" \
  -af "afade=t=out:st=0.02:d=0.06,lowpass=f=600" pop.wav

# Transition swoosh
ffmpeg -f lavfi -i "sine=frequency=300:duration=0.3" \
  -af "vibrato=f=12:d=0.8,afade=t=in:d=0.05,afade=t=out:st=0.15:d=0.15,bandpass=f=500:w=400" \
  swoosh.wav

6. Implement audio ducking in Remotion

import React from 'react';
import { Audio, useCurrentFrame, interpolate, Sequence } from 'remotion';

const AudioMixer: React.FC<{
  narrationSrc: string;
  musicSrc: string;
  narrationStart: number;
  narrationDuration: number;
}> = ({ narrationSrc, musicSrc, narrationStart, narrationDuration }) => {
  const frame = useCurrentFrame();

  const duckRampFrames = 10;
  const musicVolume = interpolate(
    frame,
    [
      narrationStart - duckRampFrames,
      narrationStart,
      narrationStart + narrationDuration,
      narrationStart + narrationDuration + duckRampFrames,
    ],
    [0.4, 0.15, 0.15, 0.4],
    { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
  );

  return (
    <>
      <Audio src={musicSrc} volume={musicVolume} />
      <Sequence from={narrationStart} durationInFrames={narrationDuration}>
        <Audio src={narrationSrc} volume={0.9} />
      </Sequence>
    </>
  );
};

export default AudioMixer;

7. Mix 3 audio layers in a Remotion composition

import React from 'react';
import { Audio, Sequence, useCurrentFrame, interpolate } from 'remotion';

interface NarrationSegment { src: string; startFrame: number; durationFrames: number; }
interface SfxEvent { src: string; frame: number; }

const FullAudioMix: React.FC<{
  narrations: NarrationSegment[];
  sfxEvents: SfxEvent[];
  musicSrc: string;
}> = ({ narrations, sfxEvents, musicSrc }) => {
  const frame = useCurrentFrame();
  const duckRamp = 10;

  let musicVolume = 0.4;
  for (const seg of narrations) {
    const duck = interpolate(
      frame,
      [seg.startFrame - duckRamp, seg.startFrame,
       seg.startFrame + seg.durationFrames, seg.startFrame + seg.durationFrames + duckRamp],
      [1, 0.375, 0.375, 1],
      { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
    );
    musicVolume = musicVolume * duck;
  }

  return (
    <>
      <Audio src={musicSrc} volume={musicVolume} loop />
      {sfxEvents.map((sfx, i) => (
        <Sequence key={i} from={sfx.frame} durationInFrames={30}>
          <Audio src={sfx.src} volume={0.4} />
        </Sequence>
      ))}
      {narrations.map((seg, i) => (
        <Sequence key={i} from={seg.startFrame} durationInFrames={seg.durationFrames}>
          <Audio src={seg.src} volume={0.9} />
        </Sequence>
      ))}
    </>
  );
};

export default FullAudioMix;

8. Use alternative TTS providers

OpenAI TTS - good quality, simple API, six built-in voices:

import OpenAI from 'openai';
import fs from 'fs';

const openai = new OpenAI();

async function generateWithOpenAI(
  text: string,
  outputPath: string,
  voice: 'alloy' | 'echo' | 'fable' | 'onyx' | 'nova' | 'shimmer' = 'alloy'
): Promise<void> {
  const mp3 = await openai.audio.speech.create({
    model: 'tts-1-hd',
    voice,
    input: text,
  });
  const buffer = Buffer.from(await mp3.arrayBuffer());
  fs.writeFileSync(outputPath, buffer);
}

Edge TTS - free, many voices, uses Microsoft Edge's TTS service:

pip install edge-tts
edge-tts --voice en-US-AriaNeural --text "Hello world" --write-media output.mp3
edge-tts --list-voices

Anti-patterns / common mistakes

Mistake	Why it is wrong	What to do instead
Music same volume during narration	Speech becomes unintelligible	Implement audio ducking - drop music 50-60% during speech
Hardcoding ElevenLabs API key	Key leaks into version control	Use environment variables: `process.env.ELEVEN_LABS_API_KEY`
Using TTS without measuring duration	Scene timing wrong, narration cut off	Measure audio duration with ffprobe after generation
SFX louder than narration	Distracts from content	SFX at 0.3-0.5, narration at 0.8-1.0
No fade on music start/end	Abrupt start/stop sounds like a bug	Add 0.5-1s fade-in at start and fade-out at end
Using low-quality TTS model	Robotic voice undermines quality	Use eleven_multilingual_v2 or tts-1-hd
Ignoring audio file format	Some formats add silence padding	Use MP3 for narration, WAV for SFX

Gotchas

ElevenLabs rate limits and character quotas - The free tier has a monthly character limit. Cache generated audio aggressively and only regenerate when text changes. Use a hash of the text as the cache key.
MP3 encoder padding adds silence - MP3 files often have 20-50ms of silence at the start. Trim with ffmpeg -af silenceremove=1:0:-50dB or account for the offset in frame timing.
Remotion Audio volume is per-component, not global - Two <Audio> components at volume 1.0 can clip. Keep total volume across simultaneous layers under 1.0.
FFmpeg SFX sound different across systems - Always specify -ar 44100 -sample_fmt s16 for consistent output across machines.
Voice consistency across scenes - ElevenLabs can produce different tones for the same settings with varying text. Use stability >= 0.5 for multi-scene narration.

References

For detailed patterns on specific audio sub-domains, read the relevant file from the references/ folder:

references/elevenlabs-api.md - advanced ElevenLabs API patterns including voice cloning, streaming TTS, websocket API, pronunciation dictionaries, and quota management
references/audio-mixing-patterns.md - advanced mixing patterns including multi-segment ducking, crossfades between scenes, volume automation curves, and mastering the final mix
references/sfx-generation.md - comprehensive SFX generation with FFmpeg including complex synthesis, layering multiple generators, and building a reusable SFX library

Only load a references file if the current task requires it - they are long and will consume context.

References

audio-mixing-patterns.md

Audio Mixing Patterns

Advanced audio mixing patterns for Remotion video compositions. Load this file when the task requires multi-segment ducking, crossfades, volume automation, or mastering techniques.

Multi-Segment Ducking

When a video has multiple narration segments, the music must duck independently for each one. Calculate a combined duck factor:

import React from 'react';
import { Audio, Sequence, useCurrentFrame, interpolate } from 'remotion';

interface NarrationSegment {
  src: string;
  startFrame: number;
  durationFrames: number;
}

function calculateDuckedVolume(
  frame: number,
  segments: NarrationSegment[],
  baseVolume: number,
  duckedVolume: number,
  rampFrames: number
): number {
  let duckFactor = 1.0;

  for (const seg of segments) {
    const segDuck = interpolate(
      frame,
      [
        seg.startFrame - rampFrames,
        seg.startFrame,
        seg.startFrame + seg.durationFrames,
        seg.startFrame + seg.durationFrames + rampFrames,
      ],
      [1, duckedVolume / baseVolume, duckedVolume / baseVolume, 1],
      { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
    );
    duckFactor = Math.min(duckFactor, segDuck);
  }

  return baseVolume * duckFactor;
}

const MultiSegmentMix: React.FC<{
  narrations: NarrationSegment[];
  musicSrc: string;
}> = ({ narrations, musicSrc }) => {
  const frame = useCurrentFrame();
  const musicVolume = calculateDuckedVolume(frame, narrations, 0.4, 0.15, 10);

  return (
    <>
      <Audio src={musicSrc} volume={musicVolume} loop />
      {narrations.map((seg, i) => (
        <Sequence key={i} from={seg.startFrame} durationInFrames={seg.durationFrames}>
          <Audio src={seg.src} volume={0.9} />
        </Sequence>
      ))}
    </>
  );
};

export default MultiSegmentMix;

Crossfade Between Scenes

Smooth audio transitions between scenes using overlapping fade-out and fade-in:

import React from 'react';
import { Audio, Sequence, useCurrentFrame, interpolate } from 'remotion';

interface SceneAudio {
  src: string;
  startFrame: number;
  durationFrames: number;
}

const CrossfadeAudio: React.FC<{
  scenes: SceneAudio[];
  crossfadeFrames: number;
}> = ({ scenes, crossfadeFrames }) => {
  const frame = useCurrentFrame();

  return (
    <>
      {scenes.map((scene, i) => {
        const isFirst = i === 0;
        const isLast = i === scenes.length - 1;

        // Fade in at the start (except first scene)
        const fadeIn = isFirst
          ? 1
          : interpolate(
              frame,
              [scene.startFrame, scene.startFrame + crossfadeFrames],
              [0, 1],
              { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
            );

        // Fade out at the end (except last scene)
        const endFrame = scene.startFrame + scene.durationFrames;
        const fadeOut = isLast
          ? 1
          : interpolate(
              frame,
              [endFrame - crossfadeFrames, endFrame],
              [1, 0],
              { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
            );

        const volume = 0.9 * fadeIn * fadeOut;

        return (
          <Sequence
            key={i}
            from={scene.startFrame}
            durationInFrames={scene.durationFrames}
          >
            <Audio src={scene.src} volume={volume} />
          </Sequence>
        );
      })}
    </>
  );
};

export default CrossfadeAudio;

Volume Automation Curves

Create custom volume envelopes for music that respond to video content:

import React from 'react';
import { Audio, useCurrentFrame, interpolate } from 'remotion';

interface VolumeKeyframe {
  frame: number;
  volume: number;
}

function volumeFromKeyframes(
  frame: number,
  keyframes: VolumeKeyframe[]
): number {
  if (keyframes.length === 0) return 0;
  if (keyframes.length === 1) return keyframes[0].volume;

  const frames = keyframes.map((k) => k.frame);
  const volumes = keyframes.map((k) => k.volume);

  return interpolate(frame, frames, volumes, {
    extrapolateLeft: 'clamp',
    extrapolateRight: 'clamp',
  });
}

const AutomatedMusic: React.FC<{
  musicSrc: string;
  keyframes: VolumeKeyframe[];
}> = ({ musicSrc, keyframes }) => {
  const frame = useCurrentFrame();
  const volume = volumeFromKeyframes(frame, keyframes);

  return <Audio src={musicSrc} volume={volume} loop />;
};

// Usage example:
// <AutomatedMusic
//   musicSrc={staticFile('audio/music/bg.mp3')}
//   keyframes={[
//     { frame: 0, volume: 0 },        // Start silent
//     { frame: 30, volume: 0.4 },     // Fade in over 1s
//     { frame: 90, volume: 0.15 },    // Duck for narration
//     { frame: 300, volume: 0.15 },   // Stay ducked
//     { frame: 310, volume: 0.4 },    // Restore after narration
//     { frame: 570, volume: 0.4 },    // Maintain level
//     { frame: 600, volume: 0 },      // Fade out at end
//   ]}
// />

export default AutomatedMusic;

Intro and Outro Music Patterns

Add distinct music for intro and outro sections with smooth transitions:

import React from 'react';
import {
  Audio,
  Sequence,
  useCurrentFrame,
  interpolate,
  useVideoConfig,
} from 'remotion';

const IntroOutroMusic: React.FC<{
  introMusicSrc: string;
  mainMusicSrc: string;
  outroMusicSrc: string;
  introFrames: number;
  outroFrames: number;
}> = ({ introMusicSrc, mainMusicSrc, outroMusicSrc, introFrames, outroFrames }) => {
  const frame = useCurrentFrame();
  const { durationInFrames } = useVideoConfig();
  const outroStart = durationInFrames - outroFrames;
  const crossfade = 15;

  // Intro music: full volume then fade out
  const introVolume = interpolate(
    frame,
    [0, introFrames - crossfade, introFrames],
    [0.5, 0.5, 0],
    { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
  );

  // Main music: fade in after intro, fade out before outro
  const mainVolume = interpolate(
    frame,
    [introFrames - crossfade, introFrames, outroStart - crossfade, outroStart],
    [0, 0.35, 0.35, 0],
    { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
  );

  // Outro music: fade in at end
  const outroVolume = interpolate(
    frame,
    [outroStart - crossfade, outroStart, durationInFrames - 15, durationInFrames],
    [0, 0.5, 0.5, 0],
    { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
  );

  return (
    <>
      <Sequence from={0} durationInFrames={introFrames}>
        <Audio src={introMusicSrc} volume={introVolume} />
      </Sequence>
      <Sequence from={introFrames - crossfade} durationInFrames={outroStart - introFrames + 2 * crossfade}>
        <Audio src={mainMusicSrc} volume={mainVolume} loop />
      </Sequence>
      <Sequence from={outroStart - crossfade} durationInFrames={outroFrames + crossfade}>
        <Audio src={outroMusicSrc} volume={outroVolume} />
      </Sequence>
    </>
  );
};

export default IntroOutroMusic;

SFX Timing Patterns

Align sound effects with visual events using a declarative timeline:

import React from 'react';
import { Audio, Sequence, staticFile } from 'remotion';

interface SfxEvent {
  type: 'click' | 'whoosh' | 'ding' | 'pop' | 'type' | 'swoosh';
  frame: number;
  volume?: number;
}

const SFX_DURATION: Record<string, number> = {
  click: 3,
  whoosh: 12,
  ding: 18,
  pop: 3,
  type: 3,
  swoosh: 9,
};

const SfxTimeline: React.FC<{ events: SfxEvent[] }> = ({ events }) => {
  return (
    <>
      {events.map((event, i) => (
        <Sequence
          key={i}
          from={event.frame}
          durationInFrames={SFX_DURATION[event.type] || 10}
        >
          <Audio
            src={staticFile(`audio/sfx/${event.type}.wav`)}
            volume={event.volume ?? 0.4}
          />
        </Sequence>
      ))}
    </>
  );
};

// Usage:
// <SfxTimeline events={[
//   { type: 'whoosh', frame: 0 },      // Intro transition
//   { type: 'click', frame: 45 },      // Button press
//   { type: 'type', frame: 90 },       // Typing animation
//   { type: 'ding', frame: 200 },      // Success notification
//   { type: 'swoosh', frame: 350 },    // Scene transition
// ]} />

export default SfxTimeline;

Final Mix Checklist

Before rendering the final video, verify the audio mix:

Peak levels - No individual frame should have combined volume > 1.0
Narration clarity - Play each narration segment with music and verify speech is clearly intelligible
Duck timing - Ramps should start before narration (pre-duck) so music is already low when speech begins
SFX placement - Every SFX should correspond to a visible action on screen. Remove any that feel random
Silence gaps - Brief silence (0.3-0.5s) between scenes feels natural. Continuous non-stop audio is fatiguing
Fade in/out - Video should start and end with audio fades, never abrupt silence-to-sound or sound-to-silence
Consistent volume - Narration volume should be uniform across all scenes. Variations feel like a bug

Headroom and Limiting

Keep total volume under 1.0 to prevent digital clipping:

function safeMixVolume(layers: number[]): number[] {
  const total = layers.reduce((sum, v) => sum + v, 0);
  if (total <= 1.0) return layers;

  // Scale all layers proportionally to fit under 1.0
  const headroom = 0.95; // Leave 5% headroom
  const scale = headroom / total;
  return layers.map((v) => v * scale);
}

// Example: three layers that would clip
const [narration, sfx, music] = safeMixVolume([0.9, 0.4, 0.4]);
// Result: [0.502, 0.223, 0.223] - total = 0.95

This is a safety net. Proper mixing should keep layers within budget from the start using the volume reference table in the main skill file.

elevenlabs-api.md

ElevenLabs API - Advanced Patterns

Deep-dive reference for ElevenLabs TTS API usage in programmatic video pipelines. Load this file only when the task involves advanced ElevenLabs features beyond basic text-to-speech generation.

API Authentication

All requests require the xi-api-key header. Store the key in environment variables and never commit it to version control.

const headers = {
  'Content-Type': 'application/json',
  'xi-api-key': process.env.ELEVEN_LABS_API_KEY!,
};

Check quota before batch generation:

async function checkQuota(): Promise<{
  characterCount: number;
  characterLimit: number;
  remaining: number;
}> {
  const response = await fetch('https://api.elevenlabs.io/v1/user/subscription', {
    headers: { 'xi-api-key': process.env.ELEVEN_LABS_API_KEY! },
  });
  const data = await response.json();
  return {
    characterCount: data.character_count,
    characterLimit: data.character_limit,
    remaining: data.character_limit - data.character_count,
  };
}

Voice Listing and Selection

Fetch all available voices to pick the right one programmatically:

interface ElevenLabsVoice {
  voice_id: string;
  name: string;
  category: string;
  labels: Record<string, string>;
  preview_url: string;
}

async function listVoices(): Promise<ElevenLabsVoice[]> {
  const response = await fetch('https://api.elevenlabs.io/v1/voices', {
    headers: { 'xi-api-key': process.env.ELEVEN_LABS_API_KEY! },
  });
  const data = await response.json();
  return data.voices;
}

// Filter voices by attributes
async function findVoice(criteria: {
  gender?: string;
  accent?: string;
  age?: string;
}): Promise<ElevenLabsVoice | undefined> {
  const voices = await listVoices();
  return voices.find((v) => {
    const labels = v.labels;
    if (criteria.gender && labels.gender !== criteria.gender) return false;
    if (criteria.accent && labels.accent !== criteria.accent) return false;
    if (criteria.age && labels.age !== criteria.age) return false;
    return true;
  });
}

Streaming TTS

For long narrations, stream audio chunks instead of waiting for the full response. This reduces time-to-first-byte and enables progressive processing:

import fs from 'fs';

async function streamNarration(
  text: string,
  voiceId: string,
  outputPath: string
): Promise<void> {
  const response = await fetch(
    `https://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream`,
    {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'xi-api-key': process.env.ELEVEN_LABS_API_KEY!,
      },
      body: JSON.stringify({
        text,
        model_id: 'eleven_multilingual_v2',
        voice_settings: {
          stability: 0.5,
          similarity_boost: 0.75,
          style: 0.5,
          use_speaker_boost: true,
        },
      }),
    }
  );

  if (!response.ok || !response.body) {
    throw new Error(`Stream error: ${response.status}`);
  }

  const writer = fs.createWriteStream(outputPath);
  const reader = response.body.getReader();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    writer.write(Buffer.from(value));
  }

  writer.end();
}

WebSocket API for Real-time TTS

Use WebSockets for lowest-latency generation. Useful when previewing narration during development:

import WebSocket from 'ws';
import fs from 'fs';

async function realtimeTTS(
  text: string,
  voiceId: string,
  outputPath: string
): Promise<void> {
  return new Promise((resolve, reject) => {
    const modelId = 'eleven_multilingual_v2';
    const wsUrl = `wss://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream-input?model_id=${modelId}`;
    const ws = new WebSocket(wsUrl);
    const chunks: Buffer[] = [];

    ws.on('open', () => {
      // Begin stream with settings
      ws.send(JSON.stringify({
        text: ' ',
        voice_settings: {
          stability: 0.5,
          similarity_boost: 0.75,
        },
        xi_api_key: process.env.ELEVEN_LABS_API_KEY!,
      }));

      // Send text
      ws.send(JSON.stringify({ text }));

      // Signal end of input
      ws.send(JSON.stringify({ text: '' }));
    });

    ws.on('message', (data: Buffer) => {
      try {
        const json = JSON.parse(data.toString());
        if (json.audio) {
          chunks.push(Buffer.from(json.audio, 'base64'));
        }
      } catch {
        // Binary data
        chunks.push(Buffer.from(data));
      }
    });

    ws.on('close', () => {
      const audioBuffer = Buffer.concat(chunks);
      fs.writeFileSync(outputPath, audioBuffer);
      resolve();
    });

    ws.on('error', reject);
  });
}

Pronunciation Dictionaries

Control how specific words are pronounced using SSML phoneme tags or the pronunciation dictionary API:

// Inline SSML approach - wrap specific words
function applyPronunciation(
  text: string,
  dictionary: Record<string, string>
): string {
  let result = text;
  for (const [word, ipa] of Object.entries(dictionary)) {
    const regex = new RegExp(`\\b${word}\\b`, 'gi');
    result = result.replace(
      regex,
      `<phoneme alphabet="ipa" ph="${ipa}">${word}</phoneme>`
    );
  }
  return result;
}

// Common tech pronunciation overrides
const techPronunciations: Record<string, string> = {
  'API': 'eI.piː.aI',
  'CLI': 'siː.ɛl.aI',
  'npm': 'ɛn.piː.ɛm',
  'SQL': 'ɛs.kjuː.ɛl',
  'OAuth': 'oʊ.ɔːθ',
  'YAML': 'jæm.əl',
  'nginx': 'ɛn.dʒɪnks',
};

Caching and Quota Management

Avoid regenerating audio for unchanged text. Use content-based hashing:

import crypto from 'crypto';
import fs from 'fs';
import path from 'path';

interface CacheKey {
  text: string;
  voiceId: string;
  modelId: string;
  stability: number;
  similarityBoost: number;
}

function getCacheHash(key: CacheKey): string {
  const content = JSON.stringify(key);
  return crypto.createHash('sha256').update(content).digest('hex').slice(0, 16);
}

function getCachePath(cacheDir: string, hash: string): string {
  return path.join(cacheDir, `${hash}.mp3`);
}

async function generateWithCache(
  text: string,
  voiceId: string,
  cacheDir: string,
  generateFn: (text: string, voiceId: string, output: string) => Promise<void>
): Promise<string> {
  const hash = getCacheHash({
    text,
    voiceId,
    modelId: 'eleven_multilingual_v2',
    stability: 0.5,
    similarityBoost: 0.75,
  });
  const cachePath = getCachePath(cacheDir, hash);

  if (fs.existsSync(cachePath)) {
    return cachePath;
  }

  await generateFn(text, voiceId, cachePath);
  return cachePath;
}

Model Selection

Model	Quality	Speed	Languages	Best for
eleven_multilingual_v2	Highest	Slower	28+	Production narration
eleven_turbo_v2_5	High	Fast	32+	Previews, iteration
eleven_monolingual_v1	Good	Fast	English only	Simple English TTS

Use eleven_turbo_v2_5 during development for faster iteration, then switch to eleven_multilingual_v2 for the final render.

Error Handling

async function safeGenerate(
  text: string,
  voiceId: string,
  outputPath: string,
  maxRetries: number = 3
): Promise<void> {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      await generateNarration(text, voiceId, outputPath);
      return;
    } catch (error) {
      const message = error instanceof Error ? error.message : String(error);

      if (message.includes('401')) {
        throw new Error('Invalid API key. Check ELEVEN_LABS_API_KEY.');
      }
      if (message.includes('429')) {
        const waitMs = Math.pow(2, attempt) * 1000;
        console.warn(`Rate limited. Waiting ${waitMs}ms before retry...`);
        await new Promise((r) => setTimeout(r, waitMs));
        continue;
      }
      if (message.includes('422')) {
        throw new Error(`Invalid request. Check voice_id "${voiceId}" exists.`);
      }

      if (attempt === maxRetries) throw error;
      console.warn(`Attempt ${attempt} failed: ${message}. Retrying...`);
    }
  }
}

Batch Generation Pipeline

Generate narration for all scenes efficiently:

interface BatchScene {
  id: string;
  text: string;
}

interface BatchResult {
  id: string;
  audioPath: string;
  durationMs: number;
  cached: boolean;
}

async function batchGenerate(
  scenes: BatchScene[],
  voiceId: string,
  outputDir: string,
  cacheDir: string,
  concurrency: number = 2
): Promise<BatchResult[]> {
  const results: BatchResult[] = [];

  // Process in batches to respect rate limits
  for (let i = 0; i < scenes.length; i += concurrency) {
    const batch = scenes.slice(i, i + concurrency);
    const batchResults = await Promise.all(
      batch.map(async (scene) => {
        const hash = getCacheHash({
          text: scene.text,
          voiceId,
          modelId: 'eleven_multilingual_v2',
          stability: 0.5,
          similarityBoost: 0.75,
        });
        const cachePath = getCachePath(cacheDir, hash);
        const outputPath = path.join(outputDir, `${scene.id}.mp3`);
        const cached = fs.existsSync(cachePath);

        if (cached) {
          fs.copyFileSync(cachePath, outputPath);
        } else {
          await safeGenerate(scene.text, voiceId, outputPath);
          fs.copyFileSync(outputPath, cachePath);
        }

        const durationMs = getAudioDurationMs(outputPath);
        return { id: scene.id, audioPath: outputPath, durationMs, cached };
      })
    );
    results.push(...batchResults);
  }

  return results;
}

sfx-generation.md

SFX Generation with FFmpeg

Comprehensive reference for generating sound effects programmatically using FFmpeg's lavfi audio generators. Load this file when the task involves creating custom SFX, building a sound library, or understanding FFmpeg audio synthesis.

FFmpeg Audio Generators

FFmpeg's lavfi (libavfilter virtual input) provides several audio sources that can be combined to create sound effects without any input files:

Generator	Description	Key Parameters
`sine`	Pure sine wave tone	frequency, duration
`anoisesrc`	White/pink/brown noise	duration, color, amplitude
`aevalsrc`	Custom math expressions	exprs, duration
`anullsrc`	Silence generator	duration, sample_rate

Basic SFX Recipes

UI Sounds

# Click - short sine burst (good for buttons)
ffmpeg -y -f lavfi -i "sine=frequency=800:duration=0.05" \
  -af "afade=t=out:st=0.02:d=0.03" \
  -ar 44100 click.wav

# Soft click - lower frequency, gentler
ffmpeg -y -f lavfi -i "sine=frequency=500:duration=0.04" \
  -af "afade=t=out:st=0.01:d=0.03,lowpass=f=1000" \
  -ar 44100 soft-click.wav

# Toggle on - rising pitch
ffmpeg -y -f lavfi -i "aevalsrc=exprs=sin(2*PI*(600+400*t/0.1)*t):d=0.1" \
  -af "afade=t=out:st=0.05:d=0.05" \
  -ar 44100 toggle-on.wav

# Toggle off - falling pitch
ffmpeg -y -f lavfi -i "aevalsrc=exprs=sin(2*PI*(1000-400*t/0.1)*t):d=0.1" \
  -af "afade=t=out:st=0.05:d=0.05" \
  -ar 44100 toggle-off.wav

# Hover - subtle high-frequency blip
ffmpeg -y -f lavfi -i "sine=frequency=2000:duration=0.03" \
  -af "afade=t=in:d=0.01,afade=t=out:st=0.01:d=0.02,volume=0.3" \
  -ar 44100 hover.wav

Keyboard and Typing

# Single keypress
ffmpeg -y -f lavfi -i "anoisesrc=d=0.08:c=white:a=0.3" \
  -af "highpass=f=2000,lowpass=f=8000,afade=t=out:st=0.04:d=0.04" \
  -ar 44100 type.wav

# Mechanical key - louder with more body
ffmpeg -y -f lavfi -i "anoisesrc=d=0.12:c=white:a=0.5" \
  -af "highpass=f=1000,lowpass=f=6000,afade=t=out:st=0.06:d=0.06" \
  -ar 44100 mech-key.wav

# Spacebar - deeper, longer
ffmpeg -y -f lavfi -i "anoisesrc=d=0.15:c=white:a=0.4" \
  -af "highpass=f=500,lowpass=f=3000,afade=t=out:st=0.08:d=0.07" \
  -ar 44100 spacebar.wav

# Enter key - satisfying thunk
ffmpeg -y -f lavfi -i "anoisesrc=d=0.18:c=brown:a=0.5" \
  -af "highpass=f=300,lowpass=f=2000,afade=t=out:st=0.08:d=0.1" \
  -ar 44100 enter.wav

Transitions

# Whoosh - frequency sweep
ffmpeg -y -f lavfi -i "sine=frequency=200:duration=0.4" \
  -af "vibrato=f=8:d=0.5,afade=t=in:d=0.1,afade=t=out:st=0.2:d=0.2,lowpass=f=1000" \
  -ar 44100 whoosh.wav

# Swoosh - faster, higher pitch
ffmpeg -y -f lavfi -i "sine=frequency=300:duration=0.3" \
  -af "vibrato=f=12:d=0.8,afade=t=in:d=0.05,afade=t=out:st=0.15:d=0.15,bandpass=f=500:w=400" \
  -ar 44100 swoosh.wav

# Slide in - rising tone with noise
ffmpeg -y -f lavfi -i "aevalsrc=exprs=sin(2*PI*(100+800*t/0.3)*t)*0.3:d=0.3" \
  -af "afade=t=in:d=0.05,afade=t=out:st=0.2:d=0.1,lowpass=f=2000" \
  -ar 44100 slide-in.wav

# Slide out - falling tone
ffmpeg -y -f lavfi -i "aevalsrc=exprs=sin(2*PI*(900-800*t/0.3)*t)*0.3:d=0.3" \
  -af "afade=t=in:d=0.05,afade=t=out:st=0.2:d=0.1,lowpass=f=2000" \
  -ar 44100 slide-out.wav

Notification Sounds

# Ding/chime - bell synthesis
ffmpeg -y -f lavfi -i "sine=frequency=1200:duration=0.6" \
  -af "afade=t=out:st=0.1:d=0.5,aecho=0.8:0.88:40:0.4" \
  -ar 44100 ding.wav

# Success - two-tone ascending
ffmpeg -y -f lavfi \
  -i "aevalsrc=exprs=sin(2*PI*800*t)*(t<0.15)+sin(2*PI*1200*t)*(t>=0.15):d=0.3" \
  -af "afade=t=out:st=0.15:d=0.15" \
  -ar 44100 success.wav

# Error - low buzzy tone
ffmpeg -y -f lavfi -i "sine=frequency=200:duration=0.4" \
  -af "vibrato=f=20:d=0.5,afade=t=out:st=0.2:d=0.2" \
  -ar 44100 error.wav

# Pop - impulse
ffmpeg -y -f lavfi -i "sine=frequency=400:duration=0.08" \
  -af "afade=t=out:st=0.02:d=0.06,lowpass=f=600" \
  -ar 44100 pop.wav

# Bubble pop - higher, rounder
ffmpeg -y -f lavfi -i "aevalsrc=exprs=sin(2*PI*(800-400*t/0.1)*t)*0.5:d=0.1" \
  -af "afade=t=out:st=0.04:d=0.06,lowpass=f=1500" \
  -ar 44100 bubble.wav

Audio Filters Reference

Key FFmpeg audio filters used in SFX generation:

Filter	Purpose	Example
`afade`	Fade in/out	`afade=t=out:st=0.1:d=0.2`
`lowpass`	Remove high frequencies	`lowpass=f=1000`
`highpass`	Remove low frequencies	`highpass=f=2000`
`bandpass`	Keep frequency range	`bandpass=f=500:w=200`
`vibrato`	Add pitch wobble	`vibrato=f=8:d=0.5`
`aecho`	Add echo/reverb	`aecho=0.8:0.88:40:0.4`
`volume`	Adjust volume	`volume=0.5`
`atempo`	Change speed	`atempo=1.5`
`areverse`	Reverse audio	`areverse`
`chorus`	Add richness	`chorus=0.5:0.9:50:0.4:0.25:2`

Chain filters with commas: -af "filter1,filter2,filter3"

Building a Reusable SFX Library

Create a build script that generates all SFX in one pass:

import { execSync } from 'child_process';
import fs from 'fs';
import path from 'path';

interface SfxDefinition {
  name: string;
  command: string;
}

const SFX_LIBRARY: SfxDefinition[] = [
  {
    name: 'click',
    command: '-f lavfi -i "sine=frequency=800:duration=0.05" -af "afade=t=out:st=0.02:d=0.03"',
  },
  {
    name: 'type',
    command: '-f lavfi -i "anoisesrc=d=0.08:c=white:a=0.3" -af "highpass=f=2000,lowpass=f=8000,afade=t=out:st=0.04:d=0.04"',
  },
  {
    name: 'whoosh',
    command: '-f lavfi -i "sine=frequency=200:duration=0.4" -af "vibrato=f=8:d=0.5,afade=t=in:d=0.1,afade=t=out:st=0.2:d=0.2,lowpass=f=1000"',
  },
  {
    name: 'ding',
    command: '-f lavfi -i "sine=frequency=1200:duration=0.6" -af "afade=t=out:st=0.1:d=0.5,aecho=0.8:0.88:40:0.4"',
  },
  {
    name: 'pop',
    command: '-f lavfi -i "sine=frequency=400:duration=0.08" -af "afade=t=out:st=0.02:d=0.06,lowpass=f=600"',
  },
  {
    name: 'swoosh',
    command: '-f lavfi -i "sine=frequency=300:duration=0.3" -af "vibrato=f=12:d=0.8,afade=t=in:d=0.05,afade=t=out:st=0.15:d=0.15,bandpass=f=500:w=400"',
  },
  {
    name: 'success',
    command: '-f lavfi -i "aevalsrc=exprs=sin(2*PI*800*t)*(t<0.15)+sin(2*PI*1200*t)*(t>=0.15):d=0.3" -af "afade=t=out:st=0.15:d=0.15"',
  },
  {
    name: 'error',
    command: '-f lavfi -i "sine=frequency=200:duration=0.4" -af "vibrato=f=20:d=0.5,afade=t=out:st=0.2:d=0.2"',
  },
];

function buildSfxLibrary(outputDir: string): void {
  if (!fs.existsSync(outputDir)) {
    fs.mkdirSync(outputDir, { recursive: true });
  }

  for (const sfx of SFX_LIBRARY) {
    const outputPath = path.join(outputDir, `${sfx.name}.wav`);
    const cmd = `ffmpeg -y ${sfx.command} -ar 44100 "${outputPath}"`;

    try {
      execSync(cmd, { stdio: 'pipe' });
      console.log(`Generated: ${sfx.name}.wav`);
    } catch (error) {
      console.error(`Failed to generate ${sfx.name}:`, error);
    }
  }
}

// Run: buildSfxLibrary('./public/audio/sfx')

Combining Multiple Generators

Layer two generators for richer sounds using FFmpeg's amix filter:

# Rich notification: sine + noise burst
ffmpeg -y \
  -f lavfi -i "sine=frequency=1000:duration=0.3" \
  -f lavfi -i "anoisesrc=d=0.05:c=white:a=0.2" \
  -filter_complex "[0]afade=t=out:st=0.1:d=0.2[a];[1]afade=t=out:st=0.02:d=0.03[b];[a][b]amix=inputs=2:duration=longest" \
  -ar 44100 rich-ding.wav

# Laser: two detuned sines
ffmpeg -y \
  -f lavfi -i "aevalsrc=exprs=sin(2*PI*(2000-1500*t/0.2)*t):d=0.2" \
  -f lavfi -i "aevalsrc=exprs=sin(2*PI*(2100-1600*t/0.2)*t)*0.5:d=0.2" \
  -filter_complex "[0][1]amix=inputs=2:duration=shortest" \
  -af "afade=t=out:st=0.1:d=0.1" \
  -ar 44100 laser.wav

Converting SFX for Remotion

Remotion works best with specific audio formats. Convert generated WAV files for optimal compatibility:

# WAV to MP3 (smaller file size for music)
ffmpeg -y -i input.wav -codec:a libmp3lame -b:a 192k output.mp3

# Ensure consistent sample rate
ffmpeg -y -i input.wav -ar 44100 -ac 2 output.wav

# Normalize volume to prevent clipping
ffmpeg -y -i input.wav -af "loudnorm=I=-16:LRA=11:TP=-1.5" output.wav

# Trim silence from start and end
ffmpeg -y -i input.wav \
  -af "silenceremove=start_periods=1:start_silence=0.01:start_threshold=-50dB,areverse,silenceremove=start_periods=1:start_silence=0.01:start_threshold=-50dB,areverse" \
  output.wav

Troubleshooting

Problem	Cause	Fix
SFX sounds different on CI	Different FFmpeg version or defaults	Pin `-ar 44100 -sample_fmt s16`
Click sounds too harsh	High frequency, no envelope	Add `afade=t=out` and `lowpass`
Silence at start of WAV	Default encoder behavior	Use `silenceremove` filter
Playback too quiet in Remotion	WAV peaks low	Normalize with `loudnorm` filter
SFX not playing at all	Wrong file path	Use `staticFile()` with correct relative path

Frequently Asked Questions

What is video-audio-design?

How do I install video-audio-design?

Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill video-audio-design in your terminal. The skill will be immediately available in your AI coding agent.

What AI agents support video-audio-design?

video-audio-design works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.

Is video-audio-design free?

Yes, video-audio-design is completely free and open source under the MIT license. Install it with a single command and start using it immediately.

What is the difference between video-audio-design and similar tools?

video-audio-design is an AI agent skill that teaches your coding agent specialized video creation knowledge. Unlike standalone tools, it integrates directly into claude-code, gemini-cli, openai-codex and other AI agents.

Can I use video-audio-design with Cursor or Windsurf?

video-audio-design works with any AI coding agent that supports the skills protocol, including Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini CLI, and 40+ more.

video-audio-design

What is video-audio-design?

Quick Start

video-audio-design

Quick Facts

How to Install

Overview

Tags

Platforms

Related Skills

Frequently Asked Questions

What is video-audio-design?

How do I install video-audio-design?

What AI agents support video-audio-design?

Maintainers

SKILL.md

Video Audio Design

When to use this skill

Key principles

Core concepts

3-layer audio architecture

ElevenLabs API model

Audio ducking concept

Frame-based audio sync in Remotion

Common tasks

1. Set up ElevenLabs API key and generate narration

2. Select and configure voice settings

3. Generate narration per scene from a script

4. Source background music

5. Generate SFX with FFmpeg

6. Implement audio ducking in Remotion

7. Mix 3 audio layers in a Remotion composition

8. Use alternative TTS providers

Anti-patterns / common mistakes

Gotchas

References

References

audio-mixing-patterns.md

Audio Mixing Patterns

Multi-Segment Ducking

Crossfade Between Scenes

Volume Automation Curves

Intro and Outro Music Patterns

SFX Timing Patterns

Final Mix Checklist

Headroom and Limiting

elevenlabs-api.md

ElevenLabs API - Advanced Patterns

API Authentication

Voice Listing and Selection

Streaming TTS

WebSocket API for Real-time TTS

Pronunciation Dictionaries

Caching and Quota Management

Model Selection

Error Handling

Batch Generation Pipeline

sfx-generation.md

SFX Generation with FFmpeg

FFmpeg Audio Generators

Basic SFX Recipes

UI Sounds

Keyboard and Typing

Transitions

Notification Sounds

Audio Filters Reference

Building a Reusable SFX Library

Combining Multiple Generators

Converting SFX for Remotion

Troubleshooting

Frequently Asked Questions

What is video-audio-design?

How do I install video-audio-design?

What AI agents support video-audio-design?

Is video-audio-design free?

What is the difference between video-audio-design and similar tools?

Can I use video-audio-design with Cursor or Windsurf?