Case study

AI-Powered Video Editing Pipeline

Python + Remotion · VO → YouTube

PythonWhisperXLLMRemotionReactTypeScriptFFmpegEDL JSON

Key outcomes

Dual-stack: Python (WhisperX, LLM, asset sourcing) + Remotion (React/TS) for rendering
EDL JSON pivot format connecting intelligence and compositing
Animated subtitles, overlays, transitions, audio ducking, -14 LUFS

The Problem

Video editing is one of the most time-intensive parts of content production. A 30-minute documentary-style YouTube video can take 20–40 hours to edit: transcription, B-roll selection, subtitle sync, colour grade, audio normalisation, chapter markers.

For high-volume channels—or creators who want to focus on ideas and speaking, not post-production—this is a bottleneck. The pipeline eliminates it.

What Was Built

The pipeline takes a single voiceover audio file (and optionally a text script) and produces a fully edited YouTube video in documentary style: B-roll, synced subtitles, transitions, chapter markers, and loudness-normalised audio.

The output is production-ready. Target benchmarks: Johnny Harris, Veritasium, ColdFusion.

Dual-Stack Architecture

The system is split into two runtimes that communicate via an EDL JSON (Edit Decision List) pivot format:

Python stack — intelligence layer:

  • WhisperX for word-level transcription (faster than Whisper, better timestamps)
  • LLM analysis of transcript → scene segmentation, B-roll keywords, subtitle groupings
  • Asset sourcing from Pexels and Pixabay with caching and deduplication
  • EDL JSON v3 generation: one record per segment with timing, B-roll asset, subtitle text, transition type

Remotion stack (React/TypeScript) — compositing layer:

  • Reads the EDL JSON and renders the video frame-by-frame
  • Word-synced subtitle animation (Hormozi-style: large, high-contrast, bold)
  • B-roll composited over the voiceover with Ken Burns motion on stills
  • @remotion/transitions for cut types (hard cut, cross-dissolve, wipe)
  • Light leak overlays and motion graphics for visual rhythm

FFmpeg handles the final pass: audio ducking (B-roll music under VO), loudness normalisation to -14 LUFS, and mux.

The EDL JSON Format

The EDL JSON is the architectural decision that makes the dual-stack approach work. It is the contract between intelligence and compositing.

Each entry contains: segment_id, start_ms, end_ms, vo_text, subtitle_chunks, broll_asset, broll_type (video/image), transition_in, transition_out, motion_preset.

This format is human-readable and editable. A creator can open the EDL JSON, change a B-roll selection or subtitle grouping, and re-render without re-running the AI analysis. Render-only changes are fast; intelligence re-runs are only needed when the analysis itself needs to change.

Professional Video Rules Baked In

The LLM is prompted with specific editorial rules:

  • Shot length: 3–7 seconds per B-roll clip (faster cuts feel more energetic)
  • Cut ratio: 90% hard cuts, 10% transitions (over-using transitions is a beginner mistake)
  • Subtitle grouping: 3–5 words per card (readability at speed)
  • Ken Burns: applied to stills only, with scale and pan direction derived from image content keywords
  • Audio ducking: B-roll music (if provided) ducks to -18 LUFS under the VO, then returns

These rules produce a consistent output style without requiring per-video prompt tuning.

Outputs

Each run produces:

  • final.mp4 — the complete rendered video
  • edl.json — the full edit decision list (inspect, edit, re-render)
  • chapters.txt — YouTube chapter timestamps derived from scene segmentation
  • assets/ — sourced and cached B-roll assets with metadata

Scope

  • WhisperX transcription + LLM analysis → EDL JSON v3; optional text script for accuracy.
  • Asset sourcing (Pexels/Pixabay) with caching, dedup, and per-segment manifests.
  • Remotion: compositing, word-synced subtitles, motion graphics, @remotion/transitions, light leaks.
  • FFmpeg: ducking, loudness normalization, final mux. Outputs: final.mp4, edl.json, chapters.txt.
  • Professional rules: 3–7s per shot, 90% hard cuts, Hormozi-style subtitles, Ken Burns on B-roll.
Like what you see?

Waqas Raza

AI-Native Full-Stack Engineer. Top Rated on Upwork · $180K+ earned · 93% job success. I build production AI agents, LLM systems, Web3 platforms, and full-stack applications.

Hire me on Upwork