Self-hosted tool · Designed & built end-to-end

Burn styled captions
without the round-trip.

Drop a video into the browser, let Whisper transcribe it, polish the segments with Claude if needed, drag the overlays to taste, and burn the result through FFmpeg. What you see in the editor is exactly what gets embedded in the file.

Prototype · walkthrough

See it in action

Screenshots of the real editor UI, the timeline, drag-to-position overlays, and the live caption preview. Source on GitHub.

View prototype

Reference

Design system

Editor-dark tokens, timeline-blue accent, Inter for tool UI, JetBrains Mono for timecodes - the full, code-truthful spec.

View system

Engineering

Under the hood

Python · FastAPI · Whisper · Claude · FFmpeg · libass. Bundled into the YouTube Downloader's Caption tab via WKWebView.

See the build

The problem

Burning styled captions is a slow round-trip through Premiere or paid SaaS.

Export to Premiere, wait, tweak the style, re-export. Or pay a subscription for a cloud captioner that mis-syncs every third sentence. YouTube's auto-translate is famously out of sync too. I wanted a self-hosted tool that handles the entire pipeline - from raw video to burned-in subtitles - in one sitting, without leaving the browser. And I wanted the preview to match what gets burned, to the pixel.

The flow

One pipeline, five beats. Drop in, burn out.

I sketched the pipeline on paper first - confirming that each step had a real job before writing a line of Python.

from raw video to burned captions without leaving the browser

drop video

drag & drop
or browse

Whisper
transcribes

word-level
timestamps

Claude
refines

grammar &
fillers
(optional)

edit &
position

timeline +
drag overlay

FFmpeg
burns

libass,
pixel-perfect
match

↖ Claude step is optional

preview = burn. always. ↑

Three design bets

The decisions that shaped the whole experience.

"Self-hosted" is easy to say and hard to make feel good. Three early bets turned it from a CLI script into a tool people would actually reach for.

The preview must match the burn exactly.

The in-browser preview renders caption overlays using CSS - but the burned file uses libass. Getting these to agree required pinning PlayResY=288, pre-wrapping line breaks server-side, and using an inline-block anchor for CSS position. What you drag in the editor is exactly where the subtitle lands in the exported video.

~~Approximate preview~~ → Pixel-accurate match

Word-level timestamps, not segment-level.

Whisper's default segment timestamps are loose - often pinned to 0.0 even when the first word starts at 3 seconds. The app transcribes with word_timestamps=True and snaps each segment's start/end to the actual word boundaries. Captions hit when the speaker speaks.

~~Segment start ≈ 0.0~~ → Snapped to word onset

Two-pass burn for box + outline together.

libass can give you a translucent box or a per-glyph outline - not both in one pass. The pipeline chains two subtitles filters: pass 1 renders the box only, pass 2 layers text and outline on top. The user just toggles two controls; the complexity is invisible.

~~Box OR outline~~ → Box AND outline, two passes

UX choices that matter

Small calls that change how the whole thing feels.

Most of what makes a tool feel reliable is a pile of small, invisible decisions. Here are the ones worth defending.

Dragging a caption updates only the active segment.

Click a segment on the timeline, then drag the overlay in the live preview. Only that segment stores a custom position. The rest default to the global alignment. Right-click a segment to clear its override and snap it back to the stylesheet default.

The wrap-width slider mirrors the burn's exact glyph budget.

The server uses font_size × (video_h/288) × 0.55 to estimate character width. The browser JS uses the same formula to re-wrap text as the slider moves. Different line counts between preview and burn no longer happen.

Claude is genuinely optional, not a locked feature.

A toggle clearly labelled "Polish with Claude (grammar / fillers)" defaults to off. Whisper-only output ships fast. When on, the contribution is auditable - segment text shows before and after the refinement so nothing slips through invisibly.

The dark editor palette comes from the app itself.

The app's own CSS uses #0b0d10 backgrounds and #5b8def accents - a video-timeline feel. The case study inherits those same values so the design language is coherent between the tool and its documentation.

Segments with a custom position show green in the timeline.

The timeline bar turns green for segments where pos_x / pos_y have been set. You can see at a glance which captions have been manually placed and which are still using the global defaults - no hunting through a settings panel.

Bundled into the YouTube Downloader as the Caption tab.

The FastAPI server is embedded in a macOS YouTube downloader app via an embedded WKWebView. The Caption tab opens the editor pointing at a local server. One install, one place to go - no context-switching to a separate browser tab.

Behind the design

One dark system, used everywhere.

Auto-Caption runs on a "focused editor / timeline" system - deep navy backgrounds, a crisp blue accent drawn directly from the app's own stylesheet, and JetBrains Mono for anything that smells like a timecode. Here's the short version; the full spec lives on the Design page.

Color · editor-dark surfaces · one blue accent

The accent comes straight from the app's own CSS.

The app uses #5B8DEF for every interactive element - active timeline segments, button fills, focus rings. The case study inherits this exact value as --brand-primary. The surface chain runs from #0D1117 to the card's #161E2C, matching the editor panel hierarchy.

Blue

#5B8DEF

Brand accent, active segments, CTAs

Purple

#8B7DEF

Claude / AI-refine indicator

Green

#4CB87A

Positioned segment, success

Amber

#F0A030

Timeline playhead, warning

Panel

#161E2C

Card / panel background

Text

#E8EAED

Primary text on dark

Type · tool UI needs crisp, not friendly

Inter for labels, JetBrains Mono for anything time-related.

Inter 700 Display · headings · panel titles

Captions ready.

Clean, neutral, tool-grade. Inter at weight 700 reads as "professional software" without the warmth of a consumer app - the right register for an editor.

Inter 400/500 Body · UI labels · settings

Drop a video, adjust style, hit Burn - get back the same file with captions embedded.

Weight 400–500 for body copy and panel labels. Inherits from the root; nothing overrides it unless it needs to feel system-like.

JetBrains Mono Timecodes · eyebrows · token labels

00:01.420 → 00:04.890 · model: base · #5B8DEF

Only where something must read as "data" - start/end times, hex values in the design spec, section eyebrows. Not used for body copy.

Shape · crisp editor corners

Less soft than a consumer app, never truly sharp.

An editor tool benefits from slightly tighter radii. Four steps, from pill status badges down to the timeline segment chips that need to be visibly rectangular.

Pill Status badges, eyebrow chips

999px

Card Panels, upload zone, preview frame

16px

Button Action buttons, segment editor

12px

Chip Timeline segments, inline badges

6px

Identity · the icon

A timeline with a playhead - the whole app in one frame.

Auto-Caption

Video · FFmpeg · libass

The icon shows a video frame with two caption lines and a timeline bar beneath it - a playhead resting mid-track. At thumbnail size it reads as "video tool with captions." In-product icons are inline SVG line icons at 2px stroke with round caps, matching the app's existing glyph set.

Motion · 4 patterns

Purposeful feedback, not decoration.

▶

Segment activate 280ms ease-out - timeline block highlights

burn

Burn progress button pulses while FFmpeg is running

panel

Panel open editor area fades and scales in, 220ms

error

Error shake input shakes when validation fails, 450ms

See the full design spec View on GitHub

What's deliberately missing

The cuts I'd defend.

A solo tool for personal use has to be ruthless about scope. Here's what was left out on purpose - each omission keeps the tool fast, offline-capable, and maintainable by one person.

No cloud

No remote server, no account, no upload to a third party. The whole pipeline runs on localhost. This was the original motivation for building the tool - Premiere and cloud SaaS were the alternatives, and neither fit a self-hosted workflow where the video never leaves the machine.

Not real-time

Whisper transcription takes roughly the same wall-clock time as the video's duration on the base model. This is a deliberate trade: accuracy over speed. The UI shows a progress log so the wait is legible, not a spinner.

Global style

Style controls (font, size, color) apply globally, not per-segment. Position is per-segment because that's what makes the drag-and-drop editor worth having. Per-segment color and font are deferred, not forgotten - the segment dict schema already supports them via ASS override tags.

No translation

Whisper's multilingual capability is not exposed as a translation feature. The tool targets captioning in the source language. Cross-lingual translation adds model complexity and post-processing that would need its own design treatment - deliberately out of scope for v1.

Under the hood

Python all the way down.

One FastAPI server handles upload, transcription, refinement, and burn. The browser is just a view into that server - no separate frontend build, no Node.js. Bundled into the YouTube Downloader's "Caption" tab via an embedded WKWebView.

Pipeline

Python · FastAPI · Whisper

OpenAI Whisper with word_timestamps=True for frame-accurate sync
Segment timestamps snapped to actual word boundaries, never raw Whisper output
Optional Claude Sonnet pass for grammar and filler-word cleanup
Server-side line-break pre-wrapping so preview and burn agree

Burn

FFmpeg · libass

Bundled imageio-ffmpeg static binary - Homebrew FFmpeg lacks libass
Two-pass render when box + outline are both requested (libass limitation)
ASS format for any per-segment \pos override; SRT otherwise
PlayResY=288 locked so font sizes match between SRT and ASS paths

Editor

Embedded HTML · WKWebView

Single-page HTML/CSS/JS embedded in app.py - no build step
Timeline drag updates pos_x/pos_y on the active segment only
Live caption overlay using the same wrap-budget formula as the server
Bundled as the "Caption" tab in the macOS YouTube Downloader via WKWebView