Burn styled captions
without the round-trip.
Drop a video into the browser, let Whisper transcribe it, polish the segments with Claude if needed, drag the overlays to taste, and burn the result through FFmpeg. What you see in the editor is exactly what gets embedded in the file.
See it in action
Screenshots of the real editor UI, the timeline, drag-to-position overlays, and the live caption preview. Source on GitHub.
Design system
Editor-dark tokens, timeline-blue accent, Inter for tool UI, JetBrains Mono for timecodes — the full, code-truthful spec.
Under the hood
Python · FastAPI · Whisper · Claude · FFmpeg · libass. Bundled into the YouTube Downloader's Caption tab via WKWebView.
Burning styled captions is a slow round-trip through Premiere or paid SaaS.
Export to Premiere, wait, tweak the style, re-export. Or pay a subscription for a cloud captioner that mis-syncs every third sentence. YouTube's auto-translate is famously out of sync too. I wanted a self-hosted tool that handles the entire pipeline — from raw video to burned-in subtitles — in one sitting, without leaving the browser. And I wanted the preview to match what gets burned, to the pixel.
One pipeline, five beats. Drop in, burn out.
I sketched the pipeline on paper first — confirming that each step had a real job before writing a line of Python.
or browse
transcribes
timestamps
refines
fillers
(optional)
position
drag overlay
burns
pixel-perfect
match
The decisions that shaped the whole experience.
"Self-hosted" is easy to say and hard to make feel good. Three early bets turned it from a CLI script into a tool people would actually reach for.
The preview must match the burn exactly.
The in-browser preview renders caption overlays using CSS — but the burned file uses libass. Getting these to agree required pinning PlayResY=288, pre-wrapping line breaks server-side, and using an inline-block anchor for CSS position. What you drag in the editor is exactly where the subtitle lands in the exported video.
Word-level timestamps, not segment-level.
Whisper's default segment timestamps are loose — often pinned to 0.0 even when the first word starts at 3 seconds. The app transcribes with word_timestamps=True and snaps each segment's start/end to the actual word boundaries. Captions hit when the speaker speaks.
Two-pass burn for box + outline together.
libass can give you a translucent box or a per-glyph outline — not both in one pass. The pipeline chains two subtitles filters: pass 1 renders the box only, pass 2 layers text and outline on top. The user just toggles two controls; the complexity is invisible.
Small calls that change how the whole thing feels.
Most of what makes a tool feel reliable is a pile of small, invisible decisions. Here are the ones worth defending.
Dragging a caption updates only the active segment.
Click a segment on the timeline, then drag the overlay in the live preview. Only that segment stores a custom position. The rest default to the global alignment. Right-click a segment to clear its override and snap it back to the stylesheet default.
The wrap-width slider mirrors the burn's exact glyph budget.
The server uses font_size × (video_h/288) × 0.55 to estimate character width. The browser JS uses the same formula to re-wrap text as the slider moves. Different line counts between preview and burn no longer happen.
Claude is genuinely optional, not a locked feature.
A toggle clearly labelled "Polish with Claude (grammar / fillers)" defaults to off. Whisper-only output ships fast. When on, the contribution is auditable — segment text shows before and after the refinement so nothing slips through invisibly.
The dark editor palette comes from the app itself.
The app's own CSS uses #0b0d10 backgrounds and #5b8def accents — a video-timeline feel. The case study inherits those same values so the design language is coherent between the tool and its documentation.
Segments with a custom position show green in the timeline.
The timeline bar turns green for segments where pos_x / pos_y have been set. You can see at a glance which captions have been manually placed and which are still using the global defaults — no hunting through a settings panel.
Bundled into the YouTube Downloader as the Caption tab.
The FastAPI server is embedded in a macOS YouTube downloader app via an embedded WKWebView. The Caption tab opens the editor pointing at a local server. One install, one place to go — no context-switching to a separate browser tab.
One dark system, used everywhere.
Auto-Caption runs on a "focused editor / timeline" system — deep navy backgrounds, a crisp blue accent drawn directly from the app's own stylesheet, and JetBrains Mono for anything that smells like a timecode. Here's the short version; the full spec lives on the Design page.
The accent comes straight from the app's own CSS.
The app uses #5B8DEF for every interactive element — active timeline segments, button fills, focus rings. The case study inherits this exact value as --brand-primary. The surface chain runs from #0D1117 to the card's #161E2C, matching the editor panel hierarchy.
Inter for labels, JetBrains Mono for anything time-related.
Clean, neutral, tool-grade. Inter at weight 700 reads as "professional software" without the warmth of a consumer app — the right register for an editor.
Weight 400–500 for body copy and panel labels. Inherits from the root; nothing overrides it unless it needs to feel system-like.
Only where something must read as "data" — start/end times, hex values in the design spec, section eyebrows. Not used for body copy.
Less soft than a consumer app, never truly sharp.
An editor tool benefits from slightly tighter radii. Four steps, from pill status badges down to the timeline segment chips that need to be visibly rectangular.
A timeline with a playhead — the whole app in one frame.
The icon shows a video frame with two caption lines and a timeline bar beneath it — a playhead resting mid-track. At thumbnail size it reads as "video tool with captions." In-product icons are inline SVG line icons at 2px stroke with round caps, matching the app's existing glyph set.
Purposeful feedback, not decoration.
The cuts I'd defend.
A solo tool for personal use has to be ruthless about scope. Here's what was left out on purpose — each omission keeps the tool fast, offline-capable, and maintainable by one person.
No remote server, no account, no upload to a third party. The whole pipeline runs on localhost. This was the original motivation for building the tool — Premiere and cloud SaaS were the alternatives, and neither fit a self-hosted workflow where the video never leaves the machine.
Whisper transcription takes roughly the same wall-clock time as the video's duration on the base model. This is a deliberate trade: accuracy over speed. The UI shows a progress log so the wait is legible, not a spinner.
Style controls (font, size, color) apply globally, not per-segment. Position is per-segment because that's what makes the drag-and-drop editor worth having. Per-segment color and font are deferred, not forgotten — the segment dict schema already supports them via ASS override tags.
Whisper's multilingual capability is not exposed as a translation feature. The tool targets captioning in the source language. Cross-lingual translation adds model complexity and post-processing that would need its own design treatment — deliberately out of scope for v1.
Python all the way down.
One FastAPI server handles upload, transcription, refinement, and burn. The browser is just a view into that server — no separate frontend build, no Node.js. Bundled into the YouTube Downloader's "Caption" tab via an embedded WKWebView.