Size and position by platform.
The biggest mistake is using the same caption style across every platform. Phone feed and desktop feed are different reading contexts. Captions that look fine on YouTube desktop are unreadable at 9:16 on a phone, and captions sized for TikTok look oversized on YouTube.
Working defaults used across creator communities (not platform-published specs, general conventions; adjust for your audience):
Font weight matters more than font choice. Bold sans-serif (Inter Bold, Montserrat SemiBold, Helvetica Black) reads against busy backgrounds on mobile; thin fonts disappear. White text with a thick black stroke works on almost anything. LinkedIn's own ad guidance recommends 4:5 vertical or 1:1 square for feed performance, so the 1:1 sizing above is the safer default for professional audiences.
Burned-in vs toggleable captions.
Most platforms support soft captions: toggleable closed captions the viewer has to enable. Most creators still burn captions into the video, for three specific reasons:
Given that a substantial majority of social video is watched with sound off, burning captions in trades flexibility for guaranteed display.
One exception: YouTube long-form. Upload soft captions (a .srt file) for accessibility and SEO. YouTube indexes the transcript for search. Still burn in visual captions for the first 30 seconds, because that's the retention-critical window where you can't trust closed captions to be enabled.
What caption style says about you.
Captions are the most visible design element in most creator videos. If yours look templated (the default Submagic bouncy word-by-word, the default CapCut font), viewers register that you used a tool. That's not necessarily bad, but it's not distinctive. Captions that match your brand (your font, your color, your rhythm) read as considered in a way default presets don't.
Animation is separate from style. The word-by-word animated captions that dominate TikTok are a short-form convention: they look right on short-form and overdesigned on everything else. If you're publishing across platforms from the same source, static styled captions travel better than animated ones.
How to do it in Sapari.
Upload the recording
Captions generate in the same analysis as silence, false starts, and audio cleanup.
Pick the aspect ratio first
Defaults adapt automatically to each format.
Override if needed
Font, color, position, background, and per-word highlighting are editable.
Export with captions burned in
SRT export is on the roadmap; today captions render into the video.
Captions regenerate automatically after every cut, so silence removal and false start dismissal don't break timing.
Common questions.
Should I keep "um" in the captions if I cut it from the audio?
No. If the audio cuts it, the caption should too. Tools that run silence removal and captioning in separate passes sometimes get out of sync here. A good pipeline regenerates captions after the edit.
What about non-English audio?
Caption it in the spoken language. Auto-translate is a separate decision. Platforms handle multilingual audiences differently and the right answer depends on your audience.
Do I need to worry about line breaks?
For short-form, yes: one to three words per line reads better than wrapping. For long-form, automatic line breaks at phrase boundaries are fine.
What's the smallest caption I can get away with?
On a phone feed at 9:16, 56px is the practical floor. Below that, older viewers can't read them comfortably.