"Um" and "uh" aren't always worth cutting.
Filler words do two things. Most of the time they're verbal throat-clearing (the brain catching up to the mouth), and cutting them tightens the delivery. But sometimes they're doing real work. The "um" before a sensitive topic signals thoughtfulness. The "uh" mid-sentence gives the listener a beat to register the last point. Research on conversational pauses has found that gaps of 200–300ms between turns are the norm in English conversation, and the brain expects them. Cutting every instance flattens cadence into robot-TED-talk voice, which reads as over-produced.
Cut the obvious repetitions (three "um"s clustered at the start of a sentence) and keep the ones that function as pauses. The fix is the review step, not the algorithm.
False starts are a different problem.
A filler word is one or two sounds. A false start is an entire phrase or sentence you abandoned and restarted:
So the way I think about this is — actually, the way I think about this is you need to decide what matters first.
The first fragment is a restart. Filler words are sounds and pauses are beats, but a false start is a redundant thought the viewer never needed, so it has to come out entirely. Finding these manually means scrubbing and reading the transcript, which is exactly the post-production tax you're trying to avoid. Most silence removers miss false starts because the audio is full. The redundancy is in the words, not the signal.
The tools that do catch them run the transcript through a language model that understands when a speaker restarted the same thought. Sapari does this. Some other editors do a version of it, and most silence removers don't. Detection quality varies with recording type (clear monologue is easier than overlapping conversation), and the better tools give you a confidence threshold to control aggressiveness.
The restarts worth keeping.
Not every restart is a mistake. A language model flagging these as false starts is technically correct but creatively wrong.
Dismiss them in review. Better tools make this easy (card-based interface, one click to keep). Worse ones bury the decision in the transcript view.
How to do it in Sapari.
Upload the recording
MP4, MOV, or any common video format.
Set the false start slider
Conservative catches obvious restarts only. Moderate catches restarts plus clear stumbles. Aggressive catches everything the model suspects.
Review the cards
Each detected range is a purple card with the transcript excerpt visible. Dismiss the intentional ones.
Run silence removal in the same pass
Filler words short enough to be caught by silence thresholds get cut at Balanced or higher pacing.
A 45-minute recording typically has 15–25 false starts. Detection runs as part of the main analysis; review takes another 5–10 minutes.
Common questions.
Should I cut all my "um"s?
No. Cut the ones that cluster (three in a row) and keep the ones that function as pauses. If you're unsure, leave it: an under-cut "um" sounds more natural than an over-edited sentence.
Does false start detection work for non-English audio?
Yes, in the languages the transcription supports (English, Spanish, Portuguese, French in Sapari). English is most extensively tested.
What about accents or fast speech?
Detection quality degrades with heavy accents or very fast speech because the transcription itself degrades. Running audio cleanup first helps the transcription, which helps detection.
Can I just listen through the whole recording?
You can. A 45-minute recording with 20 false starts is roughly an hour of careful listening. An AI detection pass surfaces the candidates in a fraction of that, and review takes another 5–10 minutes.