April 23, 2024, 7:18 p.m. | /u/Mindless-Ordinary485

Machine Learning www.reddit.com

I've had a lot of success with Whisper when it comes to transcriptions, but word level timestamps seems to be slightly inaccurate. From my understanding ("Whisper cannot provide reliable word timestamps, because the END-TO-END models like Transformer using cross-entropy training criterion are not designed for reliably estimating word timestamps." [https://www.youtube.com/watch?v=H576iCWt1Co&t=192s](https://www.youtube.com/watch?v=H576iCWt1Co&t=192s)) For my use case, I need precise word level timestamps, because I'm doing audio insertion after specific words. This becomes problematic when I do an insertion and the back part …

audio clip example file france however machinelearning speech the end will word

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote