A combination of engines generally gets the best WER with additional cost. hoste...

A combination of engines generally gets the best WER with additional cost. hosted whisper + gemini 2.5 flash lite with custom deconfliction based on what each one does best is a reasonable path. Gemini does general conversation and silence better than whisper v3 large but whisper v3 large does better specialty vocab. Of course both after and before the merge, common transcription errors are fixed with a dictionary based lookup (that preserves punctuation, etc). This combo stays multi-lingual and is pretty cheap but is complex. There are better single source transcription vendors out there but they generally fail to either provide multi-lingual, or to provide timing info or are ridiculously expensive, or or or... I think the next gen of multi-modal models will make this all moot as they will likely crush transcription. Gemini shows that direction right now. OpenAI does a bad job of it but is in the game. Anthropic is surprisingly not really engaged in this yet (but they did just announce real time audio so they gotta be thinking about it).