OmniDub rebuilds a video's dialogue in any language while every speaker keeps their voice, their emotion, and the original timing — powered entirely by Xiaomi MiMo-V2.5.
The speaker is shouting, crying, whispering. The dub comes out flat like a news anchor reading a weather report.
Three speakers in the video? All three collapse into the same AI narrator. Viewers can't tell who's talking.
Translated line runs 30% longer than the original shot, so the dub talks right over the next cut. Unwatchable.
For every sentence, MiMo-V2.5 video-understanding inspects a frame strip and classifies the delivery as (angry), (whispering), (sighing), and so on. That tag is injected inline into the TTS prompt using MiMo's native natural-language control syntax.
Uses mimo-v2.5 + mimo-v2.5-tts
The longest clean clip of each speaker is extracted, base64-encoded, and passed as that speaker's voice handle to voice cloning. Three speakers in = three distinct voices out, each matching the timbre and cadence of the original.
Uses mimo-v2.5-tts-voiceclone
The reasoning model translates each line with the target clip duration as a hard constraint. If the first draft would be too long, it explicitly reasons — "shorten by dropping 'you know'" — and iterates up to three times until it fits the original shot ±8%.
Uses mimo-v2.5-pro
When the speaker's mouth is open on an "O" shape, the translator is told to prefer Indonesian/Chinese syllables with matching vowels at that time index. The dub doesn't perfectly lip-sync — no one's does — but it stops looking obviously wrong.
Uses mimo-v2.5 vision + mimo-v2.5-pro
Source: two-speaker English clip · Target: Mandarin 中文 · Voice cloned per speaker
Sign up at platform.xiaomimimo.com. Token Plan subscribers get preferential rates and a full Credits reset each month.
git clone https://github.com/kohkoh099-boop/omnidub
cd omnidub
cp .env.example .env
# paste your MIMO_API_KEY into .env
docker compose up --build
Open http://localhost:8080. Done.
Every arrow in the dataflow touches api.xiaomimimo.com/v1. No Whisper, no ElevenLabs, no pyannote. ASR, diarization, emotion classification, reasoning translation, and TTS — all MiMo.
MiMo-V2.5-TTS is the only major TTS that accepts natural-language delivery tags inline: (whispering)The deadline is tomorrow. We push classified emotion straight into the prompt — no post-hoc prosody hacks.
Dubbing translation is a constraint satisfaction problem, not a pure NMT task. MiMo-V2.5-Pro can actually reason about "your previous draft was 2.3s, the shot is 1.8s, drop the filler" and self-correct.
Short clips that don't have enough clean audio for cloning gracefully degrade to the built-in Chloe voice (or any voice the user picks), so the pipeline never fails on edge cases.