Built for Xiaomi MiMo Orbit 100T MIT · Open source ↗

Dub any video,
without losing the soul of it.

OmniDub rebuilds a video's dialogue in any language while every speaker keeps their voice, their emotion, and the original timing — powered entirely by Xiaomi MiMo-V2.5.

Get it running See it in action
The problem

Auto-dubbing today loses three things.

😐

Emotion

The speaker is shouting, crying, whispering. The dub comes out flat like a news anchor reading a weather report.

👥

Voice identity

Three speakers in the video? All three collapse into the same AI narrator. Viewers can't tell who's talking.

⏱️

Timing

Translated line runs 30% longer than the original shot, so the dub talks right over the next cut. Unwatchable.

How OmniDub fixes it

Every problem maps to a MiMo-V2.5 capability.

01

Emotion preserved

For every sentence, MiMo-V2.5 video-understanding inspects a frame strip and classifies the delivery as (angry), (whispering), (sighing), and so on. That tag is injected inline into the TTS prompt using MiMo's native natural-language control syntax.

Uses mimo-v2.5 + mimo-v2.5-tts

02

Voice identity preserved

The longest clean clip of each speaker is extracted, base64-encoded, and passed as that speaker's voice handle to voice cloning. Three speakers in = three distinct voices out, each matching the timbre and cadence of the original.

Uses mimo-v2.5-tts-voiceclone

03

Timing preserved

The reasoning model translates each line with the target clip duration as a hard constraint. If the first draft would be too long, it explicitly reasons — "shorten by dropping 'you know'" — and iterates up to three times until it fits the original shot ±8%.

Uses mimo-v2.5-pro

04

Lip-shape aware

When the speaker's mouth is open on an "O" shape, the translator is told to prefer Indonesian/Chinese syllables with matching vowels at that time index. The dub doesn't perfectly lip-sync — no one's does — but it stops looking obviously wrong.

Uses mimo-v2.5 vision + mimo-v2.5-pro

Demo

See the pipeline end-to-end. 62 seconds.

Source: two-speaker English clip · Target: Mandarin 中文 · Voice cloned per speaker

omnidub.local
1 · Drop a video
clip_interview.mp4 12.8 MB · 01:42
2 · Target language
🇮🇩 Indonesia 🇨🇳 中文 🇬🇧 English 🇯🇵 日本語
3 · Voice
🧬 Clone each speaker 🎙️ Built-in
Single-page UI, designed mobile-first — one thumb, no dropdowns.
omnidub.local · processing
Dubbing in progress
  1. Extracting audio
  2. Transcribing + diarizing mimo-v2.5
  3. Fingerprinting speakers voiceclone
  4. Tagging emotion per line mimo-v2.5
  5. Timing-aware translation mimo-v2.5-pro
  6. Rendering voice mimo-v2.5-tts
  7. Muxing MP4
Real-time progress over Server-Sent Events — you see every MiMo call as it happens.
Install

One command. Two minutes.

  1. Grab a MiMo API key

    Sign up at platform.xiaomimimo.com. Token Plan subscribers get preferential rates and a full Credits reset each month.

  2. Clone & configure
    git clone https://github.com/kohkoh099-boop/omnidub
    cd omnidub
    cp .env.example .env
    # paste your MIMO_API_KEY into .env
  3. Run
    docker compose up --build

    Open http://localhost:8080. Done.

Why MiMo

The only stack that could pull this off.

Four models, one API key, zero external ML

Every arrow in the dataflow touches api.xiaomimimo.com/v1. No Whisper, no ElevenLabs, no pyannote. ASR, diarization, emotion classification, reasoning translation, and TTS — all MiMo.

mimo-v2.5 mimo-v2.5-pro mimo-v2.5-tts mimo-v2.5-tts-voiceclone

Emotion as a first-class TTS input

MiMo-V2.5-TTS is the only major TTS that accepts natural-language delivery tags inline: (whispering)The deadline is tomorrow. We push classified emotion straight into the prompt — no post-hoc prosody hacks.

Reasoning model for dubbing translation

Dubbing translation is a constraint satisfaction problem, not a pure NMT task. MiMo-V2.5-Pro can actually reason about "your previous draft was 2.3s, the shot is 1.8s, drop the filler" and self-correct.

Built-in voices for fallback

Short clips that don't have enough clean audio for cloning gracefully degrade to the built-in Chloe voice (or any voice the user picks), so the pipeline never fails on edge cases.