OmniDub — Emotion-aware video dubbing · Powered by Xiaomi MiMo-V2.5

The problem

Auto-dubbing today loses three things.

😐

Emotion

The speaker is shouting, crying, whispering. The dub comes out flat like a news anchor reading a weather report.

👥

Voice identity

Three speakers in the video? All three collapse into the same AI narrator. Viewers can't tell who's talking.

⏱️

Timing

Translated line runs 30% longer than the original shot, so the dub talks right over the next cut. Unwatchable.

How OmniDub fixes it

Every problem maps to a MiMo-V2.5 capability.

01

Emotion preserved

For every sentence, MiMo-V2.5 video-understanding inspects a frame strip and classifies the delivery as (angry), (whispering), (sighing), and so on. That tag is injected inline into the TTS prompt using MiMo's native natural-language control syntax.

Uses mimo-v2.5 + mimo-v2.5-tts

02

Voice identity preserved

The longest clean clip of each speaker is extracted, base64-encoded, and passed as that speaker's voice handle to voice cloning. Three speakers in = three distinct voices out, each matching the timbre and cadence of the original.

Uses mimo-v2.5-tts-voiceclone

03

Timing preserved

The reasoning model translates each line with the target clip duration as a hard constraint. If the first draft would be too long, it explicitly reasons — "shorten by dropping 'you know'" — and iterates up to three times until it fits the original shot ±8%.

Uses mimo-v2.5-pro

04

Lip-shape aware

When the speaker's mouth is open on an "O" shape, the translator is told to prefer Indonesian/Chinese syllables with matching vowels at that time index. The dub doesn't perfectly lip-sync — no one's does — but it stops looking obviously wrong.

Uses mimo-v2.5 vision + mimo-v2.5-pro

Demo

See the pipeline end-to-end. 62 seconds.

Source: two-speaker English clip · Target: Mandarin 中文 · Voice cloned per speaker

omnidub.local

1 · Drop a video

⇪

clip_interview.mp4 12.8 MB · 01:42

2 · Target language

🇮🇩 Indonesia 🇨🇳 中文 🇬🇧 English 🇯🇵 日本語

3 · Voice

🧬 Clone each speaker 🎙️ Built-in

Single-page UI, designed mobile-first — one thumb, no dropdowns.

omnidub.local · processing

Dubbing in progress

Extracting audio
Transcribing + diarizing mimo-v2.5
Fingerprinting speakers voiceclone
Tagging emotion per line mimo-v2.5
Timing-aware translation mimo-v2.5-pro
Rendering voice mimo-v2.5-tts
Muxing MP4

Real-time progress over Server-Sent Events — you see every MiMo call as it happens.

Install

One command. Two minutes.

Grab a MiMo API key
Sign up at platform.xiaomimimo.com. Token Plan subscribers get preferential rates and a full Credits reset each month.

Clone & configure

git clone https://github.com/kohkoh099-boop/omnidub
cd omnidub
cp .env.example .env
# paste your MIMO_API_KEY into .env

Run
```
docker compose up --build
```
Open http://localhost:8080. Done.

Why MiMo

The only stack that could pull this off.

Four models, one API key, zero external ML

Every arrow in the dataflow touches api.xiaomimimo.com/v1. No Whisper, no ElevenLabs, no pyannote. ASR, diarization, emotion classification, reasoning translation, and TTS — all MiMo.

mimo-v2.5 mimo-v2.5-pro mimo-v2.5-tts mimo-v2.5-tts-voiceclone

Emotion as a first-class TTS input

MiMo-V2.5-TTS is the only major TTS that accepts natural-language delivery tags inline: (whispering)The deadline is tomorrow. We push classified emotion straight into the prompt — no post-hoc prosody hacks.

Reasoning model for dubbing translation

Dubbing translation is a constraint satisfaction problem, not a pure NMT task. MiMo-V2.5-Pro can actually reason about "your previous draft was 2.3s, the shot is 1.8s, drop the filler" and self-correct.

Built-in voices for fallback

Short clips that don't have enough clean audio for cloning gracefully degrade to the built-in Chloe voice (or any voice the user picks), so the pipeline never fails on edge cases.