*Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Haokun Gui, Bin Xia, Jingyao Li, Bei Yu, Jiaya Jia

*Equal Contribution

Github: https://github.com/dvlab-research/MGM-Omni

Demo: https://huggingface.co/spaces/wcy1122/MGM-Omni

Model: https://huggingface.co/collections/wcy1122/mgm-omni-6896075e97317a88825032e1

— August 18, 2025

MGM-Omni is an omni-chatbot capable of processing text, image, video, and speech inputs, and can generate both text and speech responses. Based on MiniGemini and MiniGemini v2 (Lyra), MiniGemini-Omni (MGM-Omni) enables long-form speech understanding and generation, as well as voice cloning in both Chinese and English.

We provide some examples for **[Voice Clone],** [Long Speech Generation] and [Long Audio Understanding]. If you would like to explore more, welcome to try our web demo!

18001753692899_.pic.jpg

<aside> 💡

Main Properties

Omni-modality supports: MGM-Omni supports audio, video, image, and text inputs, understands long contexts, and can generate both text and speech outputs, making it a truly versatile multi-modal AI assistant.
Long-form Speech Understanding: Unlike most existing open-source multi-modal models, which typically fail with inputs longer than 15 minutes, MGM-Omni can handle hour-long speech inputs while delivering superior overall and detailed understanding and performance!
Long-form Speech Generation: With a treasure trove of training data and smart Chunk-Based Decoding, MGM-Omni can generate over 10 minutes of smooth, natural speech for continuous storytelling.
Streaming Generation: Thanks to the parallel decoding approach for speech tokens, MGM-Omni enables efficient and smooth streaming audio, making it suitable for live conversations.
Zero-shot Voice Cloning: With MGM-Omni’s extensive and diverse audio training, you can create a customized voice clone by simply recording a short clip (around 10 seconds) and reviewing the results.
Fully Open-source: All the code, models, and training data will be released. </aside>

👀 Model Overview

Figure1. Model overview: MGM-Omni decouple audio understanding and generation into MLLM and SpeechLM. We integrate a Flow Matching model to further improve the speech synthesis quality.

MGM-Omni is an advanced omni model designed to handle text, image, video, and speech inputs, with the ability to generate both text and speech. For inputs from different modalities, we employ modality-specific encoders to extract features, which are subsequently fed into the MLLM. The text generated by the MLLM is then passed to SpeechLM, which produces speech tokens using a Chunk-Based Parallel Decoding strategy. These speech tokens are further transformed into Mel-Spectrograms via a Flow Matching model, and the final audio is synthesized using a vocoder. In the audio decoding stage, MGM-Omni utilizes the speech tokenizer and Flow Matching model from CosyVoice2, and adopts HifiGAN as the vocoder.

Main Properties

👀 Model Overview

👂 Omni-Understanding

Joint Short-Long Input Training