*Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Bei Yu, Jiaya Jia
*Equal Contribution
Github: https://github.com/dvlab-research/MGM-Omni
Demo: https://huggingface.co/spaces/wcy1122/MGM-Omni
Model: https://huggingface.co/collections/wcy1122/mgm-omni-6896075e97317a88825032e1
— August 18, 2025
MGM-Omni is an omni-chatbot capable of processing text, image, video, and speech inputs, and can generate both text and speech responses. Based on MiniGemini and MiniGemini v2 (Lyra), MiniGemini-Omni (MGM-Omni) enables long-form speech understanding and generation, as well as voice cloning in both Chinese and English.
We provide some examples for **[Voice Clone],** [Long Speech Generation] and [Long Audio Understanding]. If you would like to explore more, welcome to try our web demo!
<aside> 💡
Figure1. Model overview: MGM-Omni decouple audio understanding and generation into MLLM and SpeechLM. We integrate a Flow Matching model to further improve the speech synthesis quality.
MGM-Omni is an advanced omni model designed to handle text, image, video, and speech inputs, with the ability to generate both text and speech. For inputs from different modalities, we employ modality-specific encoders to extract features, which are subsequently fed into the MLLM. The text generated by the MLLM is then passed to SpeechLM, which produces speech tokens using a Chunk-Based Parallel Decoding strategy. These speech tokens are further transformed into Mel-Spectrograms via a Flow Matching model, and the final audio is synthesized using a vocoder. In the audio decoding stage, MGM-Omni utilizes the speech tokenizer and Flow Matching model from CosyVoice2, and adopts HifiGAN as the vocoder.