Alibaba (NYSE:BABA) unveiled its open source large language model called Qwen3-Omni, which can process text, images, audio, and video.
The model can process text, images, audio, and video but delivers real-time streaming responses in text and natural speech.
“Introducing Qwen3-Omni — the first natively end-to-end omni-modal AI unifying text, image, audio & video in one model — no modality trade-offs,” said Alibaba’s Qwen in a post on X, formerly Twitter.
Qwen3-Omni supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages, according to a blog post by the company.
The Chinese tech giant noted that Qwen3-Omni can be freely adapted via system prompts to modify response styles, personas, and behavioral attributes. The model supports function call, enabling seamless integration with external tools and services.
Alibaba added that Qwen3-Omni-30B-A3B-Captioner, a low-hallucination but detailed universal audio caption model, fills the gap in the open-source community.
“We’ve open-sourced Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner, to empower developers to explore a variety of applications from instruction-following to creative tasks,” Qwen said on X.
Alibaba noted that across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open- source SOTA [State-of-the-Art] on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Alphabet’s (GOOG) (GOOGL) Gemini-2.5-Pro, Seed-ASR, and OpenAI’s GPT-4o-Transcribe.
In May 2024, Microsoft (MSFT)-backed OpenAI unveiled GPT-4o, starting the trend of ‘omni” models, and made it available for free to all users. Google’s Gemini 2.5 Pro, introduced in March 2025, can also analyze video, but, like GPT-4o, users have to pay to use it.
Meanwhile, Qwen3-Omni can be downloaded, modified, and deployed for free under an enterprise-friendly Apache 2.0 license, even for commercial applications, according to a report by VentureBeat.
Google’s open source, Apache 2.0-licensed Gemma 3n, introduced in May, is probably a close rival, as it also accepts video, audio, text, and images as input but only outputs text, the report added.
Aliababa said Qwen3-Omni adopts the Thinker-Talker architecture. Thinker is tasked with text generation, while Talker focuses on generating streaming speech tokens by receiving high-level representations directly from Thinker.