Alibaba’s (BABA) latest flagship reasoning AI model, Qwen3-Max-Thinking, outperforms several rivals in multiple benchmarks, the company said.
The Qwen family of large language models is developed by the Alibaba Cloud division.
“By scaling up model parameters and leveraging substantial computational resources for reinforcement learning, Qwen3-Max-Thinking achieves significant performance improvements across multiple dimensions, including factual knowledge, complex reasoning, instruction following, alignment with human preferences, and agent capabilities,” Alibaba said. “On 19 established benchmarks, it demonstrates performance comparable to leading models such as GPT-5.2-Thinking (OPENAI), Claude-Opus-4.5 (ANTHRO), and Gemini 3 Pro (GOOG)(GOOGL).”
The new model features two innovations, including adaptive tool-use capabilities that enable on-demand retrieval and code interpreter invocation, and advanced test-time scaling techniques that increase reasoning performance.
Alibaba said Qwen3-Max-Thinking with test-time scaling techniques surpassed DeepSeek-V3.2 (DEEPSEEK), Claude-Opus-4.5, GPT-5.2, and Gemini-3 Pro in the GPQA Diamond, IMO-AnswerBench, LiveCodeBench, and Humanity’s Last Exam benchmarks.