Alibaba's New AI Models: A Leap in Maths-Specific AI
Alibaba Group Holding has introduced its latest innovation in the realm of artificial intelligence (AI)—maths-specific large language models (LLMs) named Qwen2-Math. According to Alibaba, these models surpass the capabilities of renowned AI systems like OpenAI’s GPT-4o and Google's LLMs, specifically in solving mathematical problems.
The Qwen team, a branch of Alibaba's cloud computing unit, stated, "Over the past year, we have dedicated significant efforts to researching and enhancing the reasoning capabilities of large language models, with a particular focus on their ability to solve arithmetic and mathematical problems."
The Qwen2-Math Models: Structure and Performance
The newly launched models are built on the foundation of the Qwen2 LLMs released earlier in June. These models are differentiated based on the number of parameters—a term in machine learning that refers to the variables present in an AI system during its training process. These parameters are crucial as they determine how the AI interprets data inputs to produce desired outputs.
Among the models, the Qwen2-Math-72B-Instruct stands out with its extensive parameter count, outperforming several US-developed LLMs, including GPT-4o, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and Meta Platforms’ Llama-3.1-405B, on various mathematical benchmarks.
Testing and Benchmarking
The Qwen2-Math models were rigorously tested using both English and Chinese mathematics benchmarks. These included GSM8K, a comprehensive dataset of 8,500 diverse grade school maths problems; OlympiadBench, a high-level bilingual scientific benchmark; and the gaokao, the challenging Chinese university entrance examination.
Despite their achievements, the Qwen team acknowledged limitations due to the models' current support for English only. Plans are underway to introduce bilingual models shortly, with multilingual capabilities on the horizon.
Open Source and Community Involvement
Alibaba's initiative, Tongyi Qianwen, has been available to third-party developers for over a year. The open source nature of this project means the public can access its source code, enabling developers to modify or enhance the system's design and capabilities.
In July, Qwen2-72B-Instruct was ranked just behind GPT-4o and Claude 3.5 Sonnet in the LLM rankings by SuperClue, a benchmarking platform that assesses models based on various metrics, including mathematical calculations, logical reasoning, coding, and text comprehension.
The Competition in AI Development
SuperClue highlighted that the gap between Chinese and US AI models is closing rapidly, with significant progress made by Chinese developers in the first half of the year. A separate evaluation by LMSYS, an AI research organization supported by the University of California, Berkeley, ranked Qwen2-72B 20th, whereas models from OpenAI, Anthropic, and Google dominated the top positions.