Chinese AI startup DeepSeek (DEEPSEEK) released a research paper that claimed the training cost of its R1 model was at a much lower cost than what U.S. competitors have seen.
The training of DeepSeek’s reasoning-focused R1 model cost $294,000 and used slightly more than 500 Nvidia (NASDAQ:NVDA) H800 GPUs, according to a research paper published by the startup in the journal Nature.
DeepSeek’s claims about the development costs and the technology it used for R1 were questioned by several people in the technology space earlier this year. In January, Tesla (TSLA) CEO Elon Musk seemed to agree with CEO of Scale AI Alexandr Wang suggesting that China’s DeepSeek had access to roughly 50,000 H100 Nvidia’s chips, which they were unable talk about due to U.S. export controls.
AI-related stocks, including Nvidia, saw their market caps drop sharply after the release of R1 in January. The stocks have since recouped those losses.
A100 use
In a supplementary paper, DeepSeek admitted for the first time that it owned Nvidia’s A100 chips and had used them in preparatory stages of development.
“Regarding our research on DeepSeek-R1, we utilized the A100 GPUs to prepare for the experiments with a smaller model (30B parameters),” DeepSeek researchers wrote in supplementary information to the paper. “The results from this smaller model have been promising, which has allowed us to confidently scale up to 660B R1- Zero and R1.”
DeepSeek CEO Liang Wenfeng is listed among one of the authors of the paper.
The researchers noted that DeepSeek’s R1 model incentivizes reasoning in large language models through reinforcement learning, according to the paper.
Researchers also said they utilized 64 of Nvidia’s H800 GPUs to train its DeepSeek-R1-Zero model, with the process requiring approximately 198 hours. In total, 512 H800 GPUs were used, according to the researchers. DeepSeek’s DeepSeek-R1-Zero model was pitted against DeepSeek-R1 in the paper as well. It took roughly 80 hours to train DeepSeek-R1, the researchers added.
Researchers showed that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning, or RL, removing the need for human-labelled reasoning trajectories.
The trained model achieved superior performance on verifiable tasks such as mathematics, coding competitions, and science, technology, engineering and mathematics fields, surpassing its counterparts trained via conventional supervised learning on human demonstrations.
RL is a machine learning method where an agent learns to make optimal decisions by interacting with an environment through trial and error.