Microsoft (MSFT) said it has achieved a new AI inference record, with its Azure ND GB300 v6 virtual machines processing 1.1 million tokens per second on a single rack powered by Nvidia (NVDA) GB300 GPUs.
The performance test, conducted using the Llama 2 70B generative text model and Nvidia TensorRT-LLM open-sourced library for optimizing large language model inference.
The test marked a 27% speed improvement from 12,022 tokens/s per previous-generation Nvidia Blackwell GPU to 15,200 tokens/sec per Blackwell Ultra GPU. It also beat the previous Azure ND GB200 v6 record of 865,000 tokens/s by 27%.
Microsoft CEO Satya Nadella said the result “sets an industry record made possible by our co-innovation with NVIDIA and Azure’s expertise in running AI at production scale.”