Microsoft Azure hits 1.1 million token/sec AI inference record

Microsoft (MSFT) said it has achieved a new AI inference record, with its Azure ND GB300 v6 virtual machines processing 1.1 million tokens per second on a single rack powered by Nvidia (NVDA) GB300 GPUs.

The performance test, conducted using the Llama 2 70B generative text model and Nvidia TensorRT-LLM open-sourced library for optimizing large language model inference.

The test marked a 27% speed improvement from 12,022 tokens/s per previous-generation Nvidia Blackwell GPU to 15,200 tokens/sec per Blackwell Ultra GPU. It also beat the previous Azure ND GB200 v6 record of 865,000 tokens/s by 27%.

Microsoft CEO Satya Nadella said the result “sets an industry record made possible by our co-innovation with NVIDIA and Azure’s expertise in running AI at production scale.”

Leave a Reply

Your email address will not be published. Required fields are marked *