Nvidia unveils AI model for music, audio that can modify voices, generate new sounds
Nvidia (NASDAQ:NVDA) unveiled a new AI model called Fugatto for generating music and audio, aimed at people producing music, films and video games.
Fugatto (Foundational Generative Audio Transformer Opus) generates or transforms any mix of music, voices and sounds described with prompts using any combination of text and audio files, according to the company.
For example, the AI model can create a music snippet based on a text prompt, remove or add instruments from an existing song, change the accent or emotion in a voice, and even let people produce sounds never heard before, the company said in a blog on Monday.
“We wanted to create a model that understands and generates sound like humans do,” said Rafael Valle, a manager of applied audio research at Nvidia, and also an orchestral conductor and composer.
Nvidia noted that an ad agency could apply Fugatto to quickly target an existing campaign for several regions, applying different accents and emotions to voiceovers. In addition, Video game developers could use the AI model to modify prerecorded assets in their title to fit the changing action as users play the game.
Fugatto can make a trumpet bark or a saxophone meow. With fine-tuning and small amounts of singing data, researchers found it could handle tasks it was not pretrained on, like generating a high-quality singing voice from a text, the company added.
Nvidia said Fugatto’s full version uses 2.5 billion parameters and was trained on a bank of NVIDIA DGX systems packing 32 NVIDIA H100 Tensor Core GPUs. The overall work on the model took more than a year.
Fugatto would potentially compete with similar technologies from startups such as Runway and bigger companies like Meta Platforms (META). In October, the Facebook owner unveiled its AI model called Movie Gen, which can create realistic-seeming video and audio clips based on user prompts.
In February, ChatGPT-maker OpenAI introduced Sora, which can create realistic and imaginative scenes from text instructions. The text-to-video model from the Microsoft (MSFT)-backed company has not been released to the public yet.