Google (GOOG) (GOOGL) on Tuesday unveiled its multimodal Gemini Embedding 2 artificial intelligence model, the tech giant’s newest model that maps text, images, video, audio, and documents into a single embedding space.
“Gemini Embedding 2 maps text, images, videos, audio and documents into a single, unified embedding space, and captures semantic intent across over 100 languages,” Google said in a blog post. “This simplifies complex pipelines and enhances a wide variety of multimodal downstream tasks—from Retrieval-Augmented Generation (RAG) and semantic search to sentiment analysis and data clustering.”
The latest addition to the Gemini family of AI models supports up to 8192 input tokens for text; can process up to 6 images per request in both PNG and JPEG formats; has support for up to 120 seconds of video in MP4 and MOV formats; can ingest and embed audio data without the need for transcription; and directly embed PDFs that are up to 6 pages long.
“Gemini Embedding 2 doesn’t just improve on legacy models,” Google added, comparing it to similar offerings from Amazon (AMZN), Voyage and other Google models. “It establishes a new performance standard for multimodal depth, introducing strong speech capabilities and outperforming leading models in text, image, and video tasks. This measurable improvement and unique multimodal coverage give developers exactly what they need for their diverse embedding needs.”
Google shares were up fractionally in midday trading on Tuesday.