OpenAI introduces new benchmark to measure expert-level scientific reasoning

OpenAI (OPENAI) has introduced a new benchmark, FrontierScience, which is used to measure expert-level scientific reasoning across the fields of biology, chemistry and physics.

“FrontierScience is written and verified by experts across physics, chemistry, and biology, and consists of hundreds of questions designed to be difficult, original, and meaningful,” the Microsoft-backed (MSFT) OpenAI said in a blog post.

The new benchmark measures Olympiad-style scientific reasoning and Research, or real-world scientific research abilities. OpenAI’s latest model, GPT-5.2, scored 77% on the Olympiad portion and 25% in the Research category.

In comparison, Anthropic’s (ANTHRO) Claude Opus 4.5 scored a 71.4% in Olympiad and 17.5% in Research. Google’s (GOOG)(GOOGL) Gemini Pro 3 scored recorded a 76.1% score in Olympiad and 12.4% in Research. xAI’s (X.AI) Grok 4 scored 66.2% in Olympiad and 15.9% in the Research benchmark.

“Looking ahead, we expect progress in scientific reasoning to come from both better general-purpose reasoning systems and focused effort on improvising scientific capabilities,” OpenAI said. “Benchmarks like FrontierScience help us understand the weaknesses of today’s AI systems to focus our work on making models be reliable partners in scientific discovery.”

Leave a Reply Cancel reply