Loading...
Loading...
The New York Times, working with AI startup Oumi, tested Google’s Gemini-powered AI Overviews using the SimpleQA benchmark and found the feature answered about 91% of questions correctly, meaning roughly 1 in 10 responses are incorrect. Oumi ran the benchmark across Gemini updates (85% on Gemini 2.5, 91% after Gemini 3), and extrapolated that AI Overviews could be producing tens of millions of wrong answers daily given search volume. Google disputes the methodology, saying SimpleQA contains erro
A startup analysis by Oumi found Google’s Gemini 2 and Gemini 3–powered AI Overviews misstate facts frequently enough that, at projected search volumes, the feature could generate hundreds of thousands of false answers per minute. Oumi reviewed 8,652 AI Overviews using the SimpleQA benchmark and reported 85% accuracy for Gemini 2 and 91% for Gemini 3, but flagged growing problems with “ungrounded” claims and poor source citations. Examples include incorrect dates for public figures and claims contradicted by cited links. Publishers and the News/Media Alliance say AI Overviews both siphon traffic and undercut journalistic accuracy; Google disputes the study’s methodology. The findings matter for search reliability, publisher revenue, and AI accountability.
A startup analysis by Oumi found Google’s AI Overviews—powered by Gemini 2 and Gemini 3—were accurate in 85% and 91% of sampled answers, respectively, but still produce millions of incorrect or unverified responses as Google scales toward trillions of searches. Oumi tested 4,326 results per model using the SimpleQA benchmark and flagged errors ranging from wrong dates to false claims about public figures, plus a rising share of “ungrounded” answers whose cited links don’t support the summary. Publishers and the News/Media Alliance warn Overviews siphon traffic and ad revenue while spreading misinformation; Google disputes Oumi’s methods. The findings raise concerns about trust, citation quality, and the impact of generative AI on news ecosystems.
A New York Times analysis with startup Oumi finds Google’s Gemini-powered AI Overviews answers SimpleQA benchmark questions correctly about 90% of the time, meaning roughly 1 in 10 responses are incorrect — which could scale to tens of millions of wrong answers daily across Google searches. Oumi ran SimpleQA against different Gemini versions, seeing accuracy rise from 85% with Gemini 2.5 to 91% after Gemini 3, but highlighted concrete factual errors and misattributed sources. Google disputes the test’s validity, saying SimpleQA contains errors and that it uses its own vetted SimpleQA Verified set and multiple models (Gemini Flash vs. Pro) to balance speed and accuracy. The report underscores challenges in benchmarking, grounding, and deploying large language models at web scale.
The New York Times, working with AI startup Oumi, tested Google’s Gemini-powered AI Overviews using the SimpleQA benchmark and found the feature answered about 91% of questions correctly, meaning roughly 1 in 10 responses are incorrect. Oumi ran the benchmark across Gemini updates (85% on Gemini 2.5, 91% after Gemini 3), and extrapolated that AI Overviews could be producing tens of millions of wrong answers daily given search volume. Google disputes the methodology, saying SimpleQA contains errors and that it prefers more vetted tests like SimpleQA Verified; it also notes AI Overviews uses different internal models (Gemini Flash vs. Pro) for speed. The story matters because search-integrated generative AI can amplify factual errors at massive scale, highlighting evaluation challenges and trade-offs between latency, cost, and accuracy.