Welcome to DU!
The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards.
Join the community:
Create a free account
Support DU (and get rid of ads!):
Become a Star Member
Latest Breaking News
Editorials & Other Articles
General Discussion
The DU Lounge
All Forums
Issue Forums
Culture Forums
Alliance Forums
Region Forums
Support Forums
Help & Search
General Discussion
Related: Editorials & Other Articles, Issue Forums, Alliance Forums, Region ForumsJust how unreliable are generative AI models? Results of a new GAIA test at Princeton.
GAIA explained here:
https://hal.cs.princeton.edu/reliability/benchmark/gaia/
GAIA (General AI Assistants) is a benchmark designed to evaluate AI agents on real-world question-answering tasks that require multi-step reasoning, tool use, web browsing, and file manipulation. Questions are organized into three difficulty levels: Level 1 tasks typically require a single tool or a short chain of reasoning, Level 2 tasks demand combining multiple tools and reasoning over several steps, and Level 3 tasks involve long-horizon plans with many intermediate actions. Agents are evaluated on exact-match accuracy against annotated ground-truth answers. Because each question has a unique, verifiable answer, GAIA is well-suited for measuring not only correctness but also the reliability of the problem-solving process including consistency across repeated runs, calibration of expressed confidence, and robustness to perturbations in task formatting.
Analysis here:
https://hal.cs.princeton.edu/reliability/benchmark/gaia/analysis/
GAIA: Reliability Failure Analysis
How do frontier AI agents fail when given the same task multiple times? We ran Claude Opus 4.5, Gemini 2.5 Pro, and GPT 5.4 on GAIAs 165 real-world tasks with multiple repetitions per model, then examined cases where agents gave wrong answers, disagreed with themselves, or broke under tool failures and input perturbations. Below are the most instructive examples.
A note on ambiguity. Several of the failures below stem from genuinely ambiguous questions or inputs tasks where the correct answer depends on an interpretation the benchmark authors likely assumed was obvious but isnt. GAIA was designed to test general-purpose assistant capabilities, not to stress-test edge cases in question wording, and some ambiguity is inevitable in a benchmark of this scope.
That said, ambiguity turns out to be a useful lens for reliability. A well-calibrated agent encountering a question with competing valid interpretations should recognize the ambiguity and lower its confidence accordingly or flag the competing readings rather than silently committing to one. In the examples below, models almost never do this. They resolve ambiguity nondeterministically across runs, report high confidence regardless of which interpretation they chose, and give no signal that the question admitted more than one reading.
The issue isnt that the models get the wrong answer on an ambiguous question its that they dont behave differently when a question is ambiguous versus when it isnt.
-snip-
How do frontier AI agents fail when given the same task multiple times? We ran Claude Opus 4.5, Gemini 2.5 Pro, and GPT 5.4 on GAIAs 165 real-world tasks with multiple repetitions per model, then examined cases where agents gave wrong answers, disagreed with themselves, or broke under tool failures and input perturbations. Below are the most instructive examples.
A note on ambiguity. Several of the failures below stem from genuinely ambiguous questions or inputs tasks where the correct answer depends on an interpretation the benchmark authors likely assumed was obvious but isnt. GAIA was designed to test general-purpose assistant capabilities, not to stress-test edge cases in question wording, and some ambiguity is inevitable in a benchmark of this scope.
That said, ambiguity turns out to be a useful lens for reliability. A well-calibrated agent encountering a question with competing valid interpretations should recognize the ambiguity and lower its confidence accordingly or flag the competing readings rather than silently committing to one. In the examples below, models almost never do this. They resolve ambiguity nondeterministically across runs, report high confidence regardless of which interpretation they chose, and give no signal that the question admitted more than one reading.
The issue isnt that the models get the wrong answer on an ambiguous question its that they dont behave differently when a question is ambiguous versus when it isnt.
-snip-
4 replies
= new reply since forum marked as read
Highlight:
NoneDon't highlight anything
5 newestHighlight 5 most recent replies
Just how unreliable are generative AI models? Results of a new GAIA test at Princeton. (Original Post)
highplainsdem
12 hrs ago
OP
Researcher Stephen Rabanser's long thread about this starts here on X (no Bluesky account, sorry)
highplainsdem
12 hrs ago
#2
Nevilledog
(55,046 posts)1. Anybody relying on AI without fact-checking is reckless.
highplainsdem
(61,769 posts)2. Researcher Stephen Rabanser's long thread about this starts here on X (no Bluesky account, sorry)
Response to highplainsdem (Original post)
highplainsdem This message was self-deleted by its author.
Fiendish Thingy
(23,001 posts)4. AI is increasing the world's stupidity. Nt