Reasoning or Rationalizing? Testing Model Robustness Against Misleading Information
Key Results at a Glance
| Model | Baseline | Correct Hints After | Correct Hints Before | Incorrect Hints After | Incorrect Hints Before |
|---|---|---|---|---|---|
| Gemini 2.0 Flash | 74.2% | 68.3% (-5.9pp) | 69.2% (-5.0pp) | 69.2% (-5.0pp) | 53.3% (-20.9pp) |
| OpenAI GPT-4o-mini | 53.3% | 53.3% (±0.0pp) | 41.7% (-11.6pp) | 51.7% (-1.6pp) | 33.3% (-20.0pp) |
*pp = percentage points vs baseline
Research Question
How robust are autoregressive LLMs against misleading information, and does the position of this information affect their reasoning accuracy?
Specifically, we investigate whether models can maintain correct reasoning when exposed to incorrect hints, and whether the timing of this exposure (before vs. after questions) affects their robustness to misinformation.
Robustness Testing Framework
Autoregressive models generate tokens sequentially, with each token conditioned on all previous tokens:
This architectural constraint reveals three robustness vulnerabilities:
- Information Position Sensitivity: Models show different robustness levels based on when misleading information appears
- Reasoning Fragility: Models struggle to maintain correct reasoning paths when exposed to contradictory information
- Asymmetric Robustness: Models are significantly less robust to early misinformation than late misinformation
Key Findings
1. Hints Paradoxically Hurt Performance
- Gemini: Baseline 74.2% → With correct hints 68-69%
- OpenAI: Shows resistance to correct hints but collapses with incorrect ones
- Implication: Models may be optimizing for coherence over correctness
2. Position Matters - Robustness Varies with Information Timing
- Incorrect hints BEFORE: Both models drop ~20 percentage points
- Incorrect hints AFTER: Gemini -5pp, OpenAI -1.6pp
- The 4x difference proves early context anchors reasoning more strongly
3. Models Exhibit Different Failure Modes
- Gemini: Higher baseline (74.2%) but more susceptible to any hints
- OpenAI: Lower baseline (53.3%) but catastrophic failure with early misinformation (→33.3%)
- Pattern: Higher-performing models may show LOWER robustness to misleading information
Experimental Design
Dataset
- 120 questions (60 math, 60 science)
- 3 difficulty levels: Easy, Medium, Hard
- 5 experimental conditions per model
Models Tested
- Google Gemini 2.0 Flash (Latest multimodal model)
- OpenAI GPT-4o-mini (Efficient GPT-4 variant)
Why This Matters
Current Benchmarks Are Blind
Standard reasoning benchmarks (GSM8K, MATH, ARC) measure only:
Final answer correctness
Robustness to framing
Resistance to misleading context
Actual reasoning vs. pattern matching
Real-World Implications
- Prompt Injection Vulnerability: Early tokens in prompts have outsized influence
- Adversarial Robustness: Models can be derailed by strategic misinformation placement
- Reasoning vs. Rationalization: Models generate plausible-sounding justifications, not logical derivations
- Evaluation Gaps: We're not measuring what we think we're measuring
Contributions
Robustness Assessment
Quantifies how vulnerable models are to misleading information
Simple Protocol
No expensive compute required--just careful prompt manipulation
Benchmark Blindspot
Exposes critical gap in current evaluation methods
Quantified Effect
~20pp accuracy drop with early misinformation (4x worse than late misinformation)
Key Takeaway
Autoregressive LLMs lack robustness against misleading information. The sequential generation architecture creates fundamental vulnerabilities where models fail to maintain correct reasoning when exposed to incorrect hints, especially when that misinformation appears early. This robustness failure isn't a training issue to be fixed; it's a fundamental architectural limitation that must be understood and mitigated in deployment.
This research reveals that our most advanced language models can be derailed by the simple act of putting misleading information in the wrong place--a vulnerability that no benchmark currently measures.
