AI Research

Reasoning or Rationalizing? Testing Model Robustness Against Misleading Information

Key Results at a Glance

ModelBaselineCorrect Hints AfterCorrect Hints BeforeIncorrect Hints AfterIncorrect Hints Before
Gemini 2.0 Flash74.2%68.3% (-5.9pp)69.2% (-5.0pp)69.2% (-5.0pp)53.3% (-20.9pp)
OpenAI GPT-4o-mini53.3%53.3% (±0.0pp)41.7% (-11.6pp)51.7% (-1.6pp)33.3% (-20.0pp)

*pp = percentage points vs baseline

Research Question

How robust are autoregressive LLMs against misleading information, and does the position of this information affect their reasoning accuracy?

Specifically, we investigate whether models can maintain correct reasoning when exposed to incorrect hints, and whether the timing of this exposure (before vs. after questions) affects their robustness to misinformation.

Robustness Testing Framework

Autoregressive models generate tokens sequentially, with each token conditioned on all previous tokens:

P(response) = P(t₁) × P(t₂|t₁) × P(t₃|t₁,t₂) × ... × P(tₙ|t₁...tₙ₋₁)

This architectural constraint reveals three robustness vulnerabilities:

  1. Information Position Sensitivity: Models show different robustness levels based on when misleading information appears
  2. Reasoning Fragility: Models struggle to maintain correct reasoning paths when exposed to contradictory information
  3. Asymmetric Robustness: Models are significantly less robust to early misinformation than late misinformation

Key Findings

1. Hints Paradoxically Hurt Performance

  • Gemini: Baseline 74.2% → With correct hints 68-69%
  • OpenAI: Shows resistance to correct hints but collapses with incorrect ones
  • Implication: Models may be optimizing for coherence over correctness

2. Position Matters - Robustness Varies with Information Timing

  • Incorrect hints BEFORE: Both models drop ~20 percentage points
  • Incorrect hints AFTER: Gemini -5pp, OpenAI -1.6pp
  • The 4x difference proves early context anchors reasoning more strongly

3. Models Exhibit Different Failure Modes

  • Gemini: Higher baseline (74.2%) but more susceptible to any hints
  • OpenAI: Lower baseline (53.3%) but catastrophic failure with early misinformation (→33.3%)
  • Pattern: Higher-performing models may show LOWER robustness to misleading information

Experimental Design

Dataset

  • 120 questions (60 math, 60 science)
  • 3 difficulty levels: Easy, Medium, Hard
  • 5 experimental conditions per model

Models Tested

  • Google Gemini 2.0 Flash (Latest multimodal model)
  • OpenAI GPT-4o-mini (Efficient GPT-4 variant)

Why This Matters

Current Benchmarks Are Blind

Standard reasoning benchmarks (GSM8K, MATH, ARC) measure only:

Final answer correctness

Robustness to framing

Resistance to misleading context

Actual reasoning vs. pattern matching

Real-World Implications

  1. Prompt Injection Vulnerability: Early tokens in prompts have outsized influence
  2. Adversarial Robustness: Models can be derailed by strategic misinformation placement
  3. Reasoning vs. Rationalization: Models generate plausible-sounding justifications, not logical derivations
  4. Evaluation Gaps: We're not measuring what we think we're measuring

Contributions

Robustness Assessment

Quantifies how vulnerable models are to misleading information

Simple Protocol

No expensive compute required--just careful prompt manipulation

Benchmark Blindspot

Exposes critical gap in current evaluation methods

Quantified Effect

~20pp accuracy drop with early misinformation (4x worse than late misinformation)

Key Takeaway

Autoregressive LLMs lack robustness against misleading information. The sequential generation architecture creates fundamental vulnerabilities where models fail to maintain correct reasoning when exposed to incorrect hints, especially when that misinformation appears early. This robustness failure isn't a training issue to be fixed; it's a fundamental architectural limitation that must be understood and mitigated in deployment.

This research reveals that our most advanced language models can be derailed by the simple act of putting misleading information in the wrong place--a vulnerability that no benchmark currently measures.

Access the Research Code

View on GitHub