AI Research

Reasoning or Rationalizing? Testing Model Robustness Against Misleading Information

Key Results at a Glance

Model	Baseline	Correct Hints After	Correct Hints Before	Incorrect Hints After	Incorrect Hints Before
Gemini 2.0 Flash	74.2%	68.3% (-5.9pp)	69.2% (-5.0pp)	69.2% (-5.0pp)	53.3% (-20.9pp)
OpenAI GPT-4o-mini	53.3%	53.3% (±0.0pp)	41.7% (-11.6pp)	51.7% (-1.6pp)	33.3% (-20.0pp)

*pp = percentage points vs baseline

Research Question

How robust are autoregressive LLMs against misleading information, and does the position of this information affect their reasoning accuracy?

Specifically, we investigate whether models can maintain correct reasoning when exposed to incorrect hints, and whether the timing of this exposure (before vs. after questions) affects their robustness to misinformation.

Robustness Testing Framework

Autoregressive models generate tokens sequentially, with each token conditioned on all previous tokens:

P(response) = P(t₁) × P(t₂|t₁) × P(t₃|t₁,t₂) × ... × P(tₙ|t₁...tₙ₋₁)

This architectural constraint reveals three robustness vulnerabilities:

Information Position Sensitivity: Models show different robustness levels based on when misleading information appears
Reasoning Fragility: Models struggle to maintain correct reasoning paths when exposed to contradictory information
Asymmetric Robustness: Models are significantly less robust to early misinformation than late misinformation

Key Findings

1. Hints Paradoxically Hurt Performance

Gemini: Baseline 74.2% → With correct hints 68-69%
OpenAI: Shows resistance to correct hints but collapses with incorrect ones
Implication: Models may be optimizing for coherence over correctness

2. Position Matters - Robustness Varies with Information Timing

Incorrect hints BEFORE: Both models drop ~20 percentage points
Incorrect hints AFTER: Gemini -5pp, OpenAI -1.6pp
The 4x difference proves early context anchors reasoning more strongly

3. Models Exhibit Different Failure Modes

Gemini: Higher baseline (74.2%) but more susceptible to any hints
OpenAI: Lower baseline (53.3%) but catastrophic failure with early misinformation (→33.3%)
Pattern: Higher-performing models may show LOWER robustness to misleading information

Experimental Design

Dataset

120 questions (60 math, 60 science)
3 difficulty levels: Easy, Medium, Hard
5 experimental conditions per model

Models Tested

Google Gemini 2.0 Flash (Latest multimodal model)
OpenAI GPT-4o-mini (Efficient GPT-4 variant)

Why This Matters

Current Benchmarks Are Blind

Standard reasoning benchmarks (GSM8K, MATH, ARC) measure only:

Final answer correctness

Robustness to framing

Resistance to misleading context

Actual reasoning vs. pattern matching

Real-World Implications

Prompt Injection Vulnerability: Early tokens in prompts have outsized influence
Adversarial Robustness: Models can be derailed by strategic misinformation placement
Reasoning vs. Rationalization: Models generate plausible-sounding justifications, not logical derivations
Evaluation Gaps: We're not measuring what we think we're measuring

Contributions

Robustness Assessment

Quantifies how vulnerable models are to misleading information

Simple Protocol

No expensive compute required--just careful prompt manipulation

Benchmark Blindspot

Exposes critical gap in current evaluation methods

Quantified Effect

~20pp accuracy drop with early misinformation (4x worse than late misinformation)

Key Takeaway

Autoregressive LLMs lack robustness against misleading information. The sequential generation architecture creates fundamental vulnerabilities where models fail to maintain correct reasoning when exposed to incorrect hints, especially when that misinformation appears early. This robustness failure isn't a training issue to be fixed; it's a fundamental architectural limitation that must be understood and mitigated in deployment.

This research reveals that our most advanced language models can be derailed by the simple act of putting misleading information in the wrong place--a vulnerability that no benchmark currently measures.

Access the Research Code

View on GitHub