Counterfactual Rotation: Dual-Path Attention for Robust Prediction

Extending Transformer Attention with Dual-Path Counterfactuals — Poster by Andre Kramer (andrekramermsc.substack.com).

Andre Kramer

Oct 07, 2025

Column 1: Motivation & Equation

Transformers perform well in-distribution but collapse on OOD inputs.
Negations and counterfactuals expose brittle factual attention.
Idea: Rotate probability mass toward counterfactual evidence.

Key equation (architectural inspiration):

p′(a)∝p(a)+sqrt(p(a)(1−p(a))) [ αsin⁡θ A(a)−iβcos⁡θ N(a) ]

A(a): factual attention
N(a): counterfactual negation
Gate g=sin⁡θ balances experts
Override β=cos⁡θ ensures safe fusion

Column 2: Architecture & Method

From equation → network

Two experts: factual pf and counterfactual pc.
Gate g controls mixture.
Override β provides safety cap.
JS divergence tracks disagreement (proxy for “rotation angle”).

Training recipe

Counterfactual warm-up
OOD rehearsal with fused supervision
Split calibration (ID: vector scaling; OOD: temp scaling)

Column 3: Results

Headline improvements

ID acc: 0.23 → 0.65
OOD acc: 0.06 → 0.58
Calibration (ECE): stable ~0.12–0.13
Entropy reduced (more confident, but still safe on OOD).

Plots (visual story)

Reliability curves (ID vs OOD) → “Confidence matches accuracy even OOD.”
Confidence & entropy histograms (separation between ID/OOD) → “Clearer separation ID vs OO.”
Gate g and β distributions (bounded, interpretable)
Scatter: β vs JS (correct vs wrong separation) → “β override correlates with correctness.”
Confusion matrix (fused predictions, errors diffuse not clustered) → “Errors diffuse, not clustered—safer failure.”

Takeaway

Dual-path = empirical approximation of rotation equation.
Provides robustness under OOD without loss of interpretability.
Negation as counterfactual — “if not X, then Y” — provides a simple, general mechanism for robustness across domains.
Possibility to separate prediction and perception into dual heads, grounding one in values or training signals—making safety a structural feature of the system, not an afterthought.
Small architectural tweak → large gains.

Crux: “Our dual-path design lets models hedge between factual and counterfactual evidence. This rotation improves out-of-distribution robustness, without losing interpretability. Small tweak, big gains.”

Footer

The code in GitHub: https://github.com/andrekramer/chevron/blob/main/dual_attention_heads.py
Co-authored with ChatGPT-5.
What kinds of robustness, reliability, and safety benefits might it confer over standard ML approaches?
Seeking collaborators to explore extensions, theoretical grounding, benchmarks, interpretability, and safety.