Activation Patching and Causal Mediation Put LLM Explanations on Trial
Inside LIBERTy’s representation-level probes that separate mechanisms from rationalizations
Plausible explanations can be wrong—and in language models, they often are. Years of interpretability work warn that eye-catching highlights and coherent rationales may not reflect what actually caused a model’s answer, especially when attention maps are treated as explanations without interventions [1,24,25,30]. LIBERTy, a 2026-ready evaluation framework, tackles this head-on by elevating representation-level causal tests—activation and path patching, targeted ablation/editing, and causal mediation/abstraction—from optional diagnostics to first-class evidence. The bet is simple: change the internal causes and watch the output; if it moves as predicted, the explanation earns credibility.
This article drills into LIBERTy’s “white-box when possible” methodology: how hypotheses about attention heads, MLP features, and circuits are turned into experiments; how counterfactual activations are substituted to verify influence; how targeted ablations falsify spurious pathways; how causal mediation quantifies direct and indirect effects; and how sparse feature dictionaries enable semantic, not just token-level, intervention. We also cover the scoring and controls that make these tests comparable across model families. Readers will learn how LIBERTy converts plausibility into testable causal claims, what to measure, how to run robust internal experiments, and where the failure modes lurk.
Architecture/Implementation Details
Why plausibility isn’t faithfulness—and why interventions are the missing proof
Saliency maps, textual rationales, and even chain-of-thought often look persuasive. But without interventional tests, they remain correlational and vulnerable to confounds [1,30]. Attention, in particular, is a hypothesis generator, not a verdict: tests must manipulate the purported mediators and verify the predicted effect on outputs [24,25]. LIBERTy therefore treats representation-level interventions as the gold standard for confirming (or falsifying) explanatory claims, and it triangulates them with counterfactual inputs and environment-shift robustness to avoid being fooled by off-manifold artifacts or spurious cues [5,9,14].
From hypotheses to tests: locating candidate mediators
LIBERTy operationalizes a pipeline from explanation to experiment:
- Hypothesize mediators. Candidate loci include attention heads, MLP neurons/features, and circuits implicated by attribution or mechanistic analyses [24,25].
- Ground hypotheses in structure. Causal abstraction provides a formal language for proposing pathway structures to test. Tracr’s compiled transformers offer a controlled lab where known circuits can be probed end-to-end.
- Select counterfactual pairs. Use minimal semantic edits or contrast sets to isolate a single causal factor at the input level [12,47].
- Design internal interventions. Choose activation/path patching or ablation/editing at the suspected mediators; align interventions to semantic units where possible (see SAEs below) [27,41,42].
Activation and path patching: counterfactual substitution to verify influence
Activation patching substitutes internal activations from a counterfactual example into a target example at hypothesized mediators (e.g., specific layers, heads, or features). If the explanation correctly named the mediator, the model’s output should shift toward the counterfactual outcome [27,42]. LIBERTy records the direction and magnitude of these changes and aggregates them into average causal effects (ACE) of the patched sites, attributing causal weight to the implicated pathways. To guard against distribution shift and leakage, LIBERTy pairs patching with on-manifold counterfactuals and insertion tests that complement deletions.
Path patching extends this idea to multi-hop routes—testing whether a chain of components jointly carries influence. By patching along a hypothesized path, evaluators can contrast single-node and multi-node ACEs to estimate whether interactions are necessary for the observed behavior, a key step toward pathway-level attribution rather than isolated hotspots.
Targeted ablation and editing: falsifying spurious pathways and confirming necessity
Where patching asks “is this mediator sufficient to carry the counterfactual?”, ablation asks “is it necessary?” LIBERTy deploys targeted removal or editing of activations at the suspected sites and measures the resulting output degradation. This complements input-level erasure and deletion–insertion curves and directly challenges explanations that over-index on visually salient but causally inert components [9,10]. To reduce the confound that models can reweight remaining features after removal, LIBERTy integrates ROAR-style remove-and-retrain evidence, strengthening necessity claims when performance drops persist even after retraining. Editing methods that localize factual associations further allow precise tests of whether the cited memory trace actually drives the answer.
Causal mediation and abstraction: estimating direct/indirect effects and testing structure
Beyond point interventions, LIBERTy estimates direct and indirect effects through mediation analysis aligned with causal abstraction hypotheses. Concretely, experimenters specify a structural mapping from input factors to internal mediators and outputs, then combine patching and ablation to estimate how much of an output change flows through the named pathway versus alternative routes. This moves the evaluation from “what lights up?” to “what fraction of the effect does this pathway explain?”, enabling principled reporting of variance explained by identified mediators.
Feature-level alignment with sparse feature dictionaries
Token- or neuron-level manipulations can be coarse. Recent mechanistic interpretability advances use sparse autoencoders to disentangle interpretable features in LLM activations, yielding feature dictionaries that align with semantic factors. LIBERTy leverages these to patch or ablate at the level of a putative concept (e.g., negation, quantifier) rather than a raw token position, reducing concept conflation and sharpening causal tests. When a feature-level patch flips the output in the predicted direction, the explanation earns stronger credit for causal specificity.
Designing robust internal experiments: black-box vs. white-box, repeatability, variance controls
Representation-level protocols require internal access; LIBERTy supports both:
- White-box settings: full activation instrumentation enables layer/head/feature targeting and causal mediation estimates [27,37,41,42].
- Black-box settings: input-level counterfactual edits, deletion–insertion, and environment-shift stress tests provide complementary constraints; internal claims are qualified accordingly [5,9,14].
To ensure repeatability, LIBERTy follows HELM-style transparency and variance controls: fixed prompt templates, standardized decoding grids, multi-seed trials, bootstrap confidence intervals, and mixed-effects models for inference, with preregistered hypotheses and power analyses [32,36]. When stochastic decoding is necessary, variance is explicitly modeled and reported.
Quantifying effects and reporting
LIBERTy’s mediation and pathway scoring reports:
- Average causal effect (ACE) of patched/ablated mediators on outputs (faithfulness-aligned scale)
- Proportion of variance explained by identified pathways
- Pathway attribution: single-node vs. multi-node contributions under path patching
- Uncertainty bands via bootstrap over items and seeds, with multiple comparisons controlled (e.g., BH-FDR) Each metric is normalized to 0–100 for comparability and macro-averaged with confidence intervals; sensitivity analyses probe robustness to prompt and decoding choices.
Model family comparability
LIBERTy applies identical intervention protocols across closed and open model families where interfaces allow—GPT‑4‑class successors, Claude, Gemini, and leading open models (Llama, Mixtral, Gemma, Qwen, DeepSeek, Grok) [49–57]. When white-box access is unavailable, LIBERTy falls back to input-level and environment-shift tests and reports mediation claims only where representation-level evidence exists, preserving apples-to-apples comparisons across systems [14,32].
Failure modes and safeguards
Representation-level tests are not immune to pitfalls. LIBERTy defends against common threats by:
- Combining deletion and insertion to avoid off-manifold artifacts
- Using ROAR to counter model adaptivity after erasure
- Running sanity checks to detect uninformative attributions
- Stress-testing under environment shifts to expose spurious pathways
- Treating attention visualizations as hypotheses to be falsified via interventions, not as causal proof [24,25]
- Preferring feature-level manipulations via SAEs to reduce concept confounds
🔬 The guiding principle: intervene on the mechanism you claim, predict the direction of change, and quantify the effect with uncertainty.
Comparison Tables
Internal intervention techniques at a glance
| Technique | What it tests | Inputs needed | Granularity | Strengths | Key risks/mitigations |
|---|---|---|---|---|---|
| Activation patching | Sufficiency of hypothesized mediator/pathway via counterfactual substitution | Counterfactual example; access to activations | Layer/head/feature | Directly measures causal influence; supports path-level tests | Off-manifold risk mitigated by on-manifold edits and insertion tests [5,27,42] |
| Targeted ablation/editing | Necessity of mediator; falsifies spurious routes | Access to activations/parameters | Neuron/feature/circuit | Tests flip/drop under removal; ROAR strengthens causality | Model reweighting; address via remove-and-retrain (ROAR) [4,10,27] |
| Causal mediation/abstraction | Direct/indirect effects; structural hypotheses | Structural mapping + interventions | Pathway-level | Quantifies variance explained; tests multi-hop chains | Mis-specified structure; validate with Tracr or process supervision [37,40] |
| SAE feature-level patching | Semantic-unit interventions (concept-aligned) | Sparse feature dictionary | Concept-level feature | Reduces concept confounds; sharper causal claims | Feature misalignment; requires validated dictionaries |
Best Practices
- Pre-register mediator hypotheses, counterfactual pairs, intervention sites, metrics, and power targets; publish code, logs, and seeds for HELM-style transparency [32,36].
- Pair deletion with insertion and use human-validated counterfactual edits to minimize off-manifold artifacts.
- Use ROAR-style retraining when claiming necessity from removal; report with and without retraining.
- Prefer feature-level interventions via SAEs when available; otherwise, localize to minimal layers/heads to reduce spread.
- Estimate ACE with bootstrap confidence intervals; use mixed-effects models for cross-task, cross-model inference; control multiplicity (e.g., BH-FDR).
- In black-box settings, qualify causal claims and triangulate with counterfactual robustness and environment-shift tests (e.g., WILDS-style splits).
- Treat attention maps and TCAV-like concept links as hypotheses; insist on interventional confirmation before causal claims [24,25,28,29].
Practical Examples
While specific implementation details are not publicly available beyond the framework description, LIBERTy outlines several canonical internal experiments and where they apply:
-
Counterfactual NLI mediator test: Construct minimal pairs that change a single semantic factor (e.g., negation). Hypothesize that a specific feature or head mediates sensitivity to that factor. Substitute activations from the counterfactual example at that mediator (activation patching) and measure whether the output flips or shifts as predicted; report ACE with uncertainty. Pair with insertion tests to check that adding the factor into a neutral context produces a corresponding change, mitigating deletion artifacts.
-
Chain-of-thought (CoT) necessity checks: On process-supervised math/logic tasks (GSM8K, MATH), identify the token positions and layers associated with a particular reasoning step [20,21,22,38]. Perform targeted ablation at those internal states; if the step is necessary, downstream intermediate states or final answers should degrade. Where feasible, patch in the correct step’s activations to test sufficiency. Report step-level accuracy, infidelity, and effect sizes of ablations.
-
Pathway-level mediation in compiled transformers: Use Tracr to obtain a transformer with known circuits for an algorithmic task. Specify a causal abstraction that maps input factors to internal subcircuits and outputs. Run path patching along the hypothesized chain and estimate direct/indirect effects; compare to single-node patches to assess interactions. This provides a groundable reference for pathway attribution and validates the mediation protocol end-to-end.
-
Feature-level patching with sparse autoencoders: Train or adopt a sparse feature dictionary that disentangles interpretable features in LLM activations. For a target concept (e.g., quantifiers), patch the corresponding feature from a counterfactual example into the original context. If the explanation is concept-causal, the output should change in the predicted direction; ablate the feature to test necessity. Report concept-level ACE and discuss alignment quality.
-
Black-box comparability fallback: For closed models where internal access is unavailable, run the same counterfactual and environment-shift tests and report deletion–insertion AUC, counterfactual flip rates, and attribution stability. Reserve mediation scores for models where activation/feature interventions were possible, and clearly distinguish evidence tiers in the LIBERTy report [14,32].
In all cases, LIBERTy emphasizes seeds, decoding grids, and bootstrap CIs; mixed-effects modeling accommodates variability across tasks and models, and multiplicity control prevents over-claiming from multiple probes. Specific metrics beyond these protocols are unavailable in the report.
Conclusion
LIBERTy’s central claim is that explanation faithfulness must be earned through interventions, not aesthetics. By turning attention heads, MLP features, and circuits into manipulable hypotheses—then validating them with activation/path patching, ablation/editing, and mediation—LIBERTy replaces plausibility with causal evidence. Feature-level alignment via sparse autoencoders lifts interventions to semantic units, while rigorous variance controls and HELM-style transparency keep comparisons honest across open and closed model families. The result is a framework that can say, with statistical backing, which explanations reflect internal mechanisms and which are merely rationalizations.
Key takeaways:
- Plausibility ≠faithfulness; attention and saliency are hypotheses until interventional tests confirm them [1,24,25,30].
- Activation/path patching and targeted ablation/editing provide complementary tests of sufficiency and necessity at the representation level [4,27,42].
- Causal mediation/abstraction quantifies direct/indirect effects and variance explained by pathways.
- Sparse autoencoders enable concept-aligned, feature-level interventions that reduce confounds.
- Robust reporting requires HELM-style transparency, multi-seed variance modeling, and principled inference [32,36].
Next steps for practitioners:
- Pre-register mediator hypotheses and protocols; implement patch/ablate experiments with uncertainty reporting.
- Build minimal, human-validated counterfactual datasets tailored to the factors your explanations cite [5,12].
- Invest in feature dictionaries (SAEs) to align interventions to semantic units.
- Where internals are inaccessible, use counterfactual and environment-shift tests and clearly qualify causal claims.
Looking forward, broader adoption of representation-level interventions—paired with standardized reporting—should sharpen the field’s understanding of how modern LLMs actually compute, and which explanations we can trust.