Activation Patching and Causal Mediation Put LLM Explanations on Trial

Inside LIBERTy’s representation-level probes that separate mechanisms from rationalizations

Plausible explanations can be wrong—and in language models, they often are. Years of interpretability work warn that eye-catching highlights and coherent rationales may not reflect what actually caused a model’s answer, especially when attention maps are treated as explanations without interventions [1,24,25,30]. LIBERTy, a 2026-ready evaluation framework, tackles this head-on by elevating representation-level causal tests—activation and path patching, targeted ablation/editing, and causal mediation/abstraction—from optional diagnostics to first-class evidence. The bet is simple: change the internal causes and watch the output; if it moves as predicted, the explanation earns credibility.

This article drills into LIBERTy’s “white-box when possible” methodology: how hypotheses about attention heads, MLP features, and circuits are turned into experiments; how counterfactual activations are substituted to verify influence; how targeted ablations falsify spurious pathways; how causal mediation quantifies direct and indirect effects; and how sparse feature dictionaries enable semantic, not just token-level, intervention. We also cover the scoring and controls that make these tests comparable across model families. Readers will learn how LIBERTy converts plausibility into testable causal claims, what to measure, how to run robust internal experiments, and where the failure modes lurk.

Architecture/Implementation Details

Why plausibility isn’t faithfulness—and why interventions are the missing proof

Saliency maps, textual rationales, and even chain-of-thought often look persuasive. But without interventional tests, they remain correlational and vulnerable to confounds [1,30]. Attention, in particular, is a hypothesis generator, not a verdict: tests must manipulate the purported mediators and verify the predicted effect on outputs [24,25]. LIBERTy therefore treats representation-level interventions as the gold standard for confirming (or falsifying) explanatory claims, and it triangulates them with counterfactual inputs and environment-shift robustness to avoid being fooled by off-manifold artifacts or spurious cues [5,9,14].

From hypotheses to tests: locating candidate mediators

LIBERTy operationalizes a pipeline from explanation to experiment:

Hypothesize mediators. Candidate loci include attention heads, MLP neurons/features, and circuits implicated by attribution or mechanistic analyses [24,25].
Ground hypotheses in structure. Causal abstraction provides a formal language for proposing pathway structures to test. Tracr’s compiled transformers offer a controlled lab where known circuits can be probed end-to-end.
Select counterfactual pairs. Use minimal semantic edits or contrast sets to isolate a single causal factor at the input level [12,47].
Design internal interventions. Choose activation/path patching or ablation/editing at the suspected mediators; align interventions to semantic units where possible (see SAEs below) [27,41,42].

Activation and path patching: counterfactual substitution to verify influence

Activation patching substitutes internal activations from a counterfactual example into a target example at hypothesized mediators (e.g., specific layers, heads, or features). If the explanation correctly named the mediator, the model’s output should shift toward the counterfactual outcome [27,42]. LIBERTy records the direction and magnitude of these changes and aggregates them into average causal effects (ACE) of the patched sites, attributing causal weight to the implicated pathways. To guard against distribution shift and leakage, LIBERTy pairs patching with on-manifold counterfactuals and insertion tests that complement deletions.

Path patching extends this idea to multi-hop routes—testing whether a chain of components jointly carries influence. By patching along a hypothesized path, evaluators can contrast single-node and multi-node ACEs to estimate whether interactions are necessary for the observed behavior, a key step toward pathway-level attribution rather than isolated hotspots.

Targeted ablation and editing: falsifying spurious pathways and confirming necessity

Where patching asks “is this mediator sufficient to carry the counterfactual?”, ablation asks “is it necessary?” LIBERTy deploys targeted removal or editing of activations at the suspected sites and measures the resulting output degradation. This complements input-level erasure and deletion–insertion curves and directly challenges explanations that over-index on visually salient but causally inert components [9,10]. To reduce the confound that models can reweight remaining features after removal, LIBERTy integrates ROAR-style remove-and-retrain evidence, strengthening necessity claims when performance drops persist even after retraining. Editing methods that localize factual associations further allow precise tests of whether the cited memory trace actually drives the answer.

Causal mediation and abstraction: estimating direct/indirect effects and testing structure

Beyond point interventions, LIBERTy estimates direct and indirect effects through mediation analysis aligned with causal abstraction hypotheses. Concretely, experimenters specify a structural mapping from input factors to internal mediators and outputs, then combine patching and ablation to estimate how much of an output change flows through the named pathway versus alternative routes. This moves the evaluation from “what lights up?” to “what fraction of the effect does this pathway explain?”, enabling principled reporting of variance explained by identified mediators.

Feature-level alignment with sparse feature dictionaries

Token- or neuron-level manipulations can be coarse. Recent mechanistic interpretability advances use sparse autoencoders to disentangle interpretable features in LLM activations, yielding feature dictionaries that align with semantic factors. LIBERTy leverages these to patch or ablate at the level of a putative concept (e.g., negation, quantifier) rather than a raw token position, reducing concept conflation and sharpening causal tests. When a feature-level patch flips the output in the predicted direction, the explanation earns stronger credit for causal specificity.

Designing robust internal experiments: black-box vs. white-box, repeatability, variance controls

Representation-level protocols require internal access; LIBERTy supports both:

White-box settings: full activation instrumentation enables layer/head/feature targeting and causal mediation estimates [27,37,41,42].
Black-box settings: input-level counterfactual edits, deletion–insertion, and environment-shift stress tests provide complementary constraints; internal claims are qualified accordingly [5,9,14].

To ensure repeatability, LIBERTy follows HELM-style transparency and variance controls: fixed prompt templates, standardized decoding grids, multi-seed trials, bootstrap confidence intervals, and mixed-effects models for inference, with preregistered hypotheses and power analyses [32,36]. When stochastic decoding is necessary, variance is explicitly modeled and reported.

Quantifying effects and reporting

LIBERTy’s mediation and pathway scoring reports:

Average causal effect (ACE) of patched/ablated mediators on outputs (faithfulness-aligned scale)
Proportion of variance explained by identified pathways
Pathway attribution: single-node vs. multi-node contributions under path patching
Uncertainty bands via bootstrap over items and seeds, with multiple comparisons controlled (e.g., BH-FDR) Each metric is normalized to 0–100 for comparability and macro-averaged with confidence intervals; sensitivity analyses probe robustness to prompt and decoding choices.

Model family comparability

LIBERTy applies identical intervention protocols across closed and open model families where interfaces allow—GPT‑4‑class successors, Claude, Gemini, and leading open models (Llama, Mixtral, Gemma, Qwen, DeepSeek, Grok) [49–57]. When white-box access is unavailable, LIBERTy falls back to input-level and environment-shift tests and reports mediation claims only where representation-level evidence exists, preserving apples-to-apples comparisons across systems [14,32].

Failure modes and safeguards

Representation-level tests are not immune to pitfalls. LIBERTy defends against common threats by:

Combining deletion and insertion to avoid off-manifold artifacts
Using ROAR to counter model adaptivity after erasure
Running sanity checks to detect uninformative attributions
Stress-testing under environment shifts to expose spurious pathways
Treating attention visualizations as hypotheses to be falsified via interventions, not as causal proof [24,25]
Preferring feature-level manipulations via SAEs to reduce concept confounds

🔬 The guiding principle: intervene on the mechanism you claim, predict the direction of change, and quantify the effect with uncertainty.

Comparison Tables

Internal intervention techniques at a glance

Technique	What it tests	Inputs needed	Granularity	Strengths	Key risks/mitigations
Activation patching	Sufficiency of hypothesized mediator/pathway via counterfactual substitution	Counterfactual example; access to activations	Layer/head/feature	Directly measures causal influence; supports path-level tests	Off-manifold risk mitigated by on-manifold edits and insertion tests [5,27,42]
Targeted ablation/editing	Necessity of mediator; falsifies spurious routes	Access to activations/parameters	Neuron/feature/circuit	Tests flip/drop under removal; ROAR strengthens causality	Model reweighting; address via remove-and-retrain (ROAR) [4,10,27]
Causal mediation/abstraction	Direct/indirect effects; structural hypotheses	Structural mapping + interventions	Pathway-level	Quantifies variance explained; tests multi-hop chains	Mis-specified structure; validate with Tracr or process supervision [37,40]
SAE feature-level patching	Semantic-unit interventions (concept-aligned)	Sparse feature dictionary	Concept-level feature	Reduces concept confounds; sharper causal claims	Feature misalignment; requires validated dictionaries

Best Practices

Pre-register mediator hypotheses, counterfactual pairs, intervention sites, metrics, and power targets; publish code, logs, and seeds for HELM-style transparency [32,36].
Pair deletion with insertion and use human-validated counterfactual edits to minimize off-manifold artifacts.
Use ROAR-style retraining when claiming necessity from removal; report with and without retraining.
Prefer feature-level interventions via SAEs when available; otherwise, localize to minimal layers/heads to reduce spread.
Estimate ACE with bootstrap confidence intervals; use mixed-effects models for cross-task, cross-model inference; control multiplicity (e.g., BH-FDR).
In black-box settings, qualify causal claims and triangulate with counterfactual robustness and environment-shift tests (e.g., WILDS-style splits).
Treat attention maps and TCAV-like concept links as hypotheses; insist on interventional confirmation before causal claims [24,25,28,29].

Practical Examples

While specific implementation details are not publicly available beyond the framework description, LIBERTy outlines several canonical internal experiments and where they apply:

Counterfactual NLI mediator test: Construct minimal pairs that change a single semantic factor (e.g., negation). Hypothesize that a specific feature or head mediates sensitivity to that factor. Substitute activations from the counterfactual example at that mediator (activation patching) and measure whether the output flips or shifts as predicted; report ACE with uncertainty. Pair with insertion tests to check that adding the factor into a neutral context produces a corresponding change, mitigating deletion artifacts.
Chain-of-thought (CoT) necessity checks: On process-supervised math/logic tasks (GSM8K, MATH), identify the token positions and layers associated with a particular reasoning step [20,21,22,38]. Perform targeted ablation at those internal states; if the step is necessary, downstream intermediate states or final answers should degrade. Where feasible, patch in the correct step’s activations to test sufficiency. Report step-level accuracy, infidelity, and effect sizes of ablations.
Pathway-level mediation in compiled transformers: Use Tracr to obtain a transformer with known circuits for an algorithmic task. Specify a causal abstraction that maps input factors to internal subcircuits and outputs. Run path patching along the hypothesized chain and estimate direct/indirect effects; compare to single-node patches to assess interactions. This provides a groundable reference for pathway attribution and validates the mediation protocol end-to-end.
Feature-level patching with sparse autoencoders: Train or adopt a sparse feature dictionary that disentangles interpretable features in LLM activations. For a target concept (e.g., quantifiers), patch the corresponding feature from a counterfactual example into the original context. If the explanation is concept-causal, the output should change in the predicted direction; ablate the feature to test necessity. Report concept-level ACE and discuss alignment quality.
Black-box comparability fallback: For closed models where internal access is unavailable, run the same counterfactual and environment-shift tests and report deletion–insertion AUC, counterfactual flip rates, and attribution stability. Reserve mediation scores for models where activation/feature interventions were possible, and clearly distinguish evidence tiers in the LIBERTy report [14,32].

In all cases, LIBERTy emphasizes seeds, decoding grids, and bootstrap CIs; mixed-effects modeling accommodates variability across tasks and models, and multiplicity control prevents over-claiming from multiple probes. Specific metrics beyond these protocols are unavailable in the report.

Conclusion

LIBERTy’s central claim is that explanation faithfulness must be earned through interventions, not aesthetics. By turning attention heads, MLP features, and circuits into manipulable hypotheses—then validating them with activation/path patching, ablation/editing, and mediation—LIBERTy replaces plausibility with causal evidence. Feature-level alignment via sparse autoencoders lifts interventions to semantic units, while rigorous variance controls and HELM-style transparency keep comparisons honest across open and closed model families. The result is a framework that can say, with statistical backing, which explanations reflect internal mechanisms and which are merely rationalizations.

Key takeaways:

Plausibility ≠ faithfulness; attention and saliency are hypotheses until interventional tests confirm them [1,24,25,30].
Activation/path patching and targeted ablation/editing provide complementary tests of sufficiency and necessity at the representation level [4,27,42].
Causal mediation/abstraction quantifies direct/indirect effects and variance explained by pathways.
Sparse autoencoders enable concept-aligned, feature-level interventions that reduce confounds.
Robust reporting requires HELM-style transparency, multi-seed variance modeling, and principled inference [32,36].

Next steps for practitioners:

Pre-register mediator hypotheses and protocols; implement patch/ablate experiments with uncertainty reporting.
Build minimal, human-validated counterfactual datasets tailored to the factors your explanations cite [5,12].
Invest in feature dictionaries (SAEs) to align interventions to semantic units.
Where internals are inaccessible, use counterfactual and environment-shift tests and clearly qualify causal claims.

Looking forward, broader adoption of representation-level interventions—paired with standardized reporting—should sharpen the field’s understanding of how modern LLMs actually compute, and which explanations we can trust.

Sources & References

Towards Faithfully Interpretable NLP Systems Establishes the plausibility vs. faithfulness distinction that motivates interventional testing.

A Benchmark for Interpretability Methods in Deep Neural Networks (ROAR) Supports remove-and-retrain as a necessity check against model adaptivity in ablation.

Interpretable Explanations of Black Boxes by Meaningful Perturbations Justifies insertion alongside deletion and on-manifold perturbations to avoid artifacts.

RISE: Randomized Input Sampling for Explanation of Black-box Models Provides perturbation baselines that complement representation-level probes.

Understanding Neural Networks Through Representation Erasure Grounds targeted ablation at the representation level.

Sanity Checks for Saliency Maps Warns about degenerate attributions and motivates sanity checks.

Learning the Difference That Makes a Difference with Counterfactual Examples in NLI Provides minimal counterfactual edits for causal tests of dependence.

WILDS: A Benchmark of in-the-Wild Distribution Shifts Supplies environment-shift stress tests to detect spurious pathways.

Attention is not Explanation Cautions against treating attention as causal evidence without interventions.

Attention is not not Explanation Positions attention as a hypothesis generator needing causal confirmation.

Locating and Editing Factual Associations in GPT Demonstrates representation editing and localization for causal tests.

Interpretability Beyond Feature Attribution: Quantitative Testing with TCAV Shows concept-level links that require interventional confirmation for causal claims.

Network Dissection: Quantifying Interpretability of Deep Visual Representations Provides concept-level analyses that LIBERTy treats as hypotheses.

Holistic Evaluation of Language Models (HELM) Underpins the transparency, reproducibility, and comparability standards.

Show Your Work: Improved Reporting of Experimental Results Guides multi-seed variance, bootstrap CIs, and multiple comparisons control.

Causal Abstractions of Neural Networks Formalizes structural hypotheses and mediation for pathway-level analysis.

Tracr: Compiled Transformers as a Laboratory for Interpretability Offers ground-truth circuits for validating mediation and path patching.

Towards Monosemanticity: Decomposing Language Models With Superposition Introduces sparse autoencoders enabling feature-level, concept-aligned interventions.

TransformerLens (activation/patching and interpretability tooling) Provides activation and path patching tooling referenced by LIBERTy.

GPT-4 Technical Report Cited to contextualize closed-model evaluation in cross-family comparisons.

Anthropic Claude models Model family included in LIBERTy’s comparative matrix.

Google Gemini models Model family included in cross-system comparisons.

Meta Llama 3 announcement Open-model family for comparability.

Mistral/Mixtral models Open-model family for comparability.

Google Gemma models Open-model family for comparability.

Qwen2 models Open-model family for comparability.

DeepSeek LLM (open models) Open-model family for comparability.

xAI Grok-1 Open-model family for comparability.