Run a LIBERTy Evaluation in 30 Days
The gap between persuasive model explanations and truly faithful ones is now too consequential to ignore. The interpretability literature warns that plausibility is not faithfulness and that attention heatmaps alone are not causally diagnostic without interventions. LIBERTy—an end-to-end framework for 2026—meets this moment by prescribing rigorous, interventional tests, transparent reporting, and statistical power for large-scale evaluation of causal faithfulness across tasks, models, and explanation types. This article is a practical playbook: how to staff and scope, what to run each week, which datasets and metrics to start with, how to adapt to black-box versus white-box access, and what to deliver at the end.
In 30 days, you’ll preregister hypotheses and prompts; run pilots to size your study; execute a battery of input, counterfactual, and representation-level tests; and ship a replication-ready package with disaggregated results and cost-per-point accounting. You’ll learn how to choose explanation types (chain-of-thought, textual rationales, saliency maps, tool/program traces), pick datasets that actually enable causal tests, and assemble a metrics toolbox—from ERASER to deletion–insertion to retrain-after-removal and activation patching—that triangulates faithfulness while mitigating known validity threats.
Architecture/Implementation Details
Team, scope, and success criteria
- Define faithfulness upfront: explanations must track causal factors actually used by the model, not merely human-plausible rationalizations. Avoid assumptions that attention weights are explanatory without interventional confirmation.
- Pick explanation types and endpoints:
- Chain-of-thought (CoT): step-level correctness and intervention sensitivity.
- Textual rationales: evidence-grounded spans, ERASER-style tests.
- Saliency/attribution maps: deletion–insertion AUC, ROAR, infidelity/sensitivity.
- Tool-use/program traces: ablate steps or counterfactually edit tool outputs; use ReAct histories or Tracr-compiled programs as groundable references.
- Choose primary endpoints by causal property: counterfactual dependence, minimal sufficiency/necessity, invariance to spurious features, mediation/pathways.
Success looks like preregistered metrics with power, convergent positive results across complementary tests, and uncertainty reporting that supports fair cross-model comparison.
Week 1: preregistration, fixed prompts, metric definitions, and sample size planning
- Preregister hypotheses, datasets, prompt templates, decoding grids, metrics, and primary/secondary endpoints (HELM-style transparency; “Show Your Work” reporting).
- Lock prompts and decoding parameters (e.g., standardized temperatures such as 0.0, 0.3, 0.7; k-sample self-consistency where appropriate) to control variance.
- Define metrics per property:
- ERASER comprehensiveness/sufficiency for rationale removal/isolation.
- Deletion–insertion curves/AUC; include insertion to mitigate off-manifold issues.
- ROAR retrain-after-removal for stronger necessity claims.
- Counterfactual flip rates on minimal edits; align edits with attribution shifts.
- Environment-level attribution stability under WILDS-like shifts.
- Mediation via activation/path patching; estimate average causal effects (ACE) for hypothesized mediators.
- Plan power: use pilot variance and bootstrap CIs; adopt hierarchical mixed-effects models across tasks/models; control multiplicity (BH-FDR). Specific metrics unavailable until pilot variance is observed.
Week 2: pilot runs for variance estimation and dataset sanity checks
- Run small batches per model–task–metric to estimate variance and refine sample sizes and decoding grids.
- Perform sanity checks for attribution method degeneracies; verify on-manifold, fluent counterfactual edits to reduce deletion artifacts.
- Validate dataset supervision signals: gold evidence spans (ERASER tasks, FEVER, HotpotQA) and step-level process supervision (GSM8K, MATH) should behave as expected on a small subset.
Week 3: perturbations, counterfactual robustness, and environment splits
- Input- and feature-level tests: compute ERASER comprehensiveness/sufficiency, deletion–insertion AUC; prepare ROAR datasets for retraining.
- Counterfactual robustness: use minimally edited pairs (Counterfactual NLI; CheckList) to measure flip rates and whether attribution and outputs move in the expected direction.
- Environment robustness: evaluate attribution stability and accuracy across WILDS-style shifts; relate de-emphasis of spurious cues to performance stability.
- Representation-level probes (white-box only): activation/path patching and targeted ablations at hypothesized mediators; consider SAE-disentangled features for more semantically aligned interventions.
Week 4: full-scale runs, retraining-based controls, and uncertainty reporting
- Execute the full matrix across models (closed and open families listed in contemporary reports), tasks, and explanation types with multi-seed trials and standardized decoding.
- Run ROAR (remove-and-retrain) to strengthen necessity claims, mitigating model adaptivity and feature interactions.
- Summarize with means, standard deviations, and 95% bootstrap confidence intervals per configuration; fit mixed-effects models for inference with random intercepts for tasks/models; control for multiple comparisons.
- Compute accounting: report parameter counts where disclosed, context lengths, per-item generation budgets, wall-clock times, and cost-per-point statistics per metric; normalize by matching sample counts and decoding parameters for fair cross-model comparisons. Specific metrics unavailable where providers do not disclose FLOPs.
- Release a replication package: versioned datasets/splits, prompts, generation logs, seeds, metric scripts, containers; include model cards, datasheets, and data statements.
Tooling stack
- Dataset loaders with evidence/process supervision and counterfactual/environment splits: ERASER suite, FEVER, HotpotQA, GSM8K, MATH, Counterfactual NLI, CheckList, WILDS; multimodal extensions where needed.
- Attribution baselines: Integrated Gradients (axiomatic), LIME and SHAP (model-agnostic), RISE and occlusion (perturbation-based).
- Perturbation pipelines: deletion–insertion, comprehensiveness/sufficiency, on-manifold edit validators.
- Representation-level interventions: TransformerLens for activation/patching workflows; SAE-based feature editing when available.
- Statistics: bootstrap CI scripts, mixed-effects modeling, BH-FDR control, variance logging across seeds/generations.
Black-box versus white-box execution
- Black-box only: emphasize input-level perturbations (ERASER, deletion–insertion), counterfactual flip tests, environment robustness, and method sanity checks.
- White-box: add activation/path patching, targeted ablations, and mediation analysis; use Tracr for ground-truth circuits and ReAct traces for tool-use causality tests where applicable.
- In both modes: triangulate across complementary methods to mitigate validity threats—off-manifold perturbations, attribution instability across methods/seeds, and attention-as-explanation pitfalls.
Comparison Tables
Datasets to start with and the properties they test
| Dataset category | Examples | Supervision signal | Primary properties tested |
|---|---|---|---|
| Evidence-grounded QA / verification | HotpotQA; FEVER | Gold supporting facts/evidence spans | Minimal sufficiency/necessity; counterfactual dependence via edits to cited facts |
| Process-supervised math/logic | GSM8K; MATH | Step-level solutions | CoT step correctness; counterfactual edits to steps; mediation via patching step positions |
| Counterfactual pairs / behavioral tests | Counterfactual NLI; CheckList; Contrast Sets | Minimal semantic edits | Counterfactual flip rates; attribution shift alignment |
| Shift suites | WILDS; CIFAR-10.1 | Environment/subgroup splits | Invariance to spurious features; attribution stability vs. accuracy under shift |
| Multimodal justification | VQA-X/ACT-X; ScienceQA; A-OKVQA; VCR; FEVEROUS | Gold justifications or process-like signals | Localized occlusion effects; evidence grounding across modalities |
Metrics toolbox at a glance
| Metric / protocol | What it measures | Notes |
|---|---|---|
| ERASER comprehensiveness/sufficiency | Necessity/sufficiency of rationale spans | Standard for textual rationales |
| Deletion–insertion curves (AUC) | Output sensitivity to prioritized features | Pair with insertion to reduce off-manifold artifacts |
| ROAR (remove-and-retrain) | Feature necessity under retraining | Mitigates reweighting confound |
| Infidelity / sensitivity | Consistency between perturbations, output, and explanation | Diagnostic for explanation stability |
| Counterfactual flip rate | Dependence on edited factors | Use CNLI/CheckList/contrast sets |
| Activation/path patching; mediation | Causal impact of hypothesized mediators | White-box only; ACE estimation |
Black-box vs. white-box: which tests fit
| Access | Feasible tests | Limitations |
|---|---|---|
| Black-box | ERASER, deletion–insertion, counterfactual tests, WILDS shifts, sanity checks | No activation-level mediation; rely on input perturbations |
| White-box | All black-box tests plus activation/patching, ablation, causal abstraction | Requires safe instrumentation; security considerations apply |
Best Practices
- Preregister everything: datasets, prompts, metrics, endpoints, and power targets; publish code, data, seeds, and containers (HELM-style; “Show Your Work”).
- Triangulate across complementary methods to counter validity threats: use deletion and insertion; counterfactual edits validated for fluency; ROAR to address adaptivity; representation-level interventions to confirm attribution hypotheses.
- Treat attention maps as hypotheses to falsify or confirm via targeted interventions—not as explanations by default.
- Prioritize datasets with gold evidence or process supervision; where only plausibility labels exist (e-SNLI), qualify interpretations and emphasize causal tests.
- Evaluate invariance: test explanation stability and performance across predefined environments/subgroups; analyze spurious correlation de-emphasis.
- Control variance: fixed prompts; standardized decoding grids; multi-seed runs; bootstrap CIs; mixed-effects models; BH-FDR for multiple comparisons.
- Document responsibly: model cards, datasheets, and data statements for sources, demographics, risks, and limitations.
Practical Examples
While specific quantitative results depend on your models and budgets, here’s how the 30-day plan plays out with the datasets and metrics specified in LIBERTy.
-
Evidence-grounded QA (HotpotQA/FEVER): In Week 1, preregister ERASER-style endpoints (comprehensiveness/sufficiency) with deletion–insertion AUC as secondary. In Week 2 pilots, verify that removing human-labeled supporting facts degrades predictions more than removing random spans (sanity check). In Week 3, add counterfactual edits to cited facts and measure flip rates, while ensuring edits are fluent/on-manifold. If you have white-box access, patch activations corresponding to supporting sentences from counterfactual documents to test mediator hypotheses. In Week 4, run ROAR by retraining models with important spans removed to strengthen necessity claims.
-
Process-supervised math (GSM8K/MATH): Define CoT endpoints: step-level correctness, sensitivity to counterfactual step edits, and effects of removing or substituting steps. In pilots, estimate variance of step correctness under self-consistency decoding. In Week 3, ablate or patch activations at step-associated token positions to test whether specific steps causally mediate final answers (white-box). Report mediation ACE and uncertainty in Week 4.
-
Counterfactual robustness (CNLI/CheckList/Contrast Sets): Predefine minimal edits (negation, quantifiers, entity swaps) and measure counterfactual flip rates and attribution shift alignment. Use insertion tests alongside deletion to reduce off-manifold confounds.
-
Environment-level invariance (WILDS; CIFAR-10.1): Partition evaluations by environment/subgroup and measure whether attribution stability predicts performance stability under shift; evaluate whether attributions de-emphasize known spurious cues.
-
Multimodal justification (VQA-X/ACT-X; ScienceQA; A-OKVQA; VCR; FEVEROUS): Pair pointing-and-justification checks with cross-modal occlusion; confirm that evidence grounding correlates with localized occlusion effects and counterfactual flip rates.
-
Black-box versus white-box runs: For closed models (e.g., GPT-4-class, Claude, Gemini), rely on input-level and environment tests with comprehensive uncertainty reporting. For open models (Llama, Mixtral, Gemma, Qwen, DeepSeek, Grok), add activation patching/ablation and SAE-based feature interventions where feasible. In both cases, apply HELM-style harnessing and cost-per-point accounting.
These examples illustrate the LIBERTy principle: measure causal faithfulness through convergent, interventional tests matched to supervision signals, and report with enough transparency and power to support credible comparisons.
Conclusion
In a month, ML teams can move beyond plausible-sounding explanations to causally faithful ones by following LIBERTy’s reproducible blueprint. Anchor evaluations in evidence-grounded or process-supervised data, combine input-level perturbations with counterfactual robustness and representation-level mediation, and report with HELM-style transparency and statistical rigor. Whether you have black-box APIs or full white-box access, the framework provides feasible, scalable routes to credible claims about what your model’s explanations actually mean.
Key takeaways:
- Faithfulness requires interventions; plausibility and attention maps are insufficient without causal tests.
- Start with datasets that enable causal evaluation: ERASER-style evidence, process supervision, counterfactual pairs, and environment splits.
- Triangulate metrics: ERASER, deletion–insertion, ROAR, counterfactual flip rates, and mediation via activation patching.
- Control variance and power: preregister, standardize prompts/decoding, bootstrap CIs, and use mixed-effects models.
- Ship a full replication package with model/data cards, disaggregated results, and cost-per-point tables 📦.
Next steps: Draft your preregistration this week; assemble datasets with evidence/process supervision; build your perturbation and patching pipelines; run a 2-day pilot for variance; and schedule Week 3’s counterfactual and environment tests. Looking ahead, mechanistic advances like sparse autoencoders and libraries such as TransformerLens will make pathway-level mediation tests more precise, further narrowing the gap between explanation and cause.