ai 10 min read • intermediate

Run a LIBERTy Evaluation in 30 Days

A practical, reproducible playbook for ML teams to measure causal faithfulness at scale

By AI Research Team •
Run a LIBERTy Evaluation in 30 Days

Run a LIBERTy Evaluation in 30 Days

The gap between persuasive model explanations and truly faithful ones is now too consequential to ignore. The interpretability literature warns that plausibility is not faithfulness and that attention heatmaps alone are not causally diagnostic without interventions. LIBERTy—an end-to-end framework for 2026—meets this moment by prescribing rigorous, interventional tests, transparent reporting, and statistical power for large-scale evaluation of causal faithfulness across tasks, models, and explanation types. This article is a practical playbook: how to staff and scope, what to run each week, which datasets and metrics to start with, how to adapt to black-box versus white-box access, and what to deliver at the end.

In 30 days, you’ll preregister hypotheses and prompts; run pilots to size your study; execute a battery of input, counterfactual, and representation-level tests; and ship a replication-ready package with disaggregated results and cost-per-point accounting. You’ll learn how to choose explanation types (chain-of-thought, textual rationales, saliency maps, tool/program traces), pick datasets that actually enable causal tests, and assemble a metrics toolbox—from ERASER to deletion–insertion to retrain-after-removal and activation patching—that triangulates faithfulness while mitigating known validity threats.

Architecture/Implementation Details

Team, scope, and success criteria

  • Define faithfulness upfront: explanations must track causal factors actually used by the model, not merely human-plausible rationalizations. Avoid assumptions that attention weights are explanatory without interventional confirmation.
  • Pick explanation types and endpoints:
  • Chain-of-thought (CoT): step-level correctness and intervention sensitivity.
  • Textual rationales: evidence-grounded spans, ERASER-style tests.
  • Saliency/attribution maps: deletion–insertion AUC, ROAR, infidelity/sensitivity.
  • Tool-use/program traces: ablate steps or counterfactually edit tool outputs; use ReAct histories or Tracr-compiled programs as groundable references.
  • Choose primary endpoints by causal property: counterfactual dependence, minimal sufficiency/necessity, invariance to spurious features, mediation/pathways.

Success looks like preregistered metrics with power, convergent positive results across complementary tests, and uncertainty reporting that supports fair cross-model comparison.

Week 1: preregistration, fixed prompts, metric definitions, and sample size planning

  • Preregister hypotheses, datasets, prompt templates, decoding grids, metrics, and primary/secondary endpoints (HELM-style transparency; “Show Your Work” reporting).
  • Lock prompts and decoding parameters (e.g., standardized temperatures such as 0.0, 0.3, 0.7; k-sample self-consistency where appropriate) to control variance.
  • Define metrics per property:
  • ERASER comprehensiveness/sufficiency for rationale removal/isolation.
  • Deletion–insertion curves/AUC; include insertion to mitigate off-manifold issues.
  • ROAR retrain-after-removal for stronger necessity claims.
  • Counterfactual flip rates on minimal edits; align edits with attribution shifts.
  • Environment-level attribution stability under WILDS-like shifts.
  • Mediation via activation/path patching; estimate average causal effects (ACE) for hypothesized mediators.
  • Plan power: use pilot variance and bootstrap CIs; adopt hierarchical mixed-effects models across tasks/models; control multiplicity (BH-FDR). Specific metrics unavailable until pilot variance is observed.

Week 2: pilot runs for variance estimation and dataset sanity checks

  • Run small batches per model–task–metric to estimate variance and refine sample sizes and decoding grids.
  • Perform sanity checks for attribution method degeneracies; verify on-manifold, fluent counterfactual edits to reduce deletion artifacts.
  • Validate dataset supervision signals: gold evidence spans (ERASER tasks, FEVER, HotpotQA) and step-level process supervision (GSM8K, MATH) should behave as expected on a small subset.

Week 3: perturbations, counterfactual robustness, and environment splits

  • Input- and feature-level tests: compute ERASER comprehensiveness/sufficiency, deletion–insertion AUC; prepare ROAR datasets for retraining.
  • Counterfactual robustness: use minimally edited pairs (Counterfactual NLI; CheckList) to measure flip rates and whether attribution and outputs move in the expected direction.
  • Environment robustness: evaluate attribution stability and accuracy across WILDS-style shifts; relate de-emphasis of spurious cues to performance stability.
  • Representation-level probes (white-box only): activation/path patching and targeted ablations at hypothesized mediators; consider SAE-disentangled features for more semantically aligned interventions.

Week 4: full-scale runs, retraining-based controls, and uncertainty reporting

  • Execute the full matrix across models (closed and open families listed in contemporary reports), tasks, and explanation types with multi-seed trials and standardized decoding.
  • Run ROAR (remove-and-retrain) to strengthen necessity claims, mitigating model adaptivity and feature interactions.
  • Summarize with means, standard deviations, and 95% bootstrap confidence intervals per configuration; fit mixed-effects models for inference with random intercepts for tasks/models; control for multiple comparisons.
  • Compute accounting: report parameter counts where disclosed, context lengths, per-item generation budgets, wall-clock times, and cost-per-point statistics per metric; normalize by matching sample counts and decoding parameters for fair cross-model comparisons. Specific metrics unavailable where providers do not disclose FLOPs.
  • Release a replication package: versioned datasets/splits, prompts, generation logs, seeds, metric scripts, containers; include model cards, datasheets, and data statements.

Tooling stack

  • Dataset loaders with evidence/process supervision and counterfactual/environment splits: ERASER suite, FEVER, HotpotQA, GSM8K, MATH, Counterfactual NLI, CheckList, WILDS; multimodal extensions where needed.
  • Attribution baselines: Integrated Gradients (axiomatic), LIME and SHAP (model-agnostic), RISE and occlusion (perturbation-based).
  • Perturbation pipelines: deletion–insertion, comprehensiveness/sufficiency, on-manifold edit validators.
  • Representation-level interventions: TransformerLens for activation/patching workflows; SAE-based feature editing when available.
  • Statistics: bootstrap CI scripts, mixed-effects modeling, BH-FDR control, variance logging across seeds/generations.

Black-box versus white-box execution

  • Black-box only: emphasize input-level perturbations (ERASER, deletion–insertion), counterfactual flip tests, environment robustness, and method sanity checks.
  • White-box: add activation/path patching, targeted ablations, and mediation analysis; use Tracr for ground-truth circuits and ReAct traces for tool-use causality tests where applicable.
  • In both modes: triangulate across complementary methods to mitigate validity threats—off-manifold perturbations, attribution instability across methods/seeds, and attention-as-explanation pitfalls.

Comparison Tables

Datasets to start with and the properties they test

Dataset categoryExamplesSupervision signalPrimary properties tested
Evidence-grounded QA / verificationHotpotQA; FEVERGold supporting facts/evidence spansMinimal sufficiency/necessity; counterfactual dependence via edits to cited facts
Process-supervised math/logicGSM8K; MATHStep-level solutionsCoT step correctness; counterfactual edits to steps; mediation via patching step positions
Counterfactual pairs / behavioral testsCounterfactual NLI; CheckList; Contrast SetsMinimal semantic editsCounterfactual flip rates; attribution shift alignment
Shift suitesWILDS; CIFAR-10.1Environment/subgroup splitsInvariance to spurious features; attribution stability vs. accuracy under shift
Multimodal justificationVQA-X/ACT-X; ScienceQA; A-OKVQA; VCR; FEVEROUSGold justifications or process-like signalsLocalized occlusion effects; evidence grounding across modalities

Metrics toolbox at a glance

Metric / protocolWhat it measuresNotes
ERASER comprehensiveness/sufficiencyNecessity/sufficiency of rationale spansStandard for textual rationales
Deletion–insertion curves (AUC)Output sensitivity to prioritized featuresPair with insertion to reduce off-manifold artifacts
ROAR (remove-and-retrain)Feature necessity under retrainingMitigates reweighting confound
Infidelity / sensitivityConsistency between perturbations, output, and explanationDiagnostic for explanation stability
Counterfactual flip rateDependence on edited factorsUse CNLI/CheckList/contrast sets
Activation/path patching; mediationCausal impact of hypothesized mediatorsWhite-box only; ACE estimation

Black-box vs. white-box: which tests fit

AccessFeasible testsLimitations
Black-boxERASER, deletion–insertion, counterfactual tests, WILDS shifts, sanity checksNo activation-level mediation; rely on input perturbations
White-boxAll black-box tests plus activation/patching, ablation, causal abstractionRequires safe instrumentation; security considerations apply

Best Practices

  • Preregister everything: datasets, prompts, metrics, endpoints, and power targets; publish code, data, seeds, and containers (HELM-style; “Show Your Work”).
  • Triangulate across complementary methods to counter validity threats: use deletion and insertion; counterfactual edits validated for fluency; ROAR to address adaptivity; representation-level interventions to confirm attribution hypotheses.
  • Treat attention maps as hypotheses to falsify or confirm via targeted interventions—not as explanations by default.
  • Prioritize datasets with gold evidence or process supervision; where only plausibility labels exist (e-SNLI), qualify interpretations and emphasize causal tests.
  • Evaluate invariance: test explanation stability and performance across predefined environments/subgroups; analyze spurious correlation de-emphasis.
  • Control variance: fixed prompts; standardized decoding grids; multi-seed runs; bootstrap CIs; mixed-effects models; BH-FDR for multiple comparisons.
  • Document responsibly: model cards, datasheets, and data statements for sources, demographics, risks, and limitations.

Practical Examples

While specific quantitative results depend on your models and budgets, here’s how the 30-day plan plays out with the datasets and metrics specified in LIBERTy.

  • Evidence-grounded QA (HotpotQA/FEVER): In Week 1, preregister ERASER-style endpoints (comprehensiveness/sufficiency) with deletion–insertion AUC as secondary. In Week 2 pilots, verify that removing human-labeled supporting facts degrades predictions more than removing random spans (sanity check). In Week 3, add counterfactual edits to cited facts and measure flip rates, while ensuring edits are fluent/on-manifold. If you have white-box access, patch activations corresponding to supporting sentences from counterfactual documents to test mediator hypotheses. In Week 4, run ROAR by retraining models with important spans removed to strengthen necessity claims.

  • Process-supervised math (GSM8K/MATH): Define CoT endpoints: step-level correctness, sensitivity to counterfactual step edits, and effects of removing or substituting steps. In pilots, estimate variance of step correctness under self-consistency decoding. In Week 3, ablate or patch activations at step-associated token positions to test whether specific steps causally mediate final answers (white-box). Report mediation ACE and uncertainty in Week 4.

  • Counterfactual robustness (CNLI/CheckList/Contrast Sets): Predefine minimal edits (negation, quantifiers, entity swaps) and measure counterfactual flip rates and attribution shift alignment. Use insertion tests alongside deletion to reduce off-manifold confounds.

  • Environment-level invariance (WILDS; CIFAR-10.1): Partition evaluations by environment/subgroup and measure whether attribution stability predicts performance stability under shift; evaluate whether attributions de-emphasize known spurious cues.

  • Multimodal justification (VQA-X/ACT-X; ScienceQA; A-OKVQA; VCR; FEVEROUS): Pair pointing-and-justification checks with cross-modal occlusion; confirm that evidence grounding correlates with localized occlusion effects and counterfactual flip rates.

  • Black-box versus white-box runs: For closed models (e.g., GPT-4-class, Claude, Gemini), rely on input-level and environment tests with comprehensive uncertainty reporting. For open models (Llama, Mixtral, Gemma, Qwen, DeepSeek, Grok), add activation patching/ablation and SAE-based feature interventions where feasible. In both cases, apply HELM-style harnessing and cost-per-point accounting.

These examples illustrate the LIBERTy principle: measure causal faithfulness through convergent, interventional tests matched to supervision signals, and report with enough transparency and power to support credible comparisons.

Conclusion

In a month, ML teams can move beyond plausible-sounding explanations to causally faithful ones by following LIBERTy’s reproducible blueprint. Anchor evaluations in evidence-grounded or process-supervised data, combine input-level perturbations with counterfactual robustness and representation-level mediation, and report with HELM-style transparency and statistical rigor. Whether you have black-box APIs or full white-box access, the framework provides feasible, scalable routes to credible claims about what your model’s explanations actually mean.

Key takeaways:

  • Faithfulness requires interventions; plausibility and attention maps are insufficient without causal tests.
  • Start with datasets that enable causal evaluation: ERASER-style evidence, process supervision, counterfactual pairs, and environment splits.
  • Triangulate metrics: ERASER, deletion–insertion, ROAR, counterfactual flip rates, and mediation via activation patching.
  • Control variance and power: preregister, standardize prompts/decoding, bootstrap CIs, and use mixed-effects models.
  • Ship a full replication package with model/data cards, disaggregated results, and cost-per-point tables 📦.

Next steps: Draft your preregistration this week; assemble datasets with evidence/process supervision; build your perturbation and patching pipelines; run a 2-day pilot for variance; and schedule Week 3’s counterfactual and environment tests. Looking ahead, mechanistic advances like sparse autoencoders and libraries such as TransformerLens will make pathway-level mediation tests more precise, further narrowing the gap between explanation and cause.

Sources & References

arxiv.org
Towards Faithfully Interpretable NLP Systems Defines the distinction between plausibility and faithfulness that motivates LIBERTy's causal evaluation approach.
arxiv.org
ERASER: A Benchmark to Evaluate Rationalized NLP Predictions Provides evidence-grounded datasets and rationale-based metrics (comprehensiveness/sufficiency) used in the playbook.
arxiv.org
On the (In)fidelity and Sensitivity of Explanations Introduces infidelity/sensitivity metrics for checking consistency of explanations under perturbations.
arxiv.org
A Benchmark for Interpretability Methods in Deep Neural Networks (ROAR) Supplies the retrain-after-removal protocol to strengthen causal necessity claims for features.
arxiv.org
Interpretable Explanations of Black Boxes by Meaningful Perturbations Supports on-manifold perturbation design and insertion tests to mitigate deletion artifacts.
arxiv.org
Axiomatic Attribution for Deep Networks (Integrated Gradients) Serves as a principled attribution baseline in the tooling stack.
arxiv.org
A Unified Approach to Interpreting Model Predictions (SHAP) Provides a model-agnostic attribution baseline for black-box settings.
arxiv.org
“Why Should I Trust You?” Explaining the Predictions of Any Classifier (LIME) Adds a widely used model-agnostic attribution method for comparison and sanity checks.
arxiv.org
RISE: Randomized Input Sampling for Explanation of Black-box Models Supports deletion–insertion curve methodology for saliency evaluation.
arxiv.org
Sanity Checks for Saliency Maps Warns about attribution degeneracies; informs Week 2 sanity checks and triangulation guidance.
arxiv.org
Learning the Difference That Makes a Difference with Counterfactual Examples in NLI Provides counterfactual pairs for measuring flip rates and attribution shifts.
arxiv.org
Invariant Risk Minimization Conceptually grounds evaluation of invariance to spurious features across environments.
arxiv.org
WILDS: A Benchmark of in-the-Wild Distribution Shifts Supplies environment/subgroup splits to test explanation stability under distribution shift.
arxiv.org
e-SNLI: Natural Language Inference with Natural Language Explanations Shows plausibility-only rationales that require caution for faithfulness claims.
arxiv.org
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence Provides multimodal datasets with justifications for cross-modal faithfulness tests.
arxiv.org
HotpotQA: A Dataset for Diverse, Explainable Multi-hop QA Evidence-grounded QA dataset for sufficiency/necessity and counterfactual tests.
arxiv.org
FEVER: a Large-scale Dataset for Fact Extraction and VERification Evidence-grounded fact verification dataset used for rationale tests.
arxiv.org
Training Verifiers to Solve Math Word Problems (GSM8K) Process-supervised math benchmark for step-level CoT evaluation.
arxiv.org
Measuring Mathematical Problem Solving With the MATH Dataset Another process-supervised math dataset to evaluate reasoning steps.
arxiv.org
Chain-of-Thought Prompting Elicits Reasoning in LMs Motivates CoT explanations and step-level evaluation protocols.
arxiv.org
Self-Consistency Improves Chain of Thought Reasoning Supports standardized decoding with k-sample self-consistency in variance controls.
arxiv.org
Attention is not Explanation Cautions against interpreting attention as explanation without interventions.
arxiv.org
Attention is not not Explanation Nuances attention as hypothesis rather than definitive explanation, motivating interventional tests.
arxiv.org
Locating and Editing Factual Associations in GPT Backs representation-level interventions (activation/patching) to test mediators.
arxiv.org
Interpretability Beyond Feature Attribution: Quantitative Testing with TCAV Supports concept-level analysis that requires interventional confirmation for causal claims.
arxiv.org
Network Dissection: Quantifying Interpretability of Deep Visual Representations Provides concept-level interpretability tools to bridge features and human concepts.
arxiv.org
Holistic Evaluation of Language Models (HELM) Informs transparent evaluation harnessing, fixed prompts, and reproducible reporting.
arxiv.org
Model Cards for Model Reporting Guides documentation of model capabilities and risks in deliverables.
arxiv.org
Datasheets for Datasets Guides dataset documentation and transparency in the replication package.
aclanthology.org
Data Statements for NLP: Towards Mitigating System Bias and Enabling Better Science Adds standardized data documentation practices for disaggregated reporting.
arxiv.org
Show Your Work: Improved Reporting of Experimental Results Supports power analyses, variance reporting, and mixed-effects modeling practices.
arxiv.org
Causal Abstractions of Neural Networks Provides formal grounding for mediation and pathway analyses in white-box settings.
openai.com
Improving Mathematical Reasoning with Process Supervision Motivates step-level supervision and interventions for evaluating CoT.
arxiv.org
ReAct: Synergizing Reasoning and Acting in Language Models Supports evaluation of tool-use traces via ablation and counterfactual editing.
arxiv.org
Tracr: Compiled Transformers as a Laboratory for Interpretability Offers ground-truth circuits for representational faithfulness tests.
transformer-circuits.pub
Towards Monosemanticity: Decomposing Language Models With Superposition Introduces SAEs to enable feature-level, semantically aligned interventions.
github.com
TransformerLens (activation/patching and interpretability tooling) Provides practical tooling for activation patching and mechanistic probes.
arxiv.org
ScienceQA: A Large-scale Multi-modal Science Question Answering Dataset Supplies multimodal tasks with explanations for cross-modal faithfulness tests.
arxiv.org
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge Adds multimodal justification tasks to evaluate cross-modal explanations.
arxiv.org
Visual Commonsense Reasoning (VCR) Provides multimodal rationales for evaluating explanation grounding.
arxiv.org
FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information Extends evidence-grounded verification to tables and text for multimodal evaluation.
arxiv.org
Contrast Sets: A Test Suite for the NLP Community Offers minimally edited pairs to directly test counterfactual dependence.
github.com
CIFAR-10.1 Provides matched-distribution test sets for probing generalization and spurious reliance.
arxiv.org
GPT-4 Technical Report Represents the class of closed models included in the comparative experimental matrix.
www.anthropic.com
Anthropic Claude models Represents closed-model family considered in LIBERTy’s comparative evaluation.
ai.google.dev
Google Gemini models Represents closed-model family evaluated under the framework.
ai.meta.com
Meta Llama 3 announcement Represents open-model family included in comparative experiments.
mistral.ai
Mistral/Mixtral models Represents open-model family considered in experiments.
ai.google.dev
Google Gemma models Represents open-model family in the comparative matrix.
github.com
Qwen2 models Represents open-model family included in the LIBERTy evaluation scope.
github.com
DeepSeek LLM (open models) Represents open-model family for white-box/black-box adaptations.
x.ai
xAI Grok-1 Represents open-model family potentially evaluated under LIBERTy.

Advertisement