Explain AI Models: Methods and Opportunities in
Explainable AI, Data-Centric AI, and Mechanistic Interpretability

NeurIPS 2025 Tutorial

Shichang Zhang

Shichang Zhang
Harvard University

Himabindu Lakkaraju

Himabindu Lakkaraju
Harvard University

Julius Adebayo

Julius Adebayo
Guide Labs

Overview

Understanding AI system behavior has become critical for safety, trust, and effective deployment across diverse applications. Three major research communities have emerged to address this challenge through interpretability methods: Explainable AI focuses on feature attribution to understand which input features drive model decisions; Data-Centric AI emphasizes data attribution to analyze how training examples shape model behavior; and Mechanistic Interpretability examines component attribution to understand how internal model components contribute to outputs. These three branches share the goal of better understanding AI systems across different aspects and differ primarily in their perspectives rather than techniques. This tutorial begins with foundational concepts and historical context, providing essential background on why explainability matters and how the field has evolved since its early days. The first technical deep dive covers post hoc explanation methods, data-centric explanation techniques, mechanistic interpretability approaches, and presents a unified framework demonstrating that these methods share fundamental techniques such as perturbations, gradients, and local linear approximations. The second technical deep dive explores inherently interpretable models, clarifying concepts like reasoning (chain-of-thought) LLMs and self-explanatory LLMs in the context of explainability, and techniques for building inherently interpretable LLMs. We also showcase open source tools that make these methods accessible to practitioners. Furthermore, we highlight promising future research directions in interpretability research and the induced future directions in AI more broadly, with applications in model editing, steering, and regulation. Through comprehensive coverage of algorithms, real-world case studies, and practical guidance, attendees will gain both a deep technical understanding of state-of-the-art methods and practical skills to apply interpretability techniques effectively in AI applications.

Schedule

  • Introduction: Why Explainability (5 minutes)
  • History and Pre-2015 Research (10 minutes)
  • Technical Deep Dive 1 (50 minutes)
    • Post hoc Explanation
    • Data-Centric Explanation
    • Mechanistic Interpretability
    • A Unified View of Explainability
  • Break and Q&A (15 minutes)
  • Technical Deep Dive 2 (40 minutes)
    • Inherently Interpretable Models
    • Reasoning (Chain-Of-Thought) LLMs And "Self-Explanatory" LLMs
    • Inherently Interpretable LLMs
  • Open Source Tools (10 minutes)
  • Conclusion and Future Directions (10 minutes)

Citation

If you find this tutorial useful, please cite:

@misc{zhang2025xai,
    title={Explain AI Models: Methods and Opportunities in Explainable AI, Data-Centric AI, and Mechanistic Interpretability},
    author={Zhang, Shichang and Lakkaraju, Hima and Adebayo, Julius},
    year={2025},
    howpublished={NeurIPS 2025 Tutorial},
    url={https://your-tutorial-website.com}
}