References

Explainable AI and Feature Attribution

  • Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 818–833. Springer, 2014.
  • Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4768–4777, 2017.
  • Piotr Dabkowski and Yarin Gal. Real-time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pages 6970–6979, 2017.
  • Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
  • Sushant Agarwal, Shahin Jabbari, Chirag Agarwal, Sohini Upadhyay, Steven Wu, and Himabindu Lakkaraju. Towards the unification and robustness of perturbation and gradient based explanations. In International Conference on Machine Learning, pages 110–119. PMLR, 2021

Data-centric AI and Data Attribution

  • Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, pages 2242–2251. PMLR, 2019.
  • Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017.
  • Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Datamodels: Predicting predictions from training data. In International conference on machine learning, pages 9525-9587. PMLR, 2022.
  • Juhan Bae, Wu Lin, Jonathan Lorraine, and Roger Baker Grosse. Training data attribution via approximate unrolling. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

Mechanistic Interpretability and Component Attribution

  • Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  • Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri` a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36:16318–16352, 2023.
  • Steven Cao, Victor Sanh, and Alexander Rush. 2021. Low-Complexity Probing via Finding Subnetworks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 960–966, Online. Association for Computational Linguistics.
  • Harshay Shah, Andrew Ilyas, and Aleksander Madry. Decomposing and editing predictions by modeling model computation. In International conference on machine learning, pages 44244-44292. PMLR, 2024.

Inherently Interpretable Models

  • J. Hewitt, J. Thickstun, C. Manning, and P. Liang. Backpack language models. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9103–9125, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.506. URL https://aclanthology.org/2023.acl-long.506. (p. 30)
  • Aya Abdelsalam Ismail, Julius Adebayo, Hector Corrada Bravo, Stephen Ra, and Kyunghyun Cho. Concept bottleneck generative models. In ICLR, 2024.
  • Been Kim, Martin Wattenberg, Justin Gilmer, Carrie J. Cai, James Wexler, Fernanda B. Viégas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In ICML, 2018.
  • Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In ICML, 2020.
  • Sukrut Rao, Sweta Mahajan, Moritz Böhle, and Bernt Schiele. Discover-then-name: Taskagnostic concept bottlenecks via automated concept discovery. In ECCV, 2024.
  • Chung-En Sun, Tuomas Oikarinen, Berk Ustun, Tsui-Wei Weng. Concept Bottleneck large language models, ICLR 2025.
  • Emanuele Marconato, Andrea Passerini, Stefano Teso, GlanceNets: Interpretabile, Leak-proof Concept-based Models, Neurips, 2022.

Others: Explanation Evaluation, Model Editing and Steering, and Explanation Framework

  • Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018.
  • Blair Bilodeau, Natasha Jaques, Pang Wei Koh, and Been Kim. Impossibility theorems for feature attribution. Proceedings of the National Academy of Sciences, 121(2):e2304406120, 2024.
  • Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
  • Tessa Han, Suraj Srinivas, and Himabindu Lakkaraju. Which explanation should i choose? a function approximation perspective to characterizing post hoc explanations. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  • Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness, S. Srinivas*, S. Bordt*, H. Lakkaraju
  • Prasad Chalasani, Jiefeng Chen, Amrita Roy Chowdhury, Xi Wu, and Somesh Jha. Concise explanations of neural networks using adversarial training
  • Jean-Stanislas Denain and Jacob Steinhardt. Auditing visualizations: Transparency methods struggle to detect anomalous behavior. arXiv preprint arXiv:2206.13498, 2022
  • Ann-Kathrin Dombrowski. A geometrical perspective on explanations for deep neural networks.
  • Thomas Fel, David Vigouroux, Rémi Cadène, and Thomas Serre. How good is your explanation? algorithmic stability measures to assess the quality of explanations for deep neural networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 720–730, 2022
  • Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. Evaluating feature importance estimates. 2018
  • Suraj Srinivas, Kyle Matoba, Himabindu Lakkaraju, and François Fleuret. Efficient training of lowcurvature neural networks. Advances in Neural Information Processing Systems, 35:25951–25964, 2022. 5
  • Suraj Srinivas, Sebastian Bordt, and Hima Lakkaraju. Which models have perceptually-aligned gradients? an explanation via off-manifold robustness. arXiv preprint arXiv:2305.19101, 2023. 5
  • Harshay Shah, Prateek Jain, and Praneeth Netrapalli. Do input gradients highlight discriminative features? Advances in Neural Information Processing Systems, 34:2046–2059, 2021.
  • Andrew Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.