Interpretability of Machine Learning Models

Neural network models such as transformers already have an enormous impact on the world and they will likely be deployed in many more areas over the next decade. A key concern is that despite delivering impressive results, we do not understand *how* these models work. This important an application perspective, if we want to deploy these mdoels in sensitive domains like healthcare or the justice system, it is important to understand how models make decisions. Moreover, understanding how they work might also enable us to mitigate undesirable behavior such as hallucination or overfitting.

Recently, tools from category theory, quantum field theory and causal inference have been proposed to address this issue. Broadly, the aim of this reading course is to gain an understand of both the practical and the theoretical foundations of interpretability. In particular, the main focus of this group will be on mechanistic interpretability (both from a theoretical standpoint using causal inference, category theory, geometry and a practical viewpoint of running a few experiments). We would also like to explore other mathematical tools in interpretable AI including QFT based neural networks and neural tangent kernels.

Organizers: Djordje Mihajlovic and Siddharth Setlur

Schedule:

Thursday, 10am – 12pm, (October 9th- December 11th)
Location: Unless stated otherwise; we will primarily be based in Bayes Center, 2.56 however on several weeks we will have to relocate so check beforehand!
Week 1 (09/10/25 – 19 George Square, G.23): Introduction to ML for mathematicians, Tobias Cheung.

– Dependent on the background of participants, the first one/two weeks will introduce the necessary ML concepts. In principle, we only assume a background in linear algebra and some basic calculus.

– The first week will focus on basic concepts such as how neural networks are built and trained.
Week 2 (16/10/25 – Bayes Center, 3.55): Transformers & LLMs, Djordje Mihajlovic.

– The second ML concept week will focus on transformers, which is the architecture used by modern LLMs among others. We will introduce the concept of attention upon which modern ML systems are built on, and discuss how this mechanism has allowed for successful development of the LLMs that we know of today.

-[Attention is All You Need]
Week 3: An Introduction to Interpretability, Siddharth Setlur.

– This week will focus on interpreting how transformers model the features of the training data (in the case of LLMs this is often just a corpus of text). A key concept here is the *principle of superposition* which posits that each neuron encodes multiple features. In order to see how the model encodes features, one needs to deal with this superposition and one approach to do this is a *sparse autoencoder*. Recent work has successfully isolated features using this framework and initial results show that the learned features of LLMs cluster and possibly even lie on higher dimensional manifolds

– [Superposition, polysemantic neurons, sparse autoencoders, feature geometry]

– [Features learned by LLMs are non-linear]
Week 4:

– There has not been a lot of work into building a theoretical foundation for interpretability. A mathematical language for interpretability would be useful to help develop theoretical results. One approach to this is via causal abstraction – a concept in causal inference where one attempts to understand a fine-grained model with many variables using a coarse-grained model with fewer variables. In our setting the fine grained model would be the neural network with all its neurons, while a coarse grained model coule be human interpretable concepts. The primary reference for this is

– [Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability]
Week 5:
– Continuation of topic from previous week.
Week 6 (13/11/25 – 19 George Square, G.23):

– Another (somewhat related) theworetical approach to interpretability is via categorical probability. Here the focus is on separating the syntax of a model (its architecture) from the semantics of the model. This can be modelled by string diagrams and corresponding functors into a semantics category (which represents an interpretation of the model.) Weeks 6 and 7 will focus on this method (including time to introduce the category theory concepts).

– [Towards Compositional Interpretability for XAI]
Week 7:
– Continuation of the topic from the previous week.
Week 8:

– Another approach to interpretability might be to look into the output of the models (in the case of LLMs the text they generate). Recent work has attempted to look at the *magnitude homology* of categories of text generated by LLMs.

– [The Magnitude of Categories of Texts Enriched By Language Models]
Week 9: TBD (Depending on interest of attendees; current ideas are QFT based NNs (see Jim Halversons notes) or Neural Tangent Kernel methods)
Week 10 (11/12/25 – Bayes Center, 3.55): TBD

Lecture notes:

Lecture notes to be added.