In this talk, I propose to discuss the problem of building explainable AI with the two approaches - causal vs correlational.
I will talk about what mech interp is in large language models like Gemma. It's a way to understand how models answer questions by looking inside them and checking which neurons activate when.
I will discuss the Anthropic's open sourced a python module - circuit-tracer, and also the Neuronpedia portal , helps us find neurons linked to real-world concepts. We will examine specific prompts on transformers and understand the various paths and thoughts that make use reach the output. (It is veery interesting - for me)
I will also talk about my own work on mech interp tooling (modelrecon) - with "activation cube" data structure (this is not a standard - I came up with it) as a means to share and visualize activation data. And also the "counterfactual" library that I am working on to correctly implement intervention testing
basic problem of causal vs. correlational techniques and the limitations of corrlational.
## Why Explainability Matters 2-3 mins
We need to understand why AI models make certain choices, not just what answers they give. Without this, the model feels like a black box. - in this I will include example of human behavior
---
## What Transformers Hide 2-3 mins
I will talk about basic transformers internal steps and features that are hard to see. highliting that tools only show the final output, not the thinking process. Infact - I will highlight that it comes as a surpirse to normal people that we dont know how models "actually" arrive at specific answers. -
---
## How Circuit Tracer Helps 3 - 5 mins
I will talk about how Anthropic’s Circuit Tracer shows the inside connections of the model.
It turns hidden activations into easy-to-understand features and shows how they link together. It not that easy, but we can get used to the graphs (like the link guy in the matrix movie- he could just understand by looking at the matrix runtime code) - I will show some graphs and walk through the reasoning path on colab
---
## Seeing the Reasoning Path 10 minutes or more
The tool draws a clear path from
input → inner features → final output
This lets everyone see which parts of the model caused the answer. _ this would be fun as the type of path a model takes are weird sometimes.
---
## Why This Is Important 2 minutes
With this method, we can:
* check if the model is behaving safely
* fix mistakes inside the model
* build trust by seeing how it thinks
I will finish with description with some of my work around activation cube data and pytorch hook mechanism and other options for logging data.
Key Takeaways
The idea of explainability in AI
Tools and techniques available
Current trends in XAI
Open source tooling available
My Slides:
https://docs.google.com/presentation/d/1FNd37jW3nB95lko2imfk6A7VGVG0S0H53hUgWocYJ1g/edit?usp=sharing
Code:
https://github.com/modelrecon