Talk
Advanced

XAI - Tracing Hooks for Peek inside Transformers

Rejected

Session Description

In this talk, I propose to discuss the problem of building explainable AI with the two approaches - causal vs correlational.

I will talk about what mech interp is in large language models like Gemma. It's a way to understand how models answer questions by looking inside them and checking which neurons activate when.

I will discuss the Anthropic's open sourced a python module - circuit-tracer, and also the Neuronpedia portal , helps us find neurons linked to real-world concepts. We will examine specific prompts on transformers and understand the various paths and thoughts that make use reach the output. (It is veery interesting - for me)

I will also talk about my own work on mech interp tooling (modelrecon) - with "activation cube" data structure (this is not a standard - I came up with it) as a means to share and visualize activation data. And also the "counterfactual" library that I am working on to correctly implement intervention testing

basic problem of causal vs. correlational techniques and the limitations of corrlational.

## Why Explainability Matters 2-3 mins

We need to understand why AI models make certain choices, not just what answers they give. Without this, the model feels like a black box. - in this I will include example of human behavior

---

## What Transformers Hide 2-3 mins

I will talk about basic transformers internal steps and features that are hard to see. highliting that tools only show the final output, not the thinking process. Infact - I will highlight that it comes as a surpirse to normal people that we dont know how models "actually" arrive at specific answers. -

---

## How Circuit Tracer Helps 3 - 5 mins

I will talk about how Anthropic’s Circuit Tracer shows the inside connections of the model.

It turns hidden activations into easy-to-understand features and shows how they link together. It not that easy, but we can get used to the graphs (like the link guy in the matrix movie- he could just understand by looking at the matrix runtime code) - I will show some graphs and walk through the reasoning path on colab

---

## Seeing the Reasoning Path 10 minutes or more

The tool draws a clear path from

input → inner features → final output

This lets everyone see which parts of the model caused the answer. _ this would be fun as the type of path a model takes are weird sometimes.

---

## Why This Is Important 2 minutes

With this method, we can:

* check if the model is behaving safely

* fix mistakes inside the model

* build trust by seeing how it thinks

I will finish with description with some of my work around activation cube data and pytorch hook mechanism and other options for logging data.

Key Takeaways

  • The idea of explainability in AI

  • Tools and techniques available

  • Current trends in XAI

  • Open source tooling available

My Slides:

https://docs.google.com/presentation/d/1FNd37jW3nB95lko2imfk6A7VGVG0S0H53hUgWocYJ1g/edit?usp=sharing

Code:
https://github.com/modelrecon

References

Session Categories

Tutorial about using a FOSS project
Introducing a FOSS project or a new version of a popular project
Technology architecture

Speakers

Viraj Sharma
Student Presidium School Indirapuram
https://sharmaviraj.com
Viraj Sharma

Reviews

0 %
Approvability
0
Approvals
1
Rejections
0
Not Sure

I'm not sure how this fits at a FOSS conference

Reviewer #1
Rejected