Circuit Tracing in Transformers: Peeking Inside the Black Box

Rejected

Session Description

In this talk, I will explain what circuit tracing is in large language models like Gemma. It's a new way to understand how models answer questions by looking inside them and checking which neurons activate when. Anthropic open sourced a library called circuit-tracer and also the website Neuronpedia, which helps us find neurons linked to real-world concepts like "Texas" or "capital".

I'll show how this works with a live demo from their Jupyter notebook. We'll see what nodes and supernodes are, and how they connect to form reasoning paths in the model. This can help us debug, understand, and make models safer. We will be able to see how AI can be more transparent with the help of FOSS tools!

Slides:
https://docs.google.com/presentation/d/1FNd37jW3nB95lko2imfk6A7VGVG0S0H53hUgWocYJ1g/edit?usp=sharing

Key Takeaways

Circuit tracing helps us understand what’s going on inside AI models.
Instead of guessing how a model gives answers, we can now see the reasoning path neuron by neuron.
Nodes and Supernodes they are like Lego blocks of AI thinking.
One neuron is a node, and a group of related ones form a supernode, like a concept ("Texas").
You can trace which parts of the model fire for each part of the prompt.
Like tracing how "Dallas" leads to "Austin" through internal circuits.
Anthropic’s circuit-tracer and Neuronpedia are powerful open source tools.
Anyone can use them to experiment and explore model internals.
FOSS makes advanced AI safety research accessible.
Now we can explore, patch, and visualize how LLMs work.
This technique helps with safety, debugging, and understanding model failures.
Imagine catching a wrong answer before it happens — by watching the circuit.
Even students and indie devs can now do deep interpretability research.
You don’t need a lab — just a notebook, model weights, and curiosity.