ABCDEFGHIJKLMNOPQRSTUVWXYZAAABAC
1
200 Concrete Problems In Interpretability Spreadsheet

Note: A nicer-looking Coda document with filter support is now present here thanks to Kunvar!

The purpose of this spreadsheet is to let people quickly browse problems, find problems they’re interested in, and see which ones are being worked on already. (To sort by difficulty, the most common use case I expect, select column C, then go to Data -> Sort Sheet)

The “Existing Work” column is for completed posts, papers, or other documents. The “Currently Working On” column is for drafts, brainstorms, and people who want to work on it but haven’t produced anything yet. Please add a date if you’re currently working on something so it’s clear if you expressed interest yesterday or two years ago. Write it up as a comment and I'll approve it fairly promptly - people will still see the comment until then.

This sequence is long! What this means is that not all relevant information is contained in this spreadsheet! There is lots of great context for the sequence as a whole and each section in general, including motivation and useful resources. Some problems are copied word-for-word - many are not.

If you are interested in a problem, please take a look at the problem in the original post before deciding to tackle it! Often it includes relevant context or links that didn’t make it into the spreadsheet for space reasons. I’d also recommend looking at the first part of the relevant 200 COP post, before seriously tackling one of its problems.

Huge thanks to Neel Nanda for his work in creating this sequence and building the field.
2
CategorySubcategoryDifficultyNumberProblemExisting WorkCurrently Working On (include date!)
3
Toy Language ModelsUnderstanding neuronsB-C1.1How far can you get deeply reverse engineering a neuron in a 1L model? 1L is particularly easy since each neuron's output adds directly to the logits.
4
Toy Language ModelsUnderstanding neuronsB1.2Find an interesting neuron you think represents a feature. Can you fully reverse engineer which direction should activate that feature, and compare to neuron input direction?
5
Toy Language ModelsUnderstanding neuronsB1.3Look for trigram neurons and try to reverse engineer them. in a 1L model.(e.g, "ice cream -> sundae")
6
Toy Language ModelsUnderstanding neuronsB1.4Check out the SoLU paper for more ideas on 1L neurons to find and reverse engineer.
7
Toy Language ModelsUnderstanding neuronsC1.5How far can you get deeply reverse engineering a neuron in a 2+ layer model?
8
Toy Language ModelsUnderstanding neuronsA-B1.6Hunt through Neuroscope for the toy models and look for interesting neurons to focus on.
9
Toy Language ModelsUnderstanding neuronsA-B1.7Can you find any polysemantic neurons in Neuroscope? Explore this.
10
Toy Language ModelsUnderstanding neuronsB1.8Are there neurons whose behaviour can be matched by a regex or other code? If so, run it on a ton of text and compare the output.
11
Toy Language ModelsHow do larger models differ?B-C1.9How do 3-layer and 4-layer attention-only models differ from 2L? (For instance, induction heads only appeared with 2L. Can you find something useful that only appears at 3L or higher?)
12
Toy Language ModelsHow do larger models differ?B1.10How do 3-layer and 4-layer attention-only models differ from 2L? Look for composition scores - try to identify pairs of heads that compose a lot.
13
Toy Language ModelsHow do larger models differ?B1.11How do 3-layer and 4-layer attention-only models differ from 2L? Look for evidence of composition.
14
Toy Language ModelsHow do larger models differ?B1.12How do 3-layer and 4-layer attention-only models differ from 2L? Ablate a single head and run the model on a lot of text. Look at the change in performance. Do any heads matter a lot that aren't induction heads?
15
Toy Language ModelsHow do larger models differ?B-C1.13Look for tasks that an n-layer model can't do, but an n+1-layer model can, and look for a circuit that explains this. (Start by running both models on a bunch of text and look for per-token probability differences)
16
Toy Language ModelsHow do larger models differ?B1.14How do 1L SoLU/GELU models differ from 1L attention-only?
17
Toy Language ModelsHow do larger models differ?B1.15How do 2L SoLU models differ from 1L?
18
Toy Language ModelsHow do larger models differ?B1.16How does 1L GELU differ from 1L SoLU?
19
Toy Language ModelsHow do larger models differ?B1.17Analyse how a larger model "fixes the bugs" of a smaller model.
20
Toy Language ModelsHow do larger models differ?B1.18Does a 1L MLP transformer fix the skip trigram bugs of a 1L Attn Only model? If so, how?
21
Toy Language ModelsHow do larger models differ?B1.19Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Try looking at split-token induction, where the current token has a preceding space and is one token, but the earlier occurrence has no preceding space and is two tokens. E.g " Claire" vs. "Cl" "aire"
22
Toy Language ModelsHow do larger models differ?B1.20Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at misfiring when the previous token appears multiple times with different following tokens
23
Toy Language ModelsHow do larger models differ?B1.21Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at stopping induction on a token that likely shows the end of a repeated string (e.g, . or ! or ")
24
Toy Language ModelsHow do larger models differ?B1.22Does a 2L MLP model fix these bugs (1.19 -1.21) too?
25
Toy Language ModelsA-C1.23Choose your own adventure: Take a bunch of text with interesting patterns and run the models over it. Look for tokens they do really well on and try to reverse engineer what's going on!
26
Circuits In The WildCircuits in natural languageB2.1Look for the induction heads in GPT-2 Small that work with pointer arithmetic. Can you reverse engineer the weights?
27
Circuits In The WildCircuits in natural languageB2.2Continuing sequences that are common in natural language (E.g, "1 2 3 4" -> "5", "Monday\nTuesday\n" -> "Wednesday"Related work about heads outputting the next sequence member was done in this post:
https://www.lesswrong.com/posts/6tHNM2s6SWzFHv3Wo/mechanistically-interpreting-time-in-gpt-2-small
and continued in this paper:
https://arxiv.org/abs/2312.09230

Preliminary work on sequence continuation done during a hackathon:https://alignmentjam.com/project/one-is-1-analyzing-activations-of-numerical-words-vs-digits
28
Circuits In The WildCircuits in natural languageB2.3A harder example would be numbers at the start of lines, like "1. Blah blah blah \n2. Blah blah blah\n"-> "3". Feels like it must be doing something induction-y!
29
Circuits In The WildCircuits in natural languageB2.43 letter acronyms, like "The Acrobatic Circus Group (ACG) and the Ringmaster Friendship Union (" -> RFUjknowak (Discord) - July 2023
30
Circuits In The WildCircuits in natural languageB2.5Converting names to emails, like "Katy Johnson <" -> "katy_johnson"
31
Circuits In The WildCircuits in natural languageC2.6A harder version of 2.5 is constructing an email from a snippet, like Name: Jess Smith, Email: last name dot first name k @ gmail
32
Circuits In The WildCircuits in natural languageC2.7Interpret factual recall. Start with ROME's work with causal tracing, but how much more specific can you get? Heads? Neurons?https://arxiv.org/pdf/2304.14767.pdf
33
Circuits In The WildCircuits in natural languageB2.8Learning that words after full stops are capital letters.
Kyle Cox / 10 April 2024 / Contact: kylecox2000@gmail.com
34
Circuits In The WildCircuits in natural languageB-C2.9Counting objects described in text. (E.g, I picked up an apple, a pear, and an orange. I was holding three fruits.)
35
Circuits In The WildCircuits in natural languageC2.10Interpreting memorisation. Sometimes GPT knows phone numbers. How?
36
Circuits In The WildCircuits in natural languageB2.11Reverse engineer an induction head in a non-toy model.
37
Circuits In The WildCircuits in natural languageB2.12Choosing the right pronouns (E.g, "Lina is a great friend, isn't")https://cmathw.itch.io/identifying-a-preliminary-circuit-for-predicting-gendered-pronouns-in-gpt-2-smalAlana Xiang - 5 May 2023
38
Circuits In The WildCircuits in natural languageA-C2.13Choose your own adventure! Try finding behaviours of your own related to natural language circuits.
39
Circuits In The WildCircuits in code modelsB2.14Closing brackets. Bonus: Tracking correct brackets - [, (, {, etc.Mariam Ihab - 05/01/2024 -Attempting to work on this as part of Bluedot's AI safety fundamentals project
40
Circuits In The WildCircuits in code modelsB2.15Closing HTML tags
41
Circuits In The WildCircuits in code modelsC2.16Methods depend on object type (e.g, x.append a list, x.update a dictionary)
42
Circuits In The WildCircuits in code modelsA-C2.17Choose your own adventure! Look for interesting patterns in how the model behaves on code and try to reverse engineer something. Algorithmic flavored tasks should be easiest.
43
Circuits In The WildExtensions to IOI paperA2.18Understand IOI in the Stanford mistral models. Does the same circuit arise? (You should be able to near exactly copy Redwood's code for this)https://docs.google.com/document/d/13bmvy2rhBL8DuZY-ZJYq_VJ37eibdTHy5KBLV11-S9w/edit?usp=sharing
44
Circuits In The WildExtensions to IOI paperA2.19Do earlier heads in the circuit (duplicate token, induction, S-inhibition) have backup style behaviour? If we ablate them, how much does this damage performance? Will other things compensate?
45
Circuits In The WildExtensions to IOI paperB2.20Is there a general pattern for backup-ness? (Follows 2.19)
46
Circuits In The WildExtensions to IOI paperA2.21Can we reverse engineer how duplicate token heads work deeply? In particular, how does the QK circuit know to look for copies of the current token without activating on non-duplicates since the current token is always a copy of itself?
47
Circuits In The WildExtensions to IOI paperB2.22Understand IOI in GPT-Neo. Same size but seems to do IOI via MLP composition.
48
Circuits In The WildExtensions to IOI paperC2.23What is the role of Negative/Backup/regular Name Mover heads outside IOI? Are there examples where Negative Name Movers contribute positively?
49
Circuits In The WildExtensions to IOI paperC2.24What are the conditions for the compensation mechanisms where ablating a name mover doesn't reduce performance much to occur? Is it due to dropout?
50
Circuits In The WildExtensions to IOI paperB2.25GPT-Neo wasn't trained with dropout - check 2.24 on this.
51
Circuits In The WildExtensions to IOI paperB2.26Reverse engineering L4H11, a really sharp previous token head in GPT-2-small, at the parameter level.
52
Circuits In The WildExtensions to IOI paperC2.27MLP layers (beyond the first) seem to matter somewhat for the IOI task. What's up with this?
53
Circuits In The WildExtensions to IOI paperC2.28Understanding what's happening in the adversarial examples, most notable S-Inhibition Head attention pattern (hard)
54
Circuits In The WildConfusing thingsB-C2.29Why do models have so many induction heads? How do they specialise, and why does the model need so many?Jordan Taylor - 2023 November 7 - (though not investigating very deeply) - contact: jordantensor@gmail.com
55
Circuits In The WildConfusing thingsB2.30Why is GPT-2 Small's performance ruined if the first MLP layer is ablated?
56
Circuits In The WildConfusing thingsB-C2.31Can we find evidence of the residual stream as shared bandwidth hypothesis?
57
Circuits In The WildConfusing thingsB2.32Can we find evidence of the residual stream as shared bandwidth hypothesis? In particular, the idea that the model dedicates parameters to memory management and cleaning up memory once it's used. Are there neurons with high negative cosine sim (so the output erases the input feature) Do they correspond to cleaning up specific features?
58
Circuits In The WildConfusing thingsB2.33What happens to the memory in an induction circuit? (See 2.32)
59
Circuits In The WildStudying larger modelsB-C2.34GPT-J contains translation heads. Can you interpret how they work and what they do?
60
Circuits In The WildStudying larger modelsC2.35Try to find and reverse engineer fancier induction heads like pattern matching heads - try GPT-J or GPT-NeoX.
61
Circuits In The WildStudying larger modelsC-D2.36What's up with few-shot learning? How does it work?
62
Circuits In The WildStudying larger modelsC2.37How does addition work? (Focus on 2-digit)Transformer addition is now explained in https://philipquirke.github.io/transformer-maths/2023/10/14/Understanding-Addition.html Here's a couple of lesswrong posts about 2 digit subtraction:

plus/minus algorithm: https://www.lesswrong.com/posts/pbj6tTZyakodxC9Ho/how-does-a-toy-2-digit-subtraction-transformer-predict-the

difference algorithm: https://www.lesswrong.com/posts/RABp7ZMw2FGwh4odq/how-does-a-toy-2-digit-subtraction-transformer-predict-the-1
63
Circuits In The WildStudying larger modelsC2.38What's up with Tim Dettmer's emergent features in the residual stream stuff? Do they map to anything interpretable? What if we do max activating dataset examples?
64
Interpreting Algorithmic ProblemsBeginner problemsA3.1Sorting fixed-length lists. (format - START 4 6 2 9 MID 2 4 6 9)https://github.com/MatthewBaggins/one-attention-head-is-all-you-need/
65
Interpreting Algorithmic ProblemsBeginner problemsA3.2Sorting variable-length lists. (What's the sorting algorithm? What's the longest list you can get do? How does length affect accuracy?)
66
Interpreting Algorithmic ProblemsBeginner problemsA3.3(July 10 2023) Max over a list: https://colab.research.google.com/drive/1WdvPyO-bB6l-iWq8SYjiovHp5R3834wN?usp=sharing
(July 10 2023) Formalising in Coq: https://github.com/JasonGross/neural-net-coq-interp/blob/main/theories/max.v
Bart Bussmann (May 16th 2023)
67
Interpreting Algorithmic ProblemsBeginner problemsA3.4Interpret a 1L Transformer with MLP's trained to do modular subtraction (Analogous to Neel's grokking work)
68
Interpreting Algorithmic ProblemsBeginner problemsA3.5Taking the minimum or maximum of two intshttps://colab.research.google.com/drive/1N4iPEyBVuctveCA0Zre92SpfgH6nmHXYJuly 10 2023
Max over a list
Some work towards formalising in Coq
69
Interpreting Algorithmic ProblemsBeginner problemsA3.6Permuting lists
70
Interpreting Algorithmic ProblemsBeginner problemsA3.7Calculating sequences with Fibonnaci-style recurrence (predicting next element from the previous two)
71
Interpreting Algorithmic ProblemsHarder problemsB3.85-digit addition/subtraction.Transformer addition is now explained in https://philipquirke.github.io/transformer-maths/2023/10/14/Understanding-Addition.html
72
Interpreting Algorithmic ProblemsHarder problemsB3.9Predicting the output to simple code function. E.g, problems like "a = 1 2 3. a[2] = 4. a -> 1 2 4"Code
73
Interpreting Algorithmic ProblemsHarder problemsB3.10Graph theory problems like this. Unsure of the correct input format. Try a bunch. See here
74
Interpreting Algorithmic ProblemsHarder problemsB3.11Train a model on multiple algorithmic tasks we understand (like modular addition and subtraction). Compare to a model trained on each task. Does it learn the same circuits? Is there superposition?This paper https://arxiv.org/pdf/2402.16726.pdf may tackle the problem of multiple tasksI (Evan Anders, evanhanders@ucsb.edu) am working on this with 5-digit addition and subtraction as of Nov 6 2023. Might add more operations, too. Happy to collaborate with anyone who's interested, just email :).
75
Interpreting Algorithmic ProblemsHarder problemsB3.12Train models for automata tasks and interpret them. Do your results match the theory?https://arxiv.org/abs/2402.11917
76
Interpreting Algorithmic ProblemsHarder problemsB3.13In-Context Linear Regression - the transformer gets a sequence (x_1, y_1, x_2, y_2, ...) where y_i = Ax_i + b. A and b are different for each prompt, and need to be learned in-context. (Code here)
77
Interpreting Algorithmic ProblemsHarder problemsC3.14Problems in In-Context Linear Regression that are in-context learned. See 3.13.
78
Interpreting Algorithmic ProblemsHarder problemsC3.155 digit (or binary) multiplication
79
Interpreting Algorithmic ProblemsHarder problemsB3.16Predict repeated subsequences in randomly generated tokens, and see if you can find and reverse engineer induction heads.
80
Interpreting Algorithmic ProblemsHarder problemsB-C3.17Choose your own adventure! Find your own algorithmic problem. Leetcode easy is probably a good source.
81
Interpreting Algorithmic ProblemsB3.18Build a toy model of Indirect Object Identification - train a tiny attention-only model on an algorithmic task simulating IOI - and reverse-engineer the learned solution. Compare it to the circuit found in GPT-2 Small.
82
Interpreting Algorithmic ProblemsC3.19Is 3.18 consistent across random seeds, or can other algorithms be learned? Can a 2L model learn this? What happens if you add more MLP's or more layers?
83
Interpreting Algorithmic ProblemsC3.20Reverse-engineer Othello-GPT. Can you reverse-engineer the algorithms it learns, or the features the probes find?
84
Interpreting Algorithmic ProblemsQuestions about language modelsA3.21Train a 1L attention-only transformer with rotary to predict the previous token and reverse engineer how it does this.5/7/23: Eric (repo: https://github.com/DKdekes/rotary-interp)
85
Interpreting Algorithmic ProblemsQuestions about language modelsB3.22Train a 3L attention-only transformer to perform the Indirect Object Identification task. Can it do the task? Does it learn the same circuit found in GPT-2 Small?
86
Interpreting Algorithmic ProblemsQuestions about language modelsB3.23Redo Neel's modular addition analysis with GELU. Does it change things?
87
Interpreting Algorithmic ProblemsQuestions about language modelsC3.24How does memorisation work? Try training a one hidden layer MLP to memorise random data, or training a transformer on a fixed set of random strings of tokens.
88
Interpreting Algorithmic ProblemsQuestions about language modelsB-C3.25Compare different dimensionality reduction techniques on modular addition or a problem you feel you understand.
89
Interpreting Algorithmic ProblemsQuestions about language modelsB3.26In modular addition, look at what different dimensionality reduction techniques do on different weight matrices. Can you identify which weights matter most? Which neurons form clusters for each frequency? Anything from activations?
90
Interpreting Algorithmic ProblemsQuestions about language modelsC3.27Is direct logit attribution always useful? Can you find examples where it's highly misleading?
91
Interpreting Algorithmic ProblemsDeep learning mysteriesD3.28Explore the Lottery Ticket Hypothesis
92
Interpreting Algorithmic ProblemsDeep learning mysteriesD3.29Explore Deep Double Descent
93
Interpreting Algorithmic ProblemsExtending Othello-GPTA3.30Try one of Neel's concrete Othello-GPT projects.
94
Interpreting Algorithmic ProblemsExtending Othello-GPTB-C3.31Looking for modular circuits - try to find the circuits used to compute the world model and to use the world model to compute the next move. Try to understand each in isolation and use this to understand how they fit together. See what you can learn about finding modular circuits in general.Conceptually: https://www.alignmentforum.org/posts/nwLQt4e7bstCyPEXs/internal-interfaces-are-a-high-priority-interpretability
95
Interpreting Algorithmic ProblemsExtending Othello-GPTB-C3.32Neuron Interpretability and Studying Superposition - try to understand the model's MLP neurons, and explore what techniques do and don't work. Try to build our understanding of transformer MLP's in general.
96
Interpreting Algorithmic ProblemsExtending Othello-GPTB-C3.33Transformer Circuits Laboratory - Explore and test other conjectures about transformer circuits - e.g, can we figure out how the model manages memory in the residual stream?
97
Exploring Polysemanticity and SuperpositionConfusions to study in Toy Models of SuperpositionA4.1Does dropout create a privileged basis? Put dropout on the hidden layer of the ReLU output model and study how this changes the results.Lewis Smith thinks the answer is yes: see this post for their results.14 April 2023: Kunvar (firstuserhere)
98
Exploring Polysemanticity and SuperpositionConfusions to study in Toy Models of SuperpositionB-C4.2Replicate their absolute value model and study some of the variants of the ReLU output models.May 4, 2023 - Kunvar (firstuserhere)
99
Exploring Polysemanticity and SuperpositionConfusions to study in Toy Models of SuperpositionB4.3Explore neuron superposition by training their absolute value model on a more complex function like x -> x^2.
100
Exploring Polysemanticity and SuperpositionConfusions to study in Toy Models of SuperpositionB4.4What happens to their ReLU output model when there's non-uniform sparsity? E.g, one class of less sparse features and another of very sparse