200 Concrete Problems In Interpretability Spreadsheet

	A	B	C	D	E	F	G
1	200 Concrete Problems In Interpretability Spreadsheet Note: A nicer-looking Coda document with filter support is now present here thanks to Kunvar! The purpose of this spreadsheet is to let people quickly browse problems, find problems they’re interested in, and see which ones are being worked on already. (To sort by difficulty, the most common use case I expect, select column C, then go to Data -> Sort Sheet) The “Existing Work” column is for completed posts, papers, or other documents. The “Currently Working On” column is for drafts, brainstorms, and people who want to work on it but haven’t produced anything yet. Please add a date if you’re currently working on something so it’s clear if you expressed interest yesterday or two years ago. Write it up as a comment and I'll approve it fairly promptly - people will still see the comment until then. This sequence is long! What this means is that not all relevant information is contained in this spreadsheet! There is lots of great context for the sequence as a whole and each section in general, including motivation and useful resources. Some problems are copied word-for-word - many are not. If you are interested in a problem, please take a look at the problem in the original post before deciding to tackle it! Often it includes relevant context or links that didn’t make it into the spreadsheet for space reasons. I’d also recommend looking at the first part of the relevant 200 COP post, before seriously tackling one of its problems. Huge thanks to Neel Nanda for his work in creating this sequence and building the field.
2	Category	Subcategory	Difficulty	Number	Problem	Existing Work	Currently Working On (include date!)
3	Toy Language Models	Understanding neurons	B-C	1.1	How far can you get deeply reverse engineering a neuron in a 1L model? 1L is particularly easy since each neuron's output adds directly to the logits.
4	Toy Language Models	Understanding neurons	B	1.2	Find an interesting neuron you think represents a feature. Can you fully reverse engineer which direction should activate that feature, and compare to neuron input direction?
5	Toy Language Models	Understanding neurons	B	1.3	Look for trigram neurons and try to reverse engineer them. in a 1L model.(e.g, "ice cream -> sundae")
6	Toy Language Models	Understanding neurons	B	1.4	Check out the SoLU paper for more ideas on 1L neurons to find and reverse engineer.
7	Toy Language Models	Understanding neurons	C	1.5	How far can you get deeply reverse engineering a neuron in a 2+ layer model?
8	Toy Language Models	Understanding neurons	A-B	1.6	Hunt through Neuroscope for the toy models and look for interesting neurons to focus on.
9	Toy Language Models	Understanding neurons	A-B	1.7	Can you find any polysemantic neurons in Neuroscope? Explore this.
10	Toy Language Models	Understanding neurons	B	1.8	Are there neurons whose behaviour can be matched by a regex or other code? If so, run it on a ton of text and compare the output.
11	Toy Language Models	How do larger models differ?	B-C	1.9	How do 3-layer and 4-layer attention-only models differ from 2L? (For instance, induction heads only appeared with 2L. Can you find something useful that only appears at 3L or higher?)
12	Toy Language Models	How do larger models differ?	B	1.10	How do 3-layer and 4-layer attention-only models differ from 2L? Look for composition scores - try to identify pairs of heads that compose a lot.
13	Toy Language Models	How do larger models differ?	B	1.11	How do 3-layer and 4-layer attention-only models differ from 2L? Look for evidence of composition.
14	Toy Language Models	How do larger models differ?	B	1.12	How do 3-layer and 4-layer attention-only models differ from 2L? Ablate a single head and run the model on a lot of text. Look at the change in performance. Do any heads matter a lot that aren't induction heads?
15	Toy Language Models	How do larger models differ?	B-C	1.13	Look for tasks that an n-layer model can't do, but an n+1-layer model can, and look for a circuit that explains this. (Start by running both models on a bunch of text and look for per-token probability differences)
16	Toy Language Models	How do larger models differ?	B	1.14	How do 1L SoLU/GELU models differ from 1L attention-only?
17	Toy Language Models	How do larger models differ?	B	1.15	How do 2L SoLU models differ from 1L?
18	Toy Language Models	How do larger models differ?	B	1.16	How does 1L GELU differ from 1L SoLU?
19	Toy Language Models	How do larger models differ?	B	1.17	Analyse how a larger model "fixes the bugs" of a smaller model.
20	Toy Language Models	How do larger models differ?	B	1.18	Does a 1L MLP transformer fix the skip trigram bugs of a 1L Attn Only model? If so, how?
21	Toy Language Models	How do larger models differ?	B	1.19	Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Try looking at split-token induction, where the current token has a preceding space and is one token, but the earlier occurrence has no preceding space and is two tokens. E.g " Claire" vs. "Cl" "aire"
22	Toy Language Models	How do larger models differ?	B	1.20	Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at misfiring when the previous token appears multiple times with different following tokens
23	Toy Language Models	How do larger models differ?	B	1.21	Does a 3L attn only model fix bugs in induction heads in a 2L attn-only model? Look at stopping induction on a token that likely shows the end of a repeated string (e.g, . or ! or ")
24	Toy Language Models	How do larger models differ?	B	1.22	Does a 2L MLP model fix these bugs (1.19 -1.21) too?
25	Toy Language Models		A-C	1.23	Choose your own adventure: Take a bunch of text with interesting patterns and run the models over it. Look for tokens they do really well on and try to reverse engineer what's going on!
26	Circuits In The Wild	Circuits in natural language	B	2.1	Look for the induction heads in GPT-2 Small that work with pointer arithmetic. Can you reverse engineer the weights?
27	Circuits In The Wild	Circuits in natural language	B	2.2	Continuing sequences that are common in natural language (E.g, "1 2 3 4" -> "5", "Monday\nTuesday\n" -> "Wednesday"	Related work about heads outputting the next sequence member was done in this post: https://www.lesswrong.com/posts/6tHNM2s6SWzFHv3Wo/mechanistically-interpreting-time-in-gpt-2-small and continued in this paper: https://arxiv.org/abs/2312.09230 Preliminary work on sequence continuation done during a hackathon:https://alignmentjam.com/project/one-is-1-analyzing-activations-of-numerical-words-vs-digits
28	Circuits In The Wild	Circuits in natural language	B	2.3	A harder example would be numbers at the start of lines, like "1. Blah blah blah \n2. Blah blah blah\n"-> "3". Feels like it must be doing something induction-y!
29	Circuits In The Wild	Circuits in natural language	B	2.4	3 letter acronyms, like "The Acrobatic Circus Group (ACG) and the Ringmaster Friendship Union (" -> RFU		jknowak (Discord) - July 2023
30	Circuits In The Wild	Circuits in natural language	B	2.5	Converting names to emails, like "Katy Johnson <" -> "katy_johnson"
31	Circuits In The Wild	Circuits in natural language	C	2.6	A harder version of 2.5 is constructing an email from a snippet, like Name: Jess Smith, Email: last name dot first name k @ gmail
32	Circuits In The Wild	Circuits in natural language	C	2.7	Interpret factual recall. Start with ROME's work with causal tracing, but how much more specific can you get? Heads? Neurons?	https://arxiv.org/pdf/2304.14767.pdf
33	Circuits In The Wild	Circuits in natural language	B	2.8	Learning that words after full stops are capital letters.		Kyle Cox / 10 April 2024 / Contact: kylecox2000@gmail.com
34	Circuits In The Wild	Circuits in natural language	B-C	2.9	Counting objects described in text. (E.g, I picked up an apple, a pear, and an orange. I was holding three fruits.)
35	Circuits In The Wild	Circuits in natural language	C	2.10	Interpreting memorisation. Sometimes GPT knows phone numbers. How?
36	Circuits In The Wild	Circuits in natural language	B	2.11	Reverse engineer an induction head in a non-toy model.
37	Circuits In The Wild	Circuits in natural language	B	2.12	Choosing the right pronouns (E.g, "Lina is a great friend, isn't")	https://cmathw.itch.io/identifying-a-preliminary-circuit-for-predicting-gendered-pronouns-in-gpt-2-smal	Alana Xiang - 5 May 2023
38	Circuits In The Wild	Circuits in natural language	A-C	2.13	Choose your own adventure! Try finding behaviours of your own related to natural language circuits.
39	Circuits In The Wild	Circuits in code models	B	2.14	Closing brackets. Bonus: Tracking correct brackets - [, (, {, etc.		Mariam Ihab - 05/01/2024 -Attempting to work on this as part of Bluedot's AI safety fundamentals project
40	Circuits In The Wild	Circuits in code models	B	2.15	Closing HTML tags
41	Circuits In The Wild	Circuits in code models	C	2.16	Methods depend on object type (e.g, x.append a list, x.update a dictionary)
42	Circuits In The Wild	Circuits in code models	A-C	2.17	Choose your own adventure! Look for interesting patterns in how the model behaves on code and try to reverse engineer something. Algorithmic flavored tasks should be easiest.
43	Circuits In The Wild	Extensions to IOI paper	A	2.18	Understand IOI in the Stanford mistral models. Does the same circuit arise? (You should be able to near exactly copy Redwood's code for this)	https://docs.google.com/document/d/13bmvy2rhBL8DuZY-ZJYq_VJ37eibdTHy5KBLV11-S9w/edit?usp=sharing
44	Circuits In The Wild	Extensions to IOI paper	A	2.19	Do earlier heads in the circuit (duplicate token, induction, S-inhibition) have backup style behaviour? If we ablate them, how much does this damage performance? Will other things compensate?
45	Circuits In The Wild	Extensions to IOI paper	B	2.20	Is there a general pattern for backup-ness? (Follows 2.19)
46	Circuits In The Wild	Extensions to IOI paper	A	2.21	Can we reverse engineer how duplicate token heads work deeply? In particular, how does the QK circuit know to look for copies of the current token without activating on non-duplicates since the current token is always a copy of itself?
47	Circuits In The Wild	Extensions to IOI paper	B	2.22	Understand IOI in GPT-Neo. Same size but seems to do IOI via MLP composition.
48	Circuits In The Wild	Extensions to IOI paper	C	2.23	What is the role of Negative/Backup/regular Name Mover heads outside IOI? Are there examples where Negative Name Movers contribute positively?
49	Circuits In The Wild	Extensions to IOI paper	C	2.24	What are the conditions for the compensation mechanisms where ablating a name mover doesn't reduce performance much to occur? Is it due to dropout?
50	Circuits In The Wild	Extensions to IOI paper	B	2.25	GPT-Neo wasn't trained with dropout - check 2.24 on this.
51	Circuits In The Wild	Extensions to IOI paper	B	2.26	Reverse engineering L4H11, a really sharp previous token head in GPT-2-small, at the parameter level.
52	Circuits In The Wild	Extensions to IOI paper	C	2.27	MLP layers (beyond the first) seem to matter somewhat for the IOI task. What's up with this?
53	Circuits In The Wild	Extensions to IOI paper	C	2.28	Understanding what's happening in the adversarial examples, most notable S-Inhibition Head attention pattern (hard)
54	Circuits In The Wild	Confusing things	B-C	2.29	Why do models have so many induction heads? How do they specialise, and why does the model need so many?		Jordan Taylor - 2023 November 7 - (though not investigating very deeply) - contact: jordantensor@gmail.com
55	Circuits In The Wild	Confusing things	B	2.30	Why is GPT-2 Small's performance ruined if the first MLP layer is ablated?
56	Circuits In The Wild	Confusing things	B-C	2.31	Can we find evidence of the residual stream as shared bandwidth hypothesis?
57	Circuits In The Wild	Confusing things	B	2.32	Can we find evidence of the residual stream as shared bandwidth hypothesis? In particular, the idea that the model dedicates parameters to memory management and cleaning up memory once it's used. Are there neurons with high negative cosine sim (so the output erases the input feature) Do they correspond to cleaning up specific features?
58	Circuits In The Wild	Confusing things	B	2.33	What happens to the memory in an induction circuit? (See 2.32)
59	Circuits In The Wild	Studying larger models	B-C	2.34	GPT-J contains translation heads. Can you interpret how they work and what they do?
60	Circuits In The Wild	Studying larger models	C	2.35	Try to find and reverse engineer fancier induction heads like pattern matching heads - try GPT-J or GPT-NeoX.
61	Circuits In The Wild	Studying larger models	C-D	2.36	What's up with few-shot learning? How does it work?
62	Circuits In The Wild	Studying larger models	C	2.37	How does addition work? (Focus on 2-digit)	Transformer addition is now explained in https://philipquirke.github.io/transformer-maths/2023/10/14/Understanding-Addition.html Here's a couple of lesswrong posts about 2 digit subtraction: plus/minus algorithm: https://www.lesswrong.com/posts/pbj6tTZyakodxC9Ho/how-does-a-toy-2-digit-subtraction-transformer-predict-the difference algorithm: https://www.lesswrong.com/posts/RABp7ZMw2FGwh4odq/how-does-a-toy-2-digit-subtraction-transformer-predict-the-1
63	Circuits In The Wild	Studying larger models	C	2.38	What's up with Tim Dettmer's emergent features in the residual stream stuff? Do they map to anything interpretable? What if we do max activating dataset examples?
64	Interpreting Algorithmic Problems	Beginner problems	A	3.1	Sorting fixed-length lists. (format - START 4 6 2 9 MID 2 4 6 9)	https://github.com/MatthewBaggins/one-attention-head-is-all-you-need/
65	Interpreting Algorithmic Problems	Beginner problems	A	3.2	Sorting variable-length lists. (What's the sorting algorithm? What's the longest list you can get do? How does length affect accuracy?)
66	Interpreting Algorithmic Problems	Beginner problems	A	3.3	(July 10 2023) Max over a list: https://colab.research.google.com/drive/1WdvPyO-bB6l-iWq8SYjiovHp5R3834wN?usp=sharing (July 10 2023) Formalising in Coq: https://github.com/JasonGross/neural-net-coq-interp/blob/main/theories/max.v		Bart Bussmann (May 16th 2023)
67	Interpreting Algorithmic Problems	Beginner problems	A	3.4	Interpret a 1L Transformer with MLP's trained to do modular subtraction (Analogous to Neel's grokking work)
68	Interpreting Algorithmic Problems	Beginner problems	A	3.5	Taking the minimum or maximum of two ints	https://colab.research.google.com/drive/1N4iPEyBVuctveCA0Zre92SpfgH6nmHXY	July 10 2023 Max over a list Some work towards formalising in Coq
69	Interpreting Algorithmic Problems	Beginner problems	A	3.6	Permuting lists
70	Interpreting Algorithmic Problems	Beginner problems	A	3.7	Calculating sequences with Fibonnaci-style recurrence (predicting next element from the previous two)
71	Interpreting Algorithmic Problems	Harder problems	B	3.8	5-digit addition/subtraction.	Transformer addition is now explained in https://philipquirke.github.io/transformer-maths/2023/10/14/Understanding-Addition.html
72	Interpreting Algorithmic Problems	Harder problems	B	3.9	Predicting the output to simple code function. E.g, problems like "a = 1 2 3. a[2] = 4. a -> 1 2 4"	Code
73	Interpreting Algorithmic Problems	Harder problems	B	3.10	Graph theory problems like this. Unsure of the correct input format. Try a bunch. See here
74	Interpreting Algorithmic Problems	Harder problems	B	3.11	Train a model on multiple algorithmic tasks we understand (like modular addition and subtraction). Compare to a model trained on each task. Does it learn the same circuits? Is there superposition?	This paper https://arxiv.org/pdf/2402.16726.pdf may tackle the problem of multiple tasks	I (Evan Anders, evanhanders@ucsb.edu) am working on this with 5-digit addition and subtraction as of Nov 6 2023. Might add more operations, too. Happy to collaborate with anyone who's interested, just email :).
75	Interpreting Algorithmic Problems	Harder problems	B	3.12	Train models for automata tasks and interpret them. Do your results match the theory?	https://arxiv.org/abs/2402.11917
76	Interpreting Algorithmic Problems	Harder problems	B	3.13	In-Context Linear Regression - the transformer gets a sequence (x_1, y_1, x_2, y_2, ...) where y_i = Ax_i + b. A and b are different for each prompt, and need to be learned in-context. (Code here)
77	Interpreting Algorithmic Problems	Harder problems	C	3.14	Problems in In-Context Linear Regression that are in-context learned. See 3.13.
78	Interpreting Algorithmic Problems	Harder problems	C	3.15	5 digit (or binary) multiplication
79	Interpreting Algorithmic Problems	Harder problems	B	3.16	Predict repeated subsequences in randomly generated tokens, and see if you can find and reverse engineer induction heads.
80	Interpreting Algorithmic Problems	Harder problems	B-C	3.17	Choose your own adventure! Find your own algorithmic problem. Leetcode easy is probably a good source.
81	Interpreting Algorithmic Problems		B	3.18	Build a toy model of Indirect Object Identification - train a tiny attention-only model on an algorithmic task simulating IOI - and reverse-engineer the learned solution. Compare it to the circuit found in GPT-2 Small.
82	Interpreting Algorithmic Problems		C	3.19	Is 3.18 consistent across random seeds, or can other algorithms be learned? Can a 2L model learn this? What happens if you add more MLP's or more layers?
83	Interpreting Algorithmic Problems		C	3.20	Reverse-engineer Othello-GPT. Can you reverse-engineer the algorithms it learns, or the features the probes find?
84	Interpreting Algorithmic Problems	Questions about language models	A	3.21	Train a 1L attention-only transformer with rotary to predict the previous token and reverse engineer how it does this.		5/7/23: Eric (repo: https://github.com/DKdekes/rotary-interp)
85	Interpreting Algorithmic Problems	Questions about language models	B	3.22	Train a 3L attention-only transformer to perform the Indirect Object Identification task. Can it do the task? Does it learn the same circuit found in GPT-2 Small?
86	Interpreting Algorithmic Problems	Questions about language models	B	3.23	Redo Neel's modular addition analysis with GELU. Does it change things?
87	Interpreting Algorithmic Problems	Questions about language models	C	3.24	How does memorisation work? Try training a one hidden layer MLP to memorise random data, or training a transformer on a fixed set of random strings of tokens.
88	Interpreting Algorithmic Problems	Questions about language models	B-C	3.25	Compare different dimensionality reduction techniques on modular addition or a problem you feel you understand.
89	Interpreting Algorithmic Problems	Questions about language models	B	3.26	In modular addition, look at what different dimensionality reduction techniques do on different weight matrices. Can you identify which weights matter most? Which neurons form clusters for each frequency? Anything from activations?
90	Interpreting Algorithmic Problems	Questions about language models	C	3.27	Is direct logit attribution always useful? Can you find examples where it's highly misleading?
91	Interpreting Algorithmic Problems	Deep learning mysteries	D	3.28	Explore the Lottery Ticket Hypothesis
92	Interpreting Algorithmic Problems	Deep learning mysteries	D	3.29	Explore Deep Double Descent
93	Interpreting Algorithmic Problems	Extending Othello-GPT	A	3.30	Try one of Neel's concrete Othello-GPT projects.
94	Interpreting Algorithmic Problems	Extending Othello-GPT	B-C	3.31	Looking for modular circuits - try to find the circuits used to compute the world model and to use the world model to compute the next move. Try to understand each in isolation and use this to understand how they fit together. See what you can learn about finding modular circuits in general.	Conceptually: https://www.alignmentforum.org/posts/nwLQt4e7bstCyPEXs/internal-interfaces-are-a-high-priority-interpretability
95	Interpreting Algorithmic Problems	Extending Othello-GPT	B-C	3.32	Neuron Interpretability and Studying Superposition - try to understand the model's MLP neurons, and explore what techniques do and don't work. Try to build our understanding of transformer MLP's in general.
96	Interpreting Algorithmic Problems	Extending Othello-GPT	B-C	3.33	Transformer Circuits Laboratory - Explore and test other conjectures about transformer circuits - e.g, can we figure out how the model manages memory in the residual stream?
97	Exploring Polysemanticity and Superposition	Confusions to study in Toy Models of Superposition	A	4.1	Does dropout create a privileged basis? Put dropout on the hidden layer of the ReLU output model and study how this changes the results.	Lewis Smith thinks the answer is yes: see this post for their results.	14 April 2023: Kunvar (firstuserhere)
98	Exploring Polysemanticity and Superposition	Confusions to study in Toy Models of Superposition	B-C	4.2	Replicate their absolute value model and study some of the variants of the ReLU output models.		May 4, 2023 - Kunvar (firstuserhere)
99	Exploring Polysemanticity and Superposition	Confusions to study in Toy Models of Superposition	B	4.3	Explore neuron superposition by training their absolute value model on a more complex function like x -> x^2.
100	Exploring Polysemanticity and Superposition	Confusions to study in Toy Models of Superposition	B	4.4	What happens to their ReLU output model when there's non-uniform sparsity? E.g, one class of less sparse features and another of very sparse