Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Can someone ELI5 transformers and the “Attention is all we need” paper?
644 points by Sai_ 12 months ago | hide | past | favorite | 228 comments
I have zero AI/ML knowledge but Steve Yegge on Medium thinks that the team behind Transformers deserves a Nobel.

Makes me want to better understand this tech.

Edit: thank you for some amazing top level responses and links to valuable content on this subject.




Okay, here's my attempt!

First, we take a sequence of words and represent it as a grid of numbers: each column of the grid is a separate word, and each row of the grid is a measurement of some property of that word. Words with similar meanings are likely to have similar numerical values on a row-by-row basis.

(During the training process, we create a dictionary of all possible words, with a column of numbers for each of those words. More on this later!)

This grid is called the "context". Typical systems will have a context that spans several thousand columns and several thousand rows. Right now, context length (column count) is rapidly expanding (1k to 2k to 8k to 32k to 100k+!!) while the dimensionality of each word in the dictionary (row count) is pretty static at around 4k to 8k...

Anyhow, the Transformer architecture takes that grid and passes it through a multi-layer transformation algorithm. The functionality of each layer is identical: receive the grid of numbers as input, then perform a mathematical transformation on the grid of numbers, and pass it along to the next layer.

Most systems these days have around 64 or 96 layers.

After the grid of numbers has passed through all the layers, we can use it to generate a new column of numbers that predicts the properties of some word that would maximize the coherence of the sequence if we add it to the end of the grid. We take that new column of numbers and comb through our dictionary to find the actual word that most-closely matches the properties we're looking for.

That word is the winner! We add it to the sequence as a new column, remove the first-column, and run the whole process again! That's how we generate long text-completions on word at a time :D

So the interesting bits are located within that stack of layers. This is why it's called "deep learning".

The mathematical transformation in each layer is called "self-attention", and it involves a lot of matrix multiplications and dot-product calculations with a learned set of "Query, Key and Value" matrixes.

It can be hard to understand what these layers are doing linguistically, but we can use image-processing and computer-vision as a good metaphor, since images are also grids of numbers, and we've all seen how photo-filters can transform that entire grid in lots of useful ways...

You can think of each layer in the transformer as being like a "mask" or "filter" that selects various interesting features from the grid, and then tweaks the image with respect to those masks and filters.

In image processing, you might apply a color-channel mask (chroma key) to select all the green pixels in the background, so that you can erase the background and replace it with other footage. Or you might apply a "gaussian blur" that mixes each pixel with its nearest neighbors, to create a blurring effect. Or you might do the inverse of a gaussian blur, to create a "sharpening" operation that helps you find edges...

But the basic idea is that you have a library of operations that you can apply to a grid of pixels, in order to transform the image (or part of the image) for a desired effect. And you can stack these transforms to create arbitrarily-complex effects.

The same thing is true in a linguistic transformer, where a text sequence is modeled as a matrix.

The language-model has a library of "Query, Key and Value" matrixes (which were learned during training) that are roughly analogous to the "Masks and Filters" we use on images.

Each layer in the Transformer architecture attempts to identify some features of the incoming linguistic data, an then having identified those features, it can subtract those features from the matrix, so that the next layer sees only the transformation, rather than the original.

We don't know exactly what each of these layers is doing in a linguistic model, but we can imagine it's probably doing things like: performing part-of-speech identification (in this context, is the word "ring" a noun or a verb?), reference resolution (who does the word "he" refer to in this sentence?), etc, etc.

And the "dot-product" calculations in each attention layer are there to make each word "entangled" with its neighbors, so that we can discover all the ways that each word is connected to all the other words in its context.

So... that's how we generate word-predictions (aka "inference") at runtime!

By why does it work?

To understand why it's so effective, you have to understand a bit about the training process.

The flow of data during inference always flows in the same direction. It's called a "feed-forward" network.

But during training, there's another step called "back-propagation".

For each document in our training corpus, we go through all the steps I described above, passing each word into our feed-forward neural network and making word-predictions. We start out with a completely randomized set of QKV matrixes, so the results are often really bad!

During training, when we make a prediction, we KNOW what word is supposed to come next. And we have a numerical representation of each word (4096 numbers in a column!) so we can measure the error between our predictions and the actual next word. Those "error" measurements are also represented as columns of 4096 numbers (because we measure the error in every dimension).

So we take that error vector and pass it backward through the whole system! Each layer needs to take the back-propagated error matrix and perform tiny adjustments to its Query, Key, and Value matrixes. Having compensated for those errors, it reverses its calculations based on the new QKV, and passes the resultant matrix backward to the previous layer. So we make tiny corrections on all 96 layers, and eventually to the word-vectors in the dictionary itself!

Like I said earlier, we don't know exactly what those layers are doing. But we know that they're performing a hierarchical decomposition of concepts.

Hope that helps!


This was a very helpful visualization, thank you!

The "entanglement" part intuitively makes sense to me, but one bit I always get caught up on the key, query, and value matrices. In every self-attention explanation I've read/watched they tend to get thrown out there and similar to what you did here but leave their usage/purpose a little vague.

Would you mind trying to explain those in more detail? I've heard the database analogy where you start with a query to get a set of keys which you then use to lookup a value, but that doesn't really compute with my mental model of neural networks.

Is it accurate to say that these separate QKV matrices are layers in the network? That doesn't seem exactly right since I think the self-attention layer as a whole contains these three different matrices. I would assume they got their names for a reason that should make it somewhat easy to explain their individual purposes and what they try to represent in the NN.


I'm still trying to get a handle on that part myself... But my ever-evolving understanding goes something like this:

The "Query" matrix is like a mask that is capable of selecting certain kinds of features from the context, while the "Key" matrix focuses the "Query" on specific locations in the context.

Using the Query + Key combination, we select and extract those features from the context matrix. And then we apply the "Value" matrix to those features in order to prepare them for feed-forward into the next layer.

There are multiple "Attention Heads" per layer (GPT-3 had 96 heads per layer), and each Head performs its own separate QKV operation. After applying those 96 Q+K->V attention operations per layer, the results are merged back into a single matrix so that they can be fed-forward into the next layer.

Or something like that...

I'm still trying to grok it myself, and if anyone here shed more light on the details, I'd be very grateful!

I'm still trying to understand, for example, how many QKV matrices are actually stored in a model with a particular number of parameters. For example, in a GPT-NeoX-20B model (with 20 billion params) how many distinct Q, K, and V matrices are there, and what is their dimensionality?

EDIT:

I just read Imnimo's comment below, and it provides a much better explanation about QKV vectors. I learned a lot!


Its basically almost the same as convolution with image processing. For example, you take the 3 channel rgb value of a single pixel, do some math on it with the values of the surrounding pixels with weights, which gives you some value(s). Depending on the dimensions of everything, you can end up with a smaller dimension output, like a single 3 channel RGB value, or a higher dimension output (i.e for a 5x5 kernel, you can end up with a 9x9 output)

The confusing part that doesn't get mentioned is that the input vectors (Q, K, V) are weighted, i.e they are derived from the input with the standard linear transformation where y = A*x+b, where x is the input word, A is the linear layer matrix, and b is the bias. Those weighs are the things that are learned through the training process.


That was incredible. Thank you! If you made it into an article with images showing the mask/filter analogy, it might be one of the best/most unique explanations I've seen. Love the ground-up approach beginning with data's shape.

Reminded me of the style of a book on machine learning. If anyone liked this explanation, you may appreciate this book:

https://www.amazon.com/Applied-Machine-Learning-Engineers-Al...


If it only generates one word at a time and then repeat the process again, how does it know when to stop?

It feels like this method would create endless ramblings. But we all know you can ask Chatgpt to “summarize in one sentence” and it pulls it off. When speaking yourself you sort of have to think how to finish a sentence before you start it, to explain something cohesively, surely there must be something similar in the AI?


One of the “words” is a stop token, which represents the ending of text. So you can say the thing that maximizes coherence is to stop right then.


I have a very dumb question, I'll just throw it here: I understand word embeddings and tokenisation- and the value of each; but how can the two work together? Are embeddings calculated for tokens, and in that case, how useful are they, given that each token is just a fragment of a word, often with little or no semantic meaning?


I've heard that nowadays subword/token embeddings are learned during the training phase, and that they are useful for reconstructing the embeddings of words that contain them, and in fact allow the model to handle typos like "aple" (instead of "apple").


The way transformers operate is by transforming the embedding space through each layer. You could say that all the "understanding" is happening in that high dimensional space - that of a single token, but multiplied by the number of tokens. Seeding the embedding space with some learned value for each token is helpful. Think of it as just a vector database: token -> vector.

Decoder-only architectures (such as GPT) mask the token embedding interaction matrix (attention) such each token embedding and all subsequent transformations only have access to preceeding token embeddings (and transforms). This means that on output, only the last transformed token embedding has the full information of the entire context - and only it is capable of making predictions for the next token.

This is done so that during training, you can simultaneously make 1000s (context size) of predictions - every final token embedding transform is predicting the next token. The alternative (Encoder architecture, where there is no masking and the first token can interact with the final token) would result in massively inefficient training for predicting the next token as each full context can only make a single prediction.


Disclaimer - someone from Marqo here.

Marqo supports E5 models: https://github.com/marqo-ai/marqo


why are the words cols and properties are rows. seems counter intuitive


just tilt your head 90 degrees and it'll be fine.

this is rows/columns from a math/matrix/tensor perspective where they are the arbitrary first and second dimensions of a data-containing object.

it's not rows/columns from a database perspective where you expect columns to define a static schema and rows to be individual records.


Thank you for this.


The Yannic kilcher review is quite good.

https://youtu.be/iDulhoQ2pro

I can't ELI5 but I can ELI-junior-dev. Tl;dw:

Transformers work by basically being a differentiable lookup/hash table. First your input is tokenized and (N) tokens (this constitutes the attention frame) are encoded both based on token identity and position in the attention frame.

Then there is an NxN matrix that is applied to your attention frame "performing the lookup query" over all other tokens in the attention frame, so every token gets a "contextual semantic understanding" that takes in both all the other stuff in the attention frame and it's relative position.

Gpt is impressive because the N is really huge and it has many layers. A big N means you can potentially access information farther away. Each layer gives more opportunities to summarize and integrate long range information in a fractal process.

Two key takeaways:

- differentiable hash tables

- encoding relative position using periodic functions

NB: the attention frame tokens are actually K-vectors (so the frame is a KxN matrix) and the query matrix is an NxNxK tensor IIRC but it's easier to describe it this way


I appreciate the explanation, but I don't know what junior-dev would understand most of this. I may be just a web developer, but I couldn't understand most of this. I'd still have to read for 30m to grok it all.


Yeah sorry, it still requires math and probably some exposure to ML basics.


I think one hole in the description for simplicity is that "differentiable" it's not an adjective that applies to hash tables.

Differentiable relative to what? What is (x) in the d(hashtable)/d(x) equation?


I think it applies because lookups can be done by multiplying one-hot vectors (or masks) with matrices; this is roughly analogous to what happens when we multiply Q with K^T in a self-attention head.

Read this: https://e2eml.school/transformers.html#table_lookup

And then read: https://e2eml.school/transformers.html#attention

Matrix multiplication is differentiable as it is continuous. Therefore you can calculate partial derivatives of these operations. The ability to do that is what allows gradient descent optimization via the chain rule.

  > Differentiable relative to what?
  > What is (x) in the d(hashtable)/d(x) equation?
I think the derivative we actually care about is `∂L/∂(hashtable-lookup)` but here the `hashtable-lookup` is actually the matrix multiplication mentioned above. We want to know how much the "loss" (or error) changes with respect to each of our "lookups". Knowing how each of our "lookups" causes our model to err from the output we expect, allows us to optimise it.

Note: it's not really a "lookup" in the normal sense of the word. Hashtable lookups are discontinuous since we either have a value for a particular key or we do not. Our matrix multiplication is sort of like a fuzzy, probabilistic, continuous form of lookup in which nearby keys get partially activated based on their similarity to the query, and in which a small change in this query causes continous changes to the keys produced.

As far as my understanding of the self-attention equation (e.g. `softmax(QK^T / sqrt(d_k))V`) goes, its actually quite important that we get this fuzzy output in which lots of keys get partially activated for a particular query. If it only picked the maximum similarity and ignored the rest, there would be less information propagating through the network and it'd be harder for the network to learn relationships/interactions between inputs. This is why we scale `QK^T` by `sqrt(d_k)` in order to pass a tighter range of values into the `softmax()` function (which importantly generates probabilities that sum to 1, but contains exponentials which give it a tendency to over-emphasize the maximum value and ignore other values if they are too far apart).


That's exactly the point, though! It's surprising. A hashtable is a map from keys to values. Making it differentiable means that a small change in the key also makes a small change in the value!


differentiable relative to the model parameters. The attention mechanism does store(key(x), value(x)), followed by lookup(query(y)), where key(), value(), lookup(), query() are all designed to be differentiable.


One thing that might be worth pointing out is that the transformer architecture owes a great deal of its success to the fact that it can be implemented in a way that it can be massively parallelized in a very efficient manner.


Compared to rnns... maybe? The big nxn is really a killer.

I don't know how to judge parallelizability of different DNN models, you're comparing apples to oranges


When you train a transformer, you're training what the next expected token is. You can train all positions of the sequence each in parallel rather than having to sequentially build up the memory state as you generate the sequence with an LSTM. Mind you the inference portion of a transformer is still sequentially bottlenecked since you don't know what the output sequence is supposed to be.


I believe the other big thing is it allowed you to parallelize the training as well, so you could split those colossal training sets across many machines in a way you could not do with LSTMs.


I don't really believe this because in theory it's possible to split up training of arbitrary neural nets using decoupled neural interfaces (not that it's easy mind you) but at the point where you have openAI scale resources the bottleneck to achieving that is the difficulty of Python.


What’s a neural interface?



If you train in parallel, how do you combine the weights generated from different trainings in parallel?


Believe it or not, it's a simple as averaging or adding the gradients of each training result before adding it to the model weights. The same thing happens when you train a model using batches of inputs.


It actually isn't. You have to have a synchronizer, batchsize one or else "strange things" can happen and you waste a lot of cycles. Alternatively you can do non-simple changes to your network structure to enable distributed training.


It really is that simple. Yes, there's many different approaches to this (which can become quite clever and complex, which is true of training in general), but it all really boils down to adding or averaging the gradients in most cases.


You take a huge hit.


What does it mean for a lookup/hash table to be differentiable?


I'm not a ML expert but I know a bit about math.

It's "differentiable" in the same way that e.g. the "jump function" (Heaviside step function) is differentiable (not as a function from real numbers to real numbers, but as a distribution). It's derivative is the "point impulse function" (Dirac delta function), which, again, is a distribution, not a real function.

Distributions are nicely defined in math, but can't really be operated with numerically (at least not in the same way as real/float functions), but you can approximate them using continuous functions. So instead of having a function jump from 0 to 1, you "spread" the jump and implement it as a continuous transition from e.g. `0-epsilon` to `0+epsilon` for some tiny epsilon. Then you can differentiate it as usual, even numerically.

Similarly, hash table lookup is a dis-continuous function - the result of `hash.get(lookup)` is just `value` (or `null`). To make it continuous, you "spread" the value, so that nearby keys (for some definition of "nearby") will return nearby values.

One way to do this, is to use the scalar product between `lookup` and all keys in the hashtable (normalized, the scalar product is close to 1 if the arguments are "nearby"), and use the result as weights to multiply with all values in the hashtable. That's what the transformer does.


Thanks for this explanation. I couldn't wrap my mind around the "differentiable hash table" analogy, but "distribution of keys" -> "distribution of values" starts to click.

I'm not an ML expert either but I have taken graduate level courses and published papers with "machine learning" in the title, so I feel like I should be able to understand these things better. The field just moves so fast. It's a lot of work to keep up. Easy-to-digest explanations like this are underrated.


>The field just moves so fast. It's a lot of work to keep up. Easy-to-digest explanations like this are underrated.

This is really the truth. I can't possibly understand how people in this field who are talented can still keep up. I have a binder full of seminal papers that I have to cull to make room for more recent and relevant research every few months. I feel there is a lot of potential in simplifying the details of the mechanisms that drive a lot of it, but nobody has time to stop, consolidate the information and publish it. And if they did, it would just be another outdated textbook in a few years.


thank you. This made it click.


Differentiable in this context, means that it can be learned by gradient descent, which uses derivatives to adjust a neural network model's parameters to minimise some error measure. As for how that applies to a hash function, I think the lookup gives some sort of weighting for each possible output, for each input, with the largest weightings corresponding to a "match". But tbh I'm not certain on that last part...


> As for how that applies to a hash function, I think the lookup gives some sort of weighting for each possible output, for each input, with the largest weightings corresponding to a "match"

Note that in general one "matches" more than one thing in which case (IIRC) the result will be a weighted linear combination of the query results.

I do think that this collapses to a datastructure equivalent in function to the "normal hash table" (but not in performance, lol) in the degenerate case where every input yields strictly a single result with zero activation in all other results, so it's not invalid to call it a differentiable lookup table.


You can take a (calculus) derivative of the mathematical construct that represents the lookup table. And yeah, it is very not obvious how a "lookup/hash table" could be differentiable, based on how it's implemented say, in a leetcode exercise. That's part of the genius of the transformer.


Going to go out on a limb and say they are probably referring to the gradient calculus required for updating the model.

https://en.wikipedia.org/wiki/Differentiable_programming

See automatic differentiation.


Correct, but note that if you subject a standard hash table algo to AD it won't magically become a transformer. (Hashes in the "normal construction" are discrete functions and thus aren't really continuous or differentiable, neither are lookup tables)


What actually happens is that each vector grabs a fraction of each other vector and adds it to itself (simplification omitting some transforms along the way). Equating this with the concept of a hash function does not make sense in my opinion but some authors do it anyway.

It's differentiable because how much of a fraction to grab is the result of a simple dot product followed by a softmax.


It's not really doing hashing, which is random, but instead it makes meaningful connections, like connecting an adjective to its noun, or a pronoun to a previous reference of the name. That is why it is called "Attention is all you need" and not "Hashing is all you need". Attention includes some meaning, it is attentive to some aspect or another.


Hashing does not have to be random, it just has to compress the domain. Modulo number is a perfectly valid hash for some use cases.

Edit: just looked it up, doesn't even have to compress the domain, identity hash is a thing.


The lookup/hashtable can be viewed as a matrix which is accessed by multiplying it with a vector.


I wanted to ask the same and especially I've always been wondering: How is the meaning of aforementioned 'differentiable' related to the same term in math?


Not an expert (so this could well be slightly off), but here is a physics analogy:

You're rolling a ball down a plank of wood and you want it to roll off the plank and continue to a certain spot on the ground. You know that if the ball rolls past the target, you have to angle the ramp lower. If it fails to reach the target, you angle the ramp higher. If it goes past by a lot, you make a big change to the ramp angle. If it goes past by a little, you make a small change to the ramp angle. In this way your error tells you something about the adjustment you need to make to the system to produce the desired output.

Think of a function from the ramp angle (network weights) to how far the ball lands from the target (the error). The derivative of this function tells you what kind of change in angle results in what kind of change in the error, and you can use that to update the ramp until you hit the target. To say that the model is differentiable is to say that it's possible to infer from the error what kind of changes you need to make to the weights.


Same thing. You need everything to be differentiable in order to run gradient descent. To first approximation, training a neural network is just gradient descent.


The neural net is just a math function, continuous even, fully differentiable in all input points. In order to "learn" anything we compute gradients towards the function parameters. They get "nudged" slightly towards a better response, and we do this billions of times. It's like carving a raw stone block into a complex scene. If you put your data into the system it flows towards the desired output because the right path has been engraved during training.

<rant> This explains a bit how neural nets work but from this to chatGPT is another whole leap. You'd have to assign some of the merits of the AI to the training data itself, it's not just the algorithm for learning, but what is being learned that matters. The neural net is the same, but using 1T tokens of text is making it smart. What's so magic about this data that it can turn a random init into a language programmable system? And the same language information make a baby into a modern human, instead of just another animal. </>


No, it's the loss function we differentiate. The input to the loss function are the network weights. The input to the network are samples and those we do not differentiate.


While it's true that we don't differentiate the input samples, we do differentiate the loss function's output with respect to each of the network weights. We use the chain rule to calculate each of these "gradients" and that process is known as backpropagation.

(You might have intended to say this, in which cases I'm just trying to add clarity.)


Well, a hash table is basically a function.


Not a continuous function


I came here to post this video. It’s a great primer on the topic and it gives you ideas to prompt gpt and have it output more.

It’s how I got an understanding of beam search, a technique employed in some of the response building.


I had to look up "tl;dw" and realized it meant "too long; didn't watch" and not my first AI-laden instinct of "too long, didn't write" :-D


what is a differentiable hash table? I understand differentiation, I don't understand differential hash table ... does that mean for every tiny gradient of a key, a corresponding gradient in value can be expected?


Having read the paper myself, I'm impressed with the quality of your explanation. Well done!


How does N relate to the number of parameters that is frequently mentioned?


In my screed, N is the attention width. (How many token it looks at at a time) number of parameters is O(KxNxNxL) where k is the vector size of your tokens, and l is the # of layers. There are other parameters floating around, like in the encoder and decoder matrices, but the NXN matrix dominates.


This is an awesome explanation. You guys are the real heroes


ELI5 is tricky as details have to be sacrificed, but I'll try.

An attention mechanism is when you want a neural network to learn the function of how much attention to allocate to each item in a sequence, to learn which items should be looked at.

Transformers is a self-attention mechanism, where you ask the neural network to 'transform' each element by looking at its potential combination with every other element and using this (learnable, trainable) attention function to decide which combination(s) to apply.

And it turns out that this very general mechanism, although compute-intensive (it considers everything linking with everything, so complexity quadratic to sequence length) and data-intensive (it has lots and lots of parameters, so needs huge amounts of data to be useful) can actually represent many of things we care about in a manner which can be trained with the deep learning algorithms we already had.

And, really, that's the two big things ML needs, a model structure where there exists some configuration of parameters which can actually represent the thing you want to calculate, and that this configuration can actually be determined from training data reasonably.


The Illustrated Transfomer ( https://jalammar.github.io/illustrated-transformer/ ) and Visualizing attention ( https://towardsdatascience.com/deconstructing-bert-part-2-vi... ), are both really good resources. For a more ELI5 approach this non-technical explainer ( https://www.parand.com/a-non-technical-explanation-of-chatgp... ) covers it at a high level.


Suppose someone asked you to complete the sentence:

“After I woke up and made breakfast, I drank a glass of …”

In America one might say the most likely next words are “orange juice”, or “apple juice” but not “sports car” which has nothing to do with the sentence.

Ultimately this is what language models do, given a sequence of data (in this case words) predict the most likely next word(s).

For attention, when you read the sentence, which words stood out as more important? Probably woke up, breakfast, and glass while the words after, I, and made were less important to completing the sentence.

That is, you paid more attention to the important words to understand how to complete the sentence.

The “attention mechanism” in language models is a way to let the models learn which words are important in sentences and pay more attention to them too when completing sentences, just like a person would do as in the example above.

Further, it turns out this attention mechanism lets the models do lots of interesting things even without other fancy model techniques. That is “attention is all you need”.


When one says "attention is all you need" the implication is that some believe that you need something more than just attention. What is that something which has been demonstrated as unneeded? Is it a theory of how language works?


Recursion. Before transformers attention was used in recurrent neural networks. "attention is all you need" showed that you can just drop the recursion and just use attention, and the outcome is that you get a very nicely parallelizable architechture, allowing more efficient training.


Finally someone cut to chase. Thanks you!



Those Computerphile videos[0] by Rob Miles helped me understand transformers. He specifically references the "Attention is all you need" paper.

And for a deeper dive, Andrej Kharpaty has this hands-on video[1] where he builds a transformer from scratch. You can check-out his other videos on NLP as well they are all excellent.

[0] https://youtu.be/rURRYI66E54, https://youtu.be/89A4jGvaaKk

[1] https://youtu.be/kCc8FmEb1nY


Well here is my (a bit cynical) take on it.

In the beginning, there was the matrix multiply. A simple neural network is a chain of matrix multiplies. Let's say you have your data A1 and weights W1 in a matrix. You produce A2 as A1xW1. Then you produce A3 as A2xW2, and so on. There are other operations in there like non-linearities (so that you can actually learn something interesting) and fancy batch norms, but let's forget about those for now. The problem with this is, it's not very expressive. Let's say your A1 matrix has just 2 values, and you want the output to be their product. Can you learn a weight matrix that performs multiplication of these inputs? No you can't. Multiplication must be simulated by piecing together piecewise linear functions. To perform multiplication, the weight matrix W would also need to be produced by the network. Transformers do basically that. In the product A*W you replace A with (AxW1), W with (AxW2), and multiply those together: (AxW1)x(AxW2) And then do it once more for good measure: (AxW1)x(AxW2)x(AxW3). Boom, Nobel prize. Now your network can multiply, not just add. OK it's actually a bit more complicated, there is for example a softmax in the middle to perform normalisation, which in general helps during numerical optimisation: softmax((AxW1)x(AxW2))x(AxW3). There are then fancy explanations that try to retrospectively justify this as a "differentiable lookup table" or somesuch nonsense, calling the 3 parts "key", "query" and "value", which help make your paper more popular. But the basic idea is not so complicated. A Transformer then uses this operation as a building block (running them in parallel an in sequence) to build giant networks that can do really cool things. Maybe you can teach networks to divide next and then you get the next Nobel prize.


Are transformers

- a hack devised/stumbled upon by AI workers or

- a theoretical concept concocted by a mathematician who has been thinking about what NNs do or,

- a set of techniques pipe-lined together by clever programmers who work with NNs? Or...

- something else?

I mean, if transformers really do something rational, then there should be a straightforward rational mathematical statement of the problem, a clear and clean expression of what they do!

What I see instead is a lot of complex cumbersome description and terminological noise: no clear problem statement, lots of steps, lots of moving parts, and downhill from there.

Now I'd be the first to admit that, if I believed I understood and could reproduce intelligence or language [and is intelligence merely language? An argument to that effect can be made - see Helen Keller], then if required I'd be prone to provide dense, noisy and incorrect explanations aplenty to potential competitors and even to honest inquisitive people. I would do that b/c revelation of the truth would destroy my competitive advantage. IOW I see every reason for ChatGPT et al developers and corporations to guide outsiders astray at this time.

Developing something like ChatGPT is like running an exposed Manhattan Project - everyone wants Da Bomb and you don't want them to have it - instead you want to lead them completely astray. Seems to be succeeding: certainly as far as I'm concerned.

Here's a simple test: has anyone within reach of my words made a version of these systems from scratch that does anything like the fullblown ChatGPT does and that (s)he will reveal?

Current times would indicate that we won't learn how it all works until someone leaks it or someone else figures it out (like happened to Heisenberg and Schrodinger). My bet is on the latter. And that's the guy/gal who should get the Fields medal or Nobel Prize. [I'd bet a mathematician will do it].

Meanwhile as we twiddle our matrices much effort must be afoot to infiltrate the ChatGPT working groups and get the goods and also to keep the current ChatGPT worker bees from flying off to other hives and revealing secrets. This may be one of the few times when tech workers' jobs seriously shorten their lives.

- Wandering in the desert...


Are there any papers using more than 3 linearly transformed vectors?


It works like this:

First, convert the input text to a sequence of token numbers (2048 tokens with 50257 possible token values in GPT-3) by using a dictionary and for each token, create a vector with 1 at the token index and 0 elsewhere, transform it with a learned "embedding" matrix (50257x12288 in GPT-3) and sum it with a vector of sine and cosine functions with several different periodicities.

Then, for each layer, and each attention head (96 layers and 96 heads per layer in GPT-3), transform the input vector by query, key and value matrices (12288x128 in GPT-3) to obtain a query, key and value vector for each token. Then for each token, compute the dot product of its query vector with the key vectors of all previous tokens, scale by 1/sqrt of the vector dimension and normalize the results so they sum to 1 by using softmax (i.e. applying e^x and dividing by the sum), giving the attention coefficients; then, compute the attention head output by summing the value vectors of previous tokens weighted by the attention coefficients. Now, for each token, glue the outputs for all attention heads in the layer (each with its own key/query/value learned matrices), add the input and normalize (normalizing means that the vector values are biased and scaled so they have mean 0 and variance 1).

Next, for the feedforward layer, apply a learned matrix, add a learned vector and apply a ReLU (which is f(x) = x for positive x and f(x) = kx with k near 0 for negative x), and do that again (12288x49152 and 49152x12288 matrices in GPT-3, these actually account for around 70% of the parameters in GPT-3), then add the input before the feedforward layer and normalize.

Repeat the process for each layer, each with their own matrices, passing the output of the previous layer as input. Finally, apply the inverse of the initial embedding matrix and use softmax to get probabilities for the next token for each position. For training, train the network so that they are close to the actual next token in the text. For inference, output a next token according to the top K tokens in the probability distribution over a cutoff and repeat the whole thing to generate tokens until an end of text token is generated.


I'll throw my hat in the ring.

A transformer is a type of neural network that, like many networks before, is composed of two parts: the "encoder" that receives a text and builds an internal representation of what the text "means"[1], and the "decoder" that uses the internal representation built by the encoder to generate an output text. Let's say you want to translate the sentence "The train is arriving" to Spanish.

Both the encoder and decoder are built like Lego, with identical layers stacked on top of each other. The lowest lever of the encoder looks at the input text and identifies the role of individual words and how they interact with each other. This is passed to the layer above, which does the same but at a higher level. In our example it would be as if the first layer identified that "train" and "arrive" are important, then the second one identifies that "the train" and "is arriving" are core concepts, the third one links both concepts together, and so on.

All of these internal representations are then passed to the decoder (all of them, not just the last ones) which uses them to generate a single word, in this case "El". This word is then fed back to the decoder, that now needs to generate an appropriate continuation for "El", which in this case would be "tren". You repeat this procedure over and over until the transformer says "I'm done", hopefully having generated "El tren está llegando" in the process.

The attention mechanism already existed before transformers, typically coupled with an RNN. The key concept of the transformer was building an architecture that removed the RNN completely. The negative side is that it is a computationally inefficient architecture as there are plenty of n^2 operations on the length of the input [2]. Luckily for us, a bunch of companies started releasing for free giant models trained on lots of data, researchers learned how to "fine tune" them to specific tasks using way less data than what it would have taken to train from scratch, and transformers exploded in popularity.

[1] I use "mean" in quotes here because the transformer can only learn from word co-occurrences. It knows that "grass" and "green" go well together, but it doesn't have the data to properly say why. The paper "Climbing towards NLU" is a nice read if you care about the topic, but be aware that some people disagree with this point of view.

[2] The transformer is less efficient that an LSTM in the total number of operations but, simultaneously, it is easier to parallelize. If you are Google this is the kind of problem you can easily solve by throwing a data center or two at the problem.


> The negative side is that it is a computationally inefficient architecture as there are plenty of n^2 operations on the length of the input

Is this the reason for the limited token windows?


Yes, kinda. The transformer doesn't have a mechanism for dynamically adjusting its input size, so you need to strike a balance between the window being big enough for practical purposes but also small enough that you can still train the network.

Previous networks with RNNs could in theory receive inputs of arbitrary size, but in practice their performance decreased as the input got longer because they "forgot" the earlier input as they went on. The paper "Neural Machine Translation by Jointly Learning to Align and Translate" solved the forgetting problem by, you guessed it, adding attention to the model.

Eventually people realized that attention was all you needed (ha!), removed the RNN, and here we are.


I'm the author of https://jalammar.github.io/illustrated-transformer/ and have spent years since introducing people to Transformers and thinking of how best to communicate those concepts. I've found that different people need different kinds of introductions, and the thread here includes some often cited resources including:

https://peterbloem.nl/blog/transformers

https://e2eml.school/transformers.html

I would also add Luis Serrano's article here: https://txt.cohere.com/what-are-transformer-models/ (HN discussion: https://news.ycombinator.com/item?id=35576918).

Looking back at The Illustrated Transformer, when I introduce people to the topic now, I find I can hide some complexity by omitting the encoder-decoder architecture and focusing only on one. Decoders are great because now a lot of people come to Transformers having heard of GPT models (which are decoder only). So for me, my canonical intro to Transformers now only touches on a decoder model. You can see this narrative here: https://www.youtube.com/watch?v=MQnJZuBGmSQ


- You can develop a very deep understanding of a sequence by observing how each element interacts with each other over many sequences.

- This understanding can be encapsulated in "compressed" low dimensional vector representation of a sequences.

- You can use this understanding for many different downstream tasks, especially predicting the next item in a sequence.

- This approach scales really well with lots of GPUs and data and is super applicable to generating text.


Transformers are about converting some input data (usually text) to numeric representations, then modifying those representations through several layers to generate a target representation.

In LLMs, this means go from prompt to answer. I'll cover inference only, not training.

I can't quite ELI5, but process is roughly:

  - Write a prompt
  - Convert each token in the prompt (roughly a word) into numbers.  So "the" might map to the number 45.
  - Get a vector representation of each word - go from 45 to [.1, -1, -2, ...]. These vector representations are how a transformer understands words.  
  - Combine vectors into a matrix, so the transformer can "see" the whole prompt at once.
  - Repeat the following several times (once for each layer):
  - Multiply the vectors by the other vectors.  This is attention - it's the magic of transformers, that enables combining information from multiple tokens together.  This generates a new matrix.
  - Feed the matrix into a linear regression.  Basically multiply each number in each vector by another number, then add them all together.  This will generate a new matrix, but with "projected" values.
  - Apply a nonlinear transformation like relu.  This helps model more complex functions (like text input -> output!)
Note that I really oversimplified the last few steps, and the ordering.

At the end, you'll have a matrix. You then convert this back into numbers, then into text.


I don't think this description of attention is correct.


You mean "Multiply the vectors by the other vectors. This is attention - it's the magic of transformers, that enables combining information from multiple tokens together. This generates a new matrix."?

It's really oversimplified, as I mentioned. A more granular look is:

  - Project the vectors with a linear regression.  In decoder-only attention (what we usually use), we project the same vectors twice with different coefficients.   We call the first projection queries, and the second keys.  This transforms the vectors linearly.
  - Find the dot product of each query vector against the key vectors (multiply them)
  - (training only) Mask out future vectors, so a token can't look at tokens that come after it
  - At this point, you will have a matrix indicating how important each query vector considers each other vector (how important each token considers the other tokens)
  - Take the softmax, which both ensures all of the attention values for a vector sum to 1, and penalizes small attention values
  - Use the softmax values to get a weighted sum of tokens according to the attention calc.
  - This will turn one vector into the weighted sum of the other vectors it considers important.
The goal of this is to incorporate information from multiple tokens into a single representation.


Andrej Karpathy has a great course on this that is accessible to anyone who knows calculus and basic coding: https://karpathy.ai/zero-to-hero.html


Great course, still looking forward to understanding transformers and hopefully understanding the famous paper. On backprop ninja right now, hammering out the first exercise. I had to do other CS courses (freely shared online) to understand well enough.


One or both of these might help

Transformers, Explained: Understand the Model Behind GPT-3, BERT, and T5: https://daleonai.com/transformers-explained

Transformers from Scratch: https://e2eml.school/transformers.html

The first link covers Attention well.


Not sure I can give an ELI5 explanation, but for me this the closest to what you are after in two short videos on the topic. At least they helped me get up to speed fast.

Hope they will do the same for you ;-)

Large Language Models from scratch https://www.youtube.com/watch?v=lnA9DMvHtfI

Large Language Models: Part 2 https://www.youtube.com/watch?v=YDiSFS-yHwk


Technically speaking, the breakthrough was also the fact that it allowed for parallelization of running the computation. Instead of going word by word in a sequence, and optimizing for the next word, the approach shifted to looking at words independently and then applying the same statistical approach of finding the next word relative to that word or sequence. Then the final outcome was a weighted sum of these independent pieces.


Check this explanation from Karpathy which provides really good intuition https://youtu.be/9uw3F6rndnA

Basically tokens “talk” to each other and say this is what i have and this is what i look for.


Your summary of this is very good. It’s a really simple and intuitive way to express something very complex.


This might be dumbing this down very very much but it all boils down to having some sort of a special lookup table and you give it a "query" and a bunch of "keys" and the "value" is the most likely next word. As you input something into the network, a network of these tables are consulted and you are given most likely next word.

The novelty in this paper is this "query-key-value" relation that gets learned. A lot of previous work in this area was focused on learning a rough state machine to which you input a set of state transitions and it will give you the most likely next state. This will also work but training such networks is very slow and you also don't have the capability to train the network to "attend" to certain part of the inputs. This lookup based technique lets you do that plus this is also very compute efficient (compared to previous techniques).

I'm missing a lot of details but that's basically the intuition behind this.

These are very excellent resources: - https://www.youtube.com/watch?v=ptuGllU5SQQ&list=PLoROMvodv4... - https://www.youtube.com/watch?v=OyFJWRnt_AY&pp=ygUfYXR0ZW50a...


From the Yegge post:

> LoRA makes LLMs composable, piecewise, mathematically, so that if there are 10,000 LLMs in the wild, they will all eventually converge on having the same knowledge. This is what Geoffrey Hinton was referring to on his SkyNet tour.

I don't think that's right at all, LoRA freeze lots of the large model part and wouldn't let you just simply combine large models. Instead. I'm pretty sure Hinton is referring to data parallel training with batching:

> DataParallel (DP) - the same setup is replicated multiple times, and each being fed a slice of the data. The processing is done in parallel and all setups are synchronized at the end of each training step.

https://huggingface.co/docs/transformers/v4.15.0/parallelism

You can have many instances of the model training on different bits of data, and then just average the modified weights back together at the end. This combining of weights is what Hinton means when he says parallel copies of brains can learn things independently and then recombine them later a huge bandwidth speeds, whereas humans are far more limited to sharing separate experiences verbally or with like a multimedia presentation or something.



Prompt: "ELI5 the Attention is All We Need paper."

Imagine you have a box of toys. Some toys are more interesting to you than others, so you pay more attention to those toys and play with them more. The same thing happens in the "Attention is All You Need" paper, but instead of toys, we have words in a sentence.

Before this paper, when computers tried to translate one language to another, they would look at one word, translate it, then move to the next word. This works okay, but it's not great because sometimes the meaning of a word can depend on other words in the sentence.

The clever thing about the "Attention is All You Need" paper is that it taught the computer to pay attention to all the words in a sentence at the same time, but give more importance ("attention") to the words that matter most for understanding and translation.

This new way of translating languages using "attention" made computers much better at understanding and translating languages. It's like if you could become better at playing by understanding all your toys at once! This paper was a big deal in the field of machine learning and artificial intelligence because it improved how machines understand languages.


Prompt: “Why might people be hostile towards others posting simple copy-paste's of a chatgpt dialog in response to a question?”

Response:

“There could be several reasons why people might be hostile towards others who post simple copy-paste responses from ChatGPT in a dialog:

① Lack of originality: Copy-pasting a ChatGPT response without any personal input or effort can be seen as lazy or unoriginal. It shows a lack of engagement and genuine interaction with the question or conversation. People might value unique perspectives and insights, so when they see someone simply copying and pasting a response, they may perceive it as a low-quality contribution.

② […] ③ […] ④ […]

It's important to note that individual reactions may vary, and not everyone will respond with hostility in such situations. However, these reasons provide some insight into why certain individuals might have negative reactions towards simple copy-paste responses from ChatGPT.”


ELI5 - the big problem for computers working with language is that a computer doesn’t know what words mean or which words are related to each other.

The main useful idea from ML is just that we could learn meaning directly from the data—and we have a lot of data thanks to the internet. So there was a lot of work that went into ways to learn the meaning of every word and the relationships between words directly from text data—with some impressive successes.

But in almost all cases one of the biggest problems was learning how words affect each other when they’re far apart. That was a really hard problem because if you want to know how any two words affect each other then there are a lot of pairs you need to try. If you have 10 words in a sentence then there’s about 100 pairs; and if you have 1000 words then you have about 1 million pairs. For many years it seemed silly to even try that; computers are fast, but not _that_ fast…right?

But eventually hardware got powerful enough that someone decided to throw away all the cleverness and complexity-instead they just did the most obvious thing: try _every_ pair of words. When you get down to it, that’s really all that attention is: just test every pair of inputs to see how similar they are.

The title of the paper “Attention is All You Need” highlights that you can get rid of all the other tricks that people had been inventing to work around this problem of relating words that are far apart and learning the right meaning from the data. You don’t need to remember earlier words, you don’t need a fixed size context window or dynamic context or pre-trained word vectors or many, many other ideas. You _just_ need Attention to learn what words mean and solve the long distance problem.

Now, it didn’t _really_ solve the problem because the original transformer could only handle around 500 tokens. This is what folks mean when they talk about the “context length” or “context window” of a Transformer model. And it’s why everyone has been so surprised and impressed when the context window for GPT jumped to 2,000 (that’s 16x more memory than the original transformer), and now we see models with 30k or 100k context windows.

In any case, at this point we’ve learned that the (relatively) simple idea of Transformers is actually incredibly powerful—and remarkably general-purpose. There are actually only a handful of new ideas in Transformer-based models today than in the original paper.


Uhm nope but here's a ~70min talk I held about it: https://github.com/fkarg/things-to-talk-about/blob/master/tr...

Not sure how well the slides can be understood by themselves, though I tried to be accommodating for that


I strongly agree with Steve on the potential of Nobel prize for the discovery and I want to add that the team probably going to win it within 10 years. For reference, Jack Kilby won his shared Nobel prize on integrated circuit (IC) that transforms the electronics and computer industry after more than 40 years of the original invention.

Here is a very informative video and blog by Google Cloud Tech team explaining the game-changing nature of the self-attention to understand the contexts of the words being used in any sentences as proposed by the paper [1],[2].

[1] Transformers, explained: Understand the model behind GPT, BERT, and T5:

https://youtu.be/SZorAJ4I-sA

[2] Corresponding blog post:

https://daleonai.com/transformers-explained


There's no nobel prize for Algorthms though. There's no physics involved, unlike the IC. Even if llms turn out to be revolutionary, theres no way a Nobel can be given for Transformers.


I 'm still confused by the term 'attention', because it implies that something else is actively attending, while it's more about self-similarity. We begin with a sequence of vectors and linearly transform in 3 ways as Q,K and V (these tranformations are learned). The "attention" is (Q.K)*V so "amplify the parts of V according to the similarity between the two other projections Q K". Somehow by doing in parallel a lot of parallel self-similar transformations and stacking them in series we get syntax modeling. It remains a mystery to me what the transformations are supposed to model and why this works so well. This paper might well be one of the most profound discoveries of this century.

Please suggest some paper that delves a bit more into the theory around the architecture.


There are some good explanations here of the self-attention architecture that makes Transformers unique.

However, most people gloss over other aspects of the "Attention is all you need" paper, which is a sense mis-titled.

For example, Andrej Karpathy pointed out that the paper had another significant improvement hidden in it: during training the gradients can take a "shortcut" so that the bottom layers are trained faster than in typical deep learning architectures. This enables very large and deep models to be trained in a reasonable time. Without this trick, the huge LLMs seen these days would not have been possible!

Andrej talks about it here: https://youtu.be/9uw3F6rndnA?t=238


Without animated visuals, I don't think any non-math/non-ML person can ever get a good understanding of transformers.

You will need to watch videos.

Watch this playlist and you will understand: https://youtube.com/playlist?list=PLaJCKi8Nk1hwaMUYxJMiM3jTB...

Then watch this and you will understand even more: https://youtu.be/g2BRIuln4uc

Finally, watch this playlist: https://youtube.com/playlist?list=PL86uXYUJ7999zE8u2-97i4KG_...


Not really an ELI5, but it you really want to understand the topic, I've gathered the best resources I could find to help me do the same: https://blog.senko.net/learn-ai

If you're a programmer, start with Karpathy's video series. For a somewhat gentler intro, take a look at the MIT intro lectures first to build up on the fundamentals.

Then you're ready for The Illustrated Transformer, and afterwards, if you're brave enough, the Annotated Transformer.

It's a fascinating subject, more so when you have a solid grasp! And you'll be able to quickly spot people who kinda stumble they way through but have big gaps in understanding.


This link was posted here recently, and was the most understandable explanation I've found so far: https://e2eml.school/transformers.html


Discussed here:

Transformers from Scratch (2021) - https://news.ycombinator.com/item?id=35697627 - April 2023 (46 comments)

Transformers from Scratch - https://news.ycombinator.com/item?id=29315107 - Nov 2021 (17 comments)


There is no Nobel prize for math or computer science, though...

I worked on a few projects that were trying to develop foundation models for health care, aviation, and other domains. In particular I trained an LSTM model to write fake abstracts for clinical case reports.

We ran into many problems, but maybe the worst one with the LSTM is that a real document repeats itself a lot. For instance, somebody's name might turn up multiple times and the LSTM was very bad at that kind of copying. The LSTM community was arguing about solutions to this problem, but the attention mechanism in transformers makes it easy.


One prize for math is the Fields Medal, for cs is the Turing Award... perhaps there should be a JMC prize for advances in A.I.

, https://en.wikipedia.org/wiki/Fields_Medal

, https://en.wikipedia.org/wiki/Turing_Award

, https://en.wikipedia.org/wiki/John_McCarthy_(computer_scient...



There are already too many ELI5 explanations here. I will instead explain "attention" to those who are already familiar with seq2seq models.

In the context of seq2seq models, attention is a technique to compute a weighted average of hidden states of the encoder. When I first realized this simple fact, everything finally clicked.

In constrast to a vanilla seq2seq which takes only the encoder's hidden state at the last timestep as the context vector, the context vector of a seq2seq with attention is a weighted average of all hidden states of every timestep. The weight of a hidden state of the encoder is a similarity score between the hidden state and the decoder's previous output (the decoder's current state). The similarity function can be as simple as a dot product, but there are various ways to do it.

Attention can improve a seq2seq model because now the encoder's last hidden state doesn't need to well represent the whole input sequence, which is hard if the length is long — now the decoder takes at every timestep all the encoder's hidden states and computes an average of them with the weights uniquely different from the other timesteps. The weights at a timestep represent which input words are more important and thus to focus on when the decoder is to output a word at that timestep.

More generally, attention takes a set of values vectors and a query vector and computes a weighted average (or more generally, a weighted sum, if the weights don't sum up to 1) of the values based on the query. In the context of seq2seq models, the values are the encoder's hidden states and query is the decoder's previous output.


Not ELI5 obviously but might help some.

Transformer is a patterning probabilistic machine for a sequence of identities[1]. These identities are fed to the transformer in lanes. The transformer is conditioned to shift lanes one position to the left until they make it to the output, and make a prediction in the right-most lane that got freed up. Attention adds an exponential amount of layer interconnectivity, when we compare it with a simple densely connected layers. The attention mask serves as a high-dimensional dropout, without which it would be extremely easy for the Transformer to simply repeat the inputs (and then fail to generalize when making the prediction). Each layer up until the vertical middle of the Transformer works with a higher contextual representation than the previous one, and this is again being unwound back to lower contexts from the middle layer back to the original identities (integers) on the outputs. This means that you have raw identities on the input and output which span a certain width/window of the input sequence, but in comparison the middle-most layer has a sequence of high level contexts spanning extreme lengths of the original input sequence, knowledge-wise. [1]It's important to know that modification (learning) by the Transformer, of the vector embeddings which represent the input/output identities/integers that the Transformer works with, constitute big portion of the Transformer's power, and the practical implication of that is that it's impractical to try to tell the Transformer that e.g. some of our identities are similar or there's some logical system in their similarity, because all the Transformer really cares about is the occurrence of these identities in the sequence we train the Transformer on, and the Transformer will figure out the similarities or any kind of logic in the sequence by itself.


Feedforward: y=Wx

Attention: y=W(x)x

W is Matrix, x & y Are vectors. In the second case, W is a function of the input.


You must be from a planet with very long years!

There is no way I can even begin to digest what you have said in your comment.


Sorry maybe I should have added more explanation. One way to think about attention, which is the main distinguishing element in a transformer, is as an adaptable matrix. A feedforward layer is a matrix with static entries that do not change at inference time (only during training). The attention mechanism offers a way to have adaptable weight matrices at inference time (this is implemented by using three different matrices, K,Q & V called keys query and value in case you want to dig deeper).


I think in your notation it should have been:

y=Wx_0

y=W(x)x_0


I guess I was more thinking about self attention, so yes. The more general case is covered by your notation!


The best way to understand transformers is to take Andrej’s Karpathy course on youtube. With a keyboard and a lot of focus time.


It is hard but so worth it. It is hard to overstate how good it is. The pedagogy, the charisma/style, the fact he cofounded openai and worked for Elon but is a modest as say a math tutor popping into your house to teach you some math!

There is a great discord community attached too which makes a big difference.

What is missing from this and another course I did and is very hard to find is multivariate calculus on linear algebra. I feel motivated to create a resource on it because its pretty hard. For example how to differentiate matrix operations where broadcasting has been involved. Not just the how but really grokking it into working memory.


I've been working through [0]. Like a lot of math, the notation is daunting, but once you become familiar with it, it really is a nice tool for thought.

[0]: https://arxiv.org/abs/2207.09238


This! The best resource I've found to explain transformers, that made them clear to me. I wish all deep learning papers were written like this, using pseudocode.


Cosma Shalizi explains it very well.

"Again: Calling this "attention" at best a joke."

http://bactra.org/notebooks/nn-attention-and-transformers.ht...


Thanks a lot for sharing. I liked the explanation, although it seems to be not a perfect matching between attention and kernel (as far as I understand). Since as I undestand x_o = Qx and x_i = Qx and y_i = Vx but it doesn't map to Wu dot Wv.

Anyway just curious do you or someone else have more sources in this format?

At least this way it is very helpfull to think about the methods. Kind of agree that the formular does look very very similar to well known methods. But on the other hand, the author doesn't explain the transformers in a similar simple way since it is not obvious why you could stack kernels and get better results.


What is ELI5?

The idea behind the Transformer is nice - but by far not Nobel prize deserving.

Don't believe the hype or people like Yegge, whoever that is - in a few years a new architecture will be the "Nobel candidate".

Also, the original Transformer paper, if you read is, is horribly written.


Something seems Nobel related for the tech to be so useful.


Nobel prize for the neurel network or perceptron maybe?


explain like i'm 5 (years old)


I argue that the attention mechanism is just Hopf algebra convolution. The QK mechanism selects possible candidates and the OV mechanism picks one of these possible candidates.

I have recently written a paper on this https://arxiv.org/abs/2302.01834

I also have a discord channel https://discord.cofunctional.ai.

Ironically, it's the same mechanism as what renormalization in QFT does. I'm getting increasingly convinced that it's also how the brain works.


Could you ELI5 (or maybe 15, with high-school maths) this concept? Particularly how it may relate to how our brains might work? Most interested.


Not quite ELI5 and there's a few partially overlapping answers around already but here goes.

The key part is the attention mechanism, as title of the paper may have spoiled. It works moreless like this:

- Start with an input sequence X1, X2 ... Xn. These are all vectors.

- Map the input sequence X into 3 new sequences of vectors: query (Q), key (K), and value(V), all of the same length as the input X. This is done using learnable mappings for each of the sequences (so one for X->Q, another for X->K and one for X->V).

- Compare similarity of every query with every key. This gives you a weight for each query/key pair. Call them W(Q1, V2) and so forth.

- Compute output Z as sum of every _value_ weighted by the weight for the respective query/key pair (so Z1 = V1W(Q1,K1) + V2W(Q1,K2) + ... + VnW(Q1,Kn), Z2 = V1W(Q2,K1) + V2*W(Q2,K2)...)

- and that's about it!

As throwawaymaths mentions, this is quite similar to a learnable hash table with the notable difference that the value fetched is also changed, so that it doesn't fetch "input at an index like i" but "whatever is important at an index like i".

Now a few implementation details on top of this:

- The description is for a single "attention head". Normally several, each with their own mappings for Q/K/V, are used, so the transformer can look at different "things" simultaneously. 8 attention heads seems pretty common.

- The description doesn't take the position in the sequence into account (W(Q1,K1) and W(Q1,Kn) are treated perfectly equally). To account for ordering, "positional encoding" is normally used. Usually this is just adding a bunch of scaled sine/cosine waves to the input. Works surprisingly well.

- The transformer architecture has a number of these "attention layers" stacked one after the other and also 2 different stacks (encoder, decoder). The paper is about machine translation, so the encoder is for the input text and the decoder for the output. Attention layers work just fine in other configurations as well.

The rest of the architecture is fairly standard stuff


Simplest explanation: you start with a fully connected network of N layers. This already gives you the functionality. Then you remove a lot of unnecessary connections. This gives you the performance.


the use of words like "attention" or "hallucinate" as 'terms of art' concerns me because while its actually fine and normal to re-purpose normal english discourse words to have highly specific meaning in a field, When you use them outside the cogniscenti, they carry meaning which is not strictly appropriate.

In this case, AGI proponents are using words which are highly loaded to mean "is a thinking, reasoning being" in some way.

I don't like it. I would prefer that they'd chosen words which were more neutral and not based on illusions of intelligence, or allusions to known intelligent behaviour.

"attention" is a thing, sure. But, if you use this word in formal session presenting to Congress, you're misleading them without conscious effort to believe you think "it's alive"

I don't like it. I think in hindsight calling the field AI was a huge mistake.

If you want something to hang on this, think about legal english and words like "real property" -do you really know what a solicitor or lawyer or barrister or judge means when they say that? or "without let or hindrance" -what does the word "let" mean there?

Within legal contexts, using the jargon is a given. misusing them outside the courtroom as a non-legal practitioner is a recipe for disaster. This is were "Sovereign Citizens" are playing: look how well that's going.


I don't think you really want a 5yo's version, because that's "computers can be smart". But for an SDE1-level explanation I made this video which has gotten tons of compliments and keeps people watching year after year:

https://www.youtube.com/watch?v=S27pHKBEp30

It's in the context of NLP, which is where transformers started of course.


There is annotated version of the paper. It also has a link to a GitHub repo in there. Google “The annotated transformer”. It’s off nlp.seas.harvard.edu.


I hope that soon LLMs will generate such notebooks for any scientific paper you ask. That would help to eliminate fake and unrealistic results ones.


It helps to start with recurrent neural networks first, since those were the previous standard way of doing next-token-prediction. They worked, but training them was extremely slow because it couldn't be parallelized. Transformers are a way of getting mostly the same capabilities as RNNs but with a parallelizable architecture so you can actually train it with huge parameter numbers in a reasonable amount of time.


I agree it's very useful to look at RNNs. Even more because attention mechanisms were already getting popular in LSTM/GRU models before the transformers paper. One of the main insights of the transformer papers is that you can do with just an attention attention and you do not need the recurrent part of the network anymore (hence the title "Attention is all you need").


ELI5: In neural networks we could take data numbers, multiply by a constant, add a constant, add them to each other, and then do the same to the results again and again. And you could learn the right constants using labeled data. But we couldn’t multiply data numbers against each other. So in transformers you can. But you must multiply all numbers against all numbers in a specific sample, so it’s heavy.


Here is an attempt that sacrifices technical precision for intuition.

We're going to represent words as vectors (a sequence of numbers). We would like it to be the case that the value of the numbers reflects the meaning of the words. Words that mean similar things should be near each other. We also want to represent higher level ideas, ideas that take multiple words to express, in the same way. You can think of all the possible vectors as the entire space of ideas.

To begin with, though, we just have a vector for each word. This is insufficient - does the word "bank" mean the edge of a river or a place to store money? Is it a noun or a verb? In order to figure out the correct vector for a particular instance of this word, we need to take into account its context.

A natural idea might be to look at the words next to it. This works okay, but it's not the best. In the sentence "I needed some money so I got in my car and took a drive down to the bank", the word that really tells me the most about "bank" is "money", even though its far away in the sentence. What I really want is to find informative words based on their meaning.

This is what transformers and attention are for. The process works like this: For each word, I compose a "query" - in hand-wavy terms, this says "I'm looking for any other words out there that are X". X could be "related to money" or "near the end of the sentence" or "are adjectives". Next, for each word I also compute a "key", this is the counterpart of the query, and says "I have Y". For each query, I compare it to all the keys, and find which ones are most similar. This tells me which words (queries) should pay attention to which other words (keys). Finally, for each word I compute a "value". Whereas the "key" was sort of an advertisement saying what sort of information the word has, the "value" is the information itself. Under the hood, the "query", "key" and "value" are all just vectors. A query and a key match if their vectors are similar.

So, as an example, suppose that my sentence is "Steve has a green thumb". We want to understand the meaning of the word "thumb". Perhaps a useful step for understanding any noun would be to look for adjectives that modify it. We compute a "query" that says "I'm looking for words near the end of the sentence that are adjectives". When computing a "key" for the word green, maybe we compute "I'm near the end of the sentence, I'm a color, I'm an adjective or a noun". These match pretty well, so "thumb" attends to "green". We then compute a "value" for "green" that communicates its meaning.

By combining the information we got from the word "green" with the information for the word "thumb", we can have a better understanding of what it means in this particular sentence. If we repeat this process many times, we can build up stronger understanding of the whole sentence. We could also have a special empty word at the end that represents "what might come next?", and use that to generate more text.

But how did we know which queries, keys and values to compute? How did we know how to represent a word's meaning as numbers at all? These seemingly impossible questions are what is being "learned". How exactly that happens would require an equally big explanation of its own.

Keep in mind that this explanation is very fuzzy, and is only intended to convey the loose intuition of what is going on. It leaves out many technical details and even gets some details intentionally wrong to avoid confusion.


Thank you for this explanation. I've found that the QKV concepts are some of the most glossed over parts of attention and I'll be honest, some of the most confusing. Would you mind actually going into more detail on the questions you asked towards the end of your post? I vaguely understand how an embedding might get learned over time, but I don't understand how queries are "constructed" or how these three separate matrices get learned, or what kind of information is being encoded in them when they're learned. I also don't really understand how the value matrix itself is used.

Any further detail, even if it gets into some technical details would be very helpful and appreciated!


Thank you for asking this question and pushing for clarification. I literally have a tab open with a question to GPT asking it to explain to me the positional encoding and QKV concepts of transformers. After going through Karpathy's Zero to Hero and reading/watching a few other tutorials on modern NN architectures I feel I mostly have a grasp on the main topics (e.g. back propagation). But the Key/Query matrices just stick out like a sharp thorn. These are clearly the most important features of the transformer architecture and it is frustrating not to have a intuitive understanding of their function.


Sure, so to see how these things can be learned, we should be a little more precise about how they work.

Each token is a vector, and from that vector we compute three things - a query, a key and a value. Each of these is typically computed by multiplying the token's vector by a matrix (aka a linear projection). It's the values in these matrices that we need to learn.

When performing an attention step, for a given token we compare its "query" to every token's "key" (including it's own key - a token can attend to itself). This gives us a score for how important we think that key is. We normalize those scores to sum to one (typically via a softmax operation). Essentially, we have one "unit" of attention, and we're going to spread it across all the tokens. Some we will pay a lot of attention to, and others very little.

But what does it mean to pay a lot of or a little attention to other tokens? At the end of this whole procedure, we're going to arrive at a new vector that represents our new understanding/meaning for the token we're working on. This vector will be computed as a weighted sum of the values from all the tokens we're attending to. The weights are our attention scores (determined by the query-key similarity scores).

So as a simple example, suppose I have three tokens, A B and C, and let's focus on the attention operation for A. Say A's query vector is [1 2 -1]. A's key vector is [3 -1 0], B's key vector is [3 -1 -1] and C's key vector is [0 1 -3]. This gives us raw attention scores of 1 for A (attending to itself), 4 for B, and 5 for C. Rather than take a messy softmax, let's just normalize these to 0.1, 0.4 and 0.5 for simplicity.

Now that we have our attention weights, we also need to know each token's value. Let's say they are [1 0 1] for A, [-1 2 0] for B, and [1 1 1] for C. So our final output for this attention step will be 0.1 * [1 0 1] + 0.4 * [-1 2 0] + 0.5 * [1 1 1]. This gives us [0.2 1.3 0.6] (assuming I eyeballed the math correctly), this will be our new representation of A for the next step. (in practice there are some additional network layers that do more processing).

Okay, so how can we learn any of the matrices that go from a token vector to a query, a key and a value? The important thing is that all of this is just addition and multiplication - it's all nicely differentiable. And because the attention is "soft" (meaning we always attend at least a little bit to everything, as opposed to "hard" attention where we ignore some items entirely), we can even compute gradients through the attention scores.

Put a simpler way, I can ask "if I had included a bit more of A's value and a bit less of B's value, would my final output have been closer to the target?". To include a bit less of B, I need to make A's query and B's key a little further apart (lower dot product). And to make them a little further apart, I need to adjust the numbers in the matrices that produces them. Similarly, I can ask "if C's value had been a little larger in the first slot and a little smaller in the third, would my final output have been closer to the target?", and adjust the value matrix in the same way. Even if the attention to another token is very low, there's at least a small sliver of contribution from it, so we can still learn that we would've been better off having more (or even less) of it.

Learning the initial embeddings (the vectors that represent each word in the vocabulary, before any processing) is done in the same way - you trace all the way back through the network and ask "if the embedding for the word 'bank' had been a little more like this would my answer have been closer?", and adjust accordingly.

Understanding what exactly the queries keys and values represent is often very difficult - sometimes we can look and see which words attend to which other words and make up a convincing story. "Oh, in this layer, the verb is attending to the corresponding subject" or whatever. But in practice, the real meaning of a particular internal representation is going to be very fuzzy and not have a single clear concept behind it.

There is no explicit guidance to the network like "you should attend to this word because it's relevant to this other word." The only guidance is what the correct final output is (usually for LLMs the training task is to predict the next word, but it could be something else). And then the training algorithm adjusts all the parameters, including the embeddings, the QKV matrices, and all the other weights of the network, in whatever direction would make that correct output more likely.


This was an excellent explanation, thank you for taking the time to write it out!


I am not sure what kind of background that you have, but with a bit of basic linear algebra you might enjoy this classic:

https://jalammar.github.io/illustrated-transformer/


Has ELI5 lost its literal meaning? What five-year-old would understand anything in this thread?


It was never meant to be taken literally, it's just an idiomatic way of asking that something be explained as simply as possible.

An answer to OP's question targeted at a five year old level would likely be impossible given the subject matter, or else need to be so general and simplistic as to be useless on this forum of not actual five year olds.


The current top comment managed it in a way that is actually useful: https://news.ycombinator.com/item?id=35981106


I don't know. It's simple and straightforward, but still seems like it would go over the head of an actual five year old.


Can someone please explain the "Scaled dot-product attention" function from the article?

Attention(Q, K, V ) = softmax( (Q * TRANSPOSED(T)) / sqrt(Dk) ) V

That's where i start to shake my head.


i dont get it. they tried an architecture, then couldn't be bothered to describe it well...

i remain disappointed at the staggering low quality of academic work. the writing here is appalling. no worked example provided, despite of evidence of one... typical academic crap.

wouldn't surprise me if you go to try it and its wrong, and none of the real problems have been solved.


I do not understand what exactly you are complaining about.

It is true that the "Attention Is All You Need"[0] paper requires some level of basic understanding/experience of ML architectures and engineering, but the target audience for the paper certainly understands it. It would be unnecessary if every academic paper startet with a first principles explanation of what is considered common knowledge in the field.

> wouldn't surprise me if you go to try it and its wrong, and none of the real problems have been solved

But obviously attention based transformer architectures are solving real world problems (by being better than previous architectures in real world applications).

[0] https://arxiv.org/abs/1706.03762


> I have zero AI/ML knowledge

This may make it difficult to explain and I already see many incorrect explanations here and even more lazy ones (why post the first Google result? You're just adding noise)

> Steve Yegge on Medium thinks that the team behind Transformers deserves a Nobel

First, Yegge needs to be able to tell me what Attention and Transformers are. More importantly, he needs to tell me who invented them.

That actually gets to our important point and why there are so many bad answers here and elsewhere. Because you're both missing a lot of context as well as there being murky definitions. This is also what makes it difficult to ELI5. I'll try, then try to give you resources to get an actually good answer.

== Bad Answer (ELI5) ==

A transformer is an algorithm that considers the relationship of all parts of a piece of data. It does this through 4 mechanisms and in two parts. The first part is composed of a normalization block and an attention block. The normalization block scales the data and ensures that the data is not too large. Then the attention mechanism takes all the data handed to it and considers how it is all related to one another. This is called "self-attention" when we only consider one input and it is called "cross-attention" when we have multiple inputs and compare. Both of these create a relationship that are similar to creating a lookup table. The second block is also composed of a normalization block followed by a linear layer. The linear layer reprocesses all the relationships it just learned and gives it context. But we haven't stated the 4th mechanism! This is called a residual layer or "skip" layer. This allows the data to pass right on by each of the above parts without being processed and this little side path is key to getting things to train efficiently.

Now that doesn't really do the work justice or give a good explanation of why or how things actually work. ELI5 isn't a good way to understand things for usage, but it is an okay place to start and learn abstract concepts. For the next level up I suggest Training Compact Transformers[0]. It'll give some illustrations and code to help you follow along. It is focused on vision transformers, but it is all the same. The next level I suggest Karpathy's video on GPT[1], where you will build transformers and he goes in a bit more depth. Both these are good for novices and people with little mathematical background. For more lore and understanding why we got here and the confusion over the definition of attention I suggest Lilian Wang's blog[2] (everything she does is gold). For a lecture and more depth I suggest Pascal Poupart's class. Lecture 19[3] is the one on attention and transformers but you need to at minimum watch Lecture 18 but if you actually have no ML experience or knowledge then you should probably start from the beginning.

The truth is that not everything can be explained in simple terms, at least not if one wants an adequate understanding. That misquotation of Einstein (probably originating from Nelson) is far from accurate and I wouldn't expect someone that introduced a highly abstract concept with complex mathematics (to such a degree that physicists argued he was a mathematician) would say something so silly. There is a lot lost when distilling a concept and neither the listener nor speaker should fool themselves into believing this makes them knowledgeable (armchair expertise is a frustrating point on the internet and has gotten our society in a lot of trouble).

[0] https://medium.com/pytorch/training-compact-transformers-fro...

[1] https://www.youtube.com/watch?v=kCc8FmEb1nY

[2] https://lilianweng.github.io/posts/2018-06-24-attention/

[3] https://www.youtube.com/watch?v=OyFJWRnt_AY


As usual, the most helpful answer is burried among desperate platitudes full of inaccuracies trying to pander to the absurd reddit-esque ELI5 notion you've dismantled.


How many responses here are generated entirely or aided by the technology they are explaining?

That is the real question.


Any sequence of on topic responses that are not phrased as rebuttals or disagreements are almost assuredly generated.


So, if you got a time machine back to the year 2000 and knew how to code Transformers - could you do it? Your CPU/cloud/parallel processing hardware would be limited, but as I understand it there's nothing revolutionary being done here, just a lot of matrix math that produces results (and we're still fully understanding why that all happens).


"Matrix math" itself is very old, but I'm guessing what you're referring to here is probably neural networks or Markov chains, both of which are... a half-century to a century old, if I remember right?

Transformers on the other hand are new, less than a decade old.


So many top-level comments and replies here are so laughably wrong.

Pro tip: if you want technical info on research-related topics, don't ask HN. Tech bros can't handle telling themselves they don't know something, so everyone will give their "take" on the subject at hand.


Thanks to everyone in this thread for all the great links and explanations!!


Let the model decide which words to focus on when producing a response


I'd like to read the article. What are the required maths ?


Sure! Transformers are a type of deep learning model that have revolutionized the field of natural language processing (NLP) and achieved state-of-the-art performance on various tasks like machine translation, text summarization, and question answering.

The "Attention is All You Need" paper, published in 2017 by Vaswani et al., introduced the Transformer model architecture. The paper proposed a new way to process sequences of data, such as words in a sentence or time steps in a time series, without using recurrent neural networks (RNNs) or convolutional neural networks (CNNs). Instead, it introduced a mechanism called "self-attention."

Self-attention allows the model to weigh the importance of different words in a sentence when processing each word. This attention mechanism helps the model to focus on the relevant parts of the input sequence. In other words, it pays attention to different words based on their contextual significance for a given task.

To understand self-attention, let's consider an example. Suppose we have a sentence: "The cat sat on the mat." When processing the word "sat," self-attention enables the model to assign higher weights to words like "cat" and "the" and lower weights to words like "on" and "the mat." This way, the model can learn which words are more relevant to understanding the context of "sat."

The Transformer model consists of an encoder and a decoder. The encoder processes the input sequence, such as a sentence, while the decoder generates the output sequence, such as a translated sentence. Both the encoder and decoder are composed of multiple layers of self-attention and feed-forward neural networks. The self-attention layers allow the model to capture dependencies between different words in the sequence, while the feed-forward networks help in learning more complex patterns.

The "Attention is All You Need" paper demonstrated that Transformers achieved state-of-the-art performance on machine translation tasks while being more parallelizable and requiring less training time compared to traditional RNN-based models. Since then, Transformers have become the go-to architecture for many NLP tasks and have been further improved with variations like BERT, GPT, and T5.

In summary, the Transformer model introduced in the "Attention is All You Need" paper replaced traditional recurrent or convolutional neural networks with self-attention, allowing the model to capture contextual relationships between words more effectively. This innovation has had a significant impact on the field of NLP and has become the foundation for many subsequent advances in the field.


Lol. I mean, who here doesn't realize that this is a straight up chatGPT response? Not even edited in any way.

You know what would be great to add to HN? A chatGPT response below each comment. So nobody else finds the need to do this.


Here is an explanation for people who have basic familiarity with machine learning, that explains the Query/Key/Value computation. The ideas are relatively intuitive when you strip away the matrix manipulations.

Let's say that you are given a very good embedding of each English word as a vector of numbers. The idea of embeddings is that each dimension captures a different characteristic of the word. So for example, dimension 37 might capture gender and dimension 56 might capture how royal the word is. So "king" and "queen" will have very different scores in dimension 37 but both words will have a high score in dimension 56. These embeddings have been available for many years, eg word2vec.

The challenge is this: given a sentence with many words, how can you best encode the meaning of the sentence in a vector? The simplest approach is to take the embeddings for all the words and average them together to get a summary vector. This is a reasonable approach, and will work fine for simple tasks like assigning a positive or negative sentiment to the sentence. For example, it will do a good job of separating “I love this amazing product” and “I hate this terrible product”. This approach is analogous the “bag of words” model.

This simple model is missing two big things. First, when interpreting the meaning of each word, it uses the original embedding of that word without any regard for the context around the word. So “bank” will be assigned the same meaning in the sentence “we got money from the bank” and “we sat by the river bank.” Second, the model does not take into account the ordering of the words, so that “the dog bit the man” and “the man bit the dog” will both get the same result.

Said another way, our simple model lacks the expressibility to distinguish meaningful differences between sentences. Transformers address these deficiencies by making the model more expressive, while keeping it computationally efficient and easy to train.

First, the transformer recognizes that we need to reinterpret each word based on the other words in the sentence. Each “layer” of the transformer can be seen as doing a reinterpretation of each word based on its context. Successive layers are applied to reach iteratively better reinterpretations.

In order to reinterpret the word “bank” in the sentence “we got money from the bank”, we first need to score all of the other words based on their relevance to “bank”. Obviously, “money” should get a higher relevance score than “from”. A natural approach to get a relevance score is to take the dot product of each other word’s embedding against the embedding for the word bank. (The dot product of two vectors is a common metric to gauge their similarity.)

However, this is not quite expressive enough. For example, in the sentence “the food tastes disgusting”, the meaning of the word “disgusting” is actually not very similar to the meaning of “food”, but clearly “disgusting” is very relevant to the interpretation of “food.” To take this into account and improve the expressiveness of the model, the idea is to maintain a separate set of embeddings for each word to be used in the relevance score dot product. These embeddings are called “keys”. So when reinterpreting the word “bank” in the sentence “we got money from the bank”, we grab the key embeddings for all the words, and dot product each one against the separate query embedding for the word “bank”. For example, multiplying the key for “money” against the query for “bank” will tell us how relevant the word “money” is for reinterpreting the word “bank.” Note that we need to separate key and query to break the symmetry of the dot product. In the phrase "Monopoly" "money", the word "Monopoly" significantly changes our interpretation of the word "money", but "money" does not significantly change our interpretation of "Monopoly."

Now that we have these relevance scores, we normalize them to sum to 1, and then we reinterpret “bank” as a relevance-weighted average of the value vectors of all of the other words. This is called the Attention mechanism, because when reinterpreting each word we selectively "pay attention" to the words that are most relevant to it.

There are a number of details omitted in this description, but hopefully it gives a general sense. The black magic of designing ML architectures is developing the right intuition for what is just expressive enough to capture meaningful relationships, while still being easy to compute and leveraging modern hardware.

It's a bit like deciding how many legs to put on a table. It's not so much that 4 legs is theoretically correct, but rather that 2 legs definitely doesn't work, 3 legs seems okay but feels a bit iffy if we put our weight in certain places and it's not too much more expensive to add a fourth leg anyway, and 5 legs definitely seems like overkill.

———————————————

Major omitted details:

- The word embeddings are not fixed, but learned from scratch as trainable parameters

- The query, key and value vectors for each word are actually the output of the input embedding times three different matrices Q, K and V. The reason for doing this is a bit complex. In order to have successive layers of reinterpretation, you cannot keep using the same query vector for each word in the subsequent layers because you have reinterpreted what it means. After the first layer, you no longer have the word "bank", you just have a reinterpreted vector of numbers so there is no way to do a lookup to get a query vector. Multiplying the new vector by three different learned matrices is a clever way to get around this.

- Positional information is encoded by adding a (learned) positional vector the word embedding, so that the embedding for “bank” will look a little different if it is at the beginning of the sentence vs. the end of the sentence.


You guys are cracking me up lol most comments here must have the smartest damn 5-year olds in mind!

“It’s simple ! Just tokenize the context and allow differentiated hash maps to best map the embeddings matrix. Duh!”


Explain Like I'm Five... Years Into A Doctorate


I asked GPT (via Bing) for a five paragraph explanation. Here's what it wrote:

Sure, I'll try to explain what a transformer is in the context of AI tools like ChatGPT, and what attention references in this context. Here is a five paragraph explanation:

*Paragraph 1: What is a transformer?* A transformer is a type of neural network that can process sequential data, such as natural language text, by learning the context and meaning of the data elements. A neural network is a computer program that can learn from data and perform tasks such as classification, regression, generation, etc. A transformer can take an input sequence, such as a sentence or a paragraph, and convert it into another sequence, such as a translation or a summary. A transformer can also generate sequences from scratch, such as writing a poem or a story.

*Paragraph 2: How does a transformer work?* A transformer works by using two main components: encoders and decoders. An encoder takes an input sequence and converts it into a vector representation called an encoding. A decoder takes an encoding and converts it back into an output sequence. Both the encoder and the decoder use a technique called attention to focus on the relevant parts of the input and output sequences. Attention is a way of measuring how much each element in a sequence influences or depends on another element.

*Paragraph 3: What is attention?* Attention is a mathematical technique that allows a transformer to learn the relationships between the elements in a sequence. For example, in a machine translation task, attention allows the transformer to translate words like 'it' into the correct gender in another language by looking at all the relevant words in the original sentence. Attention also allows the transformer to handle long sequences without losing information or context. There are different types of attention, such as self-attention, cross-attention, global attention, local attention, etc.

*Paragraph 4: Why are transformers important?* Transformers are important because they are among the most powerful and versatile types of neural networks invented to date. They can handle many types of sequential data, such as text, speech, images, videos, etc. They can perform many tasks, such as machine translation, text summarization, text generation, image captioning, speech recognition, etc. They can also learn from large amounts of data and make accurate predictions that drive their wider use.

*Paragraph 5: What are some examples of transformers?* Some examples of transformers are:

- ChatGPT: A transformer that can generate realistic and engaging conversations based on a given topic or context. - BERT: A transformer that can understand natural language and answer questions or perform other tasks based on it. - GPT-3: A transformer that can generate text for any purpose or domain based on a few words or sentences. - DALL-E: A transformer that can generate images from text descriptions. - AlphaStar: A transformer that can play the strategy game StarCraft at a superhuman level.

I hope this explanation helps you understand what a transformer is and what attention references in this context.

Source: Conversation with Bing, 5/17/2023 (1) What Is a Transformer Model? | NVIDIA Blogs. https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transform.... (2) Transformer Neural Network Definition | DeepAI. https://deepai.org/machine-learning-glossary-and-terms/trans.... (3) Generative AI: AI Transformers. https://lablab.ai/blog/generative-ai-ai-transformers. (4) The Ultimate Guide to Transformer Deep Learning - Turing. https://www.turing.com/kb/brief-introduction-to-transformers.... (5) How Transformers Work. Transformers are a type of neural… | by Giuliano .... https://towardsdatascience.com/transformers-141e32e69591.


I cannot do ELI5, but can do ELI14 for you.

Transformer is a building block (a part) of a language model. "Language model" is an algorithm that can predict words following given words. For example, you can give a text to a model and get a summary of this text, or an answer to the question in the text, or a translation of the text.

Language models are often made of two parts - encoder and decoder. The encoder reads input text (each word is encoded as a bunch of numbers, for example, as list of 512 floating-point numbers) and produces a "state" (also a large list of numbers) which is expected to encode the meaning of the text. Then the decoder reads the state and produces the output as words (to be exact, as probabilities for every possible word in the dictionary to be at a certain position in the output).

Before Transformers, people tended to use so called "recurrent neural networks" architecture. With this approach, the encoder processes the text word by word and updates the state after every word:

    state = some initial state
    for word in text:
        state = model(state, word)
model(...) here is a complicated mathematical function, often with millions of operations and parameters.

As I have written above, after reading the text, the state should encode the meaning of the text.

But it turned out that this approach doesn't scale well with long or complicated texts because the information from beginning of the text gets lost. The model tends to "forget" what it had read before. So a new architecture, "Transformers", was proposed. The difference is that now we give entire text (each word encoded as bunch of numbers) to the model:

    state = model(input text)
Now the model processes the text at once. But implementing this naively would result in a very large model with too many parameters that would require too much memory and computing time. So developers used a trick here - most of the time each input word is processed separately from others (as in recurrent model), but there are stages, called "attention" where the words are processed together (and those stages are relatively light), so it looks like this:

    # stage where all text is processed at once
    # using quick algorithm
    state1 = attention(input text)
    # stage where each part of state is processed independently
    # with lot of heavy calculations
    state2 = map(some function, state1)
    state3 = attention(state2)
    state4 = map(some function, state3)
    ...
To summarize, in Transformers the model processes the text at once, but we have to employ tricks and split processing into stages to make calculation feasible. Probably that is why some people believe the authors should receive a reward for their work.

I think this explanation is as far as one can get without learning ML.


Also I think this thread is a good place to complain about the paper. The model is not described clearly. For example, try to find the size of input data vector for the model in the paper - it is not specified. There is also a misleading phrase

    All sub-layers in the model, as well as the embedding layers, produce outputs of dimension d_model = 512.
which makes the reader think that each block (Transformer) gets 512-dimensional vector as input and produces 512 numbers at the output. But this is wrong. 512 numbers is just a single word, not entire text or internal state. I could not understand this from reading just the original paper.

Also it is not written where do keys, queries and values for attention come from.


The "Attention is All You Need" paper introduced a new way for AI to read and understand language, much like how we might read a comic book.

As you read each panel of a comic book, you don't just look at the words in the speech bubbles, but you also pay attention to who's talking, what they're doing, and what happened in the previous panels. You might pay more attention to some parts than others. This is sort of like what the Transformer model does with text.

When the Transformer reads a sentence, it doesn't just look at one word at a time. It looks at all the words at once, and figures out which ones are most important to understand each other. This is called "attention." For example, in the sentence "The cat, which is black, sat on the mat," the Transformer model would understand that "cat" is connected to "black" and "sat on the mat."

The "attention" part is very helpful because, like in a comic book, understanding one part of a sentence often depends on understanding other parts. This makes the Transformer model really good at understanding and generating language.

Also, because the Transformer pays attention to all parts of the sentence at the same time, it can be faster than other models that read one word at a time. This is like being able to read a whole page of your comic book at once, instead of having to read each panel one by one.


Please don't post generated text into HN comments. HN threads are for human discussion and we ban accounts that violate this.


Apologies. Seems like a good policy.


Explaining it for a slightly older audience, a transformer is a type of artificial neural network designed for processing sequences, like sentences in a text. It's especially known for its use in natural language processing (NLP), which is the field of AI that deals with understanding and generating human language.

The Transformer is unique because it uses a mechanism called "attention" to understand the relationships between words in a sentence, which works like this:

(1) Encoding: First, the Transformer turns each word in a sentence into a list of numbers, called a vector. These vectors capture information about the word's meaning.

(2) Self-Attention: Next, for each word, the Transformer calculates a score for every other word in the sentence. These scores determine how much each word should contribute to the understanding of the current word. This is the "attention" part. For example, in the sentence "The cat, which is black, sat on the mat," the words "cat" and "black" would get high scores when trying to understand the word "black" because they are closely related.

(3) Aggregation: The Transformer then combines the vectors of all the words, weighted by their attention scores, to create a new vector for each word. This new vector captures both the meaning of the word itself and the context provided by the other words in the sentence.

(4) Decoding: Finally, in a task like translation, the Transformer uses the vectors from the encoding phase to generate a sentence in the target language. It again uses attention to decide which words in the original sentence are most relevant for each word it's trying to generate in the new sentence.

One key advantage of the Transformer is that it can calculate the attention scores for all pairs of words at the same time, rather than one at a time like previous models. This allows it to process sentences more quickly, which is important for large tasks like translating a whole book.


The importance of the "Attention is All You Need" paper by Vaswani et al., in 2017 is that it introduced the Transformer type of model architecture.

The model is so named because it "transforms" one sequence into another. For example, in a machine translation task, it can transform a sentence in one language into a sentence in another language.

The key innovation of the Transformer model is the use of self-attention mechanisms. This means that instead of processing the input sequence word by word, the model considers all the words in the sequence at the same time and learns to pay "attention" to the most important ones for the given task.

In essence, the Transformer model is a design for building network architectures that can process data in parallel and focus on different parts of the data depending on the task at hand. The Transformer model has proven to be highly effective and flexible, and has been adopted in many variants and applications, including BERT, GPT, T5, and many others.


Just a quick clarification, attention of the same sort transformers used was already being employed in RNNs for a while. Thus the name "attention is all you need", it turned out you can just remove the recurrent part which makes it hard to train the NN.


Innovation is "attention", not just "self-attention" (cross-attention for ie. translation <<encoder>>, self-attention for generation <<decoder>>).

It's general computation model, does't have to work on text only.

It's also general in the sense that you can mask it - ie. with lower triangular matrix so future doesn't influence past (decoder, generation); leave it unmasked (ie. in encoder, ie. in text translation you want attention to have access to full input text) or anything else really.


great explanation. thank you all for contributing to our learning!

> in a machine translation task, it transforms a sentence in one language into a sentence in another language.

here English is being translated to which language - I'm assuming vectors? might be a silly question, I'm assuming that's where the origin of the word "Transformer" lies


> in a machine translation task, it can transform a sentence in one language into a sentence in another language.

This means when it is translating between two human languages "a machine translation task".


Someone who read the paper pointed out to me recently that there's an aspect to transformers/attention that uses the sin or cos function to determine which words to pay attention to or the spacing between them (I'm probably not expressing this correctly, so please correct me if I'm wrong). It seems really unintuitive that sin and/or cos would be a factor in human language - can you explain this?


That sounds like a reference to the concept of cosine similarity.

Imagine that words are spread out in the space. Cosine similarity is a measure of similarity between two vectors (each word is encoded as a vector).

By measuring the cosine of the angle between the two vectors we can get:

1) whether 2 vectors have the same angle (2 words have the same meaning or close enough) when the cosine is close to 1

2) whether 2 vectors are perpendicular (2 words don't have anything to do with each other) when the cosine is close to zero

3) whether 2 vectors are opposite in direction (2 words have opposite meanings in some aspect) when the cosine is close to -1

Cosine similarity is like comparing two people's interests. If two people have similar interests, the angle between them is small, and the cosine similarity value will be high. If two people have completely different interests, the angle between them is large, and the cosine similarity value will be low. So, cosine similarity is a way to measure how similar two things are by looking at the angle between them.


Not as much of an expert as others commenting here, but I believe the sine/cosine stuff comes in just because it’s a standard and very efficient way of comparing vectors.

(“Vector” is just an alternate way of talking about a coordinate point - you can say “a point is at (x,y)”, or equivalently you can also say “turn x degrees and then travel y units forward”, either method gives enough information to find the point exactly.)

I don’t think sine and cosine are actually factors in human language - rather, the process of turning words into vectors captures whatever are the factors in human language, translates them into vectors, and in that translation process the factors get turned into something that sine/cosine measurements of vectors is good at picking up.

A toy example would be that arithmetic doesn’t seem to be a factor in political orientation, but if you assess everyone’s political orientation with some survey questions and then put them on a line from 0 to 10, then you could do some subtraction and multiplication operations to find numbers that are close together - ie doing arithmetic to find similar politics. The reason that works is not because arithmetic has anything to do with political orientation, it’s because your survey questions captured information about political orientation and transformed it into something that arithmetic works on.

I guess this explanation doesn’t do much except push the unintuitiveness into the embedding process (that’s the process of turning words into vectors in a way that captures their relationship to other words).


It is explained in the paper.

> In this work, we use sine and cosine functions of different frequencies:

> PE(pos,2i) = sin(pos/10000^{2i/d_model})

> PE(pos,2i+1) = cos(pos/10000^{2i/d_model})

> where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE_{pos+k} can be represented as a linear function of PE_{pos}.


Someone else can better explain this. Based on one of the video suggested in one of the replies here. sin and cos doesn’t have any inherent properties specific to language, they were chosen because a simple linear function was needed in that step of optimization. Any other function could fit the bill as well.


Sin or cosine was chosen explicitly because they are non-linear functions


complete newbie here: what is the intuition behind the conclusion that "cat" is highly related to "black" as opposed to, say, "mat"?


Attention and the Transformer make it possible to recognize that the probability of “black” applying to the cat is much much higher than to the mat due to the phrasing “which is” in between the cat and black.


Thank you. So this is based on the training data, I assume.


It is a lot harder to take the black out of the cat than it is to take the mat out from under it.


Humans know that, how does transform know that? Based on training data?


Sort of. Part of the training for a model includes telling it which parts of a sentence are important... a human points and clicks.


This is extremely important to know. That the relationships between words in the sentence are actually trained by human evaluation.


They are not.


No, thats incorrect. The connections are automatically deduced from the training data (which is just vast amounts of raw text).


> For example, in the sentence "The cat, which is black, sat on the mat," the words "cat" and "black" would get high scores when trying to understand the word "black" because they are closely related.

So what does that actually mean in terms of looking at new text? How does it know the relationships? Does it have to be bootstrapped on labeled data for a specific language up front?

Is that something done in the training process - providing example sentences and illustrating the connections between words - or is that earlier?


Is there a way to have recuesively constructed attentional architectures? It would seem like the same process that you describe could be even more useful if it could be applied at the level of sentences, paragraphs, etc.


> each word in a sentence ...

Does each sentence stand alone, or is the meaning of the sentence, and the words in the sentence, influenced by the sentences that come before and after it?


Great explanation


It's chat-gpt generated. It's even leaked part of the prompt in the intro.

I especially disagree with:

> natural language processing (NLP), which is the field of AI that deals with understanding and generating human language.


If GPT4 wrote this it did a great job, and highlights how incredible and useful it can be.

Although I'm sure "ELI5 what is a transformer" was one of the RHLF prompts which got handcrafted responses from OpenAI engineers whose bread and butter is transformers, so...still a great response.


Do you disagree with anything else? That sounds like a simplification and not too bad given the target audience.


how is "attention" different from using tokens > vector database > cosine similarity?


In the context of natural language processing, the attention mechanism used in Transformer models and the process of converting tokens to vectors and calculating cosine similarity have similarities but serve different purposes.

When you convert words (tokens) into vectors and calculate cosine similarity, you're typically doing what's called "word embedding". This process captures the semantic meaning of words in a high-dimensional space. Words that have similar meanings have vectors that are close to each other in this space. Cosine similarity is a measure of how similar two vectors are, which in this context equates to how similar the meanings of two words are.

On the other hand, the attention mechanism in Transformer models is a way to understand the relationships between words within a specific context. It determines how much each word in a sentence contributes to the understanding of every other word in the sentence. It's not just about the semantic similarity of words, but also about their grammatical and contextual relationships in the given sentence.

Here's an analogy: imagine you're trying to understand a conversation between a group of friends. Just knowing the meaning of their words (like word embeddings do) can help you understand some of what they're saying. But to fully understand the conversation, you also need to know who's speaking to whom, who's agreeing or disagreeing with whom, who's changing the topic, and so on. This is similar to what the attention mechanism does: it tells the model who's "talking" to whom within a sentence.

So while word embeddings and cosine similarity capture static word meanings, the attention mechanism captures dynamic word relationships within a specific context. Both are important for understanding and generating human language.


Just a guess: is this answer GPT output?


From another one of his responses that he has since deleted from this thread: “Despite these challenges, researchers have found that hierarchical attention can improve performance on tasks like document classification, summarization, and question answering, especially for longer texts. As of my last training cut-off in September 2021, this is an area of ongoing research and development.”


Probably, it is a good explanation, though.


If you're specifically focused on semantic similarity, I would say that attention adds to the dimensionality of the vector space. Distances between tokens can vary depending on context.


It's orthogonal, right? How do you go from tokens to vector? Fully connected NN? lstm? Or transformer?


I had a blog on it on overview of it with codes and explanation (more like eli15): https://medium.com/analytics-vidhya/googles-t5-transformer-t...


I have some background but I'm probably not the best person in the world to explain.

The important thing about the transformers model is that it's the first one we have found which keeps unlocking more and more powerful and general cognitive abilities the more resources we throw at it (parameters, exaflops, datasets). I saw some interview with Ilya Sutskever where he says this; it almost certainly won't be the last or best one, but it was the first one.

--

Why was it the first one? How were these guys so clever and other ones couldn't figure it out?

OK so first you need some context. There is a lot of 'Newton standing on the shoulders of giants' going on here. If all of these giants were around in the 1970s, it probably would have been invented then. Heck for all we know something as good was invented in the 1970s but our computers were too smol to benefit from it. This is what John Carmack is currently looking into.

To really notice the scaling benefits of the transformer architecture, they needed to run billion parameter transformer models on linear-algebra-accelerating GPU chips using differentiable programming frameworks. These are some of the giants we are standing on. The research and development pipeline for these amazing GPUs like [thousands of tech companies -> ASML -> TSMC -> NVIDIA] didn't exist until not so long ago. The special properties of transformers wouldn't have been discovered so soon without this hardware stack.

Another giant we are standing on is the differentiable programming linear algebra libraries and frameworks similar to theano or tensorflow or pytorch or jax. They have had things like this under the name 'mathematical programming' like CPLEX but it wasn't as accessible. 'Differentiable programming' is a newish terminology for what used to be called 'automatic differentiation' where 'differentiation' means essentially the same as calculus derivative. Informally it means that these libraries can predict any tiny output effect of any tiny input change as a computationally cheap side-effect of computing the given output, even for complicated calculations. This capability makes optimization easier, in particular it generalizes the 'backpropagation' algorithm of traditional artificial neural networks.

--

What is the transformer model in more nerdy terms.

At one level, it's just a complicatedly parameterized function, where you can fit the parameters by training on data. This viewpoint puts the importance on the computational power applied to training the model with the advantage of differentiable programming. Some will probably guess that the details of the model architecture don't really matter as long as it has sickening amount of parameters and exaflops and dataset. Some version of this viewpoint is probably true in my opinion.

More specifically, the transformer architecture is like a chain of black box differentiable 'soft' lookup tables. The soft queries and keys and values are each lists of floating point numbers (for example a single soft query is a list of numbers, called a vector) and these vectors are stacked into matrices and the soft lookup is processed quickly with fast matrix multiplication tricks. Importantly, all of this is happening inside of a differentiable programming framework which lets you cheaply answer questions about how any small change to the input will affect the output. This capability is used for training, by making trillions of billions of tiny changes to the floating point numbers in the multiplication matrices in the boxes. At the end, the fully trained chain of black box functions can be used to compute a probability distribution over the next token in the message, which lets you generate messages or translate between languages or whatever.


"Transformers" and "Attention is All You Need" refer to an important development in machine learning and artificial intelligence, particularly in the field of natural language processing (NLP). I'll try to explain them in a simple way.

Think of a conversation you had with a friend. While they were talking, you were probably not just listening to the words they were saying right now, but also remembering what they said a few minutes ago. Your brain was connecting the dots between different parts of the conversation to understand the full meaning. Now, imagine if you could only understand each word in isolation and couldn't remember anything from a few seconds ago. Conversations would be pretty hard to understand, right?

In early NLP models, this was a big problem. They couldn't easily look at the "context" of a conversation or a sentence. They could only look at a few words at a time, so they were a bit like our forgetful person. They were good at understanding the meaning of individual words, but not so good at understanding how those words fit together to create meaning.


Did you use GPT to write this? (Not a bad thing! It's a decent answer)


I copy pasted the attention is all you need paper into ChatGPT4 and gave it the prompt "Explain like I'm 5 years old".

The Transformer is a new type of computer program that helps translate languages and understand sentences. It works by paying attention to different parts of a sentence at the same time, instead of looking at one word after another like older programs. This makes it faster and better at understanding complicated sentences. It has been tested on translating English to German and English to French and did a really good job.

(Edit)


Immediately thought this was gpt as well.

Assuming prompt was "Explain Transformers and 'Attention is all you need' in a simple way"


AFAIK Transformers and context size are orthogonal concepts. You could have large token contexts before. The transformer directs the “attention” to a specific word/token inside the context.


Enlightening example of having a conversation. Makes thing clearer.


[flagged]


GPT-4 Prompt: Can you ELI5 Transformers and the “Attention is all we need” paper

Sure! Imagine you're in school, and your teacher asks you and your friends to work on a group project. Each of you has a different set of skills, and you need to work together to complete the project successfully. In this analogy, the group project is like a sentence or a piece of text that a language model is trying to understand.

Transformers are a type of model that helps computers understand and generate text, like in our group project. The key idea behind transformers is something called "attention." Attention helps the model figure out which words in a sentence are the most important to focus on, just like you and your friends pay attention to each other's skills to complete your project.

The "Attention is all we need" paper introduced transformers and showed that by using attention, the model can learn how to understand and generate text more effectively. Instead of focusing on each word one at a time, transformers can look at all the words together, decide which ones are the most important, and use that information to understand the text better.

So, in our group project example, transformers help the model work like a team, where everyone pays attention to each other's strengths and helps each other out. This makes the model more powerful and able to do a better job at understanding and generating text.


GPT-4 seems to take "explain like I'm five" quite seriously.


I just asked ChatGPT and this is what I got:

"Imagine you have a really smart computer friend that can understand and talk to you just like a person. This computer friend is called a language transformer.

The language transformer is like a super good reader. It reads lots and lots of books, stories, and articles to learn how people talk and write. It pays attention to how words go together and what they mean in different situations. It becomes really good at understanding what people are saying or asking.

Once the language transformer has learned so much, it can help with many things. For example, if you want to write a story, it can help you come up with ideas and write sentences that make sense. If you have a question, it can try to find the answer for you. It can also help translate words from one language to another, like when you want to understand what someone is saying in a different language.

The language transformer is like a big brain that knows a lot about words and how they fit together. It uses all that knowledge to talk and help you with different things."


Asked to provide some code samples:

""" Imagine you have a sentence: "The cat sat on the mat." In traditional models, each word is processed one by one, and the model might miss important relationships between words that are far apart. Transformers solve this problem using self-attention.

In self-attention, the model pays attention to all the words in the sentence simultaneously, giving more importance to different words depending on their relevance to each other. Here's a code snippet to illustrate self-attention:

    import torch
    import torch.nn as nn

    # Assume we have an input sentence with 5 words
    input_sentence = torch.tensor([[0.1, 0.2, 0.3, 0.4, 0.5]])

    # Define the self-attention layer
    self_attention = nn.MultiheadAttention(embed_dim=5, num_heads=1)

    # Apply self-attention to the input sentence
    attention_output, _ = self_attention(input_sentence, input_sentence, input_sentence)

    # The attention_output represents the transformed sentence with attention
    print(attention_output)

In this code, we use a simple example with one self-attention head. The input sentence is represented as a tensor with shape (1, 5), where each number represents the word's features. We pass the input sentence to the self-attention layer twice, using it as the query, key, and value inputs.

The self-attention layer computes the attention scores between each word in the sentence, determining how much attention each word should receive from the other words. It then combines the input sentence with these attention weights to produce the attention_output.

The attention_output contains the transformed sentence, where each word now carries information from other words based on their importance. This allows the model to capture relationships and dependencies between words effectively.

The "Attention is All You Need" paper introduced this self-attention mechanism as a fundamental building block in the Transformer model. The Transformer architecture then incorporates multiple layers of self-attention, along with other components like positional encoding, to process sequences and achieve impressive results in various NLP tasks.

Remember, this code snippet is just a simplified illustration to help you understand the concept of self-attention. The actual implementation of Transformers is more complex, involving multiple layers, feed-forward networks, and additional optimizations. """

The nn.MultiheadAttention documentation [1], the output of print(attention_output) is:

    tensor([[ 0.1756, -0.2273, -0.0787,  0.0383, -0.0779]], grad_fn=<SqueezeBackward1>)

If you badger ChatGPT it will give you an example with different query, key, and value inputs

    # Define distinct query, key, and value inputs
    query = torch.tensor([[0.2, 0.3, 0.4, 0.5, 0.6]])
    key = torch.tensor([[0.3, 0.4, 0.5, 0.6, 0.7]])
    value = torch.tensor([[0.4, 0.5, 0.6, 0.7, 0.8]])

    # Apply self-attention to the input sentence using distinct inputs
    attention_output, _ = self_attention(query, key, value)

outputting:

    tensor([[-0.2294,  0.1336, -0.1445, -0.7275, -0.1232]], grad_fn=<SqueezeBackward1>)

Pushed it then further to explain how does nn.MultiheadAttention works:

""" In PyTorch, nn.MultiheadAttention is a module that implements the multi-head attention mechanism. Multi-head attention is a key component of the Transformer architecture, which is widely used in various natural language processing (NLP) tasks.

The nn.MultiheadAttention module allows you to apply multi-head attention to a set of queries, keys, and values. Here's a simplified explanation of what it does:

Inputs: The module takes three inputs: queries, keys, and values. These inputs are typically tensors that represent the encoded representations of the input sequence.

Linear Projections: Before applying attention, the module performs linear projections on the queries, keys, and values. This step projects the input tensors into different subspaces to capture different aspects of the data. These projections are learnable parameters.

Attention Scores: Next, the module computes attention scores between the queries and the keys. The attention scores measure the relevance or importance of each key with respect to each query. Higher attention scores indicate a stronger relationship between the query and the corresponding key.

Attention Weights: The attention scores are transformed into attention weights by applying a softmax function. The softmax operation normalizes the attention scores, ensuring that the weights sum up to 1 and represent a valid distribution.

Weighted Sum: The attention weights are used to compute a weighted sum of the values. This step aggregates the values based on their importance, as determined by the attention weights. The weighted sum produces the final output of the attention mechanism.

Multi-Head Attention: The module performs the above steps multiple times in parallel, each time using different sets of learned linear projections. These parallel attention operations are called "attention heads." The outputs of the attention heads are concatenated and linearly transformed to obtain the final output of the nn.MultiheadAttention module.

The multi-head attention mechanism allows the model to capture different types of relationships and dependencies in the input sequence. By using multiple attention heads with different linear projections, the model can attend to different parts of the input space and learn diverse representations.

Overall, nn.MultiheadAttention in PyTorch provides a convenient way to incorporate the multi-head attention mechanism into your neural network architecture, facilitating effective modeling of complex dependencies in sequence data. """

[1] https://pytorch.org/docs/stable/generated/torch.nn.Multihead...


[flagged]


Absolute BS. Transformers were literally introduced by that paper.


Maybe ELI5 was the wrong level for my ask. I still want to understand it from a technical perspective.

Maybe ELI am a software engineer but not in AI/ML?


*Prompt:* I'm a software engineer who has never done anything with AI/ML. Explain to me transformers and the "Attention is all you need" paper.

*GPT4:* "Attention is All You Need" is a seminal paper published in 2017 by Vaswani et al. that introduced the Transformer architecture, a breakthrough in the field of AI and natural language processing (NLP). Transformers have since become the foundation for many state-of-the-art models, including GPT (Generative Pre-trained Transformer) series, BERT (Bidirectional Encoder Representations from Transformers), and more.

At a high level, the Transformer architecture is designed to handle sequential data, like text, in parallel rather than sequentially, as was common with earlier models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. This parallelization allows for faster training and better scalability. The key innovation in Transformers is the attention mechanism, which is used to weigh the importance of different input tokens when generating an output token.

Here's a simplified explanation of the main components of a Transformer:

    Input Embeddings: Transformers take input data in the form of text tokens. These tokens are converted into continuous vectors through an embedding layer.

    Positional Encoding: Unlike RNNs and LSTMs, Transformers don't have an inherent sense of sequence order. To address this, positional encoding is added to the input embeddings to provide information about the position of each token in the sequence.

    Encoder and Decoder: The Transformer architecture consists of an encoder and a decoder. The encoder is a stack of identical layers, each containing two primary components: a multi-head self-attention mechanism and a position-wise feed-forward network. The decoder is also composed of a stack of identical layers, but it has an additional multi-head attention mechanism that attends to the encoder's output.

    Multi-Head Attention: The attention mechanism is the core innovation in the Transformer architecture. It computes a weighted average of input tokens to generate context-aware representations. Multi-head attention allows the model to focus on different aspects of the input by employing multiple attention heads, each with its own set of learnable parameters.

    Feed-Forward Networks: These are simple fully connected layers that follow the attention mechanism in each layer of the encoder and decoder.

    Output: Finally, the decoder generates the output sequence token by token. This is done by applying a linear layer followed by a softmax activation function to get the probabilities of each possible output token.
In summary, the Transformer architecture introduced in the "Attention is All You Need" paper revolutionized NLP by enabling more efficient and powerful models. The key innovation is the attention mechanism, which helps the model learn complex relationships between input tokens and produce context-aware representations. This has led to a wide array of applications, from machine translation and text summarization to sentiment analysis and language modeling.

*Me:* It's okayish but you're probably interested in a more detailed low level explanation. Also transformers are widely used for computer vision nowadays (a bit forgivable because of ChatGPT 2021 cutoff, but they had already taken off by then)


It’s not really something you need to understand unless you’re an ML researcher.

I guess the ELI5 (with a BUNCH of details left out) is “Transformers: what if you didn’t have to process sentences as a sequence of words, but rather as a picture of words.”


>t’s not really something you need to understand unless you’re an ML researcher

What kind of an answer is this? No it’s something you should understand if you are at all interested.


It’s like saying “I have no basis in differential equations or complex numbers or Newtonian mechanics or experimental findings on the wavelike nature of light but could somebody explain the Schrödinger equation derivation like I’m five?”

Like, there’s a reason we don’t explain things to five year olds.

Also I have a good ELI5 answer, so I don’t know what else you want dude lol so petty


Who is Yegge and why should I care? All I found on google is that he wrote code for google and blogs.


He's been a popular tech writer for a couple of decades. He's written some very influential pieces in the past - here's a copy of one of his most famous (I don't think the original is available any more): https://gist.github.com/chitchcock/1281611


> (I don't think the original is available any more)

That's because this one was not a blog post, it was a Google+ post that he intended to share internally with Google only. He had other great rants on his actual blog though, about (vague memory here) Emacs, game development, language design, etc.


He's an engineers who's been involved (or close to) some of the most influential technologies/products in use in the last 10 years, and his blog posts are quite celebrated and respected.

Think of him as an essayist. Like Paul Graham (pg on this site)


1 part pg and 3 parts _why maybe.


I first heard about him after an internal post/rant of his about Google+ got leaked and went viral.

My impression is that he is a Kardashian - famous for being famous - in the software world.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: