Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think there are two layers of the 'why' in machine learning.

When you look at a model architecture it is described as a series of operations that produces the result.

There is a lower level why, which, while being far from easy to show, describes why it is that these algorithms produce the required result. You can show why it's a good idea to use cosine similarity, why cross entropy was chosen to express the measurement. In Transformers you can show that the the Q and K matrices transform the embeddings into spaces that allows different things to be closer, and using that control over the proportion of closeness allows you to make distinctions. This form of why is the explanation usually given in papers. It is possible to methodically show you will get the benefits described from techniques proposed.

The greater Why is much much harder, Harder to identify and harder to prove. the First why can tell you that something works, but it can't really tell you why it works in a way that can inform other techniques.

In the Transformer, the intuition is that the 'Why' is something along the lines of The Q transforms embeddings into an encoding of what information is needed in the embedding to resolve confusion, and that the K transforms embeddings into information to impart. When there's a match between 'What I want to know about' and 'what I know about' the V can be used as 'the things I know' to accumulate the information where it needs to be.

It's easy to see why this is the hard form, Once you get into the higher semantic descriptions of what is happening, it is much harder to prove that this is actually what is happening, or that it gives the benefits you think it might. Maybe Transformers don't work like that. Sometimes semantic relationships appear to be in processes when there is an unobserved quirk of the mathematics that makes the result coincidentally the same.

In a way I think of the maths of it as picking up a many dimentional object in each hand and magically rotating and (linearly) squishing them differently until they look aligned enough to see the relationship I'm looking at and poking those bits towards each other. I can't really think about that and the semantic "what things want to know about" at the same time, even though they are conceptualisations of the same operation.

The advantage of the lower why is that you can show that it works. The advantage of the upper why is that it can enable you to consider other mechanisms that might do the same function. They may be mathematically different but achieve the goal.

To take a much simpler example in computer graphics. There are many ways to draw a circle with simple loops processing mathematically provable descriptions of a circle. The Bressenham Circle drawing algorithm does so with a why that shows why it makes a circle but the "Why do it that way" was informed by a greater understanding of what the task being performed was.



A lot of why’s just don’t make sense to me at low level. It just feels like we need to address the issues in some way, so we make up something and brute force it with gradient descent with large data and enough computational power. It is unknown whether each design choice is a good idea, it will work anyway


I personally don't really see the point in giving meaning to the Q, K, V parts. It doesn't actually matter what Q, K, V do, it's the training algorithms' job to assign it a role automatically. It makes more sense to think of it as modeling capacity or representational space.

One of the biggest things people don't understand about machine learning is that there is a lot of information in the model that is only relevant to the training phase. It's similar to having test points for your probes on a production PCB or trace/debug logging that is disabled in production. This means that you could come up with an explanation that makes sense at training time, but is actually completely irrelevant at inference time.

For example, what you really want from attention is the pairing of all vectors in Q with all vectors in K. Why? Not necessarily because you need it for inference. It's so that you don't have to know or predict where the gradient will propagate in advance when designing your architecture. There are a lot of sparse attention variants that only really apply to inference time. They show you that transformers are doing a lot of redundant and unnecessary work, but you can only really know that after you're done training.

There is a pattern in LSTMs and Mamba models that is called gating [0], which in my opinion is a huge misnomer. Gating implies that the gate selectively stops the flow of information. What they really mean by it is element-wise multiplication.

If you look at this from the concept of model capacity, then what additional concepts and things does this multiplication allow you to represent? LSTMs are really good at modelling dynamical systems. Why is that the case? It's actually quite simple. Given a discretized linear dynamical system x_next = Ax + Bu, you run into a problem: You can only model linear systems.

So, assuming we only had matrix multiplications with activation functions, we would be limited to modeling linear dynamical systems. The key problem is that the matrices in the neural network can model A and B of the dynamical system, but they cannot have a time varying A and B. You could add additional parameters to x that contain the time varying parameters of A, but you will not be able to use these parameters to model non-linearity effectively.

Your linearization of a function f(x) might be written as g(x) = f(x_0) + f'(x_0)x = a_0 + a_1 * x. The problem is very apparent, while you can add an additional parameter to modify a_0, you can never modify a_1, since a_1 is baked into the matrix and multiplied with your x. What you want is a function like this h(x) = f(x_0) + f'(x_0)x = (a_0+m_0) + (a_1+m_1) * x, where m is a modifier value in x. In other words, we need the model to represent the multiplication m_1 * x and it turns out that this is exactly what the gating mechanism in LSTM models does.

This might look like a small victory, but it is actually enough of a victory to essentially model most non-linear behavior. You can now model the derivative in the hidden state, which also means you can model the derivative of the derivative, or the derivative of the derivative of the derivative. Of course it's still going to be an approximation, but a pretty good one if you ask me.

[0] https://en.wikipedia.org/wiki/Gating_mechanism


>I personally don't really see the point in giving meaning to the Q, K, V parts. It doesn't actually matter what Q, K, V do, it's the training algorithms' job to assign it a role automatically.

I was under the impression that the names Q K and V were historical more than anything. There is a definite sense of information flowing from the K to the Q because the V going to the next layer Q comes from the same index as the K.

I agree that it's up to the training to assign the role for the components, but there is still value in investigating the roles that are assigned. The higher level insights you can gather can lead to entirely different mechanisms to perform those roles.

That's very much what most model architectures are, efficiency guides. A multi layer perceptron with an input width the size of context_window*token_size would be capable of assigning rolls better than anything else, but at the cost of being both unfeasibly slow and unfeasibly large.

I'm a little surprised that there isn't a tiny model that generates a V on demand when it is accumulated with the attention weights, A little model that takes the Q and K values and the embeddings that they were generated from. That way when there is a partial match between the Q and K causing a decent attention value it can use the information of what parts of Q and K match to decide what V information is appropriate to pass on. It would be slower, and caching seems to be going in the other direction, but it seems like there is information that should be significant there that just isn't being used.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: