Attention is a differentiable lookup.

In classic lookup, we provide one exact key (K) and the table returns one exact value (V). It’s an all-or-nothing operation. Either the key exists and we get the corresponding value, or the key doesn’t exist, and we get an error. The Query (Q) is identical to the Key we’re searching for. The query we supply is the key we hope exists.

Attention takes this rigid lookup and converts it into a differentiable and probabilistic operation. Instead of requiring an exact match, attention takes in a query, and compares it against all available keys simultaneously (using a dot product), smooths out the results, and returns a weighted average of their corresponding values. Instead of a one-to-one lookup, every value is contributing something to the output, depending on how well the keys match the query.

This makes attention inherently position-agnostic. It’s a probabilistic search through a space of meaning. The dot product between a query for "capital" and a key for "France" yields the same score regardless of whether "France" appears first or last in the sequence. The meaning of “the man bit the dog” and “the dog bit the man” is indistinguishable to the model unless there’s some information about the order the words appear in. That’s why a transformer injects information about the position of a token before any of the Q, K, or V are computed for a token (or you can inject this information into the Q and K via rotary embeddings).

The table in attention that’s being queried is dynamic and input-dependent. It gets rewritten as the model processes a sequence. With multi-head attention, there are several such lookups being done in parallel. At every layer, the table is rebuilt again and again from the current token representations. A lot of variants of attention are about how to make this lookup more efficient or sparse or structured, but fundamentally it is a lookup.