Attention Mechanisms

For some time, I’ve been thinking about how to generalize self-attention mechanisms. Most existing attention mechanisms rely on pairwise similarities (dot products) between query and key vectors. However, higher-order relationships (involving triples or tuples of elements) could capture richer interactions. I then found that several people are already exploring this idea under the name “higher-order attention” [5]. However, this approach comes with a performance cost. Traditional self-attention has a complexity of O(n^2), while higher-order attention is even more computationally expensive. In this post, I’d like to share my perspective on this topic, connecting it with kernel methods and generalized metric spaces. ...