HyperAttention: Long-context Attention in Near-Linear Time

Han, Insu; Jayaram, Rajesh; Karbasi, Amin; Mirrokni, Vahab; Woodruff, David P.; Zandieh, Amir

Computer Science > Machine Learning

arXiv:2310.05869v2 (cs)

[Submitted on 9 Oct 2023 (v1), revised 11 Oct 2023 (this version, v2), latest version 1 Dec 2023 (v3)]

Title:HyperAttention: Long-context Attention in Near-Linear Time

Authors:Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P. Woodruff, Amir Zandieh

View PDF

Abstract:We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which measure: (1) the max column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small. HyperAttention features a modular design that easily accommodates integration of other fast low-level implementations, particularly FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like FlashAttention. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets. For example, HyperAttention makes the inference time of ChatGLM2 50\% faster on 32k context length while perplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k, with causal masking, HyperAttention offers 5-fold speedup on a single attention layer.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2310.05869 [cs.LG]
	(or arXiv:2310.05869v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2310.05869

Submission history

From: Insu Han [view email]
[v1] Mon, 9 Oct 2023 17:05:25 UTC (552 KB)
[v2] Wed, 11 Oct 2023 13:25:13 UTC (671 KB)
[v3] Fri, 1 Dec 2023 17:43:06 UTC (841 KB)

Computer Science > Machine Learning

Title:HyperAttention: Long-context Attention in Near-Linear Time

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:HyperAttention: Long-context Attention in Near-Linear Time

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators