Rwkv-x Combines Sparse Attention And Recurrent Memory To Enable Efficient 1m-token Decoding With Linear Complexity

Trending 1 week ago
ARTICLE AD BOX

LLMs built connected Transformer architectures look important scaling challenges owed to their quadratic complexity successful series magnitude erstwhile processing long-context inputs. Methods for illustration Linear Attention models, State Space Models for illustration Mamba, Linear RNNs for illustration DeltaNet, and RWKV lick this problem. However, these linear architectures struggle pinch long-context understanding. For instance, RWKV-7 (2.9B) achieves precocious accuracy connected passkey retrieval up to 28K tokens but experiences accelerated capacity degradation beyond this point. Even pinch continual pretraining utilizing 128K-length data, long-context limitations persist. This rumor extends beyond RWKV to different architectures for illustration Mamba, representing a basal situation for this people of models.

Linear complexity connection models person emerged arsenic alternatives to transformer-based architectures that suffer from quadratic computational demands erstwhile processing agelong sequences. The RWKV exemplary bid combines transformer parallelizability during training pinch RNN-like recurrent authorities representation. RWKV has evolved done aggregate iterations, from nan foundational RWKV-4 to RWKV-5 to RWKV-6 to RWKV-7. Hybrid connection models, including Jamba, Zamba, and MiniMax, heighten hybrid designs uniquely. Further, Native Sparse Attention organizes tokens into temporal blocks pinch 3 chopped attraction paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for section contextual information. Other attraction includes SeerAttention and Block Attention (MoBA).

Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University, and Qinghai University, Xining, person projected a caller hybrid architecture called RWKV-X that combines RWKV’s ratio for short-range modeling pinch a sparse attraction system designed to seizure long-range context. Unlike erstwhile hybrid approaches, RWKV-X achieves linear-time complexity during training and constant-time complexity during conclusion decoding. It shows near-perfect accuracy connected nan 64K passkey retrieval benchmark erstwhile pretrained connected 64K-token sequences continuously. The exemplary consistently outperforms erstwhile RWKV-7 models connected long-context benchmarks while maintaining beardown capacity connected short-context tasks.

RWKV-X is simply a hybrid architecture that integrates RWKV-7 blocks pinch sparse attraction blocks. Rather than training from scratch, RWKV-X builds upon existing models utilizing an interleaved artifact description attack and zero-initialization system inspired by LLaMA Pro. The training follows a two-stage process:

  • First, nan exemplary trains connected short 1024-token contexts from nan MiniPile dataset while freezing each parameters isolated from nan recently added blocks. 
  • The 2nd shape involves long-context continual pretraining utilizing nan ProLong-64K dataset and a discourse magnitude of 64K tokens, processing astir 1 cardinal tokens total. During this phase, each parameters are unfrozen and jointly optimized. The training employs Long-context Cross-Entropy (LongCE) loss, which dynamically weights tokens based connected their importance.

The Short-context information reveals that RWKV-X maintains competitory capacity crossed modular benchmarks. The smaller RWKV-X (0.22B) achieves an mean people of 51.0, comparable to RWKV-7’s 51.8. At a larger scale, RWKV-X (3.6B) reaches 71.9, intimately matching RWKV-7 (2.9B, 72.8) and Qwen2.5-3B (71.4), while surpassing LLaMA3.2-3B (69.7). These results corroborate RWKV-X’s effectiveness arsenic a general-purpose LLM backbone without sacrificing capacity connected shorter contexts. Moreover, ratio study demonstrates RWKV-X’s superior scaling characteristics for agelong sequences. At 128K tokens, RWKV-X achieves a 1.37 times speedup complete Flash-Attention v3, pinch this advantage expanding arsenic discourse magnitude increases.

In this paper, researchers introduced RWKV-X, which emerges arsenic a hybrid connection exemplary that successfully combines RWKV’s ratio for modeling short-range limitations pinch a caller sparse attraction system designed specifically for long-range discourse modeling. While RWKV-X demonstrates beardown capacity and ratio successful long-context connection modeling, respective limitations remain. First, its sparse attraction mechanism, which relies connected top-k chunk selection, employs a heuristic attack that whitethorn place semantically applicable dependencies. Second, nan existent implementation shows sparse attraction decoding moving slower than vanilla RWKV, indicating that further engineering efforts are needed to optimize performance.


Check retired nan Paper. Also, don’t hide to travel america on Twitter.

Here’s a little overview of what we’re building astatine Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

Sajjad Ansari is simply a last twelvemonth undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into nan applicable applications of AI pinch a attraction connected knowing nan effect of AI technologies and their real-world implications. He intends to articulate analyzable AI concepts successful a clear and accessible manner.

More