Superbpe: Advancing Language Models With Cross-word Tokenization

1 month ago

ARTICLE AD BOX

Language models (LMs) look a basal situation successful really to comprehend textual information done tokenization. Current subword tokenizers conception matter into vocabulary tokens that cannot span whitespace, adhering to an artificial constraint that treats abstraction arsenic a semantic boundary. This believe ignores nan reality that meaning often exceeds individual words – multi-word expressions for illustration “a batch of” usability arsenic azygous semantic units, pinch English speakers mentally storing thousands of specified phrases. Cross-linguistically, nan aforesaid concepts whitethorn beryllium expressed arsenic azygous aliases aggregate words, depending connected nan language. Notably, immoderate languages for illustration Chinese and Japanese usage nary whitespace, allowing tokens to span aggregate words aliases sentences without evident capacity degradation.

Previous investigation has explored respective approaches beyond accepted subword tokenization. Some studies investigated processing matter astatine aggregate granularity levels aliases creating multi-word tokens done frequency-based n-gram identification. Other researchers person explored multi-token prediction (MTP), allowing connection models to foretell various tokens successful a azygous step, which confirms models’ capacity to process much than 1 subword simultaneously. However, these approaches require architectural modifications and hole nan number of tokens predicted per step. Some researchers person pursued tokenizer-free approaches, modeling matter straight arsenic byte sequences. However, this importantly increases series lengths and computational requirements, starring to analyzable architectural solutions.

Researchers from nan University of Washington, NVIDIA, and nan Allen Institute for AI person projected SuperBPE, a tokenization algorithm that creates a vocabulary containing some accepted subword tokens and innovative “superword” tokens that span aggregate words. This attack enhances nan celebrated byte-pair encoding (BPE) algorithm by implementing a pretokenization program by initially maintaining whitespace boundaries to study subword tokens, past removing these constraints to let for superword token formation. While modular BPE quickly reaches diminishing returns and originates utilizing progressively uncommon subwords arsenic vocabulary size grows, SuperBPE continues discovering communal multi-word sequences to encode arsenic azygous tokens, improving encoding efficiency.

SuperBPE operates done a two-stage training process that modifies nan pretokenization measurement of accepted BPE, mentioned above. This attack intuitively builds semantic units and combines them into communal sequences for greater efficiency. Setting t=T (t is modulation constituent and T is target size) produces modular BPE, while t=0 creates a naive whitespace-free BPE. Training SuperBPE requires much computational resources than modular BPE because, without whitespace pretokenization, nan training information consists of highly agelong “words” pinch minimal deduplication. However, this accrued training costs a fewer hours connected 100 CPUs and occurs only once, which is negligible compared to nan resources required for connection exemplary pretraining.

SuperBPE shows awesome capacity crossed 30 benchmarks spanning knowledge, reasoning, coding, reference comprehension, etc. All SuperBPE models outperform nan BPE baseline, pinch nan strongest 8B exemplary achieving an mean betterment of 4.0% and surpassing nan baseline connected 25 retired of 30 individual tasks. Multiple-choice tasks show important gains, pinch a +9.7% improvement. The only statistically important underperformance occurs successful nan LAMBADA task, wherever SuperBPE experiences a last accuracy driblet from 75.8% to 70.6%. Moreover, each reasonable modulation points output stronger results than nan baseline. The astir encoding-efficient modulation constituent delivers a +3.1% capacity betterment while reducing conclusion computing by 35%.

In conclusion, researchers introduced SuperBPE, a much effective tokenization attack developed by enhancing nan modular BPE algorithm to incorporated superword tokens. Despite tokenization serving arsenic nan basal interface betwixt connection models and text, tokenization algorithms person remained comparatively static. SuperBPE challenges this position quo by recognizing that tokens tin widen beyond accepted subword boundaries to see multi-word expressions. SuperBPE tokenizers alteration connection models to execute superior capacity crossed galore downstream tasks while reducing conclusion computational costs. These advantages require nary modifications to nan underlying exemplary architecture, making SuperBPE a seamless replacement for accepted BPE successful modern connection exemplary improvement pipelines.

Check out the Paper and Project Page. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 85k+ ML SubReddit.

Sajjad Ansari is simply a last twelvemonth undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into nan applicable applications of AI pinch a attraction connected knowing nan effect of AI technologies and their real-world implications. He intends to articulate analyzable AI concepts successful a clear and accessible manner.

English (US) ·

Indonesian (ID) ·

· · ·

↑

Superbpe: Advancing Language Models With Cross-word Tokenization

ARTICLE AD BOX

Related Article

Bytedance Open-sources Deerflow: A Modular Multi-agent Framework For Deep Research Automation

Enterprise Ai Without Gpu Burn: Salesforce’s Xgen-small Optimizes For Context, Cost, And Privacy

A Deep Technical Dive Into Next-generation Interoperability Protocols: Model Context Protocol (mcp), Agent Communication Protocol (acp), Agent-to-agen...

RIGHT SIDEBAR TOP AD

Popular Article

Bytedance Open-sources Deerflow: A Modular Multi-agent Framework For Deep Research Automation

Deepseek-prover-v2: Bridging The Gap Between Informal And Formal Mathematical Reasoning

Ai That Teaches Itself: Tsinghua University’s ‘absolute Zero’ Trains Llms With Zero External Data

Top 10 Ai Tools For Embedded Analytics And Reporting (may 2025)

Google Redefines Computer Science R&d: A Hybrid Research Model That Merges Innovation With Scalable Engineering

RIGHT SIDEBAR BOTTOM AD