Huawei Introduces Pangu Ultra Moe: A 718b-parameter Sparse Language Model Trained Efficiently On Ascend Npus Using Simulation-driven Architecture And System-level Optimization

Trending 2 hours ago
ARTICLE AD BOX

Sparse ample connection models (LLMs) based connected nan Mixture of Experts (MoE) model person gained traction for their expertise to standard efficiently by activating only a subset of parameters per token. This move sparsity allows MoE models to clasp precocious representational capacity while limiting computation per token. However, pinch their expanding complexity and exemplary size approaching trillions of parameters, training them efficiently requires algorithmic invention and a tightly integrated hardware-software optimization. These challenges are particularly applicable erstwhile deploying models connected non-standard AI accelerators for illustration Ascend NPUs, which require circumstantial architectural alignment to present optimal performance.

A awesome method situation lies successful nan inefficient utilization of hardware resources while training sparse LLMs. Since only a information of parameters are progressive for each token, workloads crossed devices go unbalanced, starring to synchronization delays and underused processing power. This imbalance besides affects representation utilization arsenic different experts process different numbers of tokens, sometimes exceeding capacity. These inefficiencies are compounded astatine a ample scale, specified arsenic crossed thousands of AI chips, wherever connection and representation guidance bottlenecks importantly inhibit throughput. The inability to afloat harness nan computational committedness of sparsity successful believe restricts nan deployment of specified models connected hardware systems for illustration Ascend NPUs.

Several strategies person been projected to tackle these challenges. These see auxiliary losses to equilibrium token distribution crossed experts and drop-and-pad strategies that limit master overload by discarding tokens exceeding capacity. However, these techniques either trim exemplary capacity aliases present inefficiencies successful representation and computation. Other efforts see heuristic master placement and accepted connection patterns for illustration All-to-All dispatching, but these often neglect to standard good aliases support precocious throughput. Moreover, modular memory-saving techniques for illustration recomputation are usually coarse-grained, targeting full layers alternatively of circumstantial operations, starring to accrued runtime without proportional representation savings.

Researchers from nan Pangu squad astatine Huawei Cloud introduced a highly system and optimized training attack for ample MoE models tailored to Ascend NPUs. They developed Pangu Ultra MoE, a sparse LLM pinch 718 cardinal parameters, focusing connected aligning exemplary architecture and strategy creation pinch nan capabilities of nan Ascend hardware. Their attack originates pinch a simulation-based exemplary configuration process that evaluates thousands of architecture variants utilizing metrics grounded successful existent hardware behavior. These simulations pass creation decisions earlier immoderate beingness training is undertaken, frankincense redeeming important computational resources and enabling informed tuning of exemplary hyperparameters.

The simulation method analyzes combinations of parameters specified arsenic nan number of layers, hidden size, and master count utilizing a five-dimensional parallelism strategy that includes Pipeline Parallelism, Tensor Parallelism, Expert Parallelism, Data Parallelism, and Context Parallelism. The last exemplary configuration adopted by Huawei included 256 experts, a hidden size 7680, and 61 transformer layers. To further optimize performance, researchers integrated an Adaptive Pipe Overlap system to disguise connection costs and utilized hierarchical All-to-All connection to trim inter-node information transfer. They employed fine-grained recomputation, specified arsenic recomputing only key-value vectors successful attraction modules, and introduced tensor swapping to offload activation representation to big devices dynamically.

Pangu Ultra MoE achieved a Model Flops Utilization (MFU) of 30.0% and processed tokens astatine a complaint of 1.46 cardinal per 2nd utilizing 6,000 Ascend NPUs. The baseline MFU was 18.9% pinch 0.61 cardinal tokens per 2nd connected 4,000 NPUs. The researchers besides introduced move master placement strategies, improving device-level load equilibrium and achieving a comparative 10% MFU improvement. The exemplary performed competitively connected benchmark evaluations, attaining 81.3% connected AIME2024, 97.4% connected MATH500, 94.8% connected CLUEWSC, and 91.5% connected MMLU. In nan healthcare domain, it outperformed DeepSeek R1 by scoring 87.1% connected MedQA and 80.8% connected MedMCQA, confirming its spot successful domain-specific applications.

This study illustrates really nan Pangu squad astatine Huawei efficaciously tackled nan halfway difficulties of training monolithic MoE models connected specialized hardware. Their systematic architecture search, businesslike connection techniques, and tailored representation optimizations correspond a beardown model for scalable AI training. The activity demonstrates applicable ways to unlock nan capacity imaginable of sparse models and sets a guidance for early system-aware AI design.


Check out Paper here. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 95k+ ML SubReddit.

Here’s a little overview of what we’re building astatine Marktechpost:

  • ML News Community – r/machinelearningnews (92k+ members)
  • Newsletter– airesearchinsights.com/(30k+ subscribers)
  • miniCON AI Events – minicon.marktechpost.com
  • AI Reports & Magazines – magazine.marktechpost.com
  • AI Dev & Research News – marktechpost.com (1M+ monthly readers)

Nikhil is an intern advisor astatine Marktechpost. He is pursuing an integrated dual grade successful Materials astatine nan Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is ever researching applications successful fields for illustration biomaterials and biomedical science. With a beardown inheritance successful Material Science, he is exploring caller advancements and creating opportunities to contribute.

More