Bytedance Introduces Quadmix: A Unified Ai Framework For Data Quality And Diversity In Llm Pretraining

Trending 8 hours ago
ARTICLE AD BOX

The pretraining ratio and generalization of ample connection models (LLMs) are importantly influenced by nan value and diverseness of nan underlying training corpus. Traditional information curation pipelines often dainty value and diverseness arsenic abstracted objectives, applying value filtering followed by domain balancing. This sequential optimization overlooks nan analyzable interdependencies betwixt these factors. High-quality datasets often grounds domain biases, while diversified datasets whitethorn discuss quality. In nan discourse of fixed training budgets, location is simply a captious request to simultaneously optimize for some dimensions to maximize exemplary performance. However, defining and jointly optimizing value and diverseness stay non-trivial challenges.

ByteDance Introduces QuaDMix

ByteDance presents QuaDMix, a unified information action model that systematically balances value and diverseness during LLM pretraining. QuaDMix evaluates each information sample based connected aggregate value criteria and domain classifications and determines its sampling probability done a parameterized function. The model employs proxy exemplary experiments mixed pinch LightGBM-based regression to foretell downstream performance, enabling businesslike parameter optimization without exhaustive large-scale training. Experiments show that QuaDMix achieves an mean capacity betterment of 7.2% crossed aggregate benchmarks compared to methods optimizing value and diverseness separately, underscoring nan effectiveness of a associated approach.

QuaDMix operates successful 3 main stages: characteristic extraction, value aggregation, and quality-diversity alert sampling. Initially, each archive is annotated pinch domain labels and aggregate value scores. These scores are normalized and merged utilizing domain-specific parameters to compute an aggregated value score. Documents are subsequently sampled according to a sigmoid-based usability that prioritizes higher-quality samples while maintaining domain equilibrium done parameterized controls.

Optimization is performed by training thousands of proxy models crossed different parameter settings. A regression model, trained connected these proxy experiments, predicts capacity outcomes, enabling recognition of optimal sampling configurations. This method allows for a system exploration of a high-dimensional parameter space, aligning information action much intimately pinch intended downstream tasks.

QuaDMix provides respective advantages:

  • Unified optimization of information value and domain diversity.
  • Adaptability to task-specific requirements done proxy information target selection.
  • Computational ratio by circumventing exhaustive full-model retraining.
  • Consistent downstream capacity improvements without expanding compute budgets.

Experimental Results and Insights

Validation experiments were conducted utilizing nan RefinedWeb dataset, training 530M parameter models from scratch. QuaDMix was compared against respective baselines, including Random Selection, Fineweb-edu, AskLLM, DCLM, DSIR, and RegMix. QuaDMix consistently outperformed these methods, achieving an mean people of 39.5% crossed 9 divers benchmarks.

Key observations include:

  • Joint optimization strategies consistently outperform isolated quality- aliases diversity-focused methods.
  • Proxy exemplary capacity correlates powerfully pinch large-scale exemplary outcomes, validating nan efficacy of nan proxy-based approach.
  • Data mixtures optimized for circumstantial downstream tasks further heighten task performance.
  • Merging aggregate value criteria reduces inherent biases and improves wide exemplary robustness.
  • Expanding token diverseness beyond a definite period yields diminishing returns, emphasizing nan value of curated value complete sheer quantity.

Conclusion

QuaDMix offers a opinionated attack to information action for LLM pretraining, addressing nan longstanding situation of simultaneously optimizing information value and diversity. By integrating value aggregation and domain-aware sampling wrong a unified model and leveraging proxy-based optimization, QuaDMix establishes a scalable methodology for enhancing LLM pretraining efficiency. While location are opportunities for early improvements—such arsenic refining nan parameter abstraction and enhancing proxy exemplary fidelity—QuaDMix represents a important measurement towards much systematic and effective information curation strategies for large-scale exemplary development.


Check retired nan Paper. Also, don’t hide to travel america on Twitter and subordinate our Telegram Channel and LinkedIn Group. Don’t Forget to subordinate our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference connected AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 p.m. PST) + Hands connected Workshop

Asif Razzaq is nan CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing nan imaginable of Artificial Intelligence for societal good. His astir caller endeavor is nan motorboat of an Artificial Intelligence Media Platform, Marktechpost, which stands retired for its in-depth sum of instrumentality learning and heavy learning news that is some technically sound and easy understandable by a wide audience. The level boasts of complete 2 cardinal monthly views, illustrating its fame among audiences.

More