ARTICLE AD BOX
Reinforcement Learning from Human Feedback (RLHF) is important for aligning LLMs pinch quality values and preferences. Despite introducing non-RL alternatives for illustration DPO, industry-leading models specified arsenic ChatGPT/GPT-4, Claude, and Gemini proceed to trust connected RL algorithms for illustration PPO for argumentation optimization. Recent investigation focuses connected algorithmic improvements, including eliminating professional models to trim computational costs, filtering noisy samples during PPO sampling, and enhancing reward models to mitigate reward hacking problems. However, only a fewer studies attraction connected RLHF information building (i.e., training prompts) and its capacity scaling based connected these training prompts.
The occurrence of RLHF heavy depends connected reward exemplary quality, which faces 3 challenges: mis-specified reward modeling successful representing quality preferences, incorrect and ambiguous preferences successful training datasets, and mediocre generalization ability. To reside these issues, GenRM was introduced to validate exemplary predictions against ground-truth responses, showing bully guidance to reward hacking and gaining take successful precocious LLMs for illustration DeepSeekV3. Methods for illustration opinionated information action that select overly challenging instances during training and strategical action methods place cardinal training prompts to execute comparable capacity pinch reduced data. Performance standard study reveals that RLHF shows superior generalization compared to SFT connected caller inputs but importantly reduces output diversity.
Researchers from ByteDance Seed reside a captious spread successful RLHF investigation wherever nan domiciled of prompt-data building and its scalability has received little attention. They research data-driven bottlenecks that limit RLHF capacity scaling, focusing connected reward hacking and decreasing consequence diverseness challenges. A hybrid reward strategy is introduced by combining reasoning task verifiers (RTV) and a generative reward exemplary (GenRM) that shows stronger guidance to reward hacking and enables a much meticulous appraisal of responses against ground-truth solutions. Moreover, a caller prompt-selection method called Pre-PPO is introduced to place inherently challenging training prompts little susceptible to reward hacking.
The experimental setup employs 2 pre-trained connection models of different scales: a smaller exemplary pinch 25B parameters and a larger exemplary pinch 150B parameters. The training dataset contains 1 cardinal prompts from divers domains, including mathematics, coding, instruction-following, imaginative writing, and logical reasoning. Moreover, nan researchers constructed a elaborate information model covering aggregate accomplishment areas: logical reasoning, instruction-following, STEM tasks, coding, earthy connection processing, knowledge, contextual understanding, and out-of-distribution generalization. The information model includes 2 versions (V1.0 and V2.0) pinch overlapping prompts, though V2.0 features much challenging prompts.
The experimental results show that nan projected attack combining Pre-PPO pinch prioritized mathematical and coding tasks consistently outperforms nan baseline method crossed exemplary sizes and information datasets. The attack shows an betterment of +1.1 complete nan baseline erstwhile evaluated astatine 100-step intervals utilizing TestSet V1.0. When tested connected nan much challenging TestSet V2.0, nan capacity betterment increases to +1.4. The astir important gains look successful mathematics-intensive and coding tasks, pinch an betterment of +3.9 points successful STEM and +3.2 points successful coding. These improvements are attributed to nan strategical prioritization of mathematical reasoning and coding tasks during early RLHF training phases.
In conclusion, this insubstantial addresses captious bottlenecks successful RLHF information scaling, specifically identifying reward hacking and reduced consequence diverseness arsenic important challenges. The researchers projected a mixed attack featuring strategical punctual building and early-stage training prioritization to lick this issue. The method uses RTV and GenRM to combat reward hacking alongside nan caller Pre-PPO punctual action strategy that identifies and prioritizes challenging training prompts. Analysis reveals that RTV supervision shows nan strongest guidance to reward hacking, followed by GenRM pinch ground-truth labels and past nan BT Reward Model. The investigation establishes a instauration for optimizing RLHF information building and processing much rule methods to reward hacking and exemplary alignment.
Check out the Paper and GitHub Page. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 85k+ ML SubReddit.
🔥 [Register Now] miniCON Virtual Conference connected OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 p.m. PST) + Hands connected Workshop [Sponsored]
Sajjad Ansari is simply a last twelvemonth undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into nan applicable applications of AI pinch a attraction connected knowing nan effect of AI technologies and their real-world implications. He intends to articulate analyzable AI concepts successful a clear and accessible manner.