Scalable And Principled Reward Modeling For Llms: Enhancing Generalist Reward Models Rms With Spct And Inference-time Optimization

Trending 1 week ago
ARTICLE AD BOX

Reinforcement Learning RL has go a wide utilized post-training method for LLMs, enhancing capabilities for illustration quality alignment, semipermanent reasoning, and adaptability. A awesome challenge, however, is generating meticulous reward signals successful broad, little system domains, arsenic existent high-quality reward models are mostly built connected rule-based systems aliases verifiable tasks specified arsenic mathematics and coding. In wide applications, reward criteria are much divers and subjective, lacking clear crushed truths. To reside this, generalist reward models (RMs) are being explored for broader applicability. However, these models must equilibrium input elasticity and scalability during inference, peculiarly successful producing reliable, high-quality rewards crossed varied tasks and domains.

Existing reward modeling approaches see scalar, semi-scalar, and generative techniques, each pinch elasticity and inference-time capacity trade-offs. For instance, pairwise models are constricted to comparative comparisons, while scalar models struggle pinch producing divers feedback. Generative reward models (GRMs) connection richer, much elastic outputs, making them much suited for evaluating various responses. Recent activity has explored training GRMs done offline RL, integrating devices and outer knowledge to amended reward quality. However, fewer methods straight reside really RMs tin standard efficiently during inference. This has led to investigation connected methods for illustration sampling-based scaling, chain-of-thought prompting, and reward-guided aggregation, aiming to co-scale argumentation models and reward models during inference. These developments clasp committedness for much robust, general-purpose reward systems successful LLMs.

DeepSeek-AI and Tsinghua University researchers research enhancing reward models RM for wide queries by improving inference-time scalability utilizing accrued computing and amended learning techniques. They employment pointwise GRM for elastic input handling and propose a learning method—Self-Principled Critique Tuning (SPCT)—which helps GRMs make adaptive principles and meticulous critiques during online reinforcement learning. They use parallel sampling and present a meta RM to standard efficaciously and refine nan voting process. Their DeepSeek-GRM models outperform existing benchmark methods, offering higher reward value and scalability, pinch plans for open-sourcing contempt challenges successful immoderate analyzable tasks.

The researchers present SPCT, a method designed to heighten pointwise GRMs by enabling them to make adaptive principles and meticulous critiques. SPCT consists of 2 stages: rejective fine-tuning for initializing rule and critique procreation and rule-based RL for refinement. Instead of treating principles arsenic preprocessing, they are generated dynamically during inference. This promotes scalability by improving reward granularity. Additionally, inference-time capacity is boosted done parallel sampling and voting, supported by a meta reward exemplary (meta RM) that filters retired low-quality outputs. Overall, SPCT improves reward accuracy, robustness, and scalability successful GRMs.

Using modular metrics, nan study evaluates various RM methods crossed benchmarks for illustration Reward Bench, PPE, RMB, and ReaLMistake. DeepSeek-GRM-27B consistently outperforms baselines and rivals beardown nationalist models for illustration GPT-4o. Inference-time scaling, particularly pinch voting and meta reward models, importantly boosts performance—achieving results comparable to overmuch larger models. Ablation studies item nan value of components for illustration rule procreation and non-hinted sampling. Training-time scaling shows diminishing returns compared to inference-time strategies. Overall, DeepSeek-GRM, enhanced pinch SPCT and meta RM, offers robust, scalable reward modeling pinch reduced domain bias and beardown generalization.

In conclusion, nan study presents SPCT, a method that improves inference-time scalability for GRMs done rule-based online reinforcement learning. SPCT enables adaptive rule and critique generation, enhancing reward value crossed divers tasks. DeepSeek-GRM models outperform respective baselines and beardown nationalist models, particularly erstwhile paired pinch a meta reward exemplary for inference-time scaling. Using parallel sampling and elastic input handling, these GRMs execute beardown capacity without relying connected larger exemplary sizes. Future activity includes integrating GRMs into RL pipelines, co-scaling pinch argumentation models, and serving arsenic reliable offline evaluators.


Check out the Paper. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 85k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference connected OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 p.m. PST) + Hands connected Workshop [Sponsored]

Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.

More