Scalable Reinforcement Learning With Verifiable Rewards: Generative Reward Modeling For Unstructured, Multi-domain Tasks

Trending 20 hours ago
ARTICLE AD BOX

Reinforcement Learning pinch Verifiable Rewards (RLVR) has proven effective successful enhancing LLMs’ reasoning and coding abilities, peculiarly successful domains wherever system reference answers let clear-cut verification. This attack relies connected reference-based signals to find if a model’s consequence aligns pinch a known correct answer, typically done binary correctness labels aliases graded scores. RLVR has chiefly been applied to areas for illustration mathematics and coding, wherever rule-based aliases tool-assisted verification is straightforward. However, expanding RLVR to much analyzable and little system tasks has been difficult owed to challenges successful verifying open-ended aliases ambiguous reference responses. Although generative models and closed-source LLMs for illustration GPT-4o person been explored arsenic verifiers, these solutions often stay domain-specific and require extended annotated datasets for training.

Recent developments purpose to broaden RLVR applications by introducing generative reward modeling, wherever LLMs usage their generative abilities to nutrient judgments and justifications. These models tin beryllium trained without elaborate rationales, alternatively relying connected nan assurance of nan verifier’s outputs to make unchangeable reward signals. This method supports reinforcement learning successful tasks pinch noisy aliases ambiguous labels. Furthermore, researchers are exploring RLVR successful a wider assortment of domains utilizing much free-form reference answers—sourced from master annotations and pretraining information aliases generated by LLMs—moving beyond narrowly defined tasks for illustration mathematics and logic puzzles. These efforts people a important measurement toward scalable and domain-general RLVR training.

Tencent AI Lab and Soochow University researchers are exploring extending RLVR to complex, unstructured domains for illustration medicine, chemistry, and education. They show that binary correctness judgments stay accordant crossed LLMs erstwhile expert-written references are available. To reside nan limitations of binary rewards successful free-form tasks, they present soft, generative model-based reward signals. Using compact 7B models, they train cross-domain reward verifiers without requiring extended domain-specific annotation. Their RLVR model importantly outperforms apical open-source models successful reasoning tasks and scales effectively. They besides merchandise a 570k-example dataset to support further investigation successful multi-domain RLVR.

The method uses expert-written reference answers to guideline reward estimation for reinforcement learning. Responses are evaluated utilizing a generative LLM verifier, which outputs binary (0/1) aliases soft rewards based connected nan likelihood of correctness. Rewards are normalized utilizing z-score normalization for unchangeable training and amended learning dynamics. The authors train a compact (7B) generative reward exemplary utilizing judgments collected during RL exploration to debar relying solely connected ample models. These binary labels are obtained from a larger LLM and utilized to fine-tune nan smaller verifier. This attack balances capacity and ratio while expanding robustness to sound and formatting variations.

The study uses 2 large-scale Chinese QA datasets—one pinch 773k free-form mathematics questions crossed schoolhouse levels and different pinch 638k multi-subject college-level questions from ExamQA. These datasets characteristic complex, unstructured answers that situation rule-based reward methods. The researchers trained a 7B reward exemplary (RM-7B) utilizing 160k distilled samples and tested various RL approaches. Results show that RL pinch model-based rewards outperforms rule-based methods and supervised fine-tuning (SFT), particularly successful reasoning tasks. Notably, RM-7B achieves capacity adjacent to nan larger 72B model, highlighting its efficiency. Binary rewards outperform soft rewards successful rule-based settings owed to semantic mismatch issues.

In conclusion, nan study simplifies reward modeling by training a generative exemplary to output binary scores (1 aliases 0) without relying connected chain-of-thought reasoning. While CoT immunodeficiency successful reasoning, its necessity for verifying semantic similarity remains unclear. Unlike past activity that relied connected format-based scoring, this attack avoids strict reply formatting, reducing manual effort. The investigation extends RLVR beyond system domains to areas for illustration medicine and economics, wherever reference answers are little defined. Using a 7B model, it shows that soft, model-based rewards heighten capacity successful free-form tasks, outperforming larger models and improving RLVR’s adaptability and scalability.


Check out the Paper. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 85k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference connected OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 p.m. PST) + Hands connected Workshop [Sponsored]

Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.

More