Thinkprm: A Generative Process Reward Models For Scalable Reasoning Verification

5 hours ago

ARTICLE AD BOX

Reasoning pinch LLMs tin use from utilizing much trial compute, which depends connected high-quality process reward models (PRMs) to prime promising paths for hunt aliases ranking. PRMs people problem-solution pairs to bespeak whether nan solution is correct, and person been implemented arsenic discriminative classifiers. However, these models require extended resources, including quality annotation, golden step-by-step solutions, aliases computationally intensive rollouts. LLM-as-a-judge approaches connection advantages successful information ratio and interpretability, but they execute poorly compared to specialized reward models for analyzable reasoning tasks, failing to admit incorrect reasoning. This creates a situation to support data-efficiency and interpretability advantages while achieving nan superior capacity of discriminative PRMs.

Research approaches to lick process verification challenges person followed 3 main paths. Discriminative PRMs usability arsenic classifiers that foretell numerical correctness scores for each reasoning step, requiring extended step-level annotations. Generative PRMs framework verification arsenic a language-generation task, producing correctness decisions arsenic earthy connection tokens accompanied by verification chain-of-thought (CoT). These models compute correctness scores done conditional token probabilities for illustration P(“correct”), making them inherently interpretable and scalable. Test-time scaling techniques for illustration Best-of-N action and tree-based hunt amended reasoning capacity utilizing further inference-time compute. The effectiveness of these approaches depends heavy connected verifier value for scoring solutions.

Researchers from nan University of Michigan, Mila, LG AI Research, and nan University of Illinois Urbana-Champaign person projected THINKPRM, a agelong CoT verifier fine-tuned connected importantly less process labels than those required by discriminative PRMs. It uses nan inherent reasoning abilities of agelong CoT models to outperform some LLM-as-a-Judge and discriminative verifiers while utilizing only 1% of process labels successful PRM800K crossed respective challenging benchmarks. Under adjacent token budgets, THINKPRM scales verification compute much efficaciously than LLM-as-a-Judge, outperforming it by 7.2% connected a ProcessBench subset, highlighting nan worth of generative, agelong CoT PRMs for scaling test-time verification compute pinch minimal supervision.

The THINKPRM is evaluated against DiscPRM, nan aforesaid guidelines exemplary finetuned pinch binary cross-entropy connected nan full PRM800K dataset containing 712K process labels from 98K problem-solution pairs. Additional comparisons see unweighted mostly voting and verifier-weighted mostly for best-of-N experiments. The results are shown connected 3 mathematics reasoning tasks: 100 problems from MATH-500 covering each trouble levels, 2024 American Invitational Mathematics Examination (AIME) problems, and out-of-domain tasks including physics problems from GPQA-Diamond and a 200-problem subset from LiveCodeBench v5. For MATH-500, researchers utilized THINKPRM-1.5B and THINKPRM-14B pinch 2 different generator models.

On best-of-N action pinch MATH500, THINKPRM achieves higher aliases comparable reasoning accuracy to DiscPRM crossed each sampling budgets. Under verifier-guided hunt connected MATH-500, THINKPRM-1.5B outperforms discPRM by astir 5 percent points and surpasses LLM-as-a-judge utilizing nan aforesaid guidelines exemplary (R1-Qwen-1.5B). THINKPRM-1.5B’s scaling curve exceeds each baselines erstwhile compared to beardown off-the-shelf PRMs for illustration RLHFFlow-Deepseek-PRM and MATH-Shepherd-PRM, outperforming RLHFFlow-Deepseek-PRM by complete 7% astatine 16 beams. For out-of-domain evaluation, THINKPRM shows amended scaling than DiscPRM connected GPQA-physics, outperforming it by 8%, while connected LiveCodeBench, THINKPRM surpasses DiscPRM by 4.5%.

In conclusion, researchers introduced THINKPRM, a generative process reward exemplary trained pinch minimal supervision connected synthetic data, allowing businesslike and scalable verification of step-by-step reasoning. Researchers show that lightweight fine-tuning of generative PRMs connected arsenic fewer arsenic 8K process labels tin amended upon zero-shot LLM-as-a-judge baselines. THINKPRM besides surpasses discriminative PRMs trained pinch orders of magnitude much process labels, highlighting nan advantages of utilizing generative language-modeling objectives for interpretability, scalability, and information efficiency. The results underscore nan imaginable of generative PRMs to standard verification compute astatine test-time effectively, benefiting challenging domains specified arsenic mathematical and technological reasoning.

Check retired nan Paper. Also, don’t hide to travel america on Twitter and subordinate our Telegram Channel and LinkedIn Group. Don’t Forget to subordinate our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference connected AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 p.m. PST) + Hands connected Workshop

Sajjad Ansari is simply a last twelvemonth undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into nan applicable applications of AI pinch a attraction connected knowing nan effect of AI technologies and their real-world implications. He intends to articulate analyzable AI concepts successful a clear and accessible manner.