The provided research paper investigates challenges in using process reward models (PRMs) for reinforcement fine-tuning (RFT) of large language models (LLMs) on reasoning tasks, specifically addressing the issue of reward hacking caused by traditional summation-based credit assignment. To mitigate this, the authors introduce PURE (Process sUpervised Reinforcement lEarning), a novel framework employing a min-form credit assignment that considers the minimum future reward, leading to more stable and efficient training. Their experiments demonstrate that PURE achieves comparable or better reasoning performance with PRMs compared to methods using verifiable rewards, and that combining PRMs with a small amount of verifiable rewards further enhances performance and reduces reward hacking. The paper also analyzes various cases of reward hacking and the causes of training collapse, offering insights for future research in PRM-based RFT.

Podden och tillhörande omslagsbild pÄ den hÀr sidan tillhör Neural Intelligence Network. InnehÄllet i podden Àr skapat av Neural Intelligence Network och inte av, eller tillsammans med, Poddtoppen.

Senast besökta

Neural intel Pod

Min-Form Credit Assignment for Process Reward Model Reasoning

00:00