Appearance
Chain-of-Thought Reasoning and Reasoning RL
Different types of reasoning
Chain-of-Thought (CoT) Reasoning with LLMs: Early researches use "srcatchpad" method to break the problem into intermediate steps. Later, other work prompts a strong model to "think step by step", which was found significantly improving.
Reason with expert iteration : The Self-Taught Reasoner (STaR) [Zelikman et al., 2022] frames reasoning as a bootstrapping loop: a pretrained model first samples diverse chains-of-thought (CoTs), keeps only those that lead to correct answers, and then finetunes on these “expert” traces. Iterating this cycle can improve the LM’s reasoning capabilities and solve rate. STaR demonstrated that this version of expert iteration [Anthony et al., 2017] using automatic, string match–based verification of generated answers can bootstrap reasoning skills without human-written reasoning traces.
Reasoning RL with verified rewards, o1 and R1: OpenAI o1, Deepseek R1, KIMI1.5, using Policy gradient methods to train on math and code tasks where string matching or unit tests verify correctness.
SFT
Through SFT, we observed that we can improve the performance of our SFT model by filtering out bad examples from the SFT data.
Expert Iteration
Policy gradient
for LM, given
Trajectory
We call
Rewards and Return
A scalar reward
finite-horizon undiscounted returns:
and infinite-horizon discounted returns:
In our case, we will use the undiscounted formulation since episodes have a natural termination point (end-of-text or max generation length). The objective of the agent is to maximize the expected return:
leading to the optimization problem:
In one word, 最大化策略的回报期望
Vanilla Policy Gradient
Next, let us attempt to learn policy parameters
The core identity that we will use to do this is the REINFORCE policy gradient, shown below:
Deriving the policy gradient
How did we get this equation? For completeness, we will give a derivation of this identity below. We will make use of a few identities.
The probability of a trajectory is given by
Therefore, the log-probability of a trajectory is:
The log-derivative trick:
The environment terms are constant in
. and do not depend on the policy parameters, so
Applying the facts above:
Intuitively, this gradient will increase the log probability of every action in a trajectory that has high return, and decrease them otherwise.
Sample estimate of the gradient. Given a batch of
where
Policy Gradient Baseline
当我们引入了Vanilla Policy Gradient(REINFORCE)基础策略梯度算法后,我们开始考虑它的不足。一个很大的问题是, 策略梯度不稳定,方差(variance)非常大,这会导致收敛缓慢。
为了减小方差,一个常用的技巧是引入 Baseline (基准)
引入 Baseline 的策略梯度
带 Baseline 的策略梯度公式如下:
一个合理的 Baseline 选择是状态值函数 (On-policy value function)
无偏性证明
只要 Baseline
我们要证明减号后面的那一项等于 0。根据全期望公式,我们可以把那一项重写为:
这里最关键的数学恒等式是:Score Function 的期望为 0。
Score Function 期望为 0 的证明
对于任何概率分布
证明过程如下:
由于
pg_loss
pg_loss is not a loss in the canonical sense—it’s not meaningful to report pg_loss on the train or validation set as an evaluation metric, and a good validation pg_loss doesn’t indicate that our model is generalizing well. Instead, it is a surrogate objective function whose gradient is the policy gradient estimator.
When doing RL, you should always log and report train and validation rewards. These are the “meaningful” evaluation metrics and what we are attempting to optimize with policy gradient method
Off-Policy Policy Gradient
之前的vanilla policy以及添加了baseline的方法都是典型的on-policy方法,即每次更新策略参数
较为典型的有 PPO 和 GRPO ,他们都是通过收集旧policy的 rollouts来更新新policy的参数。
当
Group Realtive Plocy Optimization(GRPO)
Baseline设置:对于一个问题,我们采样多条rollouts,计算它们的group-normalized reward. For a question
对于同一个output
GRPO-Clip Objective
Let us first write out the full GRPO-Clip objective, and then we can build some intuition on what the clipping does:
The hyperparameter
We can rewrite the per-token objective as
We can now reason by cases. When the advantage
Since
Analogously, when the advantage