FYF Notes

Chain-of-Thought Reasoning and Reasoning RL

Different types of reasoning

Chain-of-Thought (CoT) Reasoning with LLMs: Early researches use "srcatchpad" method to break the problem into intermediate steps. Later, other work prompts a strong model to "think step by step", which was found significantly improving.
Reason with expert iteration : The Self-Taught Reasoner (STaR) [Zelikman et al., 2022] frames reasoning as a bootstrapping loop: a pretrained model first samples diverse chains-of-thought (CoTs), keeps only those that lead to correct answers, and then finetunes on these “expert” traces. Iterating this cycle can improve the LM’s reasoning capabilities and solve rate. STaR demonstrated that this version of expert iteration [Anthony et al., 2017] using automatic, string match–based verification of generated answers can bootstrap reasoning skills without human-written reasoning traces.
Reasoning RL with verified rewards, o1 and R1: OpenAI o1, Deepseek R1, KIMI1.5, using Policy gradient methods to train on math and code tasks where string matching or unit tests verify correctness.

SFT

Through SFT, we observed that we can improve the performance of our SFT model by filtering out bad examples from the SFT data.

Expert Iteration

Policy gradient

for LM, given $s_{t}$ as an input or state, $a_{t}$ as an output under the state, we can view LM as a categorical stochastic policy.

a_{t} \sim π_{θ} (\cdot | s_{t}), π_{θ} (a_{t} | s_{t}) = [s o f t m a x (f_{θ} (s_{t}))]_{a_{t}}

Trajectory

We call $s_{t + 1} = s_{t} ∥ a_{t}$ as a trajectory (or episodes or rollouts).

Rewards and Return

A scalar reward $r_{t} = R (s_{t}, a_{t})$ judges the immediate quality of the action taken at state $s_{t}$ .

r_{T} = R (s_{T}, a_{T}) := {\begin{cases} 1 & if the trajectory s_{T} ∥ a_{T} matches the ground-truth according to our reward function \\ 0 & otherwise. \end{cases}

finite-horizon undiscounted returns:

R (τ) := \sum_{t = 0}^{T} r_{t}

and infinite-horizon discounted returns:

R (τ) := \sum_{t = 0}^{\infty} γ^{t} r_{t}, 0 < γ < 1.

In our case, we will use the undiscounted formulation since episodes have a natural termination point (end-of-text or max generation length). The objective of the agent is to maximize the expected return:

J (θ) = E_{τ \sim π_{θ}} [R (τ)],

leading to the optimization problem:

θ^{*} = \arg max_{θ} J (θ) .

In one word, 最大化策略的回报期望

Vanilla Policy Gradient

Next, let us attempt to learn policy parameters $θ$ with gradient ascent on the expected return:

θ_{k + 1} = θ_{k} + α \nabla_{θ} J (θ_{k}) .

The core identity that we will use to do this is the REINFORCE policy gradient, shown below:

\nabla_{θ} J (π_{θ}) = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) R (τ)] .

Deriving the policy gradient

How did we get this equation? For completeness, we will give a derivation of this identity below. We will make use of a few identities.

The probability of a trajectory is given by
$P (τ | θ) = ρ_{0} (s_{0}) \prod_{t = 0}^{T} P (s_{t + 1} | s_{t}, a_{t}) π_{θ} (a_{t} | s_{t})$
Therefore, the log-probability of a trajectory is:
$\log P (τ | θ) = \log ρ_{0} (s_{0}) + \sum_{t = 0}^{T} [\log P (s_{t + 1} | s_{t}, a_{t}) + \log π_{θ} (a_{t} | s_{t})]$
The log-derivative trick:
$\nabla_{θ} P = P \nabla_{θ} \log P$
The environment terms are constant in $θ$ . $ρ_{0}, P (\cdot | \cdot)$ and $R (τ)$ do not depend on the policy parameters, so
$\nabla_{θ} ρ_{0} = \nabla_{θ} P = \nabla_{θ} R (τ) = 0$

Applying the facts above:

\begin{aligned} \nabla_{θ} J (θ) & = \nabla_{θ} E_{τ \sim π_{θ}} [R (τ)] \\ = \nabla_{θ} \sum_{τ} P (τ | θ) R (τ) \\ = \sum_{τ} \nabla_{θ} P (τ | θ) R (τ) \\ = \sum_{τ} P (τ | θ) \nabla_{θ} \log P (τ | θ) R (τ) (Log-derivative trick) \\ = E_{τ \sim π_{θ}} [\nabla_{θ} \log P (τ | θ) R (τ)] \\ = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) R (τ)] \end{aligned}

Intuitively, this gradient will increase the log probability of every action in a trajectory that has high return, and decrease them otherwise.

Sample estimate of the gradient. Given a batch of $N$ rollouts $D = {τ^{(i)}}_{i = 1}^{N}$ collected by sampling a starting state $s_{0}^{(i)} \sim ρ_{0} (s_{0})$ and then running the policy $π_{θ}$ in the environment, we form an unbiased estimator of the gradient as:

\hat{g} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{t = 0}^{T_{i}} \nabla_{θ} \log π_{θ} (a_{t}^{(i)} | s_{t}^{(i)}) R (τ^{(i)})

where $T_{i}$ is the length of the $i$ -th trajectory, which may vary across samples. This vector is used in the gradient-ascent update: $θ \leftarrow θ + α \hat{g}$ .

Policy Gradient Baseline

当我们引入了Vanilla Policy Gradient（REINFORCE）基础策略梯度算法后，我们开始考虑它的不足。一个很大的问题是，策略梯度不稳定，方差（variance）非常大，这会导致收敛缓慢。

为了减小方差，一个常用的技巧是引入 Baseline (基准) $b (s_{t})$ 。

引入 Baseline 的策略梯度

带 Baseline 的策略梯度公式如下：

B = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) (R (τ) - b (s_{t}))]

一个合理的 Baseline 选择是状态值函数 (On-policy value function) $V^{π} (s) = E_{τ \sim π_{θ}} [R (τ) | s_{t} = s]$ 。直观上， $(R (τ) - V^{π} (s_{t}))$ 衡量了实际观测到的回报比预期的好多少。

无偏性证明

只要 Baseline $b (s_{t})$ 仅依赖于状态 $s_{t}$ 而不依赖于具体的动作 $a_{t}$ ，引入它就不会产生偏差。我们可以通过展开期望来证明这一点：

B = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) R (τ)] - E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) b (s_{t})]

我们要证明减号后面的那一项等于 0。根据全期望公式，我们可以把那一项重写为：

E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) b (s_{t})] = \sum_{t = 0}^{T} E_{s_{t}} [b (s_{t}) E_{a_{t} \sim π_{θ} (\cdot | s_{t})} [\nabla_{θ} \log π_{θ} (a_{t} | s_{t})]]

这里最关键的数学恒等式是：Score Function 的期望为 0。

Score Function 期望为 0 的证明

对于任何概率分布 $P_{θ} (x)$ ，关于其参数 $θ$ 的记分函数 (Score Function) $\nabla_{θ} \log P_{θ} (x)$ 在原分布下的期望恒等于 0：

E_{x \sim P_{θ}} [\nabla_{θ} \log P_{θ} (x)] = 0

证明过程如下：

\begin{aligned} E_{x \sim P_{θ}} [\nabla_{θ} \log P_{θ} (x)] & = \int P_{θ} (x) \nabla_{θ} \log P_{θ} (x) d x \\ = \int P_{θ} (x) \frac{\nabla_{θ} P_{θ} (x)}{P_{θ} (x)} d x \\ = \int \nabla_{θ} P_{θ} (x) d x \\ = \nabla_{θ} \int P_{θ} (x) d x \\ = \nabla_{θ} (1) \\ = 0 \end{aligned}

由于 $E_{a_{t} \sim π_{θ} (\cdot | s_{t})} [\nabla_{θ} \log π_{θ} (a_{t} | s_{t})] = 0$ ，整个基准项的期望确实为 0。这意味着 $B = \nabla_{θ} J (θ)$ ，即引入基准后，梯度的期望值保持不变（无偏），但方差可以显著降低。

pg_loss

pg_loss is not a loss in the canonical sense—it’s not meaningful to report pg_loss on the train or validation set as an evaluation metric, and a good validation pg_loss doesn’t indicate that our model is generalizing well. Instead, it is a surrogate objective function whose gradient is the policy gradient estimator.

pg_loss (θ) = \frac{1}{N} \sum_{i = 1}^{N} \sum_{t = 0}^{T_{i}} \log π_{θ} (a_{t}^{(i)} | s_{t}^{(i)}) (R (τ^{(i)}) - b (s_{t}^{(i)}))

When doing RL, you should always log and report train and validation rewards. These are the “meaningful” evaluation metrics and what we are attempting to optimize with policy gradient method

Off-Policy Policy Gradient

之前的vanilla policy以及添加了baseline的方法都是典型的on-policy方法，即每次更新策略参数 $θ$ 后，都需要根据policy model新的 $θ$ 重新收集一批新的数据来计算梯度。这导致了数据利用率低的问题。为了解决这个问题，我们可以使用off-policy方法.

较为典型的有 PPO 和 GRPO ，他们都是通过收集旧policy的 rollouts来更新新policy的参数。

{\hat{g}}_{o f f_p o l i c y} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{t = 0}^{T_{i}} \frac{π_{θ} (a_{t}^{(i)} | s_{t}^{(i)})}{π_{θ_{o l d}} (a_{t}^{(i)} | s_{t}^{(i)})} \nabla_{θ} \log π_{θ} (a_{t}^{(i)} | s_{t}^{(i)}) R (τ^{(i)})

当 $π_{θ}$ 和 $π_{θ_{o l d}}$ 相差不大时，上面的权重系数是合理的

Group Realtive Plocy Optimization(GRPO)

Baseline设置：对于一个问题，我们采样多条rollouts，计算它们的group-normalized reward. For a question $q$ and group outputs ${o^{(i)}}_{i = 1}^{G} \sim π_{θ} (\cdot | q)$ , let $r^{(i)} = R (q, o^{(i)})$ .

A = r_{n o r m}^{(i)} = \frac{r^{(i)} - m e a n_{r}^{(j)}}{s t d_{j} r^{(j)} + ϵ}

对于同一个output $i$ 中的所有token, 他们的 $A^{(i)}$ 是相同的

GRPO-Clip Objective

Let us first write out the full GRPO-Clip objective, and then we can build some intuition on what the clipping does:

J_{GRPO-Clip} (θ) = E_{q \sim D, {o^{(i)}}_{i = 1}^{G} \sim π_{θ} (\cdot | q)} [\frac{1}{G} \sum_{i = 1}^{G} \frac{1}{| o^{(i)} |} \sum_{t = 1}^{| o^{(i)} |} \underset{per-token objective}{\underset{⏟}{min (\frac{π_{θ} (o_{t}^{(i)} | q, o_{< t}^{(i)})}{π_{θ_{old}} (o_{t}^{(i)} | q, o_{< t}^{(i)})} A^{(i)}, clip (\frac{π_{θ} (o_{t}^{(i)} | q, o_{< t}^{(i)})}{π_{θ_{old}} (o_{t}^{(i)} | q, o_{< t}^{(i)})}, 1 - ϵ, 1 + ϵ) A^{(i)})}}]

The hyperparameter $ϵ > 0$ controls how much the policy can change. To see this, we can rewrite the per-token objective in a more intuitive way following Achiam [2018a, b]. Define the function

g (ϵ, A^{(i)}) = {\begin{cases} (1 + ϵ) A^{(i)} & if A^{(i)} \geq 0 \\ (1 - ϵ) A^{(i)} & if A^{(i)} < 0 \end{cases}

We can rewrite the per-token objective as

per-token objective = min (\frac{π_{θ} (o_{t}^{(i)} | q, o_{< t}^{(i)})}{π_{θ_{old}} (o_{t}^{(i)} | q, o_{< t}^{(i)})} A^{(i)}, g (ϵ, A^{(i)}))

We can now reason by cases. When the advantage $A^{(i)}$ is positive, the per-token objective simplifies to

per-token objective = min (\frac{π_{θ} (o_{t}^{(i)} | q, o_{< t}^{(i)})}{π_{θ_{old}} (o_{t}^{(i)} | q, o_{< t}^{(i)})}, 1 + ϵ) A^{(i)}

Since $A^{(i)} > 0$ , the objective goes up if the action $o_{t}^{(i)}$ becomes more likely under $π_{θ}$ , i.e., if $π_{θ} (o_{t}^{(i)} | q, o_{< t}^{(i)})$ increases. The clipping with $min$ limits how much the objective can increase: once $π_{θ} (o_{t}^{(i)} | q, o_{< t}^{(i)}) > (1 + ϵ) π_{θ_{old}} (o_{t}^{(i)} | q, o_{< t}^{(i)})$ , this per-token objective hits its maximum value of $(1 + ϵ) A^{(i)}$ . So, the policy $π_{θ}$ is not incentivized to go very far from the old policy $π_{θ_{old}}$ .

Analogously, when the advantage $A^{(i)}$ is negative, the model tries to drive down $π_{θ} (o_{t}^{(i)} | q, o_{< t}^{(i)})$ , but is not incentivized to decrease it below $(1 - ϵ) π_{θ_{old}} (o_{t}^{(i)} | q, o_{< t}^{(i)})$ (refer to Achiam [2018b] for the full argument).

Chain-of-Thought Reasoning and Reasoning RL ​

Different types of reasoning ​

SFT ​

Expert Iteration ​

Policy gradient ​

Trajectory ​

Rewards and Return ​

Vanilla Policy Gradient ​

Deriving the policy gradient ​

Policy Gradient Baseline ​

引入 Baseline 的策略梯度 ​

无偏性证明 ​

Score Function 期望为 0 的证明 ​

pg_loss ​

Off-Policy Policy Gradient ​

Group Realtive Plocy Optimization(GRPO) ​

GRPO-Clip Objective ​