Skip to content
NotesCS 336

Stanford CS336 课程与作业笔记,聚焦大模型训练、推理与系统资源。

Chain-of-Thought Reasoning and Reasoning RL

Different types of reasoning

  • Chain-of-Thought (CoT) Reasoning with LLMs: Early researches use "srcatchpad" method to break the problem into intermediate steps. Later, other work prompts a strong model to "think step by step", which was found significantly improving.

  • Reason with expert iteration : The Self-Taught Reasoner (STaR) [Zelikman et al., 2022] frames reasoning as a bootstrapping loop: a pretrained model first samples diverse chains-of-thought (CoTs), keeps only those that lead to correct answers, and then finetunes on these “expert” traces. Iterating this cycle can improve the LM’s reasoning capabilities and solve rate. STaR demonstrated that this version of expert iteration [Anthony et al., 2017] using automatic, string match–based verification of generated answers can bootstrap reasoning skills without human-written reasoning traces.

  • Reasoning RL with verified rewards, o1 and R1: OpenAI o1, Deepseek R1, KIMI1.5, using Policy gradient methods to train on math and code tasks where string matching or unit tests verify correctness.

SFT

Through SFT, we observed that we can improve the performance of our SFT model by filtering out bad examples from the SFT data.

Expert Iteration

Policy gradient

for LM, given st as an input or state, at as an output under the state, we can view LM as a categorical stochastic policy.

atπθ(·|st),πθ(at|st)=[softmax(fθ(st))]at

Trajectory

We call st+1=stat as a trajectory (or episodes or rollouts).

Rewards and Return

A scalar reward rt=R(st,at) judges the immediate quality of the action taken at state st.

rT=R(sT,aT):={1if the trajectory sTaT matches the ground-truth according to our reward function0otherwise.

finite-horizon undiscounted returns:

R(τ):=t=0Trt

and infinite-horizon discounted returns:

R(τ):=t=0γtrt,0<γ<1.

In our case, we will use the undiscounted formulation since episodes have a natural termination point (end-of-text or max generation length). The objective of the agent is to maximize the expected return:

J(θ)=Eτπθ[R(τ)],

leading to the optimization problem:

θ=argmaxθJ(θ).

In one word, 最大化策略的回报期望

Vanilla Policy Gradient

Next, let us attempt to learn policy parameters θ with gradient ascent on the expected return:

θk+1=θk+αθJ(θk).

The core identity that we will use to do this is the REINFORCE policy gradient, shown below:

θJ(πθ)=Eτπθ[t=0Tθlogπθ(at|st)R(τ)].

Deriving the policy gradient

How did we get this equation? For completeness, we will give a derivation of this identity below. We will make use of a few identities.

  1. The probability of a trajectory is given by

    P(τ|θ)=ρ0(s0)t=0TP(st+1|st,at)πθ(at|st)

    Therefore, the log-probability of a trajectory is:

    logP(τ|θ)=logρ0(s0)+t=0T[logP(st+1|st,at)+logπθ(at|st)]
  2. The log-derivative trick:

    θP=PθlogP
  3. The environment terms are constant in θ. ρ0,P(|) and R(τ) do not depend on the policy parameters, so

    θρ0=θP=θR(τ)=0

Applying the facts above:

θJ(θ)=θEτπθ[R(τ)]=θτP(τ|θ)R(τ)=τθP(τ|θ)R(τ)=τP(τ|θ)θlogP(τ|θ)R(τ)(Log-derivative trick)=Eτπθ[θlogP(τ|θ)R(τ)]=Eτπθ[t=0Tθlogπθ(at|st)R(τ)]

Intuitively, this gradient will increase the log probability of every action in a trajectory that has high return, and decrease them otherwise.

Sample estimate of the gradient. Given a batch of N rollouts D={τ(i)}i=1N collected by sampling a starting state s0(i)ρ0(s0) and then running the policy πθ in the environment, we form an unbiased estimator of the gradient as:

g^=1Ni=1Nt=0Tiθlogπθ(at(i)|st(i))R(τ(i))

where Ti is the length of the i-th trajectory, which may vary across samples. This vector is used in the gradient-ascent update: θθ+αg^.

Policy Gradient Baseline

当我们引入了Vanilla Policy Gradient(REINFORCE)基础策略梯度算法后,我们开始考虑它的不足。一个很大的问题是, 策略梯度不稳定,方差(variance)非常大,这会导致收敛缓慢。

为了减小方差,一个常用的技巧是引入 Baseline (基准) b(st)

引入 Baseline 的策略梯度

带 Baseline 的策略梯度公式如下:

B=Eτπθ[t=0Tθlogπθ(at|st)(R(τ)b(st))]

一个合理的 Baseline 选择是状态值函数 (On-policy value function) Vπ(s)=Eτπθ[R(τ)|st=s]。直观上,(R(τ)Vπ(st)) 衡量了实际观测到的回报比预期的好多少。

无偏性证明

只要 Baseline b(st) 仅依赖于状态 st 而不依赖于具体的动作 at,引入它就不会产生偏差。我们可以通过展开期望来证明这一点:

B=Eτπθ[t=0Tθlogπθ(at|st)R(τ)]Eτπθ[t=0Tθlogπθ(at|st)b(st)]

我们要证明减号后面的那一项等于 0。根据全期望公式,我们可以把那一项重写为:

Eτπθ[t=0Tθlogπθ(at|st)b(st)]=t=0TEst[b(st)Eatπθ(|st)[θlogπθ(at|st)]]

这里最关键的数学恒等式是:Score Function 的期望为 0

Score Function 期望为 0 的证明

对于任何概率分布 Pθ(x),关于其参数 θ 的记分函数 (Score Function) θlogPθ(x) 在原分布下的期望恒等于 0:

ExPθ[θlogPθ(x)]=0

证明过程如下:

ExPθ[θlogPθ(x)]=Pθ(x)θlogPθ(x)dx=Pθ(x)θPθ(x)Pθ(x)dx=θPθ(x)dx=θPθ(x)dx=θ(1)=0

由于 Eatπθ(|st)[θlogπθ(at|st)]=0,整个基准项的期望确实为 0。 这意味着 B=θJ(θ),即引入基准后,梯度的期望值保持不变(无偏),但方差可以显著降低。

pg_loss

pg_loss is not a loss in the canonical sense—it’s not meaningful to report pg_loss on the train or validation set as an evaluation metric, and a good validation pg_loss doesn’t indicate that our model is generalizing well. Instead, it is a surrogate objective function whose gradient is the policy gradient estimator.

pg_loss(θ)=1Ni=1Nt=0Tilogπθ(at(i)|st(i))(R(τ(i))b(st(i)))

When doing RL, you should always log and report train and validation rewards. These are the “meaningful” evaluation metrics and what we are attempting to optimize with policy gradient method

Off-Policy Policy Gradient

之前的vanilla policy以及添加了baseline的方法都是典型的on-policy方法,即每次更新策略参数θ后,都需要根据policy model新的 θ 重新收集一批新的数据来计算梯度。这导致了数据利用率低的问题。为了解决这个问题,我们可以使用off-policy方法.

较为典型的有 PPO 和 GRPO ,他们都是通过收集旧policy的 rollouts来更新新policy的参数。

g^off_policy=1Ni=1Nt=0Tiπθ(at(i)|st(i))πθold(at(i)|st(i))θlogπθ(at(i)|st(i))R(τ(i))

πθπθold相差不大时,上面的权重系数是合理的

Group Realtive Plocy Optimization(GRPO)

Baseline设置:对于一个问题,我们采样多条rollouts,计算它们的group-normalized reward. For a question q and group outputs o(i)i=1Gπθ(|q), let r(i)=R(q,o(i)).

A=rnorm(i)=r(i)meanr(j)stdjr(j)+ϵ

对于同一个output i中的所有token, 他们的A(i)是相同的

GRPO-Clip Objective

Let us first write out the full GRPO-Clip objective, and then we can build some intuition on what the clipping does:

JGRPO-Clip(θ)=EqD,{o(i)}i=1Gπθ(|q)[1Gi=1G1|o(i)|t=1|o(i)|min(πθ(ot(i)|q,o<t(i))πθold(ot(i)|q,o<t(i))A(i),clip(πθ(ot(i)|q,o<t(i))πθold(ot(i)|q,o<t(i)),1ϵ,1+ϵ)A(i))per-token objective]

The hyperparameter ϵ>0 controls how much the policy can change. To see this, we can rewrite the per-token objective in a more intuitive way following Achiam [2018a, b]. Define the function

g(ϵ,A(i))={(1+ϵ)A(i)if A(i)0(1ϵ)A(i)if A(i)<0

We can rewrite the per-token objective as

per-token objective=min(πθ(ot(i)|q,o<t(i))πθold(ot(i)|q,o<t(i))A(i),g(ϵ,A(i)))

We can now reason by cases. When the advantage A(i) is positive, the per-token objective simplifies to

per-token objective=min(πθ(ot(i)|q,o<t(i))πθold(ot(i)|q,o<t(i)),1+ϵ)A(i)

Since A(i)>0, the objective goes up if the action ot(i) becomes more likely under πθ, i.e., if πθ(ot(i)|q,o<t(i)) increases. The clipping with min limits how much the objective can increase: once πθ(ot(i)|q,o<t(i))>(1+ϵ)πθold(ot(i)|q,o<t(i)), this per-token objective hits its maximum value of (1+ϵ)A(i). So, the policy πθ is not incentivized to go very far from the old policy πθold.

Analogously, when the advantage A(i) is negative, the model tries to drive down πθ(ot(i)|q,o<t(i)), but is not incentivized to decrease it below (1ϵ)πθold(ot(i)|q,o<t(i)) (refer to Achiam [2018b] for the full argument).