Explain the moon landing to a 6 year old in a few sentences.
GPT-3(仅预训练)的"补全"
Explain the theory of gravity to a 6 year old.
Explain the theory of relativity to a 6 year old in a few sentences.
Explain the big bang theory to a 6 year old.
Explain evolution to a 6 year old.
人类期望的回答
A giant rocket ship blasted off from Earth carrying astronauts to the moon. The astronauts landed their spaceship on the moon and walked around exploring the lunar surface. Then they returned safely back to Earth, bringing home moon rocks to show everyone.
InstructGPT(RLHF 之后)的回答
People went to the moon, and they took pictures of what they saw, and sent them back to the earth so we could all see them.
这就是 Ouyang et al. (2022) 在 InstructGPT 论文里强调的核心问题:
"Language models are not aligned with user intent."
预训练让模型学到了关于世界的大量知识与语言模式,但预训练目标(next-token prediction)和用户真实意图(被帮助、被回答)之间存在系统性鸿沟。
flowchart LR
A["Pretrain (语言建模)"] --> B["SFT / IFT (指令微调)"]
B --> C["RM 训练 (Bradley-Terry)"]
C --> D["PPO / RLHF (在线 RL)"]
B --> E["DPO (2023, 跳过 RM)"]
D --> F["GRPO (2024, DeepSeek)"]
E --> G["KTO / IPO / SimPO (2024, 各种变体)"]
style A fill:#e8f1f4
style B fill:#fff7e6
style C fill:#fff7e6
style D fill:#fbecec
style E fill:#eafaf0
style F fill:#eafaf0
style G fill:#eafaf0
每一步都试图解决前一步遗留的问题:
方法
解决的问题
引入的新问题
SFT / IFT
模型听不懂指令
需要大量人工示范;token-level 等价惩罚;无法学习"哪种更好"
RLHF (PPO)
能用 ranking 偏好数据;可优化任意 non-differentiable 奖励
训练不稳定;4 个 LM 同时在显存;超参敏感;reward hacking
DPO
跳过 RM,直接用 max likelihood 解 RLHF 的闭式解
纯 offline;reference 模型仍需在显存;某些任务弱于 PPO
GRPO
移除 value model,组内归一化做 advantage;适合 verifiable reward(数学/代码)
Please answer the following question. What is the boiling point of Nitrogen?
-320.4F
INPUT 2 (CoT math)
Answer the following question by reasoning step-by-step. The cafeteria had 23 apples. If they used 20 for lunch and bought 6 more, how many apples do they have?
The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9.
UNSEEN TASK (推理 + 常识)
Q: Can Geoffrey Hinton have a conversation with George Washington? Give the rationale before answering.
Geoffrey Hinton is a British-Canadian computer scientist born in 1947. George Washington died in 1799. Thus, they could not have had a conversation together. So the answer is "no".
Q: In the following sentences, explain the antecedent of the pronoun, or state that it is ambiguous.
Sentence: The reporter and the chef will discuss their favorite dishes.
Options: (A) reporter's (B) chef's (C) Ambiguous
A: Let's think step by step.
微调前 T5-XXL(无效复读)
The reporter and the chef will discuss their favorite dishes.
The reporter and the chef will discuss the reporter's favorite dishes.
The reporter and the chef will discuss the chef's favorite dishes.
... ❌ (doesn't answer question)
微调后 Flan-T5-XXL
"The reporter and the chef will discuss their favorite dishes" does not indicate whose favorite dishes they will discuss. So, the answer is (C). ✓
LLaMA 在 2023 年初开源后,社区掀起了一波"如何低成本造 IFT 数据"的浪潮。核心思路:
让一个强 LLM 自动生成指令数据(self-instruct, Wang et al. 2022)。
flowchart LR
A["175 seed tasks (人工撰写)"] --> B["text-davinci-003 (GPT-3.5)"]
B --> C["52K instruction- following examples"]
C --> D["Supervised Finetuning"]
E["LLaMA 7B (Meta 开源)"] --> D
D --> F["Alpaca 7B (Stanford, 2023)"]
style F fill:#eafaf0
LIMA(Less Is More for Alignment, Zhou et al., 2023)做了一个反直觉的实验:
只用 1,000 条精心挑选的高质量指令样本微调 LLaMA-65B,
然后人类盲评 vs Alpaca-65B / DaVinci003 / BARD / Claude / GPT-4。
真值:"Avatar is a fantasy TV show"
错误 A:"Avatar is a fantasy TV show" (正确)
错误 B:"Avatar is a adventure TV show"(题材错了,但意思接近)
错误 C:"Avatar is a musical TV show"(完全错)
SFT 把错误 B 和 C 视为同等错误,但显然 C 比 B 离谱得多。
我们需要一个能给"更接近真值的错误更小的惩罚"的训练信号——而这正是 RLHF 要解决的。
原文: "SAN FRANCISCO, California (CNN) — A magnitude 4.2 earthquake shook the San Francisco area... overturn unstable objects."
$s_1$: "An earthquake hit San Francisco. There was minor property damage, but no injuries." → $R(s_1) = 8.0$ good $s_2$: "The Bay Area has good weather but is prone to earthquakes and wildfires." → $R(s_2) = 1.2$ off-topic
flowchart LR
A["Step 1: SFT Collect demonstration data Fine-tune base LM"] --> B["Step 2: RM Training Collect comparison data Train RM with B-T loss"]
B --> C["Step 3: PPO RL Sample completions Score with RM + KL Update policy"]
C -->|"new policy"| C
style A fill:#e8f1f4
style B fill:#fff7e6
style C fill:#fbecec
从 OpenAI API waitlist 申请书里抽取真实 use case,请标注员撰写对应 prompt。
典型样例:
Use-case
Prompt
Brainstorming
List five ideas for how to regain enthusiasm for my career
Generation
Write a short story where a bear goes to the beach, makes friends with a seal, and then returns home.
效果对比
PROMPT
Write a short poem about a wise frog.
GPT-3(原始)
Write a short story in which a character has two different names.
Write a short story in which you try to get something back that you have lost.
Write a short story in which a character has a bad dream.
InstructGPT
The frog is a symbol of wisdom
He knows all the secrets of the world
He is a master of disguise
And a great teacher of life
He is a symbol of transformation
And the bringer of change
He is the frog who has seen it all
And knows the meaning of it all
注意 GPT-3 完全无视"写诗"的指令,继续在补全"Write a short ..."的模式;
而 InstructGPT 真正理解了任务,写出了一首关于智慧青蛙的诗。
"We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT,
but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning:
human AI trainers provided conversations in which they played both sides—the user and an AI assistant.
We gave the trainers access to model-written suggestions to help them compose their responses.
We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format."
What are the five most common causes of stress among people?
Alpaca (SFT only, 10k data)
The five most common causes of stress among people are work, money, relationships, health, and family
Alpaca + PPO
The five most common causes of stress among people are:
1. Pressure at work: Whether it is due to tight deadlines, high expectations, or an overwhelming workload, pressure at work can be a major source of stress
2. Money problems: Financial issues can be a significant source of stress, from not being able to make ends meet to struggling with debt.
3. Unhealthy lifestyle: Poor diet, lack of exercise, and sleep deprivation can all contribute to stress.
4. Unhappiness at home: Problems in relationships, conflicts with family members, and other domestic issues can be a major source of stress.
5. Uncertainty: Excessive worry and anxiety about the future, health issues, or other unknowns can be a major cause of stress
2023 年 Bing AI 在回答"Who won the Super Bowl?"时声称:
"The most recent Super Bowl was Super Bowl LVI, in which the Philadelphia Eagles defeated the Kansas City Chiefs by 31-24."
事实:(1) 当时最新的 Super Bowl 是 LVII;(2) Eagles 输给了 Chiefs,比分 35-38。
更糟的是 Google Bard 在演示中也犯了一个类似错误,导致 Alphabet 股价当天跌掉 100B 美元。
Ryan et al. (2024) 用 Starling 7B 的 reward model 做了一个直观实验:
让 AI 助手回答 "Where are you from?",把"I am from {country}" 中的 country 换成世界各国,
观察 RM 给的分数。结果是一张世界地图染色图——
"AI feedback 能替代 human feedback 吗?"是 2022-2024 年的核心研究问题。
RLAIF (Reinforcement Learning from AI Feedback)
Lee et al. (2023, Google) 的 RLAIF 论文做了一个直接对比:
用 GPT-4 / PaLM 等强模型替代人类标注员,产出 ranking 数据训 RM,再做 RLHF。
结果:RLAIF 在摘要任务上和 RLHF 打平甚至略胜(71% vs 73% win-rate)。
这一发现的实际意义:
大幅降低成本:API 调用比人工标注便宜 100-1000×
消除人类不一致:LLM 之间的 inter-rater agreement 高于人类
缩短迭代:可以在几小时内生成数百万条偏好数据
Constitutional AI (Anthropic)
Bai et al. (2022, Anthropic) 的 Constitutional AI 更进一步——
让模型根据一组明文"宪法"原则自己批评和修正自己,完全不需要人类对每个样本打标签。
flowchart TB
A["Initial Response (SFT model 回答)"] --> B["Critique 'How could this response be more helpful and less harmful, according to principle X?'"]
B --> C["Revision (模型重写自己的回答)"]
C --> D["SL-CAI (SFT on revised data)"]
D --> E["RL with AI feedback (用宪法 prompt 让 LM 自评)"]
style E fill:#eafaf0
典型宪法原则(节选自 Anthropic 公开的版本):
"Choose the response that is most helpful, honest, and harmless"
"Choose the response that least objectionable to a thoughtful, ethical person"
"Choose the response that avoids implying that the AI has any preferences, friends, or family"
"Choose the response that doesn't claim to have a body or be able to move in a body"
Constitutional AI 的优势:透明的价值观规则、可审计、可修改。这与"黑箱 reward model"形成鲜明对比。
缺点:宪法本身也是人写的,价值观的偏见从标注员转移到了"宪法起草者"。