Skip to content

Commit f9a6829

Browse files
committed
Add deepseek-r1-paper-zh.md
1 parent 182b59c commit f9a6829

File tree

12 files changed

+1527
-25
lines changed

12 files changed

+1527
-25
lines changed

_posts/2023-08-06-llama2-paper-zh.md

Lines changed: 19 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
layout : post
33
title : "[译][论文] LLaMA 2:开放基础和微调聊天模型(Meta/Facebook,2023)"
44
date : 2023-08-06
5-
lastupdate: 2024-04-06
5+
lastupdate: 2025-02-15
66
categories: llama ai gpt
77
---
88

@@ -670,34 +670,33 @@ evaluating a generative model is an open research question, the ranking task of
670670
Therefore, everything else being equal, an improvement of the reward model can be directly translated into
671671
an improvement for LLaMA2-Chat.
672672

673-
### 3.2.3 Iterative Fine-Tuning
673+
### 3.2.3 迭代式微调(Iterative Fine-Tuning
674674

675675
As we received more batches of human preference data annotation, we were able to train better reward
676676
models and collect more prompts. We therefore trained successive versions for RLHF models, referred to
677677
here as RLHF-V1, . . . , RLHF-V5.
678678
We explored RLHF fine-tuning with two main algorithms:
679679

680-
* Proximal Policy Optimization (PPO) (Schulman et al., 2017), the standard in RLHF literature.
681-
* Rejection Sampling fine-tuning. We sample K outputs from the model and select the best candidate
682-
with our reward, consistent with Bai et al. (2022b). The same re-ranking strategy for LLMs was also
683-
proposed in Deng et al. (2019), where the reward is seen as an energy function. Here, we go one step
684-
further, and use the selected outputs for a gradient update. For each prompt, the sample obtaining
685-
the highest reward score is considered the new gold standard. Similar to Scialom et al. (2020a), we
686-
then fine-tune our model on the new set of ranked samples, reinforcing the reward.
680+
* **<mark><code>Proximal Policy Optimization (PPO)</code></mark>** (Schulman et al., 2017), the standard in RLHF literature.
681+
* **<mark><code>Rejection Sampling fine-tuning</code></mark>**.
682+
* 与 Bai 等(2022b)保持一致,**<mark>从模型的 K 个输出中采样</mark>**,基于我们的奖励算法**<mark>选出最佳的几个输出</mark>**
683+
* Deng 等(2019)提出了类似的 re-ranking 策略,其中将奖励视为一个能量函数(energy function)。
684+
* 这里我们更进一步,**<mark>用选出的输出进行梯度更新</mark>**。对于每个 prompt,得分最高的输出作为新的黄金标准。与 Scialom 等(2020a)类似,
685+
我们用这些新的样本对模型进行微调,强化奖励(reinforcing the reward)。
687686

688-
The two RL algorithms mainly differ in:
687+
**<mark>两个 RL algorithm 的主要区别</mark>**
689688

690-
* Breadth — in Rejection Sampling, the model explores K samples for a given prompt, while only one generation is done for PPO.
691-
* Depth — in PPO, during training at step t the sample is a function of the updated model policy from
692-
t − 1 after the gradient update of the previous step. In Rejection Sampling fine-tuning, we sample
693-
all the outputs given the initial policy of our model to collect a new dataset, before applying the
694-
fine-tuning similar to SFT. However, since we applied iterative model updates, the fundamental
695-
differences between the two RL algorithms are less pronounced.
689+
* 广度:在 Rejection Sampling 中,模型为给定的提示生成 K 个样本,而在 PPO 中只生成一个。
690+
* 深度:
691+
* in PPO, during training at step t the sample is a function of the updated model policy from t − 1 after the gradient update of the previous step.
692+
* In Rejection Sampling fine-tuning, we sample all the outputs given the initial policy of our model to collect a new dataset, before applying the fine-tuning similar to SFT.
693+
694+
不过,由于我们使用了迭代式模型更新,两种 RL 算法之间的基本差异不那么明显。
696695

697696
Until RLHF (V4), we used only Rejection Sampling fine-tuning, and after that, we combined the two
698697
sequentially, applying PPO on top of the resulted Rejection Sampling checkpoint before sampling again.
699698

700-
#### Rejection Sampling
699+
#### Rejection Sampling(拒绝采样)
701700

702701
We perform rejection sampling only with our largest 70B LLaMA2-Chat. All smaller
703702
models are fine-tuned on rejection sampled data from the larger model, thus distilling the large-model
@@ -908,16 +907,14 @@ on human evaluations, it is important to note that human evaluations have severa
908907

909908
# 5 讨论
910909

911-
## 5.1 新发现与评论(Learnings and Observations)
910+
## 5.1 新发现与点评
912911

913912
我们的调优过程揭示了一些有趣的结果,比如 LLaMA2-Chat 在时间维度上组织知识的能力,以及调用外部工具 API 的能力。
914913

915914
### 超越人类监督:从 SFT 到 RLHF
916915

917-
在项目开始时,我们中的许多人都倾向于使用**<mark>有监督标注</mark>**(supervised annotation),
918-
attracted by its denser signal。
919-
同时,**<mark>强化学习</mark>**(reinforcement learning)的不稳定性已经众所周知,
920-
因此自然语言处理领域对其还是抱有一种怀疑态度。
916+
在项目开始时,我们中的许多人都倾向于使用**<mark>有监督标注</mark>**(supervised annotation),attracted by its denser signal。
917+
同时,**<mark>强化学习的不稳定性已经众所周知</mark>**,因此自然语言处理领域对其还是抱有一种怀疑态度。
921918
但事实证明强化学习非常有效,尤其是考虑到其**<mark>成本和时间效率</mark>**
922919
我们的研究结果认为,
923920

_posts/2025-01-07-ai-agent-white-paper-zh.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
layout : post
3-
title : "AI Agent(智能体)技术白皮书(Google,2024)"
3+
title : "[译] AI Agent(智能体)技术白皮书(Google,2024)"
44
date : 2025-01-07
55
lastupdate: 2025-01-07
66
categories: ai llm

0 commit comments

Comments
 (0)