|
2 | 2 | layout : post |
3 | 3 | title : "[译][论文] LLaMA 2:开放基础和微调聊天模型(Meta/Facebook,2023)" |
4 | 4 | date : 2023-08-06 |
5 | | -lastupdate: 2024-04-06 |
| 5 | +lastupdate: 2025-02-15 |
6 | 6 | categories: llama ai gpt |
7 | 7 | --- |
8 | 8 |
|
@@ -670,34 +670,33 @@ evaluating a generative model is an open research question, the ranking task of |
670 | 670 | Therefore, everything else being equal, an improvement of the reward model can be directly translated into |
671 | 671 | an improvement for LLaMA2-Chat. |
672 | 672 |
|
673 | | -### 3.2.3 Iterative Fine-Tuning |
| 673 | +### 3.2.3 迭代式微调(Iterative Fine-Tuning) |
674 | 674 |
|
675 | 675 | As we received more batches of human preference data annotation, we were able to train better reward |
676 | 676 | models and collect more prompts. We therefore trained successive versions for RLHF models, referred to |
677 | 677 | here as RLHF-V1, . . . , RLHF-V5. |
678 | 678 | We explored RLHF fine-tuning with two main algorithms: |
679 | 679 |
|
680 | | -* Proximal Policy Optimization (PPO) (Schulman et al., 2017), the standard in RLHF literature. |
681 | | -* Rejection Sampling fine-tuning. We sample K outputs from the model and select the best candidate |
682 | | -with our reward, consistent with Bai et al. (2022b). The same re-ranking strategy for LLMs was also |
683 | | -proposed in Deng et al. (2019), where the reward is seen as an energy function. Here, we go one step |
684 | | -further, and use the selected outputs for a gradient update. For each prompt, the sample obtaining |
685 | | -the highest reward score is considered the new gold standard. Similar to Scialom et al. (2020a), we |
686 | | -then fine-tune our model on the new set of ranked samples, reinforcing the reward. |
| 680 | +* **<mark><code>Proximal Policy Optimization (PPO)</code></mark>** (Schulman et al., 2017), the standard in RLHF literature. |
| 681 | +* **<mark><code>Rejection Sampling fine-tuning</code></mark>**. |
| 682 | + * 与 Bai 等(2022b)保持一致,**<mark>从模型的 K 个输出中采样</mark>**,基于我们的奖励算法**<mark>选出最佳的几个输出</mark>**。 |
| 683 | + * Deng 等(2019)提出了类似的 re-ranking 策略,其中将奖励视为一个能量函数(energy function)。 |
| 684 | + * 这里我们更进一步,**<mark>用选出的输出进行梯度更新</mark>**。对于每个 prompt,得分最高的输出作为新的黄金标准。与 Scialom 等(2020a)类似, |
| 685 | + 我们用这些新的样本对模型进行微调,强化奖励(reinforcing the reward)。 |
687 | 686 |
|
688 | | -The two RL algorithms mainly differ in: |
| 687 | +这**<mark>两个 RL algorithm 的主要区别</mark>**: |
689 | 688 |
|
690 | | -* Breadth — in Rejection Sampling, the model explores K samples for a given prompt, while only one generation is done for PPO. |
691 | | -* Depth — in PPO, during training at step t the sample is a function of the updated model policy from |
692 | | -t − 1 after the gradient update of the previous step. In Rejection Sampling fine-tuning, we sample |
693 | | -all the outputs given the initial policy of our model to collect a new dataset, before applying the |
694 | | -fine-tuning similar to SFT. However, since we applied iterative model updates, the fundamental |
695 | | -differences between the two RL algorithms are less pronounced. |
| 689 | +* 广度:在 Rejection Sampling 中,模型为给定的提示生成 K 个样本,而在 PPO 中只生成一个。 |
| 690 | +* 深度: |
| 691 | + * in PPO, during training at step t the sample is a function of the updated model policy from t − 1 after the gradient update of the previous step. |
| 692 | + * In Rejection Sampling fine-tuning, we sample all the outputs given the initial policy of our model to collect a new dataset, before applying the fine-tuning similar to SFT. |
| 693 | + |
| 694 | +不过,由于我们使用了迭代式模型更新,两种 RL 算法之间的基本差异不那么明显。 |
696 | 695 |
|
697 | 696 | Until RLHF (V4), we used only Rejection Sampling fine-tuning, and after that, we combined the two |
698 | 697 | sequentially, applying PPO on top of the resulted Rejection Sampling checkpoint before sampling again. |
699 | 698 |
|
700 | | -#### Rejection Sampling |
| 699 | +#### Rejection Sampling(拒绝采样) |
701 | 700 |
|
702 | 701 | We perform rejection sampling only with our largest 70B LLaMA2-Chat. All smaller |
703 | 702 | models are fine-tuned on rejection sampled data from the larger model, thus distilling the large-model |
@@ -908,16 +907,14 @@ on human evaluations, it is important to note that human evaluations have severa |
908 | 907 |
|
909 | 908 | # 5 讨论 |
910 | 909 |
|
911 | | -## 5.1 新发现与评论(Learnings and Observations) |
| 910 | +## 5.1 新发现与点评 |
912 | 911 |
|
913 | 912 | 我们的调优过程揭示了一些有趣的结果,比如 LLaMA2-Chat 在时间维度上组织知识的能力,以及调用外部工具 API 的能力。 |
914 | 913 |
|
915 | 914 | ### 超越人类监督:从 SFT 到 RLHF |
916 | 915 |
|
917 | | -在项目开始时,我们中的许多人都倾向于使用**<mark>有监督标注</mark>**(supervised annotation), |
918 | | -attracted by its denser signal。 |
919 | | -同时,**<mark>强化学习</mark>**(reinforcement learning)的不稳定性已经众所周知, |
920 | | -因此自然语言处理领域对其还是抱有一种怀疑态度。 |
| 916 | +在项目开始时,我们中的许多人都倾向于使用**<mark>有监督标注</mark>**(supervised annotation),attracted by its denser signal。 |
| 917 | +同时,**<mark>强化学习的不稳定性已经众所周知</mark>**,因此自然语言处理领域对其还是抱有一种怀疑态度。 |
921 | 918 | 但事实证明强化学习非常有效,尤其是考虑到其**<mark>成本和时间效率</mark>**。 |
922 | 919 | 我们的研究结果认为, |
923 | 920 |
|
|
0 commit comments