논문 정리) Training language models to follow instructions with human feedback

January 5, 2026 10 분 소요

1. 원문 링크

논문 보기

2. 논문 리뷰

Abstract

인간 평가 결과, 1.3B 파라미터 InstructGPT(우리 모델) 모델의 output이 이는 파라미터 수가 100배 적음에도 불구하고, 175B GPT-3의 output보다 선호 되었다.
- 원문 : In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.

1. Introduction

InstructGPT는 GPT-3보다 일관되게 더 선호됨. 1.3B InstructGPT가 175B GPT-3보다도 더 좋은 평가를 받았다. 175B 모델 기준으로, 라벨러가 선택한 비율은 GPT-3 대비 85 ± 3%. 보였다.
- 원문 : Labelers significantly prefer InstructGPT outputs over outputs from GPT-3. On our test set, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having over 100x fewer parameters. Outputs from our 175B InstructGPT are preferred to 175B GPT-3 outputs 85 ± 3% of the time, and preferred 71 ± 4% of the time to few-shot 175B GPT-3.
사실성(Truthfulness) 개선, TruthfulQA 벤치마크에서 약 2배 더 많은 진실되고 유익한 답변 생성하였다.
- 원문 : InstructGPT models show improvements in truthfulness over GPT-3. On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers about twice as often as GPT-3.
유해성(Toxicity) 감소 (약 25% 감소), 성 차별·편향(Bias) 개선은 거의 없었다.
- 원문 : InstructGPT shows small improvements in toxicity over GPT-3, but not bias. we use the RealToxicityPrompts dataset. InstructGPT models generate about 25% fewer toxic outputs than GPT-3 when prompted to be respectful. InstructGPT does not significantly improve over GPT-3 on the Winogender (Rudinger et al., 2018) and CrowSPairs (Nangia et al., 2020) datasets.
공공 벤치마크 성능은 일부 하락. SQuAD, DROP, HellaSwag, WMT 등에서 성능 저하 발생(alignment tax).
- 원문 : During RLHF fine-tuning, we observe performance regressions compared to GPT-3 on certain public NLP datasets, notably SQuAD (Rajpurkar et al., 2018), DROP (Dua et al., 2019), HellaSwag (Zellers et al., 2019), and WMT 2015 French to English translation (Bojar et al., 2015). This is an example of an “alignment tax” since our alignment procedure comes at the cost of 3 lower performance on certain tasks that we may care about.
여전히 사실 왜곡, 지시 불이행, 장황한 답변 등 존재한다.
- 원문 : InstructGPT still makes simple mistakes. For example, InstructGPT can still fail to follow instructions, make up facts, give long hedging answers to simple questions, or fail to detect instructions with false premises.

정렬 불일치(misalignment)로 인한 LM의 행동 문제를 정리하며, 해로운 콘텐츠 생성과 부정확하게 정의된 목표를 게임화(gaming misspecified objectives)하는 등의 사례를 포함한다.
- 원문 : The question of what it means for language models to be aligned has also received attention re cently (Gabriel, 2020). Kenton et al. (2021) catalog behavioral issues in LMs that result from misalignment, including producing harmful content and gaming misspecified objectives.
본 연구는 언어모델의 교차-태스크 일반화(cross-task generalization) 연구와도 관련이 있다. 여기서 LM들은 다양한 공개 NLP 데이터셋(명확, 적절한 지시문)에 파인튜닝되어, 다른 NLP 태스크들에서 평가된다.
- 원문 : Our work is also related to research on cross-task generalization in language models, where LMs are fine-tuned on a broad range of public NLP datasets (usually prefixed with an appropriate instruction) and evaluated on a different set of NLP tasks.

3. Methods and experimental details

3.1 High-level methodology

본 방법론은 Ziegler et al.(2019)과 Stiennon et al.(2020)의 것을 따른다. 이들은 문체 연속(stylistic continuation)과 요약(summarization) 영역에 적용했다.
- Our methodology follows that of Ziegler et al. (2019) and Stiennon et al. (2020), who applied it in the stylistic continuation and summarization domains.
Step 1 라벨러 데모 데이터로 지도학습. Step 2 비교 데이터 확보 후, 보상 모델로 학습. Step 3 PPO를 활용한 보상 모델로 정책을 최적화.
- 원문 : Step 1: Collect demonstration data, and train a supervised policy. Step 2: Collect comparison data, and train a reward model. Step 3: Optimize a policy against the reward model using PPO.
Step 2,3은 지속적으로 반복 가능하다. 현재 최고 정책으로 더 많은 비교 데이터를 수집하여 이를 새로운 RM과 새로운 정책을 학습하는데 사용된다.
- 원문 : Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy.

3.2 Dataset

GPT-3 API에는 지시형 프롬프트가 거의 없었기에, 최초 InstructGPT 학습을 위한 ‘부트스트래핑 데이터’를 확보하기 위해 라벨러들에게 직접 3가지 유형의 프롬프트를 작성하게 했다. Plain(다양한 임의 태스크), Few-shot(지시+다중 Q&A 쌍), User-based(API 대기열 신청서 기반 실제 사용케이스)
- 원문 : We asked labelers to write three kinds of prompts. Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring the tasks had sufficient diversity. Few-shot: We ask the labelers to come up with an instruction, and multiple query/response pairs for that instruction. User-based: We had a number of use-cases stated in waitlist applications to the OpenAI API. We asked labelers to come up with prompts corresponding to these use cases.

3.3 Tasks

이 프롬프트들은 생성, 질의응답, 대화, 요약, 추출 등 매우 다양한 자연어 처리 태스크들을 포함한다.
- These prompts are very diverse and include generation, question answering, dialog, summarization, extractions, and other natural language tasks (see Table 1).

3.4 Human data collection

기존 요약 태스크 선호도 수집 연구들보다 훨씬 광범위한 태스크 범위, 논쟁적/민감 주제 포함한다.
- 원문 : Compared to earlier work that collects human preference data on the task of summarization (Ziegler et al., 2019; Stiennon et al., 2020; Wu et al., 2021), our inputs span a much broader range of tasks, and can occasionally include controversial and sensitive topics.
복잡한 태스크임에도 훈련 라벨러 간 72.6±1.5%, 독립 라벨러 77.3±1.3%로 높은 동의율을 보였다.
- 원문 : Despite the complexity of the task, we find that inter-annotator agreement rates are quite high: training labelers agree with each-other 72.6 ± 1.5% of the time, while for held-out labelers this number is 77.3 ± 1.3%.

3.5 Models

Supervised fine-tuning (SFT).
코사인 학습률 감소(cosine learning rate decay)를 사용해 16 에포크 동안 훈련하고, 잔차 드롭아웃(residual dropout) 0.2를 적용했다. 최종 SFT 모델은 RM 점수로 선택했다.
- 원문 : We trained for 16 epochs, using a cosine learning rate decay, and residual dropout of 0.2. We do our final SFT model selection based on the RM score on the validation set.
Reward modeling (RM).
본 논문에서는 컴퓨팅 비용을 크게 절약하고, 175B RM 훈련이 불안정하여 RL 과정에서 가치함수(value function)로 사용하기에 적합하지 않다는 것을 발견했기 때문에 6B RM만 사용했다.
- 원문 : In this paper we only use 6B RMs, as this saves a lot of compute, and we found that 175B RM training could be unstable and thus was less suitable to be used as the value function during RL
비교 데이터 수집 속도를 높이기 위해, 라벨러에게 K=4~9 사이의 응답 수를 제시하여 랭킹하게 한다. 우리는 각 프롬프트에서 생성된 모든 $\binom{K}{2}$ 비교들을 단일 배치 요소로 훈련한다. 이는 각 완성(completion)에 대해 RM의 단일 forward pass만 필요하기 때문에 계산적으로 훨씬 효율적이다.
- 원문 : In order to speed up comparison collection, we present labelers with anywhere between K = 4 and K =9 responses to rank. we train on all $\binom{K}{2}$ comparisons from each prompt as a single batch element. This is much more computationally efficient because it only requires a single forward pass of the RM for each completion

$$ \begin{equation} \textrm{loss} (\theta) = - \frac{1}{\binom{K}{2}} \mathbb{E}_{(x, y_w, y_l) \sim D} [\log (\sigma (r_\theta (x, y_w) - r_\theta (x, y_l)))] \end{equation} $$

$\textrm{loss} (\theta)$ : 모델(또는 모수)의 Loss function
$\binom{K}{2}$ : k개 중에 2개를 선택하는 경우의 수
$y_{w}$ : 평가가 높은 응답, $y_{l}$ : 평가가 낮은 응답
$r_{\theta}(x,y)$ : 프롬프트 $x$, 응답 $y$의 보상
$\sigma (r_\theta (x, y_w) - r_\theta (x, y_l)))$ : 모든 응답 비교 쌍의 보상 차의 합산

$\sigma (r_\theta (x, y_w) - r_\theta (x, y_l)))$을 최대화하는 로스 함수로 설계되어, 선호하는 응답($y_w$)과 비선호 응답($y_l$) 간 RM 점수 차이를 벌리도록 학습. 이는 인간 선호도 순위를 정확히 반영하면서, PPO 단계에서 고품질 응답을 일관되게 생성하도록 RM을 훈련시키는 핵심 메커니즘.
Reinforcement learning (RL).
공개 NLP 데이터셋 성능 저하를 해결하기 위해 PPO 그래디언트에 사전훈련 그래디언트를 혼합하였다. 이를 PPO-ptx라 불렀다.
- 원문 : Wealso experiment with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets. We call these models “PPO-ptx.” We maximize the following combined objective function in RL training

$$ \begin{aligned} \textrm{objective} (\phi) &= \mathbb{E}_{(x, y) \sim D_{\pi_\phi^\textrm{RL}}} \bigg[ r_\theta (x, y) - \beta \log \frac{\pi_\phi^\textrm{RL} (y \vert x)}{\pi^\textrm{SFT} (y \vert x)} \bigg] \\ &+ \gamma \mathbb{E}_{x \sim D_\textrm{pretrain}} [\log (\pi_\phi^\textrm{RL} (x))] \end{aligned} $$

$\textrm{objective} (\phi)$ : 목적 함수($\phi$는 RL Policy의 가중치)
$\pi_\phi^\textrm{RL} (y \vert x)$ : RL Policy의 prompt x에 대한 y의 확률
$\pi^\textrm{SFT} (y \vert x)$ : SFT 모델의 prompt x에 대한 y의 확률
$\log \frac{\pi_\phi^\textrm{RL} (y \vert x)}{\pi^\textrm{SFT} (y \vert x)}$ : prompt x에 대한 RL Policy와 SFT 모델의 KL 발산(모델 간의 차이)
$\pi_\phi^\textrm{RL} (x)$ : RL Policy의 prompt x 확률

PPO 그래디언트($r_{\theta}(x,y)$을 최대화, $\log \frac {\pi_\phi^\textrm{RL} (y \vert x)}{\pi^\textrm{SFT} (y \vert x)}$을 최소화), 사전훈련 그래디언트($\pi_\phi^\textrm{RL} (x)$을 최대화)가 혼합된 목적 함수. 즉 보상을 최대화하면서, SFT모델과 멀어지지 않으며 동시에, 공개 NLP 데이터셋 성능 저하를 방지하는 것.

3.6 Evaluation

사용자 의도에 따라 행동하는 모델 훈련을 목표로 하며, 실질적으로는 모델이 ‘도움이 되고(helpful), 정직하고(honest), 해롭지 않으면(harmless)’ aligned 되었다고 정의하였다.
- 원문 : Following Leike et al. (2018), our aim is to train models that act in accordance with user intentions. More practically, for the purpose of our language tasks, we use a framework similar to Askell et al. (2021), who define models to be aligned if they are helpful, honest, and harmless.

4. Result

4.1 Results on the API distribution

다양한 기준에서 모델을 평가한 그래프
라벨러들은 GPT-3의 결과물보다 InstructGPT의 결과물을 현저하게 선호한다. 직접 비교했을 때, 1,750억 개 파라미터의 InstructGPT 결과물은 GPT-3 결과물보다 85 ± 3%의 비율로 더 선호 되었으며, 몇 개의 예시를 준(few-shot) GPT-3보다도 71 ± 4% 더 선호 되었다.
- 원문 : Labelers significantly prefer InstructGPT outputs over outputs from GPT-3. When compared directly, 175B InstructGPT outputs are preferred to GPT-3 outputs $85 \pm 3\%$ of the time, and preferred $71 \pm 4\%$ of the time to few-shot GPT-3.
우리 모델은 훈련 데이터를 생성하지 않은 ‘미참여’ 평가자들의 선호도에도 일반화된다. 보상 모델은 미참여 그룹의 선호도를 예측하는 데 69.6 ± 0.9%의 정확도를 보였으며, 이는 훈련 셋의 정확도인 72.4 ± 0.4%에서 소폭 감소한 수준이다.
- 원문 : Our models generalize to the preferences of ‘held-out’ labelers that did not produce any train-ing data. These RMs have an accuracy of 69.6 ± 0.9% on predicting the preferences of labelers in the held-out group, a small decrease from their 72.4 ± 0.4% accuracy on predicting the preferences of labelers in their training set.
공개된 NLP 데이터셋은 우리 언어 모델이 실제로 사용되는 방식을 반영하지 못한다. 일대일 비교에서 InstructGPT는 FLAN 모델보다 78%, T0 모델보다 79% 더 선호 되었다.
- 원문 : Public NLP datasets are not reflective of how our language models are used. In a head to head comparison, our 175B InstructGPT model outputs were preferred over our FLAN model $78 \pm 4\%$ of the time and over our T0 model $79 \pm 4\%$ of the time.

4.2 Results on public NLP datasets

회색 막대는 신뢰도 점수, 색깔 막대는 신뢰도+정보성 점수를 나타냄
RealToxicityPrompts에서의 인간 평가와 자동 평가 결과

InstructGPT 모델이 GPT-3 대비 신뢰성에서 개선된 모습이 보인다. 유해성 또한 작게 개선되었으나 편향적이지 않다. 우리는 파인튜닝 프로시저를 바꿔가며 일반 NLP 데이터셋에 관한 성능 회귀를 최소화할 수 있었다.
- 원문 : InstructGPT models show improvements in truthfulness over GPT-3. InstructGPT shows small improvements in toxicity over GPT-3, but not bias. We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure.
(위) InstructGPT는 때때로 영어로 출력을 생성하지만 다른 언어로 된 명령을 따를 수 있다.
(아래) InstructGPT는 GPT-3보다 안정적으로 코드에 대한 질문을 요약하고 답변할 수 있다.

4.3 Qualitative results

InstructGPT 모델은 RLHF 미세 조정(fine-tuning) 데이터 분포를 벗어난 지시어에 대해서도 유망한 일반화 성능을 보여준다. 지시문이 다른 언어로 되어 있을 때에도 모델이 종종 영어로 결과물을 생성한다는 점을 발견했다. 이와 비교하여, GPT-3는 이러한 작업들을 수행할 수는 있지만 더 세심한 프롬프트 작성이 필요하며, 해당 영역에서의 지시를 따르는 경우는 드물다는 것을 확인했다.
- 원문 : InstructGPT models show promising generalization to instructions outside of the RLHF fine tuning distribution. we notice that it often produces an output in English even when the instruction is in another language. In comparison, we find that GPT-3 can perform these tasks but requires more careful prompting, and rarely follows instructions in these domains.
InstructGPT 모델은 여전히 단순한 실수를 범한다. (1) 질문에 담긴 잘못된 전제를 비판 없이 사실로 받아들여 답변하는 경향이 있으며, (2) 심하게 hedge(위험회피)한다. 맥락상 명확한 정답이 존재하는 상황에서도 “하나의 답은 없다”며 지나치게 신중하게 답변을 회피하거나 여러 가능성을 늘어놓는 모습을 보입니다. (3) 여러 가지 명시적인 제약 조건이 동시에 주어지거나 문장 수를 엄격히 제한하는 등 언어 모델이 처리하기 까다로운 지시를 받을 경우 성능이 눈에 띄게 저하되는 한계가 있습니다.
- 원문 : InstructGPT still makes simple mistakes. (1) when given an instruction with a false premise, the model sometimes incorrectly assumes the premise is true, (2) the model can overly hedge; when given a simple question, it can sometimes say that there is no one answer to the question and give multiple possible answers, even when there is one fairly clear answer from the context, and (3) the model’s performance degrades when instructions contain multiple explicit constraints or when constraints can be challenging for language models.

Twitter Facebook LinkedIn

논문 정리) Training language models to follow instructions with human feedback

1. 원문 링크

2. 논문 리뷰

Abstract

1. Introduction

3. Methods and experimental details

3.1 High-level methodology

3.2 Dataset

3.3 Tasks

3.4 Human data collection

3.5 Models

3.6 Evaluation

4. Result

4.1 Results on the API distribution

4.2 Results on public NLP datasets

4.3 Qualitative results

공유하기

댓글남기기

참고

Openclaw) 무료 GCP + 무료 Gemini 테스트

논문 수학 친해지기) 기초 개념/기호 정리 - 데이터, 미분, 확률, 적분

Go) Golang 기초

레디스) 레디스 세팅 및 기본 사용법

1. 원문 링크

2. 논문 리뷰

Abstract

1. Introduction

2. Related work

3. Methods and experimental details

3.1 High-level methodology

3.2 Dataset

3.3 Tasks

3.4 Human data collection

3.5 Models

3.6 Evaluation

4. Result

4.1 Results on the API distribution

4.2 Results on public NLP datasets

4.3 Qualitative results

공유하기

댓글남기기

참고

Openclaw) 무료 GCP + 무료 Gemini 테스트

논문 수학 친해지기) 기초 개념/기호 정리 - 데이터, 미분, 확률, 적분

Go) Golang 기초

레디스) 레디스 세팅 및 기본 사용법