논문 정리) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

May 10, 2025 17 분 소요

1. 원문 링크

논문 보기

2. 논문 리뷰

Abstract

DeepSeek-R1-Zero and DeepSeek-R1, DeepSeek-R1-Zero 모델은 SFT(supervised fine-tuning) 없이 RL(reinforcement learning)만으로 좋은 추론 능력을 도출할 수 있음을 증명함.
- 원문 : We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
그러나 부족한 읽기 능력을 갖추게 되었고, 이를 위해 ‘cold-start data(학습하지 않은 데이터)’와 multi-stage training(다단계 학습 구조)’기법을 활용하여 해결함
- 원문 : Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL

용어 정리
- SFT : 지도학습 방식의 파인튜닝
- RL : 강화학습
- cold-start data : 학습하지 않은 데이터
- multi-stage training : 다단계 학습 구조(2RL + 2SFT)
사용한 평가지표
- AIME 2024 : American Invitational Mathematics Examination
  - 분야 : 고등 수준 수학 추론
  - 평가지표 : 정답률
- Codeforces : Codeforces Programming Contest Platform
  - 분야 : 알고리즘 문제 해결
  - 세부 설명 : 입출력 형식, 메모리 제한, 시간 복잡도 모두 만족해야함.
  - 평가지표 : 문제별 통과 여부
- GPQA Diamond : Graduate-level Physics Question Answering
  - 분야 : 대학원 수준의 물리학
  - 평가지표 : 정답률
- MATH-500 : Mathematics Dataset by hendrycks et al.
  - 분야 : 중고등 ~ 대학 수준 수학
  - 평가지표 : 한 번에 맞힐 확률
- MMLU : Massive Multitask Language Understanding
  - 분야 : 일반 지식 전반(역사, 법, 공학 등)
  - 평가지표 : 정확도
- SWE-bench Verified : Software Engineering Benchmark
  - 분야 : 소프트웨어 유지보수, 버그 수정 등
  - 평가지표 : 정답률(테스트 케이스 + 인간 평가)

Introduction

이전 연구들이 process-based reward models, reinforcement learning, Monte Carlo Tree Search and Beam Search 등 다양한 방법을 썻지만 OpenAI’s o1 series models에 비할 바가 안됨.
- 원문 : Several prior works have explored various approaches, including process-based reward models (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024), and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning performance comparable to OpenAI’s o1 series models.
‘GRPO’ 강화학습 프레임워크로 수천번의 steps 끝에 AIME 2024 평가지표에서 15.6% → 71.0%로 향상시킴. ‘majority voting’과 함께 86.7%까지 달성함.
- 원문
  - we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning.
  - AfterthousandsofRLsteps, DeepSeek-R1-Zeroexhibitssuperperformance on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further improves to 86.7%
이렇게 학습 시킨 DeepSeek-R1-Zero는 언어 섞임과 부족한 읽기 능력을 가지게 됨.
- 원문 : However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing
DeepSeek-R1은 DeepSeek-R1-Zero의 가독성 및 언어 혼용 문제를 해결하고 추론 성능을 높이기 위해, cold-start 데이터로 SFT를 진행한 후 reasoning 중심의 RL, ‘리젝션 샘플링 기반 SFT’, 추가 RL을 거치는 멀티 스테이지 학습 과정을 통해 OpenAI-o1-1217 수준의 성능을 달성한 모델임(base model은 DeepSeek-V3-Base).
- 학습 프로세스 : 강화학습(모델 개선 및 SFT용 데이터 생성) → SFT → RL로 미세조정(다양한 프롬프트를 통해)
- 원문
  - we perform reasoning-oriented RL like DeepSeek-R1 Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3
  - After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios
DeepSeek-R1에서 Qwen2.5-32B 등 작은 모델로 직접 지식 증류(distillation)를 수행한 결과, 강화학습을 적용하는 것보다 더 뛰어난 추론 성능을 보여주었으며, 이를 통해 대형 모델이 학습한 추론 패턴의 중요성을 입증
- 원문
  - We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5 32B(Qwen,2024b) as the basemodel, direct distillation from DeepSeek-R1 outperforms applying RL on it
  - Notably, our distilled 14B model outperforms state-of-the-art open-source

Contributions

Post-Training: Large-Scale Reinforcement Learning on the Base Model

DeepSeek-R1-Zero는 SFT없이 RL만으로 CoT(chain of thought)를 통해 복잡한 문제를 해결하도록 함. 검증, 반영, Long CoTs이 가능한 모델.
- 원문 : Wedirectly apply RL to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community
DeepSeek-R1은 두 단계의 RL과 두 단계의 SFT를 포함한 파이프라인을 통해 추론 능력과 인간 선호도 정렬을 향상함.
- 실제 학습 과정 : 1차 RL(추론능력 강화) → 1차 SFT(리젝션 샘플링) → 2차 RL(인간 선호 기반 보상 학습) → 2차 SFT(writing, factual QA, self-cognition 등 task)
- 원문 : Weintroduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human pref erences, as well as two SFT stages that serve as the seed for the model’s reasoning and non-reasoning capabilities.

Distillation: Smaller Models Can Be Powerful Too

DeepSeek-R1의 추론 데이터를 활용해 여러 소형 dense 모델(Qwen2.5 및 LLaMA3 기반)을 파인튜닝한 결과, 이들 distilled 모델이 RL만 적용한 소형 모델보다 뛰어난 추론 성능을 보였음.
- 원문 : Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks
AIME 2024, MATH-500, LiveCodeBench 등 주요 벤치마크에서 기존 오픈소스 모델을 능가하는 성과를 기록
- 원문 : DeepSeek R1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Addi tionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on LiveCodeBench

Summary of Evaluation Results

Category	Benchmark	Performance
Reasoning	AIME 2024	79.8% Pass@1
Reasoning	MATH-500	97.3%
Coding	Codeforces	2,029 Elo (Top 96.3%)
Engineering	Engineering tasks	Slightly better than DeepSeek-V3
Knowledge	MMLU	90.8%
Knowledge	MMLU-Pro	84.0%
Knowledge	GPQA Diamond	71.5%
Knowledge	SimpleQA	Better than DeepSeek-V3
General QA / Creative	AlpacaEval 2.0	87.6% win-rate
General QA / Creative	ArenaHard	92.3% win-rate
Long Context	Long-context tasks	Substantially better than DeepSeek-V3

용어 정리
- GRPO : 여러 개의 응답을 생성한 후 상대적인 우열을 비교하여 가장 우수한 응답을 기준으로 정책을 업데이트하는, 가치 함수 없이 작동하는 비교 기반 강화학습
  - 참고 사이트
- majority voting : GRPO에서 생성된 여러 응답들 중 가장 우수한 응답을 상대평가로 선택하기 위해 사용되는 전략

Approach

DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

1. Reinforcement Learning Algorithm : Group Relative Policy Optimization

$$ \mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{[q \sim P(Q),\, \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(O \mid q)]} $$ $$ \frac{1}{G} \sum_{i=1}^{G} \left( \min\left( \frac{\pi_{\theta}(o_i \mid q)}{\pi_{\theta_{\text{old}}}(o_i \mid q)} A_i,\, \text{clip} \left( \frac{\pi_{\theta}(o_i \mid q)}{\pi_{\theta_{\text{old}}}(o_i \mid q)}, 1 - \varepsilon, 1 + \varepsilon \right) A_i \right) - \beta \, \mathbb{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\text{ref}}) \right) $$ $$ \mathbb{D}_{\text{KL}}(\pi_{\theta} \| \pi_{\text{ref}}) = \frac{\pi_{\text{ref}}(o_i \mid q)}{\pi_{\theta}(o_i \mid q)} - \log\left( \frac{\pi_{\text{ref}}(o_i \mid q)}{\pi_{\theta}(o_i \mid q)} \right) - 1 $$ $$ A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \cdots, r_G\})}{\text{std}(\{r_1, r_2, \cdots, r_G\})} $$

필요 가정
- 여러 질문(q)를 가진 임베딩 공간이 있다 가정
  
  $q$ = 질문
  $P(Q)$ = 질문 분포 확률(추가설명 : q가 나올 확률, 이론적 모델링)
  $o_i$ = 답변(output)
  $\pi_{\theta}(O|q)$ = 답변 분포 확률(o가 나올 확률)
  부가설명 : ref : Base 모델 / old : step 이전 모델 / 미표기 : step 이후 모델
  $D_{KL}$ = 변형된 KL divergence
  $A_i$ = group 기준으로 정규화 된 reward
수식 설명
- min/clip 장치를 통해 reward 앞에 곱해지는 비율 발산 방지(단, 작아 지는덴 한계 없음)
- $\beta$와 $D_{KL}$ 통해 base모델과 멀어지는 것을 방지 → base 모델의 logit과 멀어질수록 큰 수 부여
변형된 KL divergence 그래프 예시

2. Reward Modeling
1) Accuracy rewards
규칙 기반의 정확성 검증이 가능하게 하고, 마찬가지로 LeetCode 문제의 경우 컴파일러는 미리 정의된 테스트 사례를 기반으로 피드백을 생성하는 데 사용됩니다.
- 원문 : enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases.
2) Format rewards
’<think>’, ‘</think>’ 태그를 사용하여 thinking process를 강화 시킴.
- 원문 : we employ a format reward model that enforces the model to put its thinking process between ‘<think>’ and ‘</think>’ tags.

3. Training Template
DeepSeek-R1-Zero는 모델이 먼저 추론 과정을 생성하고 이후에 최종 답변을 도출하도록 합니다. 이는 특정 해결 방식이나 전략을 강제하지 않기 위한 것이며, RL 학습 중 모델의 자연스러운 행동을 관찰하기 위해 구조적 제약을 최소화한 것입니다.
- 원문 : We intentionally limit our constraints to this structural format, avoiding any content-specific biases—such as mandating reflective reasoning or promoting particular problem-solving strategies—to ensure that we can accurately observe the model’s natural progression during the RL process.

4. Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero
1) Performance of DeepSeek-R1-Zero
AIME 2024 기준 평균 pass@1 점수가 15.6%에서 71.0%로 크게 증가했다.
- 원문 : the average pass@1 score on AIME 2024 shows a significant increase, jumping from an initial 15.6% to an impressive 71.0%, reaching performance levels comparable to OpenAI-o1-0912.

Table2는 여러 추론 벤치 마크에서 DeepSeek-R1-Zero와 OpenAI’s o1-0912 모델 간의 결과를 보여준다. 이 결과로 RL의 효용이 드러난다.
- 원문 : Table 2 provides a comparative analysis between DeepSeek-R1-Zero and OpenAI’s o1-0912 models across a variety of reasoning-related benchmarks. The findings reveal that RL empowers

DeepSeek-R1-Zero는 지도학습 없이 RL만으로 강력한 추론 성능을 달성하며, 다수결 기법을 통해 AIME 기준 성능이 71.0%에서 86.7%로 향상되어 OpenAI-o1-0912를 능가한다. 이는 모델의 일반화 능력과 확장 가능성을 강조한다.
- 원문 : DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised f ine-tuning data. This is a noteworthy achievement, as it underscores the model’s ability to learn and generalize effectively through RL alone. Additionally, the performance of DeepSeek R1-Zero can be further augmented through the application of majority voting. For example, when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero’s performance escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912.
2) Self-evolution Process of DeepSeek-R1-Zero
RL이 얼마나 모델의 추론 능력을 향상시킬 수 있는지 증명하였다.
- 원문 : The self-evolution process of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously

강화학습 환경과의 상호작용을 통해 DeepSeek-R1-Zero의 추론 능력을 크게 향상시켜, 더 어려운 과제도 더 효율적이고 정확하게 해결할 수 있도록 만들었다.
- These behaviors are not explicitly programmed butinstead emerge as a result of the model’s interaction with the reinforcement learning environment. This spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities, enabling it to tackle more challenging tasks with greater efficiency and accuracy.
3) Aha Moment of DeepSeek-R1-Zero
올바른 보상만 제공하면, 모델은 명시적인 지시 없이도 스스로 고급 문제 해결 전략을 개발한다는 점에서 강화학습의 힘과 아름다움이 드러난다.
- 원문 : This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies.

Aha 모먼트 예시

4) Drawback of DeepSeek-R1-Zero
DeepSeek-R1-Zero는 부족한 읽기 능력과, 언어 섞임 현상을 겪지만, 우리를 DeepSeek-R1을 만들어내기 위해, RL과 human-friendly cold-start data를 활용하였다.
- 원문 : DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. To make reasoning processes more readable and share them with the open community, we explore DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data.

DeepSeek-R1: Reinforcement Learning with Cold Start

1. Cold Start
DeepSeek-V3-Base를 파인튜닝하기 위해 cold-start data들을 모았고, DeepSeek-R1-Zero보다 Readability, Potential에서 몇 가지 장점을 가지게 되었다.
- 원문 : we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data include Readability, Potential.
DeepSeek-R1-Zero의 주요 한계는 종종 읽기 적합하지 않은 답변을 내놓았다.(언어 섞임, 마크다운 포맷팅 부족, 강조X)
- 원문 : A key limitation of DeepSeek-R1-Zero is that its content is often not suitable for reading. Responses may mix multiple languages or lack markdown formatting to highlight answers for users.
우리는 <reasoning_process>, <summary> 등과 같은 스페셜 토큰을 통해 CoT, Summarize을 실시함.
- 원문 : Here, we define the output format as |special_token|<reasoning_process>|special_token|<summary>, where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results.
Potential영역에선 우리는 연속적인 학습(RL, Fine-tune)이 추론 모델을 만드는데 나은 방법이라 믿는다.
- We believe the iterative training is a better way for reasoning models.

2. Reasoning-oriented Reinforcement Learning
우리는 언어섞임을 완화하고자, CoT과정에서 target 언어 사용 시 reward을 부여하였다.
- 원문 : To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT
인간 선호에 부합하는 언어 일관성 보상을 도입하면 성능이 소폭 감소하지만 가독성이 향상되며, 이는 추론 정확도 보상과 합산되어 최종 보상으로 사용된다.
- 원문 : Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable. Finally, we combine the accuracy of reasoning tasks and the reward for language consistency by directly summing them to form the final reward

3. Rejection Sampling and Supervised Fine-Tuning
추론 중심 RL이 수렴한 후 해당 체크포인트를 기반으로 다양한 작업 능력 강화(쓰기, 롤플레잉 등)를 위한 SFT 데이터를 수집하고 모델을 추가로 파인 튜닝한다.
- 원문 : When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round. we generate the data and fine-tune the model as described below.
1) Reasoning data
우리는 RL 훈련 결과로 얻은 체크포인트로부터 리젝션 샘플링을 수행해 추론 프롬프트와 경로를 생성한다.
- 원문 : We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training.
이전 단계에서는 규칙 기반 보상으로 평가 가능한 데이터만 포함했다.
- 원문 : In the previous stage, we only included data that could be evaluated using rule-based rewards.
이번 단계에서는 DeepSeek-V3를 이용한 생성형 보상 모델을 일부 적용하여 데이터셋을 확장했다.
- 원문 : We expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3.
다국어 혼합, 긴 문단, 코드 블록이 포함된 추론 과정은 필터링하여 제거했다. 그리고 여러 응답을 생성한 뒤, 그 중 정답만 남긴다.
- 원문 : We have filtered out chain-of-thought with mixed languages, long paragraphs, and code blocks. We sample multiple responses and retain only the correct ones.
총 60만개의 추론 관련 학습 샘플을 수집했다.
- 원문 :In total, we collect about 600k reasoning related training samples.
2) None-Reasoning data
추론과 무관한 약 20만개 데이터 확보, 총 80만개의 데이터를 통해 2 epochs만큼 fine-tuning함.
- 원문 : we collected a total of approximately 200k training samples that are unrelated to reasoning. We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples.

4. Reinforcement Learning for all Scenarios
인간의 선호를 맞추기 위해 보조 강화학습 단계를 도입하면서, 동시에 추론 능력을 함께 개선하였다.
- 원문 : To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities.
다양한 보상 신호와 프롬프트 분포를 조합하여 모델을 학습시킨다.
- 원문 : we train the model using a combination of reward signals and diverse prompt distributions.
추론 데이터는 DeepSeek-R1-Zero의 규칙 기반 보상 방식, 일반 데이터는 보상 모델을 통해 복잡한 인간 선호를 반영하였다.
- 원문 : For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios.
유용성 평가는 최종 요약만을 중심으로 진행하였고, 추론 과정에는 영향을 주지 않았고, 무해성 평가는 추론+요약을 고려하여 위험 요소나 편향을 식별하고 완화했다.
- 원문 : For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process
보상 신호와 다양한 데이터 분포를 통합함으로써, 우리는 추론에 뛰어나면서도 유용하고 무해한 모델을 훈련할 수 있게 되었다.
- 원문 : the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.

Distillation: Empower Small Models with Reasoning Capability

직접 증류함으로써, 작은 모델의 추론 능력을 크게 강화시킬 수 있음을 알 수 있었다.
- 원문 : Our findings indicate that this straightforward distillation method significantly enhances the reasoning abilities of smaller models.
증류 과정에서, RL을 포함시켰다면 성능을 더 크게 향상시킬 수 있지만, 증류 기법 효과 입증을 위해 배제하였다.
- 원문 : For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance.

Experiment

Benchmark

평가 종류	출처	설명
MMLU	Hendrycks et al., 2020	대규모 다중 작업 언어 이해 벤치마크
MMLU-Redux	Gema et al., 2024	MMLU의 확장 또는 변형
MMLU-Pro	Wang et al., 2024	MMLU의 전문적 또는 고급 버전
C-Eval	Huang et al., 2023	중국어 평가 벤치마크 (추정)
CMMLU	Li et al., 2023	중국어 다중 작업 언어 이해 벤치마크 (추정)
IFEval	Zhou et al., 2023	평가 지표 (추정)
FRAMES	Krishna et al., 2024	프레임 기반 또는 특정 구조를 평가하는 벤치마크 (추정)
GPQA Diamond	Rein et al., 2023	특정 질문 답변 데이터셋의 최고 난이도 버전 (추정)
SimpleQA	OpenAI, 2024c	단순 질문 답변 데이터셋
C-SimpleQA	He et al., 2024	중국어 단순 질문 답변 데이터셋 (추정)
SWE-Bench Verified	OpenAI, 2024d	소프트웨어 엔지니어링 벤치마크의 검증된 버전
Aider 1	-	코드 생성 또는 보조 도구 관련 벤치마크 (추정)
LiveCodeBench	Jain et al., 2024	실시간 코딩 벤치마크
Codeforces 2	-	알고리즘 및 프로그래밍 대회 플랫폼 기반 벤치마크
중국 전국 고등학교 수학 올림피아드 (CNMO 2024)	-	중국 고등학생 대상 수학 올림피아드 문제
미국 초청 수학 시험 2024 (AIME 2024)	MAA, 2024	미국 고등학생 대상 수학 시험
AlpacaEval 2.0	Dubois et al., 2024	LLM을 심사위원으로 사용하는 개방형 생성 평가 벤치마크 (GPT-4-Turbo-1106 사용)
Arena-Hard	Li et al., 2024	LLM을 심사위원으로 사용하는 개방형 생성 평가 벤치마크 (GPT-4-Turbo-1106 사용)
MATH-500	(출처 없음, 증류 모델 평가에 언급됨)	수학 관련 문제 500개 (추정)

Evaluation Setup

$$pass@1 = \frac{1}{k} \sum_{i=1}^{k} p_i$$

$k$ : 응답 개수(4~64) $p_{i}$ : $i$번째 응답의 정확성

우리는 모델의 최대 생성 길이를 32,768 토큰으로 설정했습니다. 장문 출력을 생성하는 추론 모델을 평가할 때 탐욕적 디코딩(greedy decoding)을 사용하면 반복률이 높아지고 다른 체크포인트(checkpoint) 간에 상당한 가변성이 발생한다는 것을 발견했습니다.
- 원문 : We set the maximum generation length to 32,768 tokens for the models. We found that using greedy decoding to evaluate long-output reasoning models results in higher repetition rates and significant variability across different checkpoints.
우리는 각 질문에 대해 k개의 응답(일반적으로 테스트 세트 크기에 따라 4에서 64 사이)을 생성하기 위해 샘플링 온도 0.6과 top-p 값 0.95를 사용합니다
- 원문 : we use a sampling temperature of 0.6 and a top-𝑝 value of 0.95 to generate 𝑘 responses (typically between 4 and 64, depending on the test set size) for each question.

DeepSeek-R1 Evaluation

교육 지향 지식 벤치마크(MMLU, MMLU-Pro, GPQA Diamond 등)에서 DeepSeek-R1은 대규모 강화 학습을 통한 STEM 관련 질문의 정확도 향상에 힘입어 DeepSeek-V3보다 우수한 성능을 보여줍니다.
- 원문 : For education-oriented knowledge bench marks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3. This im provement is primarily attributed to enhanced accuracy in STEM-related questions, where signif icant gains are achieved through large-scale reinforcement learning.

Distilled Model Evaluation

DeepSeek-R1의 출력을 증류(distilling)하는 것만으로 효율적인 DeepSeek-R1-7B(예: DeepSeek-R1-Distill-Qwen-7B, 아래에서도 비슷하게 약칭함)가 GPT-4o-0513과 같은 비추론 모델을 전반적으로 능가할 수 있습니다.
- 원문 : Simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeek-R1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform non-reasoning models like GPT-4o-0513 across the board.

Discussion

Distillationv.s.ReinforcementLearning

첫째, 더 강력한 모델을 더 작은 모델로 증류하는 것이 탁월한 결과를 낳는 반면, 본 논문에서 언급된 대규모 강화 학습에 의존하는 더 작은 모델은 막대한 컴퓨팅 파워를 요구하며 증류의 성능을 달성하지 못할 수도 있습니다.
- 원문 : First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation.
둘째, 증류 전략은 경제적이고 효과적이지만, 지능의 한계를 넘어서기 위해서는 여전히 더 강력한 기본 모델과 더 큰 규모의 강화 학습이 필요할 수 있습니다.
- 원문 : Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger scale reinforcement learning.

Unsuccessful Attempts

PRM은 모델 응답 순위를 재조정하거나 유도 검색을 돕는 데 유용하지만, 대규모 강화 학습 시 발생하는 추가 계산 비용에 비해 그 이점은 제한적입니다.
- 원문 : while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments.
MCTS는 사전 훈련된 가치 모델과 함께 추론 중 성능을 향상시킬 수 있지만, 자기 탐색을 통해 모델 성능을 반복적으로 높이는 것은 여전히 중요한 과제로 남아있습니다.
- 원문 : while MCTS can improve performance during inference when paired with a pre-trained value model, iteratively boosting model performance through self-search remains a significant challenge.

Conclusion, Limitations, and Future Work

DeepSeek-R1은 중국어와 영어에 최적화되어 다른 언어에서 언어 혼용 문제가 있고 프롬프트에 민감하여 퓨샷 프롬프트보다 제로샷 프롬프트가 권장되며, 긴 평가 시간으로 인해 소프트웨어 엔지니어링 작업에서는 아직 DeepSeek-V3 대비 큰 개선을 보이지 못하고 있어 향후 업데이트에서 개선될 예정입니다.

Twitter Facebook LinkedIn