논문 정리) Transformer Attention is All You Need

June 16, 2024 11 분 소요

1. 원문 링크

논문 보기

2. 관련 코드

Aiffel) Exploration - 대화형 챗봇(Transformer) 정리

3. 논문 리뷰

Abstract

기존 기계 번역 모델은 복잡한 순환 신경망 또는 컨볼루션 신경망을 이용하지만 트랜스포머는 어텐션 매커니즘만을 활용한 간단한 네트워크 구조를 제안함
특히 영어-독일어 번역에서 28.4 BLEU 달성(기존 최고의 2배 이상), 영어-프랑스어 번역에서 3.5일만에 41.8 BLEU를 달성함
- 원문 : Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature

Introduction

기존 시퀀스 모델링 방식은 순환 신경망을 기반이지만 이 없이도 Attention 매커니즘만을 활용하여 입출력을 연결하고 기존 방식보다 빠른 학습 속도와 더 우수한 번역 품질을 달성할 수 있었다.
- 원문 : In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

Background

순차 연산을 줄이기 위해 컨볼루션 신경망을 활용한 여러 방법들이 제안되었지만 먼 위치간의 의존성을 학습하기 어려운 문제가 있었다. 처음으로 완전히 셀프 어텐션 매커니즘만을 이용하여 입출력간의 위존성을 계산하는 방법을 제안한다.
- 원문 : In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions . In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.

Model Architecture

신경망 시퀀스 변환 모델은 인코더-디코더 구조를 사용하여 입력 문장을 출력 문장으로 변환합니다. 트랜스포머는 이러한 모델의 대표적인 예이며, 셀프 어텐션과 완전 연결 레이어를 사용하여 우수한 성능을 보여줍니다.
원문 : Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1,…,xn) to a sequence of continuous representations z = (z1,…,zn). Given z, the decoder then generates an output sequence (y1,…,ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

1) Encoder and Decoder Stacks

encoder : N=6개의 동일한 레이어 스택으로 구성됨. 각 레이어는 Multi Head Self Attention 메커니즘, Feed Forward로 구성됨. 두 하위 레이어에 Residual Connection을 사용하고, 정규화 적용. 모든 레이어 출력은 512차원.
- 원문 : The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.
decoder : N=6개의 동일한 레이어 스택 구성. 단, 인코더와 달리 각 레이어에 추가적인 하위 레이어(인코더, Masked Multi-Head Attention) 사용. 이외 동일하게 Residual Connection 및 정규화 사용. Self Attention 하위 layer를 수정하여 뒤 따르는 위치에 대한 어텐션 방지함.
- 원문 : The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

2) Attention

Attention function은 query, keys, values 쌍을 출력 벡터로 매핑하는 함수로 정의된다. Multi Head Attention은 여러개의 Attention 레이어가 병렬적으로 작동하는 것

2-1) Scaled Dot-Product Attention

dot products : 쿼리 벡터와 키 벡터 간의 행렬곱, 쿼리와 키의 유사도 측정
Scale : 각 dot products를 키 벡터 차원의 제곱근 $\sqrt{d_{k}}$으로 나눔(softmax 기울기 폭발 방지)
- 제곱근을 나누는 이유는 small gradients 문제가 발생할까봐
Softmax : 각 값 벡터가 현재 쿼리와 얼마나 관련이 있는지 계산
가중치합 : V 행렬의 각 값 벡터를 소프트맥스 출력값으로부터 해당하는 가중치와 곱함.
출력 : 모두 더하여 최종 Attention 출력값 얻음 $Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$
원문 : We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V .

2-2) Multi-Head Attention

$$MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O$$ $$where \ head_i = Attention(QW_{i}^{Q}, KW_{i}^{K},VW_{i}^{V})$$

주어진 단어에 대한 다른 가중치들을 곱해 Q(query), K(key), V(value) 벡터 생성
Scaled Dot-Product 과정을 거친 것을 병합
원문 : we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.

2-3) Applications of Attention in our Model

트랜스포머 모델은 세가지 방식으로 Multi Head Attention을 이용함.
- 인코더-디코더 어텐션: 디코더는 이전 디코더 레이어의 정보와 인코더의 출력을 함께 활용하여 입력 시퀀스의 모든 부분과 관련된 정보를 얻을 수 있습니다.
- 셀프 어텐션: 인코더와 디코더 모두 내부적으로 셀프 어텐션 레이어를 사용하여 이전 레이어의 정보를 서로 연결하여 더 풍부한 표현을 만들 수 있습니다.
- 디코더의 셀프 어텐션은 미래 정보를 사용하지 않도록 주의해야 합니다. 이를 위해 불필요한 연결을 차단하는 마스킹 기법을 사용합니다.
원문
- The Transformer uses multi-head attention in three different ways
  - • In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].
  - The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
  - Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.

3) Position-wise Feed_Forward Networks

$$ FFN(x) = max(0,xW_1+b_1)W_2+b_2 $$

두개의 선형 함수 + ReLU 활성화 함수
입력층, 출력층에서는 512차원의 벡터, 은닉층에서는 2048차원의 벡터
원문
- This consists of two linear transformations with a ReLU activation in between.
- The dimensionality of input and output is d_model = 512, and the inner-layer has dimensionality d_ff = 2048.

4) Embeddings and Softmax

텍스트 데이터를 다루기 위해 입력과 출력 토큰을 벡터로 변환하는 임베딩 레이어 사용
학습된 선형 변환과 소프트맥스 함수를 사용하여 디코더의 출력을 다음 토큰 예측 확률로 변환
모델의 효율성을 위해 두 개의 임베딩 레이어와 선형 변환 레이어 사이에 동일한 가중치 행렬을 공유하는 기법 사용

5) Positional Encoding

$$ PE_(pos,2i) = sin(pos/10000^{2i/d_{model}}) $$ $$ PE_(pos,2i+1) = cos(pos/10000^{2i/d_{model}}) $$

recurrence나 convolution 레이어가 없기 때문에, 포지셔널 엠베딩이 필요
pos는 위치, i는 차원임. 각 사인파에 해당
positional embedding으로 sinusoidal version을 선택. 왜냐하면 학습된 것보다 긴 시퀀스를 만났을 때 추정이 가능하도록 하기 때문.
원문
- Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence
- Wealso experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

Why Self-Attention

Self attention layer의 세가지 장점을 소개함.
- 첫번째 전체 계산 복잡도 낮음
  - self attention layer : n
    - length n이 dimensionality d보다 작을 때 빨라진다
    - 성능 개선을 위해 각 출력 위치를 중심으로 입력 시퀀스에서 크기 r의 이웃만 고려하는 것으로 제한될 수 있다.
  - recurrent layer : O(n)
- 두번째 병렬화 가능 개수 많음
- 세번째는 모든 시퀀스를 직접 연결하기에 장거리 종속성 사이의 경로 길이가 짧아 학습에 유리
원문
- One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.

Training

1) Training Data and Batching

4.5 million 문장 쌍으로 이루어진 WMT 2014 Engilsh-German dataset, 36 M 문장으로 이뤄진 WMT 2014 English-French dataset으로 학습함
원문
- We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs
- For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary

2) Hardware and Schedule

NVIDIA P100 GPUs로 작업함.
base model : each step 0.4 seconds / 100,000 steps == 12 hours
big model : eaach step 1.0 seconds / 300,000 steps == 3.5days
원문 : We trained our models on one machine with 8 NVIDIA P100 GPUs

3) Optimizer

Adam optimizer($β_1$ = 0.9, $β_2$ = 0.98, ϵ = $10^{−9}$)를 사용
원문
- We used the Adam optimizer with β1 = 0.9, β2 = 0.98 and ϵ = 10−9. We varied the learning rate over the course of training, according to the formula: $lrate = d^{-0.5}_{model} \cdot min(step\_num^{0.5}, step\_num\cdot warmup\_steps^{-1.5})$
- This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup_steps = 4000.

4) Regularization

적은 훈련량에도 더 높은 BLEU 스코어 기록
과정에서 3가지 규제를 적용함
- Residual Dropout($P_{drop} = 0.1$)
  - 출력층 내 두 번째 서브 레이어에서 Multi-Head Attention과 ADD & Norm 사이에 드롭아웃 실행
  - 임베딩과 포지셔널 임베딩 합친 값에도 드롭아웃 실행(인코더, 디코더 모두)
- Label Smoothing : $ϵ_{ls} = 0.1$
- 원문
  - Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of $P_{drop}=0.1$.
  - Label Smoothing During training, we employed label smoothing of value $ϵ_{ls} = 0.1$

Results

1) Machine Translation

영어-독어, 불어 번역 작업에서 최고 등급 달성(BLEU 28.4 / BLEU 41.0)
원문
- On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4.
- On the WMT2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate $p_{drop} = 0.1$, instead of 0.3.

2) Model Variations

여러 가지 구성 요소($d_{model}, \, d_{ff}, \, h, \, d_k, \, d_v, \, P_{drop}, \, ϵ_{ls}$)를 바꿔가면서 테스트 함
- (A)를 보면 전체 숫자는 유지한 채 배분함, BLEU 스코어의 최소-최대 차이는 0.9
- (B)를 보면 key size $d_k$를 줄이면 모델에 악영향 끼침
- 클수록 좋고, 드롭아웃은 오버피팅을 막는데 매우 도움됨.
원문
- In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
- In Table 3 rows (B), we observe that reducing the attention key size $d_k$ hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in aviding over-fitting. In row (E) we replace our sinusoidal positional encoding withe learned positional embeddings [9], and observer nearly identical results to the base model.

2) English Constituency Parsing

영어 구문 분석에서 뛰어난 성능 보여줌
원문 : Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8]

Conclusion

성능이 매우 우수해서 다른 모달리티에도 적용할 것을 계획하고 있다.
원문 : Weare excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours. The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor

Twitter Facebook LinkedIn