BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Paper)

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language

1. 서론

기존의 자연어 처리 모델은 Specific Task 모델을 Inital State 에서 부터 훈련하기 위한 형태로 접근하였다. 이러한 접근 방법의 문제는 여러가지가 있겠지만, 목적별로 많은 훈련 데이터를 확보해야만 한다는 것이 가장 큰 문제일 것이다. 그래서 최근에는 ElMo 와 같이 우리 주변에 있는 다양한 언어 데이터를 Unsupervied 형태로 (예를 들면 1~N Word/Char 후 N+1 을 찾는다던지) 활용할 수 있는 형태로 Pretrined 모델을 만들고 Task Specific 한 모델을 만들때 Embedding 을 제공하는 형태로 접근하여, 더 적은 데이터 더 높은 성능을 달성하고자 하는 연구가 많이 진행되고 있다. 본인이 생각하기에 BERT 또한 같은 맥락의 연구라고 이해 되지만, ElMo와의 차이는 BERT 모든 종류의 NLP Task 에 별도의 아키택쳐 없이 하나의 Task Specific Layer 만 추가해서 State of Art 성능을 낼 수 있다고 이야기 하는 것에서 그 차이가 있다. 본 논문에서는 OPEN API-GPT 도 비교를 하고 있는 해당 논문과의 차이는 양방향성이냐, 당방형성이냐로 이해된다. 아직 검증을 진행해 보지는 않았지만, 해당 논문에서 이야기 하는 내용이 사실이라면, 정말 엄청난 내용인 것 같다. 최근 ‘18.11월 3주 한국어 Pretrained Model 도 GitHub 에 Open 되었으니, 빨리 적용해보고 기존 모델대비 성능 차이를 확인해 볼 필요가 있겠다.

2. 기존연구

(1) Embedding from Language Model (링크)

Unsupervised 형태로 Pretrained 모델을 만들고 Task Specific 한 목적에 적용한다는 관점에서 유사한 점이 있다고 생각한다.

(2) Attention is all you need (링크)

본 논문에서 제안하고 있는 Transformer Architecture 에 대한 이해 필요, BERT 에서는 해당 논문의 Encoder 부분을 활용하고 있다. 거의 동일하게 아키택쳐를 사용하고 있으니, 해당 논문 및 구현체를 참조하면 아키택쳐는 파악이 가능하다. (※ 시각화하여 잘 설명한 사이트 : [링크])

3. 핵심 IDEA

해당 논문을 크게 4가지 Section 으로 나눠서 분석하고자 한다.

——————————————————————

(1) Transformer 아키택쳐 활용
(2) Loss Function 구성 및 Pre-Training
(3) BERT – Fine-Tunning & 다양한 NLP 문제에 적용

——————————————————————-

(1) Transformer Architecture

Transformer Architecture 자체는 위에 보는 구조를 반복하는 것으로 이해 할 수 있으며, 기존의 어떤 아키택쳐와도 다르게, CNN, RNN 구조 없이 전체다 Attention 만으로 구성된 아키택쳐로 굉장히 많은 메모리를 요구하는 아키택쳐라고 볼 수 있다. 이 아키택쳐는 몇 가지 컴포넌트로 구성되어 있는데, 아래의 순서로 설명을 진행하도록 하겠다.

-Scaled Dot Product Attention : Multi Head Attention 을 구성하는 작은 단위로 미리 이해가 필요
-Multi-Head Attention : 복수의 Scaled Dot Product Attention 을 Concat 하여, 더 나은 Attention 을 구하고자 함
-Positional Encoding : 동일한 단어라고 하여도 위치에 따라서 다른 해석을 부여 하기 위한 장치
-Short Cut & Add/Norm : Resinet 과 같은 Short Cut 을 통한 Vanising Problem 의 최소화
-Decoder Side : Encoder 의 Attention 을 Decoder 에 적용하여 최종적인 판단까지 과정

가. Scaled Dot Product Attention

사실 위의 공식 한줄로 설명이 끝나는 부분인데, 조금 애매한 부분이 Q,K,V 에 대한 정의라고 생각된다. Q,K,V 는 각 Input 과 Self Attention 의 결과물로 보면 되는데, 시각화하면 아래와 같다.

그리고 위의 간단한 수식을 구지 시각화를 하자면 아래와 같이 되겠다.

의미상으로는 아래와 같다. 어쨌든 Q.K 연산에 SoftMax 를 취한 것은 각 단어별 가중치를 구하는 Attention 을 구하는 행위라고 볼 수 있고, V 는 기존의 Input 에 VW를 곱한 것으로 어쨌든 입력된 단어의 Vector 를 의미한다고 보면 된다, 즉 Z는 가중치가 적용된, 각 단어를 표현하는 Vector 값이라고 의미적으로 해석하면 된다.

나. Multi-Head Attention

공식에서 나온 것 처럼 정말 간단한 개념이다, 위에서 설명한 Scaled Dot Product Attention 을 완전히 별개로(병렬로) 수행하고, 그 결과를 Concat 한다고 설명하고 있다. 이렇게 함으로써, 하나의 Attention 만 사용했을때, 여전히 특정 단어에 쏠리는 Attention을 최대한 분산할 수 있다고 한다.

시각화 자료를 빌려, 추가적인 설명을 하자면, Scaled Dot Product Attention 이 여러개가 있는 개념이되기 때문에, 각 Attention 단위 Q,K,V 관련된 Weight Vector 또한 별도로 정의한다.

결론적으로 같은 X 를 기준으로 Self Attention 결과인 Z 가 여러개가 생성이 된다.

전체를 합하여서 보면, 개별적인 Self Attention 결과인 Z (0~N) 을 전부 Concat 후, 별도로 정의한 Weight Vector 와 연산을 통해 최종 Z Vector 를 출력한다.

여러개의 Attention 결과를 종합하면, 결론적으로 위와 같이 조금은 더 가중치가 잘 분배된 Attention Vector 를 얻을 수가 있다고 이야기 하고 있다.

다음으로, Google 에서 공개한 BERT 구현 코드 부분 중 Attention 부분을 통해서 좀더 자세히 아키택쳐를 살펴 보자

[Code Review]

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):


  def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                           seq_length, width):
    ## Query, Key, Value 간의 Matrix 연산을 위한 Matrix 변환 
    output_tensor = tf.reshape(
        input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
    return output_tensor

  ## Matirx Size 오류 Check 를 위함 
  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

  if len(from_shape) != len(to_shape):
    raise ValueError(
        "The rank of `from_tensor` must match the rank of `to_tensor`.")

  if len(from_shape) == 3:
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]
    to_seq_length = to_shape[1]
  elif len(from_shape) == 2:
    if (batch_size is None or from_seq_length is None or to_seq_length is None):
      raise ValueError(
          "When passing in rank 2 tensors to attention_layer, the values "
          "for `batch_size`, `from_seq_length`, and `to_seq_length` "
          "must all be specified.")

  ## Scaled Dot Product Attention & Multi Head Attention 두 가지를 위에서는 따로 설명하였지만, 
  ## 실질적으로 구현은 두 가지 개념을 포함하여 구현한다.  
  ## 아래는 Q, K, V Vector 자체를 정의하는 부분이다. 
  # Scalar dimensions referenced here:
  #   B = batch size (number of sequences)
  #   F = `from_tensor` sequence length
  #   T = `to_tensor` sequence length
  #   N = `num_attention_heads`
  #   H = `size_per_head`

  from_tensor_2d = reshape_to_matrix(from_tensor)
  to_tensor_2d = reshape_to_matrix(to_tensor)

  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

  # `query_layer` = [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,
                                     num_attention_heads, from_seq_length,
                                     size_per_head)

  # `key_layer` = [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  ## Query 와 Key 를 활용하여 Self Score 를 구한다  [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  ## Score 는 Head Size 의 Root 사이즈로 나눠준다 
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))
  ## Masking 은 Transformer 의 요소가 아닌, BERT 의 훈련 방법중 하나로, 
  ## Masking 대상 단어를 Label 로 사용하여 훈련할 경우, 해당 단어의 Attention 을 0 으로  
  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
    # masked positions, this operation will create a tensor which is 0.0 for
    # positions we want to attend and -10000.0 for masked positions.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    attention_scores += adder

  # 지금까지 각 Head 의 Score 를 구했고, SoftMax 를 통해 확률로 변환 [B, N, F, T]
  attention_probs = tf.nn.softmax(attention_scores)

  # Drop Out Layer 를 추가하여 준다. 일반적이지 않지만 Transformer Paper 에 명시된 내용으로 추가 
  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

  ## 자 이제 각각의 Head 의 Attention Prob 과 아까 맨위에서 구한 Value 와의 
  ## 연산을 통해 Context Layer 를 구성한다. 
  # `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)

  # `context_layer` = [B, F, N, H]
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

  ## 이제 여러 Head 에서 최종적인 결과를 구하였으며, 남은 과정은 Z_Weight Vector 와의 
  ## 연산을 통해 최종적으로 Multi Head Attention 의 결과를 구할 필요가 있다. 
  ## 아래 코드에서는 해당 내용은 빠져 있음 .   
  if do_return_2d_tensor:
    # `context_layer` = [B*F, N*V]
    context_layer = tf.reshape(
        context_layer,
        [batch_size * from_seq_length, num_attention_heads * size_per_head])
  else:
    # `context_layer` = [B, F, N*V]
    context_layer = tf.reshape(
        context_layer,
        [batch_size, from_seq_length, num_attention_heads * size_per_head])

  return context_layer

다. Positional Encoding

자연어에서는 동일한 단어라고 하여도, 그 발생 위치에 따라서 차등을 줄 필요가 있는데, Positional Encoding 이라는 Vector 를 각 Embedding 에 Addition 함으로써, 발생 위치 정보를 표시하는 방안을 제시하고 있다.

[Code Review]

def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  ## Input 은 Char/Word => Vector Embedding 한 결과 
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]
  ## 최종적으로 input + positional 을 만들고 싶은 것으로, input 
  output = input_tensor
  
  ## Positional Embedding 을 추가 
  if use_position_embeddings:
    ## 애러 체크 
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    ## Max Size 를 정의하고, Positional Embedding Vector 를 정의 
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))

      ## Max Size 로 정의한 Positional Vector 에서 필요한 Seq_length 만큼 잘라낸다 
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      ## 우리가 필요한건 위치를 표시하기 위한 Vector 로 Batch 는 의미가 없으니 아래와 같이 변경 
      ## [Batch_size, Seq_length, Width] ==> [1, Seq_length, Width] 
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      ## tf.add 로 Input Embedding 에 Positional Vector 를 더하여 준다 !  
      output += position_embeddings
  ## Norm Layer 또 다시 추가 후 Return 
  output = layer_norm_and_dropout(output, dropout_prob)
  return output

라. Residual

Add & Normalize 영역은 단순히 Self-Attention결과와 초기 Input Embedding 의 Concat 을 의미하는 것으로 간단히 코드로 보면 아래와 같다.

[Code Review = Add & Normalize]

  with tf.variable_scope("output"):
    attention_output = tf.layers.dense(
        attention_output,
        hidden_size,
        kernel_initializer=create_initializer(initializer_range))
    attention_output = dropout(attention_output, hidden_dropout_prob)
    attention_output = layer_norm(attention_output + layer_input)

그림에서 Feed Forward 로 표시된 영역은 단순한 Fully Connected Layer 로 보면 된다.

[Code Review = Feed Forward]

# The activation is only applied to the "intermediate" hidden layer.
with tf.variable_scope("intermediate"):
  intermediate_output = tf.layers.dense(
      attention_output,
      intermediate_size,
      activation=intermediate_act_fn,
      kernel_initializer=create_initializer(initializer_range))

그리고 다시 Residual 형태의 Short Cut 을 추가하여 준다.

[Code Review = Add & Normalize]

# Down-project back to `hidden_size` then add the residual.
with tf.variable_scope("output"):
  layer_output = tf.layers.dense(
      intermediate_output,
      hidden_size,
      kernel_initializer=create_initializer(initializer_range))
  layer_output = dropout(layer_output, hidden_dropout_prob)
  layer_output = layer_norm(layer_output + attention_output)
  prev_output = layer_output
  all_layer_outputs.append(layer_output)

※ Layer Norm

부가적으로 Layer Normalized 의 경우, 우리가 잘 알고 있는 Batch Norm 의 변형으로, Batch Norm 은 각 Batch 간의 Data 분포의 차이를 보정하는 것에 그 목적이 있다면, Layer Norm 은 각 Layer 별 뉴런간의 정규화를 목적으로 하고 있다는 차이가 있다. 아래는 BN 과 LN 적용시 Train Step 에 따른 Error 의 변화 즉 얼마나 빨리 훈련이 되는지 Test 한 결과이다.( 참조 : https://arxiv.org/pdf/1607.06450.pdf )

마. Decoder 부분

만약에 2 Stack Encoder 라고 하면, 위와 같이 시각화가 가능할 것이다. (실제 논문에서는 12 Stack ) Attention is All you need 논문에서는 위와 같이 Decoder 부분도 Self-Attention 으로 구성이 되지만, BERT 에서는 조금 다른 형태로 Label 의 형태에 따라 두 가지 방법으로 활용된다. 첫번째는 Masking Token의 유추, 두 번째는 Next Token 의 유추가 되겠다. 코드로 한번 보도록 하자.

[Code Review = Last Part of BertModel]

  ## BertModel 클래스의 제일 마지막 부분이다. (Return 부분) 
  self.all_encoder_layers = transformer_model(
      input_tensor=self.embedding_output,
      attention_mask=attention_mask,
      hidden_size=config.hidden_size,
      num_hidden_layers=config.num_hidden_layers,
      num_attention_heads=config.num_attention_heads,
      intermediate_size=config.intermediate_size,
      intermediate_act_fn=get_activation(config.hidden_act),
      hidden_dropout_prob=config.hidden_dropout_prob,
      attention_probs_dropout_prob=config.attention_probs_dropout_prob,
      initializer_range=config.initializer_range,
      do_return_all_layers=True)

self.sequence_output = self.all_encoder_layers[-1]
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
with tf.variable_scope("pooler"):
  # We "pool" the model by simply taking the hidden state corresponding
  # to the first token. We assume that this has been pre-trained
  first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
  self.pooled_output = tf.layers.dense(
      first_token_tensor,
      config.hidden_size,
      activation=tf.tanh,
      kernel_initializer=create_initializer(config.initializer_range))

transformer_model Class 는 위에서 계속 설명한 Transformer Architecture 의 Encoder 부분이 되겠다. 실질적으로 Loss 함수를 설계하는데 sequence_output 와 pooled_output 을 사용하는데, sequence_output 는 Encoder 부의 Final Layer 이고, pooled_output 는 동일하게 마지막 Layer 의 첫번째 Token 을 Fully Connected Layer 를 한번 통과 시킨 결과가 되겠다.

(2) BERT Loss Function & Pre-Training

가. Loss Function

Loss Function 은 Masked LM 과 Next Sequence Prediction 두 개의 Joint Loss 로 구성된다. 각각의 Loss Function 에 대한 상세한 내용은 아래서 보도록 하겠다.

[Code Review = Loss Function]

model = modeling.BertModel(
    config=bert_config,
    is_training=is_training,
    input_ids=input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=use_one_hot_embeddings)

(masked_lm_loss,
 masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(
     bert_config, model.get_sequence_output(), model.get_embedding_table(),
     masked_lm_positions, masked_lm_ids, masked_lm_weights)

(next_sentence_loss, next_sentence_example_loss,
 next_sentence_log_probs) = get_next_sentence_output(
     bert_config, model.get_pooled_output(), next_sentence_labels)

total_loss = masked_lm_loss + next_sentence_loss

나. Masked LM

Masked LM 은 우리가 흔히 빈칸 채우기로 알려진 Task 로 BERT 가 양방향 모델이기 때문에 활용 가능한 방법으로, 80% 는 [MASK] 로 변환, 10%는 임의의 다른 단어로 치환 그리고 10%는 그냥 변환하지 않고 두는 형태로, Label 을 정의함으로써(즉, 일부러 Noise를 넣음으로써), 너무 Deep 한 모델에서 발생할 수 있는 Over fitting 문제를 회피하도록 의도하였다고 보여진다.

코드를 해석하여 보면, Transformer 의 Encoder 부의 마지막 Layer 를 활용하고 있는데, AutoEncoder 등과 같이 전체 문장을 재 생성하는 것이 아닌, Masking 된 부분만 활용하는 것을 첫줄 gather_indexs 에서 확인 가능하다. (논문에서도 명시되어 있음)

이후에 두개의 Layer 를 정의하는데 하나의 Hidden Layer 와 Vocab Size 의 Out Layer 를 정의하고 있다. 이를 통해서 각 Token 단위로 어떤 단어가 나올 것인지 확률을 구하고, 우리가 알고 있는 정답지 ( OneHot Vector) 와 Negative Log Likelihood 를 통해 Loss 를 구하고, Mask 값이 복수이기 때문에 각 Mask 값의 Loss 의 평균을 구하는 형태로 최종 Loss 를 구한다.

[Code Review = Loss Function]

def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
                         label_ids, label_weights):
  """Get loss and log probs for the masked LM."""
  input_tensor = gather_indexes(input_tensor, positions)

  with tf.variable_scope("cls/predictions"):
    # We apply one more non-linear transformation before the output layer.
    # This matrix is not used after pre-training.
    with tf.variable_scope("transform"):
      input_tensor = tf.layers.dense(
          input_tensor,
          units=bert_config.hidden_size,
          activation=modeling.get_activation(bert_config.hidden_act),
          kernel_initializer=modeling.create_initializer(
              bert_config.initializer_range))
      input_tensor = modeling.layer_norm(input_tensor)

    # The output weights are the same as the input embeddings, but there is
    # an output-only bias for each token.
    output_bias = tf.get_variable(
        "output_bias",
        shape=[bert_config.vocab_size],
        initializer=tf.zeros_initializer())
    logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    log_probs = tf.nn.log_softmax(logits, axis=-1)

    label_ids = tf.reshape(label_ids, [-1])
    label_weights = tf.reshape(label_weights, [-1])

    one_hot_labels = tf.one_hot(
        label_ids, depth=bert_config.vocab_size, dtype=tf.float32)

    # The `positions` tensor might be zero-padded (if the sequence is too
    # short to have the maximum number of predictions). The `label_weights`
    # tensor has a value of 1.0 for every real prediction and 0.0 for the
    # padding predictions.
    per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
    numerator = tf.reduce_sum(label_weights * per_example_loss)
    denominator = tf.reduce_sum(label_weights) + 1e-5
    loss = numerator / denominator

  return (loss, per_example_loss, log_probs)

다. Next Sequence Prediction

Next Token 을 찾는 Task 는 많은 Pre-Train 계열의 모델에서 활용하는 방법으로, 맥락을 이해해야 하는 NLP 모델에 있어서는 일반적으로 사용되는 방법이다. 다만, BERT 에서는 특정 단어를 맞추는 Task 가 아닌, 다음에 오는 문장에 실제 다음 문장인지 아니면 Random 한 어떤 문장인지를 구분하는 Binary Task 로 변경하여 Loss 를 구하고 있다. (위에 그림은 단어를 맞추는 것으로 되어 있지만, 이해를 돕기 위한 것이고, 실제는 Binary Task)

코드를 보면, 해당 Loss Function 에 X 는 Transformer Encoder 마지막 Layer 에 First Token 을 Fully Connected Layer 를 한번 통과 시킨 결과 값이고, Y 는 [SEP] 토큰 이후 실제 다음 문장이 나왔는지, 아니면 Random 한 어떤 문장이 나왔는지가 되겠다. (Pre-Training 방법을 보면, 실제 데이터 구성을 50:50 으로 한 것을 알 수 있다)

[Code Review = Loss Function]

def get_next_sentence_output(bert_config, input_tensor, labels):
  """Get loss and log probs for the next sentence prediction."""

  # Simple binary classification. Note that 0 is "next sentence" and 1 is
  # "random sentence". This weight matrix is not used after pre-training.
  with tf.variable_scope("cls/seq_relationship"):
    output_weights = tf.get_variable(
        "output_weights",
        shape=[2, bert_config.hidden_size],
        initializer=modeling.create_initializer(bert_config.initializer_range))
    output_bias = tf.get_variable(
        "output_bias", shape=[2], initializer=tf.zeros_initializer())

    logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    log_probs = tf.nn.log_softmax(logits, axis=-1)
    labels = tf.reshape(labels, [-1])
    one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)
    return (loss, per_example_loss, log_probs)

다. Pre-Training

Pre-Training 에 대한 설명은 거의 데이터, 사전 구성과 하이퍼파라메터 세팅에 대한 이야기 이다. 아래 원문 참조

For the pre-training corpus we use the concatenation of BooksCorpus (800M words) (Zhu et al.,2015) and English Wikipedia (2,500M words).For Wikipedia we extract only the text passagesd ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the BillionWord Benchmark (Chelba et al., 2013) in order to
extract long contiguous sequences.To generate each training input sequence, wesample two spans of text from the corpus, whichwe refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives
the A embedding and the second receives the B embedding. 50% of the time B is the actual next sentence that follows A and 50% of the time it is
a random sentence, which is done for the “next sentence prediction” task. They are sampled such
that the combined length is ≤ 512 tokens. The LM masking is applied after WordPiece tokenization with a uniform masking rate of 15%, and no
special consideration given to partial word pieces. We train with batch size of 256 sequences (256
sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40
epochs over the 3.3 billion word corpus. We use Adam with learning rate of 1e-4, β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, learning
rate warm up over the first 10,000 steps, and linear decay of the learning rate. We use a dropout probability of 0.1 on all layers. We use a gelu activation (Hendrycks and Gimpel, 2016) rather than
the standard relu, following OpenAI GPT. The training loss is the sum of the mean masked LM
likelihood and mean next sentence prediction likelihood. Training of BERTBASE was performed on 4
Cloud TPUs in Pod configuration (16 TPU chips total).5 Training of BERTLARGE was performed
on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.

(3) BERT – Fine-Tunning & 다양한 NLP 문제에 적용

가. Classification

실제 우리가 해결하고자 하는 비지니스 문제에 적용하기 위해서는 개별 Task 에 대한 Fine-Tunning 이 필요하다. 기본적으로 Transformer Encoder 부분의 Weight 는 Freeze 하고(Freeze 하던지, Learning Rate 를 매우 적게 주던지?) 추가적으로 하나의 Fully Connected Layer 를 설계하여 우리가 원하는 Label 에 맞게 훈련하는 형태로 진행된다고 보면 될 것 같다. 아래는 Classification 문제의 예시인데, 결국 우리가 활용할 수 있는 Transformer Encoder 의 Ouput 종류는 한정적이다. 첫째, 마지막 Layer 전체를 활용, 둘째, 두번째 첫번째 토큰 활용, 세번째, 마지막 레이어가 아닌 전체 레이어의 평균 활용 등 방법이 있겠다. 분류 문제에 있어서는 첫번째 토큰을 활용하는 방법이 활용되는데 ( 코드상으로 model.get_pooled_output() 에서 확인 가능하다)

[Code Review]

output_layer = model.get_pooled_output()

hidden_size = output_layer.shape[-1].value

output_weights = tf.get_variable(
    "output_weights", [num_labels, hidden_size],
    initializer=tf.truncated_normal_initializer(stddev=0.02))

output_bias = tf.get_variable(
    "output_bias", [num_labels], initializer=tf.zeros_initializer())

with tf.variable_scope("loss"):
  if is_training:
    # I.e., 0.1 dropout
    output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

  logits = tf.matmul(output_layer, output_weights, transpose_b=True)
  logits = tf.nn.bias_add(logits, output_bias)
  probabilities = tf.nn.softmax(logits, axis=-1)
  log_probs = tf.nn.log_softmax(logits, axis=-1)

  one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

  per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
  loss = tf.reduce_mean(per_example_loss)

  return (loss, per_example_loss, logits, probabilities)

나. MRC (Machine Reading Comprehension)

Squad 문제의 경우 Start Point 와 End Point 를 예측하는 문제로, Input 은 Transformer Encoder 의 마지막 Layer 를 사용하며, Start, End Point 에 대한 예측과 각각의 예측과 실제 값의 Loss 를 평균하여 사용한다.

[Code Review]

def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 use_one_hot_embeddings):
  """Creates a classification model."""
  model = modeling.BertModel(
      config=bert_config,
      is_training=is_training,
      input_ids=input_ids,
      input_mask=input_mask,
      token_type_ids=segment_ids,
      use_one_hot_embeddings=use_one_hot_embeddings)

  final_hidden = model.get_sequence_output()

  final_hidden_shape = modeling.get_shape_list(final_hidden, expected_rank=3)
  batch_size = final_hidden_shape[0]
  seq_length = final_hidden_shape[1]
  hidden_size = final_hidden_shape[2]

  output_weights = tf.get_variable(
      "cls/squad/output_weights", [2, hidden_size],
      initializer=tf.truncated_normal_initializer(stddev=0.02))

  output_bias = tf.get_variable(
      "cls/squad/output_bias", [2], initializer=tf.zeros_initializer())

  final_hidden_matrix = tf.reshape(final_hidden,
                                   [batch_size * seq_length, hidden_size])
  logits = tf.matmul(final_hidden_matrix, output_weights, transpose_b=True)
  logits = tf.nn.bias_add(logits, output_bias)

  logits = tf.reshape(logits, [batch_size, seq_length, 2])
  logits = tf.transpose(logits, [2, 0, 1])

  unstacked_logits = tf.unstack(logits, axis=0)

  (start_logits, end_logits) = (unstacked_logits[0], unstacked_logits[1])

  return (start_logits, end_logits)

[Code Review]

start_loss = compute_loss(start_logits, start_positions)
end_loss = compute_loss(end_logits, end_positions)

total_loss = (start_loss + end_loss) / 2.0

4. 결론

결론은 여러가지 NLP Task 에 있어서 State of Arts 를 달성했다는 것으로, 거의 Image 쪽에서 ImageNet 에 버금가는 Mile Stone 성격의 논문이라고 생각한다.

끝