【Morphological Inflection】国際人工知能オリンピックのサンプル問題を解いてみた!
国際人工知能オリンピック(IOAI)が今年第1回の開催を迎えるとのことで、公式HPに紹介されているサンプル問題を解いてみます。
国際人工知能オリンピックのサンプル問題を解いてみたより引用)
IOAIについて (今年から始まる、科学オリンピックのうちの1つです。今年はブルガリアのブルガスで開催されるようです。
肝心のコンテストの内容ですが、科学ラウンドと実践ラウンドの2つがあります。科学ラウンドは機械学習、深層学習に関するipynbが提供され、それを解きます。実践ラウンドはChatGPTを始めとするGUIアプリケーションを活用した科学的な問題について考察します。
いわゆるKaggle的なコーディング要素が求められるのは前者の科学ラウンドのようです。
公式HPには3問掲載されていました。
- NLPタスク(言語モデルの訓練、論文の再実装) <- これをやる
- NLPタスク(単語埋込のバイアス除去)
- 画像タスク(Adversarial Attack)
今回扱う問題は次のcolabリンクに掲載されています。
Task1 論文の再現実装
要件
- 2017 SIGMORPHON共有タスクデータのナバホ語サブセットを読み込む
- コードの欠落や誤りを修正し元論文Applying the Transformer to Character-level Transductionを完全に再実装すること
- モデルの学習に8時間以上かからないこと
- テストセットでAcc 52.1% (± 0.5%)の範囲に収めること
与えられるコードではAcc 48.8%とされています。
早速取り掛かりたいところですが、再現性を確認するためcolabのコードを4回実行したところそれぞれ
- 0.219
- 0.346
- 0.392
- 0.439
と大幅に数値に変動があるうえに示されているスコアに達したものがないためTask 1はスキップします。
Task2 より高いスコアの形態素変換器の作成
こちらはTask1に比べてとても単純です。
できるだけスコアを上げるタスクになります。
元colabとの変更点
サンプル問題が掲載されているcolabノートブック(以下元colab)記載のコードと比較した変更点を述べます。
データセット形式
元colabではinputがlemma <tag1><tag2><tag3>
といった形式でしたが、比較対象実験の結果最もスコアが向上したのはlemma<tag1><tag2><tag3>
という形式だったのでそのように変更します。
import re
import regex
def parse_tag(tag):
tag = re.sub(r"\)|\(|,|;", ' ', tag).split()
return ''.join(['<{}>'.format(t) for t in tag])
def preprocess_data(raw_data):
preprocessed_data = []
for line in raw_data:
lemma, tag, target = line.split('\t')
+ preprocessed_data.append(('{}{}'.format(lemma, parse_tag(tag)),target))
- preprocessed_data.append(('{} {}'.format(lemma, parse_tag(tag)),target))
return preprocessed_data
data['train'] = preprocess_data(data['train'])
data['dev'] = preprocess_data(data['dev'])
print('Preprocessed data sample:', data['train'][54])
chars = set(list(''.join([''.join([d[0].split()[0], d[1]]) for d in data['train']])))
char2id = { char: i for i, char in enumerate(chars)}
tags = list(set(sum([regex.findall(r"<[A-Za-z0-9]*>",d[0]) for d in data['train']], [])))
Tokenizer
元colabでは単純にpadトークン、bosトークン、eosトークンを設定していますが、学習・推論効率向上のためpadトークンをeosトークンに設定します。
これにより、モデルはシーケンス長に対して余った部分にpadトークンを挿入するという余計な事を学習しないようになることが期待できます。
### 省略 ###
class CustomTokenizer(PreTrainedTokenizer):
model_input_names = ["input_ids", "attention_mask"]
def __init__(
self,
vocab: Dict[str, int],
bos_token="<s>",
eos_token="</s>",
unk_token="<unk>",
pad_token="<pad>",
**kwargs,
) -> None:
# Add extra_ids to the special token list
self.__token_ids = vocab
self.__id_tokens: Dict[int, str] = {value: key for key, value in vocab.items()}
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
self._added_tokens_decoder = {0: pad_token, 1: bos_token, 2: eos_token, 3: unk_token}
self.offset = len(self._added_tokens_decoder)
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
+ pad_token=eos_token,
- pad_token=pad_token,
**kwargs,
)
### 省略 ###
モデル
元colabではT5を使っていますが、ここでは絶対位置符号化(Absolute Positional Encoding)が実装されているBARTを使用します。
Modeling
このタスクにおいてモデルには事前学習がなされないためoffsetを0に設定します。
↓該当部分
Config
+from transformers import BartConfig
+from modeling_custom_bart import CustomBartForConditionalGeneration
-from transformers import T5Config, T5ForConditionalGeneration
+config = BartConfig(decoder_ffn_dim=1024,
+ encoder_ffn_dim=1024,
+ encoder_attention_heads=4,
+ decoder_attention_heads=4,
+ d_model=256,
+ encoder_layers=4,
+ decoder_layers=4,
+ max_position_embeddings=64,
+ dropout=0.2,
+ activation_function="gelu",
+ decoder_start_token_id=1,
+ bos_token_id=1,
+ pad_token_id=2,
+ vocab_size=len(tokenizer))
-config = T5Config(d_ff=1024,
- d_model=256,
- num_layers=4,
- num_decoder_layers=4,
- num_heads=4,
- dropout_rate=0.2,
- vocab_size=len(tokenizer))
+model = CustomBartForConditionalGeneration(config)
-model = T5ForConditionalGeneration(config)
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.generation_config.decoder_start_token_id = tokenizer.bos_token_id
model.generation_config.max_new_tokens = 32
model.generation_config.eos_token_id = tokenizer.eos_token_id
+model.generation_config.forced_eos_token_id=2
Train
比較対象実験の結果最も精度が高かったハイパーパラメータを採用します。
### 省略 ###
training_args = Seq2SeqTrainingArguments(
+ max_steps=10000,
- max_steps=20000,
+ per_device_train_batch_size=800,
- per_device_train_batch_size=400,
learning_rate = 0.001,
lr_scheduler_type='inverse_sqrt',
warmup_steps=4000,
adam_beta2=0.98,
label_smoothing_factor=0.1,
evaluation_strategy="steps",
eval_steps=400,
eval_delay=400,
save_strategy="steps",
save_steps=400,
predict_with_generate=True,
metric_for_best_model='exact_match',
save_total_limit=1,
logging_strategy="steps",
logging_steps=400,
output_dir='custom_0.2',
overwrite_output_dir=True,
load_best_model_at_end=True
)
### 省略 ###
結果
スコアは0.533でした。
これはSIGMORPHON–UniMorph 2023 Shared Task 0: Typologically Diverse Morphological InflectionのTable 3で示されている52.1%より1.2ポイント上となります。
上手くいかなかった変更
BARTをMoEにしてみましたが、スコアは減少しました(スコア一覧参照)。
MoE
BartSparseMoeBlock
class BartBlockSparseTop2MLP(nn.Module):
def __init__(self, config: BartConfig):
super().__init__()
self.ffn_dim = config.encoder_ffn_dim # encoderとdecoderでintermediate sizeは変えないことを想定
self.hidden_dim = config.d_model
self.fc1 = nn.Linear(self.hidden_dim, self.ffn_dim, bias=True)
self.fc2 = nn.Linear(self.ffn_dim, self.hidden_dim, bias=True)
self.act_fn = ACT2FN[config.activation_function]
self.dropout = config.dropout
self.activation_dropout = config.activation_dropout
def forward(self, hidden_states):
current_hidden_states = self.act_fn(self.fc1(hidden_states))
current_hidden_states = nn.functional.dropout(current_hidden_states, p=self.activation_dropout, training=self.training)
current_hidden_states = self.fc2(current_hidden_states)
current_hidden_states = nn.functional.dropout(current_hidden_states, p=self.dropout, training=self.training)
return current_hidden_states
class BartSparseMoeBlock(nn.Module):
"""
This implementation is
strictly equivalent to standard MoE with full capacity (no
dropped tokens). It's faster since it formulates MoE operations
in terms of block-sparse operations to accomodate imbalanced
assignments of tokens to experts, whereas standard MoE either
(1) drop tokens at the cost of reduced performance or (2) set
capacity factor to number of experts and thus waste computation
and memory on padding.
"""
def __init__(self, config):
super().__init__()
self.hidden_dim = config.d_model
self.ffn_dim = config.encoder_ffn_dim
self.num_experts = 16 # config.num_local_experts
self.top_k = 4 # config.num_experts_per_tok
# gating
self.gate = nn.Linear(self.hidden_dim, self.num_experts, bias=False)
self.experts = nn.ModuleList([BartBlockSparseTop2MLP(config) for _ in range(self.num_experts)])
# Jitter parameters
self.jitter_noise = 0 # config.router_jitter_noise
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
""" """
batch_size, sequence_length, hidden_dim = hidden_states.shape
if self.training and self.jitter_noise > 0:
hidden_states *= torch.empty_like(hidden_states).uniform_(1.0 - self.jitter_noise, 1.0 + self.jitter_noise)
hidden_states = hidden_states.view(-1, hidden_dim)
# router_logits: (batch * sequence_length, n_experts)
router_logits = self.gate(hidden_states)
routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
# we cast back to the input dtype
routing_weights = routing_weights.to(hidden_states.dtype)
final_hidden_states = torch.zeros(
(batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device
)
# One hot encode the selected experts to create an expert mask
# this will be used to easily index which expert is going to be sollicitated
expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0)
# Loop over all available experts in the model and perform the computation on each expert
for expert_idx in range(self.num_experts):
expert_layer = self.experts[expert_idx]
idx, top_x = torch.where(expert_mask[expert_idx])
if top_x.shape[0] == 0:
continue
# Index the correct hidden states and compute the expert hidden state for
# the current expert. We need to make sure to multiply the output hidden
# states by `routing_weights` on the corresponding tokens (top-1 and top-2)
current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None]
# However `index_add_` only support torch tensors for indexing so we'll use
# the `top_x` tensor here.
final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
return final_hidden_states, router_logits
BartEncoderLayer
class BartEncoderLayer(nn.Module):
def __init__(self, config: BartConfig):
super().__init__()
self.embed_dim = config.d_model
self.self_attn = BART_ATTENTION_CLASSES[config._attn_implementation](
embed_dim=self.embed_dim,
num_heads=config.encoder_attention_heads,
dropout=config.attention_dropout,
config=config,
)
self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
self.dropout = config.dropout
+ self.block_sparse_moe = BartSparseMoeBlock(config)
- self.activation_fn = ACT2FN[config.activation_function]
- self.activation_dropout = config.activation_dropout
- self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)
- self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)
self.final_layer_norm = nn.LayerNorm(self.embed_dim)
def forward(
self,
hidden_states: torch.FloatTensor,
attention_mask: torch.FloatTensor,
layer_head_mask: torch.FloatTensor,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]:
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`): attention mask of size
`(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size
`(encoder_attention_heads,)`.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
"""
residual = hidden_states
hidden_states, attn_weights, _ = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
layer_head_mask=layer_head_mask,
output_attentions=output_attentions,
)
hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states
hidden_states = self.self_attn_layer_norm(hidden_states)
residual = hidden_states
+ hidden_states, router_logits = self.block_sparse_moe(hidden_states)
- hidden_states = self.activation_fn(self.fc1(hidden_states))
- hidden_states = nn.functional.dropout(hidden_states, p=self.activation_dropout, training=self.training)
- hidden_states = self.fc2(hidden_states)
- hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states
hidden_states = self.final_layer_norm(hidden_states)
if hidden_states.dtype == torch.float16 and (
torch.isinf(hidden_states).any() or torch.isnan(hidden_states).any()
):
clamp_value = torch.finfo(hidden_states.dtype).max - 1000
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
outputs = (hidden_states,)
if output_attentions:
outputs += (attn_weights,)
return outputs
BartDecoderLayer
class BartDecoderLayer(nn.Module):
def __init__(self, config: BartConfig):
super().__init__()
self.embed_dim = config.d_model
self.self_attn = BART_ATTENTION_CLASSES[config._attn_implementation](
embed_dim=self.embed_dim,
num_heads=config.decoder_attention_heads,
dropout=config.attention_dropout,
is_decoder=True,
is_causal=True,
config=config,
)
self.dropout = config.dropout
"""
self.activation_fn = ACT2FN[config.activation_function]
self.activation_dropout = config.activation_dropout
"""
self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
self.encoder_attn = BART_ATTENTION_CLASSES[config._attn_implementation](
self.embed_dim,
config.decoder_attention_heads,
dropout=config.attention_dropout,
is_decoder=True,
config=config,
)
self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim)
+ self.block_sparse_moe = BartSparseMoeBlock(config)
- self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim)
- self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim)
self.final_layer_norm = nn.LayerNorm(self.embed_dim)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
encoder_hidden_states: Optional[torch.Tensor] = None,
encoder_attention_mask: Optional[torch.Tensor] = None,
layer_head_mask: Optional[torch.Tensor] = None,
cross_attn_layer_head_mask: Optional[torch.Tensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: Optional[bool] = False,
use_cache: Optional[bool] = True,
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`): attention mask of size
`(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
encoder_hidden_states (`torch.FloatTensor`):
cross attention input to the layer of shape `(batch, seq_len, embed_dim)`
encoder_attention_mask (`torch.FloatTensor`): encoder attention mask of size
`(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size
`(encoder_attention_heads,)`.
cross_attn_layer_head_mask (`torch.FloatTensor`): mask for cross-attention heads in a given layer of
size `(decoder_attention_heads,)`.
past_key_value (`Tuple(torch.FloatTensor)`): cached past key and value projection states
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
"""
residual = hidden_states
# Self Attention
# decoder uni-directional self-attention cached key/values tuple is at positions 1,2
self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
# add present self-attn cache to positions 1,2 of present_key_value tuple
hidden_states, self_attn_weights, present_key_value = self.self_attn(
hidden_states=hidden_states,
past_key_value=self_attn_past_key_value,
attention_mask=attention_mask,
layer_head_mask=layer_head_mask,
output_attentions=output_attentions,
)
hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states
hidden_states = self.self_attn_layer_norm(hidden_states)
# Cross-Attention Block
cross_attn_present_key_value = None
cross_attn_weights = None
if encoder_hidden_states is not None:
residual = hidden_states
# cross_attn cached key/values tuple is at positions 3,4 of present_key_value tuple
cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
hidden_states, cross_attn_weights, cross_attn_present_key_value = self.encoder_attn(
hidden_states=hidden_states,
key_value_states=encoder_hidden_states,
attention_mask=encoder_attention_mask,
layer_head_mask=cross_attn_layer_head_mask,
past_key_value=cross_attn_past_key_value,
output_attentions=output_attentions,
)
hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states
hidden_states = self.encoder_attn_layer_norm(hidden_states)
# add cross-attn to positions 3,4 of present_key_value tuple
present_key_value = present_key_value + cross_attn_present_key_value
# Fully Connected
residual = hidden_states
+ hidden_states, router_logits = self.block_sparse_moe(hidden_states)
- hidden_states = self.activation_fn(self.fc1(hidden_states))
- hidden_states = nn.functional.dropout(hidden_states, p=self.activation_dropout, training=self.training)
- hidden_states = self.fc2(hidden_states)
- hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states
hidden_states = self.final_layer_norm(hidden_states)
outputs = (hidden_states,)
if output_attentions:
outputs += (self_attn_weights, cross_attn_weights)
if use_cache:
outputs += (present_key_value,)
return outputs
Feature Invariant
元論文Applying the Transformer to Character-level TransductionのFigure 2で説明されているもので、入力を<tag1><tag2><tag3>lemma
のような形式にした時、tagにはPositonal Encodingを適用しないというものです。
なおスコアは減少しました(スコア一覧参照)。
CustomBartLearnedPositionalEmbedding
class CustomBartLearnedPositionalEmbedding(nn.Embedding):
"""
This module learns positional embeddings up to a fixed maximum size.
"""
def __init__(self, num_embeddings: int, embedding_dim: int):
# Bart is set up so that if padding_idx is specified then offset the embedding ids by 2
# and adjust num_embeddings appropriately. Other models don't have this hack
+ self.offset = 0
- self.offset = 2
super().__init__(num_embeddings + self.offset, embedding_dim)
def forward(self, input_ids: torch.Tensor, past_key_values_length: int = 0, zero_count=None):
"""`input_ids' shape is expected to be [bsz x seqlen]."""
bsz, seq_len = input_ids.shape[:2]
+ if zero_count == bsz*[0]:
+ positions = torch.arange(
+ past_key_values_length, past_key_values_length + seq_len, dtype=torch.long, device=self.weight.device
+ ).expand(bsz, -1)
+ else:
+ positions = []
+ for i, zero_c in zip(range(bsz), zero_count):
+ position = torch.cat((torch.zeros(zero_c - 1, dtype=torch.long, device=self.weight.device), torch.arange(
+ past_key_values_length, past_key_values_length + seq_len - zero_c + 1, dtype=torch.long, device=self.weight.device
+ )), dim=0).tolist()
+ positions.append(position)
+ positions = torch.tensor(positions, device=self.weight.device)
+ positions = torch.cat((positions,), dim=0)
- positions = torch.arange(
- past_key_values_length, past_key_values_length + seq_len, dtype=torch.long, device=self.weight.device
- ).expand(bsz, -1)
return super().forward(positions + self.offset)
CustomBartEncoder
class BartEncoder(BartPreTrainedModel):
### 省略 ###
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
head_mask: Optional[torch.Tensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutput]:
### 省略 ###
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+ zero_number = []
+ if input_ids is not None:
+ for ids in input_ids.tolist():
+ zero_count = 0
+ for id in ids:
+ if id in self.zero_id:
+ zero_count += 1
+ zero_number.append(zero_count)
+ zero_count = zero_number
### 省略 ###
if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
+ embed_pos = self.embed_positions(input, zero_count=zero_count)
- embed_pos = self.embed_positions(input)
embed_pos = embed_pos.to(inputs_embeds.device)
### 省略 ###
CustomBartDecoder
class BartDecoder(BartPreTrainedModel):
### 省略 ###
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
encoder_hidden_states: Optional[torch.FloatTensor] = None,
encoder_attention_mask: Optional[torch.LongTensor] = None,
head_mask: Optional[torch.Tensor] = None,
cross_attn_head_mask: Optional[torch.Tensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
### 省略 ###
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+ zero_number = []
+ if input_ids is not None:
+ for ids in input_ids.tolist():
+ zero_count = 0
+ for id in ids:
+ if id in self.zero_id:
+ zero_count += 1
+ zero_number.append(zero_count)
+ zero_count = zero_number
### 省略 ###
# embed positions
+ positions = self.embed_positions(input, past_key_values_length, zero_count=zero_count)
- positions = self.embed_positions(input, past_key_values_length)
positions = positions.to(inputs_embeds.device)
### 省略 ###
スコア一覧
比較対象実験スコア表
score | アーキテクチャ | lemma/タグ入れ替え | スペース消去 | 末尾スペース | act fn | offset | encoder bos | experts | topk | bias | batch size | d_model | step |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.523 | BART | × | ○ | × | gelu | 2 | × | ||||||
0.491 | BART | ○ | ○ | × | gelu | 2 | × | ||||||
0.353 | T5 | ○ | ○ | × | gelu | N/A | × | ||||||
0.506 | BART | × | × | × | gelu | 2 | × | ||||||
0.372 | T5 | × | × | × | gelu | N/A | × | ||||||
0.392 | T5 | × | × | × | relu | N/A | × | ||||||
0.472 | BART | ○ | ○ | × | relu | 2 | × | ||||||
0.498 | BART | ○ | ○ | × | gelu | 0 | × | ||||||
0.483 | BART | ○ | ○ | × | gelu | Feature Invariant | × | ||||||
0.471 | BART | ○ | ○ | ○ | gelu | 0 | × | ||||||
0.458 | BART | ○ | ○ | × | silu | 0 | × | ||||||
0.532 | BART | × | ○ | × | gelu | 0 | × | 400 | |||||
0.51 | BART | × | ○ | × | gelu | 0 | ○ | ||||||
0.42 | BART | × | ○ | × | gelu | 0 | × | 8 | 2 | ○ | |||
0.468 | BART | × | ○ | × | gelu | 0 | × | 8 | 2 | × | |||
0.474 | BART | × | ○ | × | gelu | 0 | × | 4 | 1 | ○ | |||
0.474 | BART | × | ○ | × | gelu | 0 | × | 16 | 4 | ○ | |||
0.533 | BART | × | ○ | × | gelu | 0 | × | ○ | 800 | 10000 | |||
0.431 | BART | × | ○ | × | gelu | 0 | × | ○ | 4000 | 2000 | |||
0.528 | BART | × | ○ | × | gelu | 0 | × | × | 400 | ||||
0.493 | BART | × | ○ | × | gelu | 0 | × | ○ | 400 | 512 | |||
0.482 | BART | × | ○ | × | gelu | 0 | × | ○ | 800 | 20000 |
Baseline(元colab)
score | アーキテクチャ | lemma/タグ入れ替え | スペース消去 | 末尾スペース | act fn | offset | encoder bos |
---|---|---|---|---|---|---|---|
0.392 | T5 | × | × | × | relu | N/A | × |
0.346 | T5 | × | × | × | relu | N/A | × |
0.439 | T5 | × | × | × | relu | N/A | × |
0.219 | T5 | × | × | × | relu | N/A | × |
ハイパーパラメータ探索スコア表(T5)
score | Feature Invariant | lemma/タグ入れ替え | スペース消去 | dropout | adam beta2 | label smoothing |
---|---|---|---|---|---|---|
0.33 | ○ | ○ | ○ | 0.2 | 0.98 | 0.1 |
0.386 | ○ | ○ | ○ | 0.3 | 0.999 | 0 |
0.34 | ○ | ○ | ○ | 0.3 | 0.98 | 0.1 |
0.467 | × | ○ | ○ | 0.3 | 0.98 | 0.1 |
0.396 | × | × | × | 0.1 | 0.999 | 0.1 |
0.386 | × | × | × | 0.3 | 0.98 | 0.1 |
0.319 | × | × | × | 0.3 | 0.999 | 0 |
0.335 | × | × | × | 0.1 | 0.999 | 0 |
0.196 | × | × | × | 0.1 | 0.98 | 0 |
0.239 | × | × | × | 0.3 | 0.98 | 0 |
おわりに
気になるところがあればお気軽に連絡してください。Feature Invariantあたりの実装はあまり自信がないです。
ちなみに、IOAIは4/17まで応募を受け付けているらしいです。
Discussion