🚀

【Morphological Inflection】国際人工知能オリンピックのサンプル問題を解いてみた！

2024/04/18に公開

transformer

IOAI

tech

国際人工知能オリンピック(IOAI)が今年第1回の開催を迎えるとのことで、公式HPに紹介されているサンプル問題を解いてみます。

IOAIについて (国際人工知能オリンピックのサンプル問題を解いてみたより引用)

今年から始まる、科学オリンピックのうちの1つです。今年はブルガリアのブルガスで開催されるようです。
https://ioai-official.org/

肝心のコンテストの内容ですが、科学ラウンドと実践ラウンドの2つがあります。科学ラウンドは機械学習、深層学習に関するipynbが提供され、それを解きます。実践ラウンドはChatGPTを始めとするGUIアプリケーションを活用した科学的な問題について考察します。

いわゆるKaggle的なコーディング要素が求められるのは前者の科学ラウンドのようです。

公式HPには3問掲載されていました。

NLPタスク(言語モデルの訓練、論文の再実装) <- これをやる
NLPタスク(単語埋込のバイアス除去)
画像タスク(Adversarial Attack)

今回扱う問題は次のcolabリンクに掲載されています。

Task1 論文の再現実装

要件

2017 SIGMORPHON共有タスクデータのナバホ語サブセットを読み込む
コードの欠落や誤りを修正し元論文Applying the Transformer to Character-level Transductionを完全に再実装すること
モデルの学習に8時間以上かからないこと
テストセットでAcc 52.1% (± 0.5%)の範囲に収めること

与えられるコードではAcc 48.8%とされています。

早速取り掛かりたいところですが、再現性を確認するためcolabのコードを4回実行したところそれぞれ

0.219
0.346
0.392
0.439

と大幅に数値に変動があるうえに示されているスコアに達したものがないためTask 1はスキップします。

Task2 より高いスコアの形態素変換器の作成

こちらはTask1に比べてとても単純です。
できるだけスコアを上げるタスクになります。

元colabとの変更点

サンプル問題が掲載されているcolabノートブック(以下元colab)記載のコードと比較した変更点を述べます。

データセット形式

元colabではinputがlemma <tag1><tag2><tag3>といった形式でしたが、比較対象実験の結果最もスコアが向上したのはlemma<tag1><tag2><tag3>という形式だったのでそのように変更します。

import re
import regex

def parse_tag(tag):
    tag = re.sub(r"\)|\(|,|;", ' ', tag).split()
    return ''.join(['<{}>'.format(t) for t in tag])

def preprocess_data(raw_data):
        preprocessed_data = []
        for line in raw_data:
          lemma, tag, target = line.split('\t')
+         preprocessed_data.append(('{}{}'.format(lemma, parse_tag(tag)),target))
-         preprocessed_data.append(('{} {}'.format(lemma, parse_tag(tag)),target))
        return preprocessed_data

data['train'] = preprocess_data(data['train'])
data['dev'] = preprocess_data(data['dev'])

print('Preprocessed data sample:', data['train'][54])

chars = set(list(''.join([''.join([d[0].split()[0], d[1]]) for d in data['train']])))
char2id = { char: i for i, char in enumerate(chars)}
tags = list(set(sum([regex.findall(r"<[A-Za-z0-9]*>",d[0]) for d in data['train']], [])))

Tokenizer

元colabでは単純にpadトークン、bosトークン、eosトークンを設定していますが、学習・推論効率向上のためpadトークンをeosトークンに設定します。
これにより、モデルはシーケンス長に対して余った部分にpadトークンを挿入するという余計な事を学習しないようになることが期待できます。

### 省略 ###

class CustomTokenizer(PreTrainedTokenizer):

    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        vocab: Dict[str, int],
        bos_token="<s>",
        eos_token="</s>",
        unk_token="<unk>",
        pad_token="<pad>",
        **kwargs,
    ) -> None:
        # Add extra_ids to the special token list

        self.__token_ids = vocab
        self.__id_tokens: Dict[int, str] = {value: key for key, value in vocab.items()}

        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
        self._added_tokens_decoder = {0: pad_token, 1: bos_token, 2: eos_token, 3: unk_token}
        self.offset = len(self._added_tokens_decoder)

        super().__init__(
            bos_token=bos_token,
            eos_token=eos_token,
            unk_token=unk_token,
+           pad_token=eos_token,
-           pad_token=pad_token,
            **kwargs,
        )

### 省略 ###

モデル

元colabではT5を使っていますが、ここでは絶対位置符号化(Absolute Positional Encoding)が実装されているBARTを使用します。

Modeling

このタスクにおいてモデルには事前学習がなされないためoffsetを0に設定します。
↓該当部分

Config

+from transformers import BartConfig
+from modeling_custom_bart import CustomBartForConditionalGeneration
-from transformers import T5Config, T5ForConditionalGeneration

+config = BartConfig(decoder_ffn_dim=1024,
+             encoder_ffn_dim=1024,
+             encoder_attention_heads=4,
+             decoder_attention_heads=4,
+             d_model=256,
+             encoder_layers=4,
+             decoder_layers=4,
+             max_position_embeddings=64,
+             dropout=0.2,
+             activation_function="gelu",
+             decoder_start_token_id=1,
+             bos_token_id=1,
+             pad_token_id=2,
+             vocab_size=len(tokenizer))

-config = T5Config(d_ff=1024,
-             d_model=256,
-             num_layers=4,
-             num_decoder_layers=4,
-             num_heads=4,
-             dropout_rate=0.2,
-             vocab_size=len(tokenizer))
+model = CustomBartForConditionalGeneration(config)
-model = T5ForConditionalGeneration(config)
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.generation_config.decoder_start_token_id = tokenizer.bos_token_id
model.generation_config.max_new_tokens = 32
model.generation_config.eos_token_id = tokenizer.eos_token_id

+model.generation_config.forced_eos_token_id=2

Train

比較対象実験の結果最も精度が高かったハイパーパラメータを採用します。

### 省略 ###

training_args = Seq2SeqTrainingArguments(
+   max_steps=10000,
-   max_steps=20000,
+   per_device_train_batch_size=800,
-   per_device_train_batch_size=400,
    learning_rate = 0.001,
    lr_scheduler_type='inverse_sqrt',
    warmup_steps=4000,
    adam_beta2=0.98,
    label_smoothing_factor=0.1,
    evaluation_strategy="steps",
    eval_steps=400,
    eval_delay=400,
    save_strategy="steps",
    save_steps=400,
    predict_with_generate=True,
    metric_for_best_model='exact_match',
    save_total_limit=1,
    logging_strategy="steps",
    logging_steps=400,
    output_dir='custom_0.2',
    overwrite_output_dir=True,
    load_best_model_at_end=True
)

### 省略 ###

結果

スコアは0.533でした。
これはSIGMORPHON–UniMorph 2023 Shared Task 0: Typologically Diverse Morphological InflectionのTable 3で示されている52.1%より1.2ポイント上となります。

上手くいかなかった変更

BARTをMoEにしてみましたが、スコアは減少しました(スコア一覧参照)。

MoE

BartSparseMoeBlock

class BartBlockSparseTop2MLP(nn.Module):
    def __init__(self, config: BartConfig):
        super().__init__()
        self.ffn_dim = config.encoder_ffn_dim # encoderとdecoderでintermediate sizeは変えないことを想定
        self.hidden_dim = config.d_model

        self.fc1 = nn.Linear(self.hidden_dim, self.ffn_dim, bias=True)
        self.fc2 = nn.Linear(self.ffn_dim, self.hidden_dim, bias=True)

        self.act_fn = ACT2FN[config.activation_function]
        self.dropout = config.dropout
        self.activation_dropout = config.activation_dropout

    def forward(self, hidden_states):
        current_hidden_states = self.act_fn(self.fc1(hidden_states))
        current_hidden_states = nn.functional.dropout(current_hidden_states, p=self.activation_dropout, training=self.training)
        current_hidden_states = self.fc2(current_hidden_states)
        current_hidden_states = nn.functional.dropout(current_hidden_states, p=self.dropout, training=self.training)
        return current_hidden_states


class BartSparseMoeBlock(nn.Module):
    """
    This implementation is
    strictly equivalent to standard MoE with full capacity (no
    dropped tokens). It's faster since it formulates MoE operations
    in terms of block-sparse operations to accomodate imbalanced
    assignments of tokens to experts, whereas standard MoE either
    (1) drop tokens at the cost of reduced performance or (2) set
    capacity factor to number of experts and thus waste computation
    and memory on padding.
    """

    def __init__(self, config):
        super().__init__()
        self.hidden_dim = config.d_model
        self.ffn_dim = config.encoder_ffn_dim
        self.num_experts = 16 # config.num_local_experts
        self.top_k = 4 # config.num_experts_per_tok

        # gating
        self.gate = nn.Linear(self.hidden_dim, self.num_experts, bias=False)

        self.experts = nn.ModuleList([BartBlockSparseTop2MLP(config) for _ in range(self.num_experts)])

        # Jitter parameters
        self.jitter_noise = 0 # config.router_jitter_noise

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        """ """
        batch_size, sequence_length, hidden_dim = hidden_states.shape
        if self.training and self.jitter_noise > 0:
            hidden_states *= torch.empty_like(hidden_states).uniform_(1.0 - self.jitter_noise, 1.0 + self.jitter_noise)
        hidden_states = hidden_states.view(-1, hidden_dim)
        # router_logits: (batch * sequence_length, n_experts)
        router_logits = self.gate(hidden_states)

        routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
        # we cast back to the input dtype
        routing_weights = routing_weights.to(hidden_states.dtype)

        final_hidden_states = torch.zeros(
            (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device
        )

        # One hot encode the selected experts to create an expert mask
        # this will be used to easily index which expert is going to be sollicitated
        expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0)

        # Loop over all available experts in the model and perform the computation on each expert
        for expert_idx in range(self.num_experts):
            expert_layer = self.experts[expert_idx]
            idx, top_x = torch.where(expert_mask[expert_idx])

            if top_x.shape[0] == 0:
                continue

            # Index the correct hidden states and compute the expert hidden state for
            # the current expert. We need to make sure to multiply the output hidden
            # states by `routing_weights` on the corresponding tokens (top-1 and top-2)
            current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
            current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None]

            # However `index_add_` only support torch tensors for indexing so we'll use
            # the `top_x` tensor here.
            final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
        final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
        return final_hidden_states, router_logits

BartEncoderLayer

class BartEncoderLayer(nn.Module):
    def __init__(self, config: BartConfig):
        super().__init__()
        self.embed_dim = config.d_model

        self.self_attn = BART_ATTENTION_CLASSES[config._attn_implementation](
            embed_dim=self.embed_dim,
            num_heads=config.encoder_attention_heads,
            dropout=config.attention_dropout,
            config=config,
        )
        self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
        self.dropout = config.dropout
+       self.block_sparse_moe = BartSparseMoeBlock(config)
-       self.activation_fn = ACT2FN[config.activation_function]
-       self.activation_dropout = config.activation_dropout
-       self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)
-       self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)
        self.final_layer_norm = nn.LayerNorm(self.embed_dim)

    def forward(
        self,
        hidden_states: torch.FloatTensor,
        attention_mask: torch.FloatTensor,
        layer_head_mask: torch.FloatTensor,
        output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]:
        """
        Args:
            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`torch.FloatTensor`): attention mask of size
                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
            layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size
                `(encoder_attention_heads,)`.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
        """
        residual = hidden_states
        hidden_states, attn_weights, _ = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            layer_head_mask=layer_head_mask,
            output_attentions=output_attentions,
        )
        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
        hidden_states = residual + hidden_states
        hidden_states = self.self_attn_layer_norm(hidden_states)

        residual = hidden_states
+       hidden_states, router_logits = self.block_sparse_moe(hidden_states)
-       hidden_states = self.activation_fn(self.fc1(hidden_states))
-       hidden_states = nn.functional.dropout(hidden_states, p=self.activation_dropout, training=self.training)
-       hidden_states = self.fc2(hidden_states)
-       hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
        hidden_states = residual + hidden_states
        hidden_states = self.final_layer_norm(hidden_states)

        if hidden_states.dtype == torch.float16 and (
            torch.isinf(hidden_states).any() or torch.isnan(hidden_states).any()
        ):
            clamp_value = torch.finfo(hidden_states.dtype).max - 1000
            hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)

        outputs = (hidden_states,)

        if output_attentions:
            outputs += (attn_weights,)

        return outputs

BartDecoderLayer

class BartDecoderLayer(nn.Module):
    def __init__(self, config: BartConfig):
        super().__init__()
        self.embed_dim = config.d_model

        self.self_attn = BART_ATTENTION_CLASSES[config._attn_implementation](
            embed_dim=self.embed_dim,
            num_heads=config.decoder_attention_heads,
            dropout=config.attention_dropout,
            is_decoder=True,
            is_causal=True,
            config=config,
        )

        self.dropout = config.dropout
        """
        self.activation_fn = ACT2FN[config.activation_function]
        self.activation_dropout = config.activation_dropout
        """

        self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
        self.encoder_attn = BART_ATTENTION_CLASSES[config._attn_implementation](
            self.embed_dim,
            config.decoder_attention_heads,
            dropout=config.attention_dropout,
            is_decoder=True,
            config=config,
        )
        self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim)
+       self.block_sparse_moe = BartSparseMoeBlock(config)
-       self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim)
-       self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim)
        self.final_layer_norm = nn.LayerNorm(self.embed_dim)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        encoder_hidden_states: Optional[torch.Tensor] = None,
        encoder_attention_mask: Optional[torch.Tensor] = None,
        layer_head_mask: Optional[torch.Tensor] = None,
        cross_attn_layer_head_mask: Optional[torch.Tensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = True,
    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
        """
        Args:
            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`torch.FloatTensor`): attention mask of size
                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
            encoder_hidden_states (`torch.FloatTensor`):
                cross attention input to the layer of shape `(batch, seq_len, embed_dim)`
            encoder_attention_mask (`torch.FloatTensor`): encoder attention mask of size
                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
            layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size
                `(encoder_attention_heads,)`.
            cross_attn_layer_head_mask (`torch.FloatTensor`): mask for cross-attention heads in a given layer of
                size `(decoder_attention_heads,)`.
            past_key_value (`Tuple(torch.FloatTensor)`): cached past key and value projection states
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
        """
        residual = hidden_states

        # Self Attention
        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
        # add present self-attn cache to positions 1,2 of present_key_value tuple
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            past_key_value=self_attn_past_key_value,
            attention_mask=attention_mask,
            layer_head_mask=layer_head_mask,
            output_attentions=output_attentions,
        )
        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
        hidden_states = residual + hidden_states
        hidden_states = self.self_attn_layer_norm(hidden_states)

        # Cross-Attention Block
        cross_attn_present_key_value = None
        cross_attn_weights = None
        if encoder_hidden_states is not None:
            residual = hidden_states

            # cross_attn cached key/values tuple is at positions 3,4 of present_key_value tuple
            cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
            hidden_states, cross_attn_weights, cross_attn_present_key_value = self.encoder_attn(
                hidden_states=hidden_states,
                key_value_states=encoder_hidden_states,
                attention_mask=encoder_attention_mask,
                layer_head_mask=cross_attn_layer_head_mask,
                past_key_value=cross_attn_past_key_value,
                output_attentions=output_attentions,
            )
            hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
            hidden_states = residual + hidden_states
            hidden_states = self.encoder_attn_layer_norm(hidden_states)

            # add cross-attn to positions 3,4 of present_key_value tuple
            present_key_value = present_key_value + cross_attn_present_key_value

        # Fully Connected
        residual = hidden_states
+       hidden_states, router_logits = self.block_sparse_moe(hidden_states)
-       hidden_states = self.activation_fn(self.fc1(hidden_states))
-       hidden_states = nn.functional.dropout(hidden_states, p=self.activation_dropout, training=self.training)
-       hidden_states = self.fc2(hidden_states)
-       hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
        hidden_states = residual + hidden_states
        hidden_states = self.final_layer_norm(hidden_states)

        outputs = (hidden_states,)

        if output_attentions:
            outputs += (self_attn_weights, cross_attn_weights)

        if use_cache:
            outputs += (present_key_value,)

        return outputs

Feature Invariant

元論文Applying the Transformer to Character-level TransductionのFigure 2で説明されているもので、入力を<tag1><tag2><tag3>lemmaのような形式にした時、tagにはPositonal Encodingを適用しないというものです。
なおスコアは減少しました(スコア一覧参照)。

CustomBartLearnedPositionalEmbedding

class CustomBartLearnedPositionalEmbedding(nn.Embedding):
    """
    This module learns positional embeddings up to a fixed maximum size.
    """

    def __init__(self, num_embeddings: int, embedding_dim: int):
        # Bart is set up so that if padding_idx is specified then offset the embedding ids by 2
        # and adjust num_embeddings appropriately. Other models don't have this hack
+       self.offset = 0
-       self.offset = 2
        super().__init__(num_embeddings + self.offset, embedding_dim)

    def forward(self, input_ids: torch.Tensor, past_key_values_length: int = 0, zero_count=None):
        """`input_ids' shape is expected to be [bsz x seqlen]."""
        bsz, seq_len = input_ids.shape[:2]
+       if zero_count == bsz*[0]:
+           positions = torch.arange(
+               past_key_values_length, past_key_values_length + seq_len, dtype=torch.long, device=self.weight.device
+           ).expand(bsz, -1)
+       else:
+           positions = []
+           for i, zero_c in zip(range(bsz), zero_count):
+               position = torch.cat((torch.zeros(zero_c - 1, dtype=torch.long, device=self.weight.device), torch.arange(
+                   past_key_values_length, past_key_values_length + seq_len - zero_c + 1, dtype=torch.long, device=self.weight.device
+               )), dim=0).tolist()
+               positions.append(position)
+           positions = torch.tensor(positions, device=self.weight.device)
+           positions = torch.cat((positions,), dim=0)

-       positions = torch.arange(
-           past_key_values_length, past_key_values_length + seq_len, dtype=torch.long, device=self.weight.device
-       ).expand(bsz, -1)

        return super().forward(positions + self.offset)

CustomBartEncoder

class BartEncoder(BartPreTrainedModel):

    ### 省略 ###

    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutput]:

        ### 省略 ###

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

+       zero_number = []
+       if input_ids is not None:
+           for ids in input_ids.tolist():
+               zero_count = 0
+               for id in ids:
+                   if id in self.zero_id:
+                       zero_count += 1
+               zero_number.append(zero_count)
+       zero_count = zero_number

### 省略 ###

        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale

+       embed_pos = self.embed_positions(input, zero_count=zero_count)
-       embed_pos = self.embed_positions(input)
        embed_pos = embed_pos.to(inputs_embeds.device)

### 省略 ###

CustomBartDecoder

class BartDecoder(BartPreTrainedModel):

    ### 省略 ###

    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        encoder_hidden_states: Optional[torch.FloatTensor] = None,
        encoder_attention_mask: Optional[torch.LongTensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        cross_attn_head_mask: Optional[torch.Tensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:

        ### 省略 ###

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

+       zero_number = []
+       if input_ids is not None:
+           for ids in input_ids.tolist():
+               zero_count = 0
+               for id in ids:
+                   if id in self.zero_id:
+                       zero_count += 1
+               zero_number.append(zero_count)
+       zero_count = zero_number

        ### 省略 ###

        # embed positions
+       positions = self.embed_positions(input, past_key_values_length, zero_count=zero_count)
-       positions = self.embed_positions(input, past_key_values_length)
        positions = positions.to(inputs_embeds.device)

        ### 省略 ###

スコア一覧

比較対象実験スコア表

score	アーキテクチャ	lemma/タグ入れ替え	スペース消去	末尾スペース	act fn	offset	encoder bos	experts	topk	bias	batch size	d_model	step
0.523	BART	×	○	×	gelu	2	×
0.491	BART	○	○	×	gelu	2	×
0.353	T5	○	○	×	gelu	N/A	×
0.506	BART	×	×	×	gelu	2	×
0.372	T5	×	×	×	gelu	N/A	×
0.392	T5	×	×	×	relu	N/A	×
0.472	BART	○	○	×	relu	2	×
0.498	BART	○	○	×	gelu	0	×
0.483	BART	○	○	×	gelu	Feature Invariant	×
0.471	BART	○	○	○	gelu	0	×
0.458	BART	○	○	×	silu	0	×
0.532	BART	×	○	×	gelu	0	×				400
0.51	BART	×	○	×	gelu	0	○
0.42	BART	×	○	×	gelu	0	×	8	2	○
0.468	BART	×	○	×	gelu	0	×	8	2	×
0.474	BART	×	○	×	gelu	0	×	4	1	○
0.474	BART	×	○	×	gelu	0	×	16	4	○
0.533	BART	×	○	×	gelu	0	×			○	800		10000
0.431	BART	×	○	×	gelu	0	×			○	4000		2000
0.528	BART	×	○	×	gelu	0	×			×	400
0.493	BART	×	○	×	gelu	0	×			○	400	512
0.482	BART	×	○	×	gelu	0	×			○	800		20000

Baseline(元colab)

score	アーキテクチャ	lemma/タグ入れ替え	スペース消去	末尾スペース	act fn	offset	encoder bos
0.392	T5	×	×	×	relu	N/A	×
0.346	T5	×	×	×	relu	N/A	×
0.439	T5	×	×	×	relu	N/A	×
0.219	T5	×	×	×	relu	N/A	×

ハイパーパラメータ探索スコア表(T5)

score	Feature Invariant	lemma/タグ入れ替え	スペース消去	dropout	adam beta2	label smoothing
0.33	○	○	○	0.2	0.98	0.1
0.386	○	○	○	0.3	0.999	0
0.34	○	○	○	0.3	0.98	0.1
0.467	×	○	○	0.3	0.98	0.1
0.396	×	×	×	0.1	0.999	0.1
0.386	×	×	×	0.3	0.98	0.1
0.319	×	×	×	0.3	0.999	0
0.335	×	×	×	0.1	0.999	0
0.196	×	×	×	0.1	0.98	0
0.239	×	×	×	0.3	0.98	0

おわりに

気になるところがあればお気軽に連絡してください。Feature Invariantあたりの実装はあまり自信がないです。

ちなみに、IOAIは4/17まで応募を受け付けているらしいです。