Open2023/12/25にコメント追加16

確率的プログラミングとLLM

seya

確率的プログラミングがLLMのハルシネーションの低減や精度向上に役立つかもという情報を聞いたので勉強するスクラップ

seya

こちらが確率的プログラミングを用いてLLMの reliability 上げられるよを試している論文
https://arxiv.org/abs/2306.03081

PoCのソースコード
https://github.com/probcomp/hfppl

seya

論文内で出てくる用語

Feynman-Kac Transformer models

SMCに適用可能な Transformer トークンシーケンス上の確率モデルで、様々な言語生成タスクをエンコードする

SMC Transformer steering(Sequential Monte Carlo steerring)

Feynman-Kac Transformer 専用の SMC の変種。
particle 縮退を防ぐための交換なしの particle 再サンプリング戦略を採用し、particle 毎の重複計算を避けるために neural 活性化をキャッシュする。

-> particle というのは sequential Monte Carlo が「システムの状態を表す一連の粒子（サンプル）を用いて、それらが時間とともにどのように進化するかをシミュレート」するものなのでその用語
-> ソースコードを読む感じ particle = モデル(LLM)の実行で、並列して LLM を実行、トークン毎に評価のステップが挟まって次のトークン生成の方向に制約が加わるので、そのトークンの途中から生成が始まる際に前回までのキャッシュが利用できるようにして速度を高めているよという話っぽい多分。

LLaMPPL

Llama Transformer を呼び出す確率的プログラムとしてFeynman-Kac Transformer modelsを構築し、SMC steering を自動化するライブラリ(上に貼ったソースコード)

前提知識として調べる

モンテカルロ法

モンテカルロ法 (Monte Carlo method)とは無作為に抽出された乱数を用い、数値計算やシミュレーションを行う手法を意味します。

乱数でめちゃくちゃシミュレーションして確率を求める方法、と理解した(合ってるかな...)。解析的に解が求められない複雑な問題や多次元の問題に対して有効で、ランダムなサンプリングを通じて問題の確率的性質を探ることができる(by ChatGPT)。

▼ どうでもいい情報

モンテカルロはヨーロッパのモナコ公国内の一つの地区であり、カジノで有名なところ

https://www.jaysong.net/RBook/monte.html#:~:text=モンテカルロ法 (Monte%20Carlo%20method,%E3%81%A7%E6%9C%89%E5%90%8D%E3%81%AA%E3%81%A8%E3%81%93%E3%82%8D%E3%81%A7%E3%81%99%E3%80%82

Sequential Monte Carlo(SMC)

モンテカルロ法の中でも時間の経過とともに変化するシステムやランダムなプロセスを推定するために使用されるもの。

seya

下記のような感じでトークンが生成される毎に SMC による評価ステップが挟まり、候補となるトークン群を確率的プログラムが指定する制約に従って選別する

SMCステアリングを用いたFeynman-Kac Transformer modelsの生成過程:

1. 初期トークン／プロンプト ---> [Transformerモデル] ---+---> 候補トークン群
   (確率的制約は適用されていない)

2. 候補トークン群 ---------> [SMCステアリング] ------> 制約に基づいて選ばれたトークン
   (SMCが確率的プログラムとして提供される制約に基づき、候補トークンを精査・選択)

3. 選ばれたトークン --------> [Transformerモデル] ---+---> 新たな候補トークン群
   (選択されたトークンが次の入力として用いられる)

4. 新たな候補トークン群 ---> [SMCステアリング] ------> 制約に基づいて選ばれる次のトークン
   ...

seya

ソースコードを読もう

サンプルが実行している部分。
ConstraintModel が particle として扱われている。 smc_standard 関数で ConstraintModel を第二引数の particles 分生成している

async def main():
    constraint_model = ConstraintModel(prompt, 50)
    particles = await smc_standard(constraint_model, 40)
    for p in particles:
        print(f"{p.context}")

asyncio.run(main())

ちな ConstraintModel の中で LLM を呼ぶところは下記のように呼び出されている。from_pretrained　は transformer ライブラリからの関数をラップしている。

LLM = CachedCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", auth_token=HF_AUTH_TOKEN)
LLM.batch_size = 40

seya

smc_standard の中では ConstraintModel をたくさん作る

particles = [copy.deepcopy(model) for _ in range(n_particles)]

その中で ConstraintModel の step() 関数を実行しているのでおそらくこやつが LLM の推論を行なっている

await asyncio.gather(*[p.step() for p in particles if not p.done_stepping()])

seya

step の中身

class ConstraintModel(Model):
    def __init__(self, prompt, max_tokens):
        super().__init__()
        self.context = LMContext(LLM, prompt)
        self.q       = LMContext(LLM, prompt)
        self.max_tokens = max_tokens

    async def step(self):
        # ここで使えるトークンを絞ってる(例の中では5文字以内)
        mask = self.active_constraint_mask()
        
        # next_token() が次のトークン単一、ではなくトークンの distribution を返してる
        # proposal が mask に基づいた distribution(ここちょっと intervene 内で dist.log_prob(x) が何してるのか謎。return されてる x 使ってないし)
        token = await self.sample(self.context.next_token(), proposal = await self.proposal(mask))

        # Condition on constraint — a no-op since proposal already guarantees the constraint
        self.condition(token.token_id in mask)
        
        # Reduce number of max tokens remaining
        self.max_tokens -= 1
        
        print(f"{self.context}")

        # Check if done
        if token == LLM.tokenizer.eos_token_id or self.max_tokens == 0:
            self.finish()
            return

seya

そもそも LLM の次のトークン予測には次の3つの操作があることを念頭におく

Sampling: これは、モデルが生成する次のトークンを確率的に選ぶことです。モデルが各トークンに割り当てた確率に基づいてトークンをランダムに選びます。
Observing: これは、次のトークンの分布を観察することです。つまり、モデルがどのトークンをどれだけの確率で予測しているかを見ることです。
Intervening: これは、次のトークンの選択に何らかの方法で介入することを意味します。これには、特定のトークンを強制的に選ぶ、特定のトークンの確率を手動で調整する、または分布自体を変更するなどの操作が含まれます。介入は、生成されるテキストの内容をガイドしたり、特定の性質を持つテキストを生成するために行われることがあります。

Intervening は上記のように特定のマスクの中のトークンしか選べないようにするなどの介入

seya

transformer ライブラリの内容に割と則ってる気がしてきた

初期化時に使われている lm が AutoModelForCausalLM.from_pretrained(をラップしたもの)

class LMContext:    
    def __init__(self, lm, prompt, temp=1.0):
        """Create a new `LMContext` with a given prompt and temperature.
        
        Args:
            lm (hfppl.llms.CachedCausalLM): the language model for which this is a context.
            prompt (str): a string with which to initialize the context. Will be tokenized using `lm.tokenizer`.
            temp (float): temeprature for next-token distribution (0 < temp < float('inf'))"""
        self.lm                  = lm
        self.s                   = TokenSequence(lm, prompt)
        self.next_token_logprobs = log_softmax(lm.next_token_logprobs_unbatched(self.s.seq) / temp)
        self.temp                = temp
        self.NO_MASK    = set(range(len(self.lm.vocab)))
        self.model_mask = self.NO_MASK
        self.prompt_string_length = len(str(self.s))
        self.show_prompt = False
        
    def next_token(self):
        """Distribution over the next token.
        
        Sampling or observing from this distribution advances the state of this `LMContext` instance."""
        return LMNextToken(self)

seya

なぜこの関数で distribution が限定されるのか分からないので理解したい。

    async def intervene(self, dist, x):
        """Force the distribution to take on the value `x`, but do not _condition_ on this result.
        
        This is useful primarily with distributions that have side effects (e.g., modifying some state).
        For example, a model with the code
        
        ```python
        token_1 = await self.sample(self.stateful_lm.next_token())
        await self.observe(self.stateful_lm.next_token(), token_2)
        ```
        
        encodes a posterior inference problem, to find `token_1` values that *likely preceded* `token_2`. By contrast,
        
        ```python
        token_1 = await self.sample(stateful_lm.next_token())
        await self.intervene(self.stateful_lm.next_token(), token_2)
        ```
        
        encodes a much easier task: freely generate `token_1` and then force-feed `token_2` as the following token.
        
        Args:
            dist (hfppl.distributions.distribution.Distribution): the distribution on which to intervene.
            x: the value to intervene with.
        """
        await dist.log_prob(x)
        return x

こう呼んでる。q が LMContext、mask が制限するトークンの distribution

await self.intervene(self.q.mask_dist(mask), True)

mask_dist の中身。

    def mask_dist(self, mask):
        """Bernoulli distribution, with probability of True equal to the probability that the next token of this `LMContext` belongs
        to the given mask.
        
        Sampling or observing from this distribution modifies the state of this `LMContext` instance, so that
        the `next_token()` distribution either *will* (if True) or *will not* (if False) generate a token from
        the given mask.
        
        Args:
            mask: a `set(int)` specifying which token ids are included within the mask."""
        return LMTokenMask(self, mask)

log_prob の実装。context が ConstraintModel で宣言してる ctx やこれ。そこの操作してるから distribution が変わる

class LMTokenMask(Distribution):
    def __init__(self, ctx, mask):
        self.ctx  = ctx
        self.mask = mask
        
    async def log_prob(self, v):
        good_tokens  = self.ctx.model_mask.intersection(self.mask) if v else self.ctx.model_mask - self.mask
        bad_tokens   = [i for i in self.ctx.model_mask if i not in good_tokens]
        logprob_good = logsumexp(self.ctx.next_token_logprobs[list(good_tokens)])
        self.ctx.next_token_logprobs[bad_tokens] = float('-inf')
        self.ctx.next_token_logprobs -= logprob_good
        self.ctx.model_mask = good_tokens
        return logprob_good