🍡

シンプルなTest-time Scalingの紹介 (s1: Simple test-time scaling)

Taiju Watanabe

2025/12/08に公開

TTS

LLM

idea

 はじめにはじまして。松尾研究所データサイエンティストの渡部です。

こちらは、松尾研究所 Advent Calendar 2025の記事です。

今回は、以下のシンプルなTest-time Scalingの論文[1]の紹介、並びにその実行結果について紹介したいと思います。

https://arxiv.org/abs/2501.19393

 Test-time Scalingの概要Test-time Scalingを一言で表現すると、推論時に追加の計算リソースをつぎ込んで、精度や推論の質を向上させる手法です。

これまではスケール則に代表されるように、事前学習に大量のデータを膨大な時間をかけて学習させ、モデルの性能を向上させるTraining-time Scalingが一般的でした。しかしながら、大規模な事前学習は膨大な計算リソースが必要になるため、一部の研究機関でしか実施することができません。そこで事前学習に比べて膨大な計算リソースも必要とせず、多くの研究者が取り組むことができる推論時のスケーリング（Test-time Scaling）に最近は注目が集まっています。事前学習でLLMに埋め込まれた知識を推論時にどのようにすれば最大限引き出すことができるかが肝になっています。

 Test-time Scalingの種類Test-time Scalingのサーベイ論文[2]によると、Test-time Scalingは大きく4種類に分類することができます。


手法
概要


Sequential Scaling
中間ステップの結果に基づいて、後続のステップをよく吟味して決定していく。後述するs1もこの一種で、途中に"Wait"を差し込むことで自身の推論結果を再度吟味させている。

Parallel Scaling
通常、LLMは1つのクエリに対して1つの応答しか生成しないが、Parallel Scalingでは、複数の出力を並列に生成し、それらを集約して最終的な答えを出力する。複数の推論過程をサンプルして多数決を取るようなBest-of-Nが代表例。

Hybrid Scaling
Sequential ScalingとParallel Scalingを組み合わせた手法。Sequential Scalingの各ステップを精査・評価するような「収束的思考」と、Parallel Scalingの複数の過程を生成する「発散的思考」を組み合わせて人間の問題解決過程を表現している。

Internal Scaling
テスト時における推論のための計算量を、外部の人手による戦略に依存せず、モデル自身（内部パラメータ）が自律的に決定することができるようになる。Sequential Scalingでは途中にプロンプトを差し込んだり、Parallel Scalingでは最後に多数決を取ったりと人手が介入していた。Internal Scalingではこれらの人手の介入なしに、モデル自身が推論過程を吟味したり、自己評価を行ったりするような振る舞いを見せるようになる。典型的な例がDeepSeek-R1などのreasoning modelで、DeepSeek-R1では大規模なRLによって推論時に"Wait a moment"といった自己評価プロセスが自発的に走るようになった。

今回はこの中でもSequential Scalingに注目し、"s1: Simple test-time scaling"という論文を通してその有効性を確認します。

 s1: Simple test-time scalingの紹介こちらの論文では、Sequential Scalingという非常に簡単なアルゴリズムでTest-time Scalingを実現しています。具体的には、「Budget forcing」と呼ばれる手法を適応しています。これは以下の二つの「最大トークン数の強制」と「最小トークン数の強制」によって無駄な思考をやめさせ、逆に思考が不十分な場合は追加で思考させることを目指しています。
最大トークン数の強制
無駄に多くの推論をさせないために思考段階を早期終了させて、モデルにその時点での答えを出力させます。思考終了用の区切りトークンと、必要に応じて"Final Answer:"という文字列を差し込むことで、思考ステップを打ち切ります。

最小トークン数の強制
モデルが早く思考を終えないようにするために、思考終了の区切りトークンの生成を抑制し、さらに必要に応じて"Wait"という文字列を付け足します。これにより、モデルが自分の現在の推論内容について振り返り、考え直すことを促します。

以下の論文から引用した図では、Budget forcingを用いたモデル（s1-32B）の性能が、テスト時の計算量を増やすにつれてAIME2024に対する性能を向上させていく様子を示しています。
!AIME2024とは、高難易度の数学のベンチマークです。アメリカの数学コンテストの一つAIME（American Invitational Mathematics Examination）の、2024年に実施された回（AIME I / AIME II）から評価用として30問抽出したデータで成り立っています。
しかしながら、"Wait"を推論途中に差し込む過程を増やしていくといずれ性能が頭打ちしてしまうこともわかりました。これは、思考終了の区切りトークンの生成を抑制しすぎると、モデルは推論を続ける代わりに同じことを繰り返すループに陥ってしまうことがあるためです。



s1: Simple test-time scalingのFig. 4から引用

 s1: Simple test-time scalingの実行s1の公式レポジトリを参考にして、Budget forcingの最小トークン数の強制によるモデル自身の推論過程の見直しの挙動を確認しました。以下が具体的な推論コードになっています。

"How many r in raspberry"という質問に対して、LLMが回答する例となっており、モデルの出力トークン長が32000に満たない場合に、"Wait"を一度差し込んで、思考を続けることを促しています。
推論コードfrom vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# Decide on a token limit for thinking; As the model's max tokens is 32768, 32000 usually ensures there is enough space for the model to still answer
MAX_TOKENS_THINKING = 32000
# Decide how often to ignore end-of-thinking token
NUM_IGNORE = 1

model = LLM(
    "simplescaling/s1-32B", # s1 originally gets this prompt wrong but with budget forcing it fixes it
    tensor_parallel_size=2,
)
tok = AutoTokenizer.from_pretrained(
    "simplescaling/s1-32B"
)

stop_token_ids = tok("<|im_end|>")["input_ids"]
sampling_params = SamplingParams(
    max_tokens=32768,
    min_tokens=0,
    stop_token_ids=stop_token_ids,
    skip_special_tokens=False,
    temperature=0.0,
)

# For the exact raspberry sample in the paper see
prompts = [
    "How many r in raspberry",
]

for i, p in enumerate(prompts):
    prompt = "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n" + p + "<|im_end|>\n<|im_start|>assistant\n"
    stop_token_ids = tok("<|im_start|><|im_end|>")["input_ids"]
    sampling_params = SamplingParams(
        max_tokens=MAX_TOKENS_THINKING,
        min_tokens=0,
        stop_token_ids=stop_token_ids,
        skip_special_tokens=False,
        temperature=0.0,
    )
    prompt += "<|im_start|>think"
    o = model.generate(
        prompt,
        sampling_params=sampling_params
    )
    print(prompt + o[0].outputs[0].text)
    ignore_str = "Wait"
    max_tokens_thinking_tmp = MAX_TOKENS_THINKING
    for i in range(NUM_IGNORE): # Num of times to skip stop token
        max_tokens_thinking_tmp -= len(o[0].outputs[0].token_ids)
        if max_tokens_thinking_tmp > 0:
            prompt += o[0].outputs[0].text + ignore_str
            sampling_params = SamplingParams(
                max_tokens=max_tokens_thinking_tmp,
                min_tokens=1,
                stop_token_ids=stop_token_ids,
                skip_special_tokens=False,
                temperature=0.0,
            )
            o = model.generate(
                prompt,
                sampling_params=sampling_params
            )
            print(prompt + o[0].outputs[0].text)
    ### Final answer ###
    prompt += o[0].outputs[0].text # You can also append "Final Answer:" here like we do for some evaluations to prevent the model from just continuing to reason in its answer when early exiting
    stop_token_ids = tok("<|im_end|>")["input_ids"]
    sampling_params = SamplingParams(
        max_tokens=32768,
        min_tokens=0,
        stop_token_ids=stop_token_ids,
        skip_special_tokens=False,
        temperature=0.0,
    )
    o = model.generate(
        prompt,
        sampling_params=sampling_params,
    )
    print(prompt + o[0].outputs[0].text)
以下がその出力結果から一部抜粋したものになります。非常にシンプルな考え方ではあるのですが、以下のように元々LLMがrの数を2としていましたが、"Wait"を差し込むことによって再度思考を見直し、最終的にはrの数を3と適切な回答を出力するようになったことが確認できました。途中にプロンプトを挿入するというシンプルなSequential Scalingでも推論の質が向上したことがわかります。
出力結果（抜粋）18. **Final Answer:** The final answer is 2.

This process demonstrates careful reading, understanding the task, systematic analysis, and a final check. It's important to be thorough, especially with potentially tricky words or questions.
Wait, let me re-read the question carefully. "How many r in raspberry".  The question is asking about the letter 'r'.

Let me re-evaluate my initial thought process. I seem to have made a mistake in my initial count.

**Corrected Thinking Process:**

1. **Read the Question:** "How many r in raspberry?"
2. **Identify Key Elements:** Letter 'r', word 'raspberry'.
3. **Understand the Task:** Count occurrences of 'r' in 'raspberry'.
4. **Visualize the Word:**  See "raspberry" in my mind.
5. **Scan for 'r':**
    * First letter: r (Count: 1)
    * Second letter: a
    * Third letter: s
    * Fourth letter: p
    * Fifth letter: b
    * Sixth letter: e
    * Seventh letter: r (Count: 2)
    * Eighth letter: r (Count: 3)
    * Ninth letter: y
6. **End of Word:**  Scanned all letters.
7. **Review the Count:** The count of 'r' is 3.
8. **Consider Alternatives:**  Is the question asking about something else?  No, it's clearly about the letter 'r'.  Is there a trick with the spelling of 'raspberry'?  The spelling seems correct.  Could I have missed an 'r'?  Let's recount carefully.
9. **Final Answer:** The final answer is 3.

 おわりに"s1: Simple test-time scaling"というシンプルなTest-time Scaling (Sequential Scaling)の論文について紹介しました。推論過程に"Wait"と一言加えるだけで、LLMが自身の思考過程を見直し、正しい思考過程へと変化していくことが確認できました。Test-time Scalingは最近注目されている分野の一つですので、今後も動向を追っていきたいと思います。

 参考先日ですが、松尾研LLMコミュニティにおけるPaper&HacksにてTest-time Scalingに関する発表をさせて頂いたので、もしよろしければそちらも参考にしていただけると幸いです。

脚注
Niklas Muennighoff et al. (2025) "s1: Simple test-time scaling" EMNLP 2025. ↩︎
Qiyuan Zhang et al. (2025) "A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?" arXiv:2503.24235. ↩︎

手法	概要
Sequential Scaling	中間ステップの結果に基づいて、後続のステップをよく吟味して決定していく。後述するs1もこの一種で、途中に"Wait"を差し込むことで自身の推論結果を再度吟味させている。
Parallel Scaling	通常、LLMは1つのクエリに対して1つの応答しか生成しないが、Parallel Scalingでは、複数の出力を並列に生成し、それらを集約して最終的な答えを出力する。複数の推論過程をサンプルして多数決を取るようなBest-of-Nが代表例。
Hybrid Scaling	Sequential ScalingとParallel Scalingを組み合わせた手法。Sequential Scalingの各ステップを精査・評価するような「収束的思考」と、Parallel Scalingの複数の過程を生成する「発散的思考」を組み合わせて人間の問題解決過程を表現している。
Internal Scaling	テスト時における推論のための計算量を、外部の人手による戦略に依存せず、モデル自身（内部パラメータ）が自律的に決定することができるようになる。Sequential Scalingでは途中にプロンプトを差し込んだり、Parallel Scalingでは最後に多数決を取ったりと人手が介入していた。Internal Scalingではこれらの人手の介入なしに、モデル自身が推論過程を吟味したり、自己評価を行ったりするような振る舞いを見せるようになる。典型的な例がDeepSeek-R1などのreasoning modelで、DeepSeek-R1では大規模なRLによって推論時に"Wait a moment"といった自己評価プロセスが自発的に走るようになった。

松尾研究所テックブログ

株式会社松尾研究所のテックブログです。

はじめに

Test-time Scalingの概要

Test-time Scalingの種類

s1: Simple test-time scalingの紹介

s1: Simple test-time scalingの実行

おわりに

参考

Discussion