iTranslated by AI
Act II: The Aftermath of the SLM Pipeline — Ensuring Reproducibility Across Three Macs
About this article
Claude Code created a draft based on AI conversation logs and experimental records, which I then refined.
This article documents the investigation into reproducibility within an SLM pipeline currently operating on a personal development environment (three Macs). It is not intended as a general guide to ollama specifications or SLM behavior.
In my previous article, I designed eval-kit2, which changed SLM quality evaluation from "multiplier of the number of items" to "importance-weighted coverage."
Through that, I found that SLM non-determinism is limited to Lv3 and below. Essential information is being captured consistently. If so, is non-determinism "acceptable"?
Setting temperature=0 fixes the output, but I will cover that in the next entry.
I wrote that last time. This is that story.
Introduction
What I will do
- Report the phenomenon where output quality of the same SLM differed across three Macs.
- Share the process of isolating whether the ollama version difference or SLM non-determinism was the main cause.
- Show the procedure and effectiveness of ensuring reproducibility using temperature=0 + seed fixing.
What I will not do
- Provide a detailed explanation of ollama's internal implementation or sampling algorithms.
- Compare with other inference engines (llama.cpp, vLLM, etc.).
- Discuss parameter tuning theories for temperature or seeds.
The Trigger — Different Quality Despite the Same Model
In my SLM pipeline, I have qwen2.5:14b (Q4_K_M quantization) process summaries of AI conversation logs. I run this pipeline on three Macs.
| Machine | Chip | Memory | Usage |
|---|---|---|---|
| mbp | M4 Pro | 48GB | Main dev machine |
| wk-01 | M4 | 16GB | Git server & worker |
| wk-02 | M4 | 24GB | Spare machine & worker |
When I threw the same prompt at the same conversation log, wk-01 always had the highest quality, and mbp had the lowest.
To quantify quality, I used the eval-kit2 (importance-weighted coverage) designed last time. I classified the items that should be included in the conversation log summary into five levels of importance and measured the coverage of Lv1 (Essential) and Lv2 (Important).
Results for two test conversation logs:
| Machine | Test c16 Lv1-2 | Test c17 Lv1-2 |
|---|---|---|
| wk-01 | 3/5 | 5/6 |
| wk-02 | 3/5 | 4/6 |
| mbp | 2/5 | 4/6 |
There is a clear difference between wk-01 and mbp. The model should be the same.
Hunting the Culprit — Difference in ollama Versions
First, I confirmed the identity of the model binaries.
I compared the blob SHA256 of the models held by ollama on all three machines. As a result, they were identical on all machines (2049f5674b1e...9a54). The model files themselves are the same.
Next, I suspected the version of the ollama engine itself. Upon checking, they were inconsistent across the three machines.
| Machine | ollama version |
|---|---|
| wk-01 | 0.13.5 |
| mbp | 0.15.6 |
| wk-02 | 0.17.4 |
I updated all machines to 0.13.5, the same as wk-01, and retested.
Result: The effect was limited.
The quality gap narrowed but did not disappear. Even with the same version and the same model, running it twice changes the output. The ollama version difference is one factor, but the main cause lies elsewhere. SLM non-determinism itself was dominant.
The Solution — temperature=0 + seed=42
Can't the fact that SLM output changes with every execution be controlled by runtime parameters? I checked the list of model parameters in ollama. I found that temperature (default 0.8) and seed are the parameters related to reproducibility.
To control this, I added support for two environment variables to the pipeline's execution script (run_bundle.sh).
OLLAMA_TEMPERATURE=0 OLLAMA_SEED=42 ./run_bundle.sh
- temperature=0: Always selects the token with the highest probability (greedy decoding). Since no sampling is performed, the output is stable.
- seed=42: Fixes the random number sequence when using sampling. Even when temperature > 0, it becomes easier to get the same output.
Results
I compared the three machines with temperature=0 + seed=42 configured.
| Machine | Test c16 Lv1-2 | Test c17 Lv1-2 |
|---|---|---|
| wk-01 | 3/5 | 5/6 |
| wk-02 | 3/5 | 5/6 |
| mbp (remote) | 3/5 | 4/6 |
The outputs of wk-01 and wk-02 converged to be nearly identical.
Only mbp had a slight remaining difference. Only mbp connects via the OpenAI-compatible API, and it is possible that the request structure or sampling settings differ internally. I plan to verify if this difference is resolved by unifying all machines to use the OpenAI-compatible API in the future.
The Trade-off — Reproducibility vs. Diversity
The cost of setting temperature=0 is the "loss of diversity."
At temperature=0, the token with the highest probability is always selected. This means the possibility of "getting a good expression by chance" also becomes zero. SLM output becomes stable, but also uniform.
I decided on the following policy for switching:
- When measuring baseline: temperature=0 + fixed seed. Without reproducibility, the effect of prompt improvements cannot be correctly measured.
- During production operation: Keep the default temperature (0.8). Allow for some variation to utilize diverse expressions.
The conclusion I reached from this experiment was that reproducibility is required as a prerequisite for quality evaluation. When comparing "which is better, prompt A or B," if the SLM output changes every time, it is impossible to know whether the difference stems from the prompt or from the random numbers.
The Next Move is Clear — Controlling Quality with Parameters
By investigating the list of parameters in ollama this time, I feel I have found not only a solution to the reproducibility problem but also a hint for improving summary quality.
Temperature and seed were parameters for "reproducibility." However, looking at the list, there are other parameters that seem to affect quality itself. Adjusting these may allow me to build SLM outputs with the quality I am seeking.
| Parameter | Default | Role | Example impact on summary quality |
|---|---|---|---|
top_k |
40 | Limits next token candidates to top K probabilities | Lowering the value (e.g., 20) may suppress the appearance of eccentric words and increase summary stability |
top_p |
0.9 | Limits candidates to those whose cumulative probability exceeds P (nucleus sampling) | Lowering the value (e.g., 0.7) combined with top_k might result in more conservative output, potentially reducing redundant items |
repeat_penalty |
1.1 | Penalizes reappearance of tokens that appeared immediately before | Increasing the value (e.g., 1.3) might reduce repetition of the same phrases and improve summary redundancy |
num_ctx |
2048 | Context window size (number of tokens) | Increasing the value (e.g., 4096) allows processing long conversation logs without chunking, reducing context disconnects |
num_predict |
128 | Maximum number of tokens to generate | If the value is too small, the output will be cut off. It is necessary to set an appropriate value according to the length of the summary |
mirostat |
0 (disabled) | Adaptive sampling method that keeps Perplexity constant | Setting to 1 or 2 automatically adjusts output consistency instead of temperature/top_k |
I haven't tried them yet. However, there is a big difference between "knowing the parameters" and "not knowing." Without this reproducibility investigation, I might have continued trying to improve quality only by refining prompts without realizing these parameters existed.
Furthermore, I had to be careful about managing the ollama version itself. Because I installed ollama via brew, the versions on each machine had become inconsistent at the timing of brew upgrade. If the inference engine version changes, the output can change even with the same model and the same parameters. I learned that aligning not only parameters but also the ollama version, and comparing output differences between versions, is part of quality management.
Summary
In this article, I reported on the investigation into the cause of the phenomenon where SLM output quality differed across three Macs, and the countermeasures taken to ensure reproducibility.
- ollama version difference is a factor, but not the primary cause: Even when all machines were aligned to the same version, SLM non-determinism altered the output. Version unification is a necessary condition, but not a sufficient one.
- Convergence with temperature=0 + fixed seed: The outputs of the three machines became nearly identical. This setting is indispensable for baseline measurements.
Remaining Tasks
With reproducibility ensured, A/B testing for prompt improvement has become possible. Now that the foundation of eval-kit2 + reproducibility is in place, the next step is to enter the stage of prompt improvement and parameter optimization.
Playbill (Development Records via docs × AI)
*In this series, the trial-and-error process is likened to a "play," and progress units are described as "acts."
Part 1: docs × AI (Five Chapters)
Part 2: Organizing Conversation Logs (Six Acts + Interlude)
Part 3: Workflow and Task Management (Four Acts)
Part 4: Building the First App (Five Acts)
Part 5: After the SLM Pipeline (Number of Acts TBD)
- Act 1: After the SLM Pipeline — Measuring Quality by "Importance"
- Act 2: After the SLM Pipeline — Ensuring Reproducibility on Three Macs ← Here now
Created: 2026-03-02
Source: slm.2026-03-02.claude.16.md / slm.2026-03-02.claude.08.md (AI Conversation Summary)
Article Creation Process
- Plot Creation: Claude Code (extracted from conversation summary)
- Initial Creation: Claude Code
- First Review: Me
- Editing: ChatGPT
- Pre-posting Review: (Not performed)
- Post-posting Review: NotebookLM
- The plot was generated with the
/dev-log-to-article-plotskill, and the draft was created with the/zenn-blog-writingskill.
Discussion