iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🎲

Running the Same Prompt Across 4 AIs Revealed Insights Invisible in a Single Run: SRP Case Study

に公開

About the Previous Article

In June 2025, I introduced SRP (Stochastic Resonance Prompting) on note.com using an R code refactoring case study. This time, I will cover its application to open-ended thinking tasks with no single correct answer, rather than code-related work.

https://note.com/chemica_tan/n/n92f4ee61a831

Execution Environment

The novelty this time lies in automation.

I implemented a command called /quick-homo-srp using the custom skill feature of Claude Code. When I give an instruction in natural language like "Apply SRP to this discussion," it automatically generates an SRP prompt based on the session context and launches 4 Codex CLI instances (the number and models are adjustable) in parallel in the background. No manual copy-pasting is required, and I get a notification once it's complete. The mechanism for prompt generation is summarized in the Appendix.

The content of the prompt (§1 to §7) is identical for all instances. The only difference is §8 (the output file path). This setup is designed to draw out "stochastic fluctuations for the same input."

Task: Defining "Grothendieck Prime-ness"

The task I assigned was this: "For integers from 2 to 1000, define 'Grothendieck prime-ness' and discuss the Top 5 and the interval distribution."

The term "Grothendieck prime" originates from a mathematical anecdote. Alexander Grothendieck, one of the greatest abstract mathematicians of the 20th century, was asked in a lecture to "give an example of a specific prime number" and he replied, "57." Since 57 = 3 × 19 is a composite number, it has been called the "Grothendieck prime" ever since.

It is a somewhat silly topic, but I chose it because something too difficult would not be suitable for the article. Tasks with no single correct answer are also ideal for SRP. If four AIs independently create definitions, comparing the diversity of their approaches can provide hints for a better implementation.

The name SRP comes from "Stochastic Resonance" in physics. In cases where a completely bad idea is simply rejected or an obviously good idea is adopted, SRP is not necessary. The core lies at the boundary between "slightly bad" and "slightly good." A slightly bad idea is not just discarded; it serves as a foil for a slightly good idea, helping its adoption. For example, even if three out of four are mediocre, comparing them highlights the excellence of the remaining one. That is stochastic resonance.

What was Revealed by the Comparison

When reading the four reports side-by-side, there were three discoveries that could not be noticed in a single execution.

Discovery 1: The Emergence of the Number 437

Let's list the Top 5 for each instance.

Rank #1 #2 #3 #4
1 169 (13^2) 259 (7 \times 37) 169 (13^2) 943 (23 \times 41)
2 361 (19^2) 371 (7 \times 53) 289 (17^2) 989 (23 \times 43)
3 841 (29^2) 559 (13 \times 43) 361 (19^2) 851 (23 \times 37)
4 437 (19 \times 23) 589 (19 \times 31) 437 (19 \times 23) 437 (19 \times 23)
5 221 (13 \times 17) 611 (13 \times 47) 841 (29^2) 589 (19 \times 31)

The Top 1 is different for all four instances: 169, 259, 169, and 943. However, 437 is in the Top 5 for 3 out of 4 instances. Other numbers appeared at most 2/4 times.

Why is 437 so robust? It is not divisible by 2, 3, 5, 7, 11, or 13. Its smallest prime factor, 19, is close to \sqrt{437} \approx 20.9, making it tedious to find by trial division. Furthermore, it is sandwiched between 431 (prime) and 439 (prime), placing it in an "environment" with high prime density.

The contrast with 57 is interesting. 57 remained in the "middle to upper-middle" range in all instances. Being divisible by 3 is a universal penalty factor. 437 can be seen as an "upgraded version of 57" that retains 57's strengths (odd number, many neighboring primes) while overcoming its weakness (divisibility by 3).

In a single execution, you would only know that "169 is 1st" or "259 is 1st." Only by comparing them does the stability beyond the definition of 437 become apparent.

Discovery 2: Detecting Design Flaws

Take another look at the Top 5 for Instance #3. The scores are all lined up at 1.000.

Note: There is a "ceiling effect" where the top ranks reach the score limit of 1.000, making it impossible to rank them by small differences.
— From the Instance #3 report

#3 itself acknowledges this issue in a footnote. However, the extent of its severity cannot be judged within #3 alone.

When lined up with #1, #2, and #4, the situation becomes clear. In the other three, the scores are continuously distributed and properly differentiated. The fact that only #3 is stuck at the ceiling is a design flaw.

It follows the same principle as code review. It is hard to see problems in your own code. When four people tackle the same problem, someone will — even if indirectly — detect the flaw.

Discovery 3: Visualizing Implicit Assumptions

Only Instance #4 introduced a \ln(n) term (size scale) into its definition.

A(n) = \frac{\ln n - \ln 2}{\ln 1000 - \ln 2}

As a result, the Top 5 for #4 are concentrated in the 900s (943, 989, 851). The other three are scattered across the entire range.

"Are larger numbers more Grothendieck-like?" — This question was not in the prompt. #4 implicitly answered Yes, while #1–#3 implicitly answered No. There are valid reasons for both positions.

But without comparison, you wouldn't even notice that "a choice was made" in the first place.

SRP visualizes implicit assumptions. The points of divergence are where the true design decisions lie hidden.

FAQ

Q1: Can diversity be achieved with the same model × 4?

In my experience, yes, sufficiently so. This is especially true for tasks with high difficulty or a high degree of freedom. Though, I suppose you wouldn't think of using SRP otherwise... The fact that we obtained four different definition names (Cognitive False Positive Index, Cognitive Illusion Degree, Cognitive Mirage Index, and Cognitive Camouflage Degree), four different formulas, and four different Top 1 results this time is evidence of that. The probabilistic nature of LLM generation naturally ensures diversity.

However, I do sometimes intentionally run them on different models. Recently, I've often used gpt-5.2, gpt-5.2-codex, gemini-3.0-pro, and Claude Opus 4.5. While this adds some breadth to the variations, I find it works best when you have confidence in your own domain knowledge and can prepare a good SRP prompt to begin with.

Q2: Isn't the cost 4x higher?

Since it's parallel execution, the time required is almost the same as a single run. In many cases, running it four times from the start is more efficient than a cycle of "single execution → discovering a flaw → redoing it." From February 2026, the rate limit for Codex has been doubled for two months, so it's a good time to try it out.

https://x.com/sama/status/2018437537103269909

Q3: Is the consolidation of results manual?

In this case, Claude Opus 4.5 created the consolidated report. Full automation is also easy by just adding the logic to the skill. However, consolidation involving domain judgment requires human intervention. Or rather, you might find yourself wanting to step in once you see the results.

Q4: What kind of tasks is it suitable for?

It's for when you want to boost the quality of your output just one step further, even at a slight additional cost. However, it's a waste for deterministic tasks like sorting or calculation.

It works well for open-ended tasks with no single correct answer—design reviews, establishing definitions, refactoring policies, or creating specifications. Meta-planning (deciding how to formulate a plan) is a classic task. I also use it as a Swiss-cheese-style defense for tasks that are too complex to summarize through a simple "coding → verification → re-verification → report" cycle.

To give a specific example: "Converting complex specifications written in natural language into patterns like regular expressions. There are multiple styles of pattern writing, and the correctness of the conversion needs to be comprehensively verified." It is suitable for such tasks. In professional terms, this was a task of converting atomic combination rules described in chemical engineering papers into SMARTS (regular expressions for molecular structures) and verifying whether the target molecules could be captured without omission. Having four AIs convert the same specification and comparing them yielded significant results.

Actually, the motivation for developing this method was similar to what I explained in the note article at the beginning. "It's a task in a field I'm not familiar with, but I want to use SRP to draft a good plan before handing it off to an agent"—I still use it quite often in those situations.

Q5: Any tips for the prompt?

Don't leave any implicit assumptions. The SRP prompt template (8-section structure) enforces structure. In particular, writing Guiding Questions specifically with numbers makes it easier to obtain comparable outputs.

Q6: Can it be done outside of Claude Code?

Yes. The previous note article was done by manual copy-pasting. Anything that allows parallel execution works. Automation is convenient but not required.

Conclusion

SRP is not a panacea. It is based on the simple principle that "comparison reveals what is hidden." Before blindly accepting the output of a single AI, just try sending it a few more times — that's all there is to it.

Questions and feedback are welcome in the comments.


Appendix: Automatic Generation of SRP Prompts

Inside the /quick-homo-srp skill, the following six items are first extracted from the current conversation:

  • Objective: What to generate and why this task exists
  • Domain knowledge: Prerequisite knowledge that the receiving AI must know to avoid invalid output
  • Design decisions: Policies tried and rejected in the past, and the history of constraints
  • File references: An explicit list of files the agent should read
  • Constraints: Rules that cannot be inferred from code alone
  • Evaluation criteria: Criteria for judging the quality of the output

These are fed into an 8-section template ranging from §1 Objective to §8 Output Instructions to generate a structured prompt. The core of the template is §6 Guiding Questions (specific numbered questions), which ensures that the outputs from each instance are in a comparable format.

# Startup command for each instance (actual skill output)
cat <<'PROMPT_EOF' | codex exec --full-auto -m gpt-5.2-codex -
[§1-§7: SRP prompt common to all instances]
[§8: Output path specific to this instance]
PROMPT_EOF

§1 to §7 are the same for all instances; only the output file path in §8 differs. Since the template enforces the structure, the output format remains consistent while the content naturally diverges — this is how SRP works.

Discussion