iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🏗️

#03 A House of Only Skeletons

に公開

#03 A House of Mere Skeleton

When I scored my first prototype, it was a 4.1 out of 10.

I have Claude Sonnet roleplay as an "expert in distinguishing humans from AI" to evaluate my output—a method I call LLM-judging. I look at three metrics:

  1. HL (Human-Likeness)
  2. SV (Stylistic Variance; lower is better)
  3. TN (Naturalness of Timing)

HL 4.1, SV 0.64, TN 4.1. These are the numbers after I built it and graded it myself.

You Cannot Live in a Blueprint

The cause was immediately clear. process_message() was only returning parameters, not generating text. Emotional states, recommended styles, response delays—the blueprint was there, but the words reflecting these were nowhere to be found. It was like building a house with nothing but a skeleton. A score of 4 was the natural ceiling.

I integrated the Anthropic API. I passed the emotional state into the system prompt to perform text generation. HL jumped from 4.1 to 6.1. However, TN dropped from 4.1 to 3.5.

The API response was too fast. Even though I had set a "2-minute delay" for a message returned in 0.3 seconds, that information wasn't reflected when passed to the judge. The TimingController was merely returning a value, but that value wasn't appearing anywhere in the output. It was a design oversight.


Injecting Cultural Context

There is a parameter in the configuration file called context_level: 0.85, which represents the degree of high-context culture. I reflected this in the system prompt with the rule: "avoid direct negation and force the reader to infer from the context."

"I am sorry, but that is difficult" changed to "Let me think about it." HL 6.8. This +0.7 was the moment the parameters were mapped to actual linguistic behavior.

However, SV remained stuck at 0.50. Even after inserting fillers ("Um," "Ah,") and increasing structural variation, SV stayed at 0.50. Even with fillers, it was still following a pattern of "inserting fillers at the exact same position every time." It wasn't that the randomness was insufficient; it was that the structure outside of the fillers remained identical.

The Discovery of Subtraction

The final +0.5 was the most interesting.

While looking at the test output, I noticed that replies very often started with "Thank you for contacting me." Humans don't thank you every single time in subsequent exchanges. LLMs almost always do. In English, "Thanks for reaching out" appears in the same position.

I added these to the configuration file as banned phrases.

"banned_phrases": [
  "Thank you for reaching out",
  "Please feel free to ask",
  "Feel free at any time"
]

I instructed the system prompt to "never use these." With just this, HL increased by +0.5.

Human-likeness can sometimes be improved more by "what you stop doing" than by "what you add." Subtraction rather than addition. This was the biggest discovery this time.


HL 7.7

From v1 to v5, I went through five versions in a single day.

Version Changes HL SV TN
v1 Returning parameters only 4.1 0.64 4.1
v2 Text generation via API 6.1 0.56 3.5
v3 Cultural context reflection 6.8 0.50 4.5
v4 Fillers & structural variation 7.2 0.50 4.5
v5 Banned phrases & tone mirroring 7.7 0.36 5.5

the numbers improved. But there is something to stop and consider here.

Even if the LLM-judge score is 7.7, it's a separate issue whether an actual human feels that way. There is no guarantee that what an LLM deems "human-like" will also be perceived as "human-like" by a human. I haven't performed a Human Eval yet. I am aware that I am just chasing numbers.

SV is also at 0.36, barely missing the target of 0.35. With pipeline-based post-processing, this might be the structural limit.

References


<!-- metadata
event_date: 2026-03-18
notes: HL 4.1→7.7, SV 0.64→0.36, TN 4.1→5.5 confirmed via back_log(old_chat2.txt) and commit history. All 7 commits concentrated on 2026-03-18. The reliability issue of LLM-judges was recognized at the time.
-->

GitHubで編集を提案

Discussion