iTranslated by AI
Understanding LLM APIs through Concepts and Structure: A Farmer-Engineer's Chatbot Development Journey (Part 3)
Introduction: The Wall After "Making It Work"
In the previous two installments, I managed to run sample code from a reference book using Z.ai and deployed a watercress recipe AI to Render by converting it into a Flask application.
The app is running. However, a nagging feeling remained.
I didn't really understand "why the code is written this way."
Why set temperature=0.7?
Why is session.modified = True necessary?
What is the difference between Session and RAG?
Making something work is different from understanding it. Today, I decided to face the "concepts and structure of LLM APIs" head-on.
Development: Reading Chapter 5 of "Practical Programming in the Generative AI Era"
I read Chapter 5, "Best Practices for Using the OpenAI API," from the book "Practical Programming in the Generative AI Era" (Impress).
The book is based on GPT-3.5 (as of the end of 2023), so the model names and pricing are outdated. However, the concepts and structure of the API have not changed. I realized that the best way to learn is a two-step approach: "Grasp the concepts from the book, and check the latest information in the official documentation."
3 Key Points That Deepened My Understanding
① The Meaning of API Parameters
The three parameters temperature, presence_penalty, and frequency_penalty control "creativity, topic diversity, and inhibition of repetition," respectively.
Intuitive Understanding of presence_penalty and frequency_penalty
These can be confusing at first, so let's compare them to cooking.
presence_penalty is the "degree to which the AI attempts to use ingredients that haven't been discussed yet." The higher the value, the more aggressively the AI tries to bring up "topics not yet mentioned."
frequency_penalty is the "degree to which the AI tries to avoid words that have been used many times already." The higher the value, the more the AI dislikes repetitive use of the same word and tries to paraphrase.
When asking the Watercress AI, "Tell me in detail how to make ohitashi":
- frequency_penalty=0.0 → "Put watercress in a watercress pot, and watercress..."
- frequency_penalty=0.5 → "Boil the watercress in a pot, and then drain the water..."
presence_penalty controls "topic breadth," while frequency_penalty controls "word repetition." Since these two function independently, they can be adjusted in combination.
Recommended Additional Settings for Current app_v1.py
response = client.chat.completions.create(
model="glm-4.7",
messages=api_messages,
temperature=0.7, # Balanced type
presence_penalty=0.3, # Encourage slightly more diverse suggestions
frequency_penalty=0.3, # Slightly suppress repetition of expressions
max_tokens=4096
)
temperature (Creativity & Diversity)
0.0 → Same response every time, deterministic
"What watercress goes well with curry?" → Always the same answer
→ Ideal for classifiers and question categorization
0.7 → Balanced type (current app_v1.py)
→ Ideal for general-purpose chatbots
1.0 → Different response every time, creative
→ Ideal for brainstorming and generating advertising copy
presence_penalty (Guiding toward new topics)
0.0 → Default. Tends to repeat the same topics
0.5 → Makes it easier to introduce new topics and perspectives
1.0 → Actively tries to expand the conversation
Application to Watercress AI:
"As many uses for watercress as possible"
→ Setting presence_penalty=0.6 prevents repeating the same "ohitashi" or "shira-ae" and suggests diverse menu items.
frequency_penalty (Suppressing repetition of the same words)
0.0 → Default. Repeats the same expressions
0.5 → Reduces repetition of the same words
1.0 → Actively paraphrases
Application to Watercress AI:
Can suppress "watercress, watercress, watercress..." appearing in every sentence.
② The Meaning of Function Calling
A mechanism where an LLM autonomously decides "which function to call" and can invoke external Python functions.
# Image
tools = [
{
"name": "search_neo4j",
"description": "Search the watercress cuisine database",
},
{
"name": "get_farm_info",
"description": "Get inventory information for Nanaka Farm",
}
]
# "What autumn hot pot dishes go well with watercress this week?"
# → LLM autonomously decides to call search_neo4j
# → Search Neo4j → Convert results into text
This is the essence of an "agent," which I aim for in Phase 4. I had an intuition that it would "add breadth."
③ The Importance of Token Management
The "GLM-4.7 content empty error" I experienced in the last deployment was exactly a failure in token management.
GLM-4.7 has inference mode (thinking mode) enabled by default
↓
reasoning_content (English thought process) consumes tokens first
↓
No tokens left to generate content (final answer)
↓
Content returns as an empty string
The solution is to provide enough margin with max_tokens=4096. This is a behavior specific to Z.ai not found in the official documentation, and it is knowledge I discovered after five hours of debugging.
Twist: Claude Gave Me a 10-Question Quiz
To break out of the state of "I made it work, but I'm not sure if I understand it," I asked Claude to give me a 10-question quiz on the "concepts and structure of APIs."
The results were as follows:
| Question | Theme | Result |
|---|---|---|
| Q1 | Meaning of stateless API | ✅ |
| Q2 | Safe writing to Flask sessions | ✅ |
| Q3 | Behavior of temperature=0.0 | ✅ |
| Q4 | Mechanism of RAG | ✅ |
| Q5 | Meaning of cosine similarity 1.0 | ✅ |
| Q6 | Role of Function Calling | ✅ |
| Q7 | Replacing pickle with Neo4j | ✅ |
| Q8 | Why system prompts aren't in Sessions | ✅ |
| Q9 | Cause of GLM-4.7 content error | ✅ |
| Q10 | Difference between Session and RAG | ✅ |
All 10 questions answered correctly.
There were three questions that were particularly valuable for reflection.
Q8: Why not put the system prompt in the Session?
I tried to put the 30,613 characters of watercress data from Notion into a system prompt and save it to the Session.
→ The cookie size exceeded the 4KB limit for Flask.
→ A "session cookie too large" warning appeared.
Solution: Save only the conversation history (user/assistant) in the Session. Add the system prompt only at the beginning during each API call.
This problem was actually encountered during deployment, so the answer is tied to my memory as an "experience."
Q9: GLM-4.7 content error
This is knowledge specific to Z.ai that only I possess. It is also a problem I would have never encountered if I had been using OpenAI.
Q10: The difference between Session and RAG
Session = Short-term memory (what we talked about in this conversation)
RAG = Long-term memory (the global watercress cuisine database)
Being able to summarize it in this single sentence made the challenges of my current implementation and the next steps clear.
The Value of "Experiential Learning" Through Quizzes
Looking back at why I was able to answer all 10 questions correctly, I realized that I didn't "learn the answers as knowledge," but rather that they were all problems I had "experienced as errors."
Experience Q8: Cookie size and separation of system prompts
When I tried to put Notion's world watercress cuisine database (30,613 characters) into a system prompt and save it to the Flask Session, this warning streamed into my terminal:
UserWarning: The 'session' cookie is too large:
the value was 40040 bytes but the limit is 4093 bytes.
I was trying to squeeze a 40KB cookie into a 4KB frame. The solution was a design change: "do not put the system prompt in the Session."
# Before: Put everything in the Session
session["messages"] = [
{"role": "system", "content": SYSTEM_PROMPT}, # 30,000 characters
{"role": "user", "content": "..."}
]
# After: Only user and assistant in the Session
session["messages"] = [] # Conversation history only
# Add system prompt at the beginning only during API call
api_messages = [{"role": "system", "content": SYSTEM_PROMPT}] + session["messages"]
I would never have realized this "4KB cookie limit" number just by reading a reference book. I realized it because I actually tried to stuff 30,000 characters of data into it.
Experience Q9: The trap of GLM-4.7's inference mode
Immediately after deploying to Render, I encountered a mysterious error that only occurred in the production environment. When I asked "What dishes use watercress?", the response was 0 bytes—meaning nothing was returned.
The cause was GLM-4.7's "inference mode (thinking mode)."
GLM-4.7 response structure:
reasoning_content: "Let me think about watercress recipes..."
← Writes thought process in English (consumes tokens first)
content: "" ← Not enough tokens remaining, becomes empty
At max_tokens=1024 (default), tokens were exhausted by the thinking process alone, leaving no room to generate the actual answer. It was only resolved after changing it to max_tokens=4096.
This is a Z.ai-specific behavior not found in the official documentation. It's a problem I would have never encountered if I had used OpenAI. This knowledge is also the reason I decided it was "worth writing about on the blog."
Experience Q10: The moment the difference between Session and RAG clicked
I was able to explain the "difference between Session and RAG" in words, but I only understood it as a "sensation" for the first time through implementation.
Session is short-term memory that remembers "what we talked about in this conversation." It disappears when you close the browser. It is nothing more and nothing less.
RAG is long-term memory that holds "knowledge of 190 watercress dishes from around the world." It remains in Neo4j even after the app restarts. If I add a new dish to Neo4j, the knowledge increases without changing the code.
When you combine these two:
"Talked about curry in today's conversation (Session)" +
"Data on curry-based watercress dishes from around the world (RAG)"
= "AI that can suggest specific dishes that match the context of today's conversation"
On the day I implemented app_v2_rag.py, the meaning of this design connected for the first time in a single line.
Conclusion: From "Making it Work" to "Understanding it"
I realized something today.
Depending on whether or not you understand "why" an error occurred, the speed at which you advance is completely different.
Whether it's the GLM-4.7 content error or the Session cookie size overflow, I was able to answer today's quiz precisely because I understood "why it happened," not just because I "fixed the error."
Reading a text in a reference book and understanding it is completely different from fighting an actual error and understanding it. The latter sticks with you much better.
I am leaving this blog as a record of the day a farmer understood LLM APIs through "concepts and structure."
Next Preview
Part 4 is a "Record of deploying a Flask app to Render with Claude Code." I plan to write about the commit logs that include Co-Authored-By: Claude Sonnet 4.5.
A 5-hour record of solving an error that only reproduces in the production environment—the GLM-4.7 inference mode—through pair programming with Claude Code—will be the most technically unique content yet.
Nanaka Farm produces watercress in Kumamoto Prefecture.
For inquiries from restaurants and chefs, please contact nanaka-farm.com.
Discussion