iTranslated by AI
Testing if Zoltraak can run on local LLMs
Introduction
Who is this article for?
- People interested in local LLMs
- Those interested in zoltraak
- Those who want to create requirement definition documents using LLMs
Mac Studio (M2 Ultra 128GB)
Content
In this article, I will attempt to create a requirement definition document from natural language using Zoltraak by Mr. Motoki.
However, it wouldn't be very interesting just to run zoltraak normally more than two months after its release. Therefore, in this article, I would like to verify what level of output quality can be achieved when using local LLMs and whether it is practical to use in the first place.
Conclusion
To state the conclusion first, the results were as follows:
-
Currently, it seems difficult to make zoltraak fully operational using only local LLMs.
- Requirement definition documents can be created without any problems.
- I was unable to generate executable Python code for the subsequent process of creating the directory and file structure.
- This might be improved to some extent by refining the descriptions in the grimoires.
-
Unless there is a specific reason why you must execute it in an on-premises environment, it is likely best to use the default Claude API (in terms of quality, speed, and success rate).
-
gemma2:27b-instruct and qwen2:72b-instruct are also not to be underestimated.
What is Zoltraak?
I only have a partial understanding of Zoltraak so far, but I will describe my current understanding below.
- It allows creating requirement definition documents from natural language using LLMs.
- It can write source code based on the requirement definition document.
- Depending on how you use it, you can even create a draft for a new business proposal.
I think the following post by Tetumemo-san is helpful for getting an idea of it.
However, since the quality of the output highly depends on the performance of the LLM used, I understand that at the time of writing, Claude 3.5 Sonnet produces the highest quality results.
Selection of Local LLM Models
While I understand that Claude 3.5 Sonnet is the best if only quality is considered, the performance of open LLMs has improved dramatically recently, especially with the emergence of Google Gemma 2.
In particular, the EZO-Common/Humanities-9B-gemma-2-it model introduced in my previous article has been found to boast Japanese language performance comparable to Claude 3 Haiku and Gemini 1.5 Flash, which are used in the default settings of zoltraak, according to the shaberi3 benchmark.
Therefore, in this article, I would like to conduct the verification using EZO-Common-9B-gemma-2-it.
Installing zoltraak
I will install it by following the GitHub page.
- Installing zoltraak
$ git clone https://github.com/dai-motoki/zoltraak.git
$ cd zoltraak
$ python -m venv zoltraak-dev
$ source zoltraak-dev/bin/activate # For Linux ← same for Mac
$ pip install setuptools wheel
$ pip install -e .
The installation is now complete.
Using the Claude API (Default)
If you are using the Claude API, you should be able to run it after setting up your Anthropic API key.
- Create a .env file in the root directory of the project.
- Add the following line to the .env file, replacing YOUR_API_KEY with your actual Anthropic API key:
ANTHROPIC_API_KEY=YOUR_API_KEY
Running
Now, you should be able to run zoltraak by entering a command like the one below.
$ zoltraak "Develop an educational augmented reality (AR) application by the end of this month" -c general_def
Trying it with Local LLMs
While it wasn't a massive undertaking, I thought the details of the modification steps to make it work with local LLMs were a bit too granular for a blog post, so I will omit them.
However, for reference, I will outline the general flow of how I carried out the modifications:
- Step 1: Modified it to run using Gemini-1.5-flash instead of Claude
- A generate_response function for Gemini was already available in zoltraak/zoltraak/llms/gemini.py, so I updated the code to use it.
- Additionally, you need to set GEMINI_API_KEY instead of ANTHROPIC_API_KEY.
- Step 2: Modified it to perform similar tasks via the OpenAI API
- Tools like llama.cpp's llama-server and ollama are compatible with the OpenAI API.
- Just like in the Shaberi3 Benchmark, you can run zoltraak using local LLMs by configuring the api_base URL and the model name.
Verification
Here, I will compare the outputs when the same prompt is given to gemini-1.5-flash and EZO-Common/Humanities-9B-gemma-2-it-f16.
I ran it with the following prompt, which is quite broad:
$ zoltraak "Development of an AI investment advisor using web API and LLM" -c general_def
Then, the following text is output to the terminal, and first, the requirement definition document is generated.
==============================================================
Activation Formula (Prompt Compiler) : general_def.md
Magic Formula (Requirement Definition) : def_ai_investment_advisor_web_api_llm.md
Refining Formula (Prompt Formatter) : md_comment_lang.md
Kotodama (LLM Vendor/Model Name) : gemini/gemini-1.5-flash
Open File : False
==============================================================
Step 1. Constructing the Magic Formula using the Activation Formula... 🪄 ━━━━━━━━━━━━━━━━━━━━━━☆゚.*・。゚
After a while, you will be asked Do you want to execute the Domain Formula from the Magic Formula? (y/n): . If you type 'y' and press Enter, multiple outputs will be generated. The content and number of outputs depend on the prompt.
Step 2. Constructing the Domain from the Magic Formula...
0%| | 0/7 [00:00<?, ?files/s]
14%|████████████████████▊ | 1/7 [00:08<00:48, 8.01s/files]
29%█████████████████████████████████████████▋ | 2/7 [00:15<00:39, 7.85s/files]
43%|██████████████████████████████████████████████████████████████▌ | 3/7 [00:23<00:31, 7.87s/files]
57%|███████████████████████████████████████████████████████████████████████████████████▍ | 4/7 [00:32<00:24, 8.33s/files]
71%|████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 5/7 [00:40<00:16, 8.30s/files]
86%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 6/7 [00:49<00:08, 8.31s/files]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:57<00:00, 8.19s/files]
If the process completes without any errors, the generation of all outputs is finished.
About Errors
In Step 2, the process stopped due to errors quite frequently.
- Previously, when I was using Claude 3 Haiku on the free tier, I encountered errors due to rate limits.
- This can be resolved simply by switching to a pay-as-you-go plan.
- With Gemini, there were times when errors occurred and times when they didn't.
- Errors usually happened when an unexecutable Python file was generated and the system tried to run it.
- With EZO-Common/Humanities-9B-gemma-2-it-f16, I tried about 10 times, but all of them resulted in errors during Step 2, only progressing as far as outputting the requirement definition document in Step 1.
- The output Python file
def_ai_investment_advisor_llm_api_v1.pycontained the following description. - Although this should have contained a Python script for generating README.md or detailed explanations, it instead contained a description of what the Python script was supposed to do.
- The output Python file
This script aims to create a README.md file based on the content
of the requirement definition document and further generate
detailed explanations for each section using an LLM.
Fundamental Issues
If we ignore rate limits as they aren't fundamental errors, the current challenge with local LLMs seems to be their inability to accurately handle complex tasks like the following.
This instruction could be considered a task that is quite difficult even for humans to follow precisely. For tasks of this complexity, it still seems difficult to process them accurately using only current local LLMs.
As a countermeasure, one might consider rewriting the aforementioned architect_claude.md into expressions that are easier for the models to understand, but despite various trials and errors, I was ultimately unable to reach a solution.
By the way, I also tried with gemma2:27b-instruct-fp16 and qwen2:72b-instruct-q4_K_M, and the results were as follows:
| Model | Context Window | Results | Remarks |
|---|---|---|---|
| Claude 3 haiku, sonnet, etc. | 200k | Relatively high success rate | |
| gemini-1.5-flash | 1.0M | Succeeds occasionally | |
| qwen2:72b-instruct-q4_K_M | 128k | Rarely succeeds | |
| gemma2:27b-instruct-fp16 | 8k | All failed | Even if Python code was occasionally created, the model was rewritten to gpt-3.5-turbo etc., and could not be executed |
| EZO-Common-9B-gemma-2-it-f16 | 8k | All failed | Python code was not output in the first place |
Although EZO-Common-9B-gemma-2-it-f16 had a high score in the shaberi3 benchmark, gemma2:27b-instruct-fp16 and qwen2:72b-instruct-q4_K_M may also be useful in situations where you want to perform complex processing. The downside is that processing becomes quite slow with the qwen2:72b-instruct-q4_K_M class...
Another potential issue is that gemma2-based models have a short context window of 8k, which might be an influencing factor.
Output Results: Excerpt from the Requirement Definition Document
Therefore, I would like to check only the output of the requirement definition document from Step 1 here. The results are as follows:
-
Output from Gemini-1.5-flash (Excerpt)

-
Output from EZO-Common-9B-gemma-2-it-f16 (Excerpt)

In both cases, the output for requirements and stakeholders is quite similar. What I personally found interesting was how the LLM is utilized.
While both seem to assign the LLM a role like a Financial Planner (FP) who creates investment strategies based on investor information, I thought it was impressive that EZO-Common-9B-gemma-2-it-f16 included a strategy to further use the data for LLM training and updates to improve quality.
Conclusion
In this article, I verified whether it is possible to run zoltraak using local LLMs.
I started this verification thinking that since performance has improved so much, it should be possible to run it with local LLMs, but unfortunately, I was unable to execute zoltraak to the end using only local LLMs.
My conclusion is that unless there is a specific reason why you must run it in an on-premises environment, it is likely best to use the default Claude API (in terms of quality, speed, and success rate).
Additionally, I believe there was value in discovering through various tests that gemma2:27b-instruct-fp16 and qwen2:72b-instruct-q4_K_M are also quite capable models.
Thank you for reading this far. I look forward to seeing you again in the next post.
Discussion