iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
💥

Testing if Zoltraak can run on local LLMs

に公開

Introduction

Who is this article for?

  • People interested in local LLMs
  • Those interested in zoltraak
  • Those who want to create requirement definition documents using LLMs
Environment
Mac Studio (M2 Ultra 128GB)

Content

In this article, I will attempt to create a requirement definition document from natural language using Zoltraak by Mr. Motoki.

However, it wouldn't be very interesting just to run zoltraak normally more than two months after its release. Therefore, in this article, I would like to verify what level of output quality can be achieved when using local LLMs and whether it is practical to use in the first place.

Conclusion

To state the conclusion first, the results were as follows:

  • Currently, it seems difficult to make zoltraak fully operational using only local LLMs.

    • Requirement definition documents can be created without any problems.
    • I was unable to generate executable Python code for the subsequent process of creating the directory and file structure.
    • This might be improved to some extent by refining the descriptions in the grimoires.
  • Unless there is a specific reason why you must execute it in an on-premises environment, it is likely best to use the default Claude API (in terms of quality, speed, and success rate).

  • gemma2:27b-instruct and qwen2:72b-instruct are also not to be underestimated.

What is Zoltraak?

I only have a partial understanding of Zoltraak so far, but I will describe my current understanding below.

  • It allows creating requirement definition documents from natural language using LLMs.
  • It can write source code based on the requirement definition document.
  • Depending on how you use it, you can even create a draft for a new business proposal.

I think the following post by Tetumemo-san is helpful for getting an idea of it.

https://x.com/tetumemo/status/1783227393978859853

However, since the quality of the output highly depends on the performance of the LLM used, I understand that at the time of writing, Claude 3.5 Sonnet produces the highest quality results.

Selection of Local LLM Models

While I understand that Claude 3.5 Sonnet is the best if only quality is considered, the performance of open LLMs has improved dramatically recently, especially with the emergence of Google Gemma 2.

In particular, the EZO-Common/Humanities-9B-gemma-2-it model introduced in my previous article has been found to boast Japanese language performance comparable to Claude 3 Haiku and Gemini 1.5 Flash, which are used in the default settings of zoltraak, according to the shaberi3 benchmark.

https://zenn.dev/robustonian/articles/ezo_9b_gemma_2_it

Therefore, in this article, I would like to conduct the verification using EZO-Common-9B-gemma-2-it.

Installing zoltraak

I will install it by following the GitHub page.

https://github.com/dai-motoki/zoltraak

  • Installing zoltraak
$ git clone https://github.com/dai-motoki/zoltraak.git
$ cd zoltraak
$ python -m venv zoltraak-dev
$ source zoltraak-dev/bin/activate  # For Linux ← same for Mac
$ pip install setuptools wheel
$ pip install -e .

The installation is now complete.

Using the Claude API (Default)

If you are using the Claude API, you should be able to run it after setting up your Anthropic API key.

  • Create a .env file in the root directory of the project.
  • Add the following line to the .env file, replacing YOUR_API_KEY with your actual Anthropic API key:
.env
ANTHROPIC_API_KEY=YOUR_API_KEY

Running

Now, you should be able to run zoltraak by entering a command like the one below.

$ zoltraak "Develop an educational augmented reality (AR) application by the end of this month" -c general_def

Trying it with Local LLMs

While it wasn't a massive undertaking, I thought the details of the modification steps to make it work with local LLMs were a bit too granular for a blog post, so I will omit them.

However, for reference, I will outline the general flow of how I carried out the modifications:

  • Step 1: Modified it to run using Gemini-1.5-flash instead of Claude
    • A generate_response function for Gemini was already available in zoltraak/zoltraak/llms/gemini.py, so I updated the code to use it.
    • Additionally, you need to set GEMINI_API_KEY instead of ANTHROPIC_API_KEY.
  • Step 2: Modified it to perform similar tasks via the OpenAI API
    • Tools like llama.cpp's llama-server and ollama are compatible with the OpenAI API.
    • Just like in the Shaberi3 Benchmark, you can run zoltraak using local LLMs by configuring the api_base URL and the model name.

Verification

Here, I will compare the outputs when the same prompt is given to gemini-1.5-flash and EZO-Common/Humanities-9B-gemma-2-it-f16.

I ran it with the following prompt, which is quite broad:

$ zoltraak "Development of an AI investment advisor using web API and LLM" -c general_def

Then, the following text is output to the terminal, and first, the requirement definition document is generated.

==============================================================
Activation Formula (Prompt Compiler)   : general_def.md
Magic Formula (Requirement Definition) : def_ai_investment_advisor_web_api_llm.md
Refining Formula (Prompt Formatter)    : md_comment_lang.md
Kotodama (LLM Vendor/Model Name)       : gemini/gemini-1.5-flash
Open File                              : False
==============================================================
Step 1. Constructing the Magic Formula using the Activation Formula... 🪄 ━━━━━━━━━━━━━━━━━━━━━━☆゚.*・。゚

After a while, you will be asked Do you want to execute the Domain Formula from the Magic Formula? (y/n): . If you type 'y' and press Enter, multiple outputs will be generated. The content and number of outputs depend on the prompt.

Step 2. Constructing the Domain from the Magic Formula...
  0%|                                                                                                                                                          | 0/7 [00:00<?, ?files/s] 
 14%|████████████████████▊                                                                                                                             | 1/7 [00:08<00:48,  8.01s/files] 
 29%█████████████████████████████████████████▋                                                                                                        | 2/7 [00:15<00:39,  7.85s/files] 
 43%|██████████████████████████████████████████████████████████████▌                                                                                   | 3/7 [00:23<00:31,  7.87s/files] 
 57%|███████████████████████████████████████████████████████████████████████████████████▍                                                              | 4/7 [00:32<00:24,  8.33s/files] 
 71%|████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                         | 5/7 [00:40<00:16,  8.30s/files] 
 86%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                    | 6/7 [00:49<00:08,  8.31s/files]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:57<00:00,  8.19s/files]

If the process completes without any errors, the generation of all outputs is finished.

About Errors

In Step 2, the process stopped due to errors quite frequently.

  • Previously, when I was using Claude 3 Haiku on the free tier, I encountered errors due to rate limits.
    • This can be resolved simply by switching to a pay-as-you-go plan.
  • With Gemini, there were times when errors occurred and times when they didn't.
    • Errors usually happened when an unexecutable Python file was generated and the system tried to run it.
  • With EZO-Common/Humanities-9B-gemma-2-it-f16, I tried about 10 times, but all of them resulted in errors during Step 2, only progressing as far as outputting the requirement definition document in Step 1.
    • The output Python file def_ai_investment_advisor_llm_api_v1.py contained the following description.
    • Although this should have contained a Python script for generating README.md or detailed explanations, it instead contained a description of what the Python script was supposed to do.
Beginning of def_ai_investment_advisor_llm_api_v1
This script aims to create a README.md file based on the content
of the requirement definition document and further generate
detailed explanations for each section using an LLM.

Fundamental Issues

If we ignore rate limits as they aren't fundamental errors, the current challenge with local LLMs seems to be their inability to accurately handle complex tasks like the following.

https://github.com/dai-motoki/zoltraak/blob/main/zoltraak/grimoires/architect/architect_claude.md

This instruction could be considered a task that is quite difficult even for humans to follow precisely. For tasks of this complexity, it still seems difficult to process them accurately using only current local LLMs.

As a countermeasure, one might consider rewriting the aforementioned architect_claude.md into expressions that are easier for the models to understand, but despite various trials and errors, I was ultimately unable to reach a solution.

By the way, I also tried with gemma2:27b-instruct-fp16 and qwen2:72b-instruct-q4_K_M, and the results were as follows:

Model Context Window Results Remarks
Claude 3 haiku, sonnet, etc. 200k Relatively high success rate
gemini-1.5-flash 1.0M Succeeds occasionally
qwen2:72b-instruct-q4_K_M 128k Rarely succeeds
gemma2:27b-instruct-fp16 8k All failed Even if Python code was occasionally created, the model was rewritten to gpt-3.5-turbo etc., and could not be executed
EZO-Common-9B-gemma-2-it-f16 8k All failed Python code was not output in the first place

Although EZO-Common-9B-gemma-2-it-f16 had a high score in the shaberi3 benchmark, gemma2:27b-instruct-fp16 and qwen2:72b-instruct-q4_K_M may also be useful in situations where you want to perform complex processing. The downside is that processing becomes quite slow with the qwen2:72b-instruct-q4_K_M class...

Another potential issue is that gemma2-based models have a short context window of 8k, which might be an influencing factor.

Output Results: Excerpt from the Requirement Definition Document

Therefore, I would like to check only the output of the requirement definition document from Step 1 here. The results are as follows:

  • Output from Gemini-1.5-flash (Excerpt)
    Alt text

  • Output from EZO-Common-9B-gemma-2-it-f16 (Excerpt)
    Alt text

In both cases, the output for requirements and stakeholders is quite similar. What I personally found interesting was how the LLM is utilized.

While both seem to assign the LLM a role like a Financial Planner (FP) who creates investment strategies based on investor information, I thought it was impressive that EZO-Common-9B-gemma-2-it-f16 included a strategy to further use the data for LLM training and updates to improve quality.

Conclusion

In this article, I verified whether it is possible to run zoltraak using local LLMs.

I started this verification thinking that since performance has improved so much, it should be possible to run it with local LLMs, but unfortunately, I was unable to execute zoltraak to the end using only local LLMs.

My conclusion is that unless there is a specific reason why you must run it in an on-premises environment, it is likely best to use the default Claude API (in terms of quality, speed, and success rate).

Additionally, I believe there was value in discovering through various tests that gemma2:27b-instruct-fp16 and qwen2:72b-instruct-q4_K_M are also quite capable models.

Thank you for reading this far. I look forward to seeing you again in the next post.

Discussion