iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🍺

ELYZA-tasks-100: Building a System for AI-based Evaluation of Human Responses

に公開

I have created a system using the ELYZA Tasks 100 task dataset—a benchmark for measuring the performance of various LLMs on Japanese tasks—to score human responses using GPT-4o, compare the results with other AI models, and accumulate the answers and scoring results as data. I've written this article to share the background and an overview of the project.
https://github.com/wmoto-ai/elyza-tasks-100-humanevaluator

Here is how it works:
https://x.com/wmoto_ai/status/1813576952072261653

🌟 Project Overview

The main features of this project are as follows:

  • Accepts human answers to Elyza Tasks 100 questions
  • Scores and evaluates responses using GPT-4o
  • Compares scoring results with other AI models (HODACHI/EZO-Humanities-9B-gemma-2-it, GPT-4, Claude 3.5 Sonnet)
  • Visually displays scoring results in graphs
  • Docker support for easy setup

💡 Development Background

Initial Ambition

Originally, I aimed to release this system as a public web app to collect a large amount of response data for Elyza Tasks 100. I had high hopes that the collected data could be used as "high-quality human response data" for fine-tuning LLMs and other purposes.

Reality Check

However, once I actually started solving the Elyza-tasks-100 problems myself... it turned out to be more of an ordeal than expected. Carefully answering a task set of 100 questions requires a significant amount of time and effort. I ended up giving up halfway through.
https://x.com/wmoto_ai/status/1808140365905776863
I would like to express my respect once again to the author of the following article, who completed all questions manually and published their results.
https://zenn.dev/yuki127/articles/2496cd8383c84c

At this point, I snapped back to reality and realized that collecting a large amount of proper response data from general users would be difficult without substantial PR power.

Pivot: Releasing as OSS

Therefore, I changed my policy and decided to release it as an OSS project rather than a personal data collection project. By doing so, I hope for the following benefits:

  1. Other developers and AI researchers can freely use and improve it.
  2. It can be utilized for educational purposes.
  3. The project can grow by incorporating insights from the community.

I hope that these benefits will arise.

🛠 Technical Details

Technologies Used

  • Backend: Python (FastAPI)
  • Frontend: HTML, JavaScript (Chart.js)
  • AI API: OpenAI GPT-4o
  • Containerization: Docker

That being said, I developed these parts by providing the following requirements to Claude 3.5 Sonnet, then taking the resulting files as a draft and putting them into a Claude.ai Project to have them fine-tuned. Here is the very first prompt:

<Instructions>
Please listen to the following user requests and create a requirement definition document and specification at a level where a repository and code can be created. If you have any questions, please ask me.

<Requirements> 
* Display questions based on a CSV (attached) containing 100 Japanese task examples called Elyza Tasks 100, and have humans answer in a descriptive format. 
* Since this web app itself will be released as open source, I want to make it somewhat versatile so that users can change the CSV for questions.
* For each task response, have the gpt-4o API score it in the background on a scale of 1 to 5. 
* I want to be able to easily change the prompts used for gpt-4o scoring in a configuration file.
* I want to be able to save the response results and the gpt-4o scoring results as a CSV. The session name (column) in the CSV should be created based on the name the user enters before starting the test and the date information at that time.
* After finishing the questions in the CSV, navigate to a separate page where the average score for each question is calculated and can be compared with pre-evaluated scores of AI models like gpt-4 or claude3 haiku/opus on an interactive dashboard. 
* Since there are many questions, allow users to resume answering later. 
* Minimize costs as much as possible, excluding GPT API fees. 
* Keep the configuration simple and easy to maintain. 
* However, I want to consider the appearance and impact, as having them answer via Google Forms lacks visual appeal. Note that authentication is not required.
* Keep the number of folders and files extremely simple, while making it versatile enough to be used as a repository for automated scoring by LLMs as described above.
* API keys and other sensitive information should be manageable via a .env file.
* It should be easy to start up using Docker after git cloning from the repository.

Main Features

  1. User Response Acceptance: Input responses through a simple Web UI
  2. Scoring by GPT-4o: Automatically evaluate responses and provide scores and reasons
  3. Visualization of Results: Compare user scores with other AI models using bar charts
  4. Data Accumulation: Save responses and scoring results in CSV files

Key Implementation Points

  • Minimized environment dependency through Docker support
  • Aimed for the most intuitive operation possible
  • Considerations on how to store data, etc.

Side Note

I used this system to evaluate MS Copilot (formerly Bing Chat), which cannot be called via API, by copying and pasting in the browser.
https://x.com/wmoto_ai/status/1813726890710179982

🙏 Conclusion

Originally, I had no connection to things like frontend development (in fact, I don't even work in the IT field), but through this project, I had the chance to think about various aspects such as "how to make it easier to answer?" and "in what format is accumulated data easy to handle?" I thought it was wonderful that having an LLM write code to some extent allows me to allocate my resources to the areas I truly want to focus on.

Thank you for reading to the end. Cheers! 🍻

Discussion