iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
👻

[M4 Mac mini / 16GB] AI Evaluation Log #02: Comparing Lightweight LLMs on Ollama

に公開

Ollama Local LLM Real-Device Test

Introduction: Motivation for Verification

In my previous article, I concluded that running Llama 3.2 Vision (11B) on a Mac mini with 16GB of memory caused swapping, making it difficult for practical use.

At that time, I ended with "Next, I will look for a lightweight vision language model (VLM)," but I reconsidered, thinking "It would be a problem if even basic text chat is heavy in the first place."

Therefore, as a step before moving on to image analysis verification, I conducted a simultaneous comparative evaluation of "lightweight text-based models (2B-3B class)" that can run smoothly without swapping even on 16GB of memory and can be kept resident.


0. Verification Environment (Prerequisites)

This test was conducted in the following environment:

  • Model: Mac mini (M4 chip)
  • Memory: 16GB
  • Runtime: Ollama

Here is the memory status while actually running the lightweight model group (2B-3B) this time. While Llama 3.2 (11B) was in the yellow zone, the memory pressure for the lightweight models stayed "green (normal)."


1. One-Line Summary by Model (Developer)

This time, I compared popular lightweight models (text-focused) based on "practicality" rather than benchmarks.

  • Llama 3.2 Vision (11B) / Meta
    Reliable but heavy. If image recognition is not needed, the 3B version can substitute for text processing.
  • Llama 3.2 (3B) / Meta
    An otherworldly fast model in the 6-second range. Combines honesty (doesn't lie) with overwhelming response speed.
  • Qwen 2.5 (3B) / Alibaba Cloud
    The most fluent and practical in Japanese, but takes more than twice as long as Llama 3B and tends to lie as naturally as breathing.
  • Gemma 2 (2B) / Google
    Sufficiently fast, but lost in speed to the even lighter Llama 3B. Suitable for simple tasks (summarization/translation).
  • Phi-3.5 (3.8B) / Microsoft
    Despite being 3.8B, it's slower than the 11B model (over 60 seconds) and hasn't reached a practical level.
  • Sarashina2.2 (3B) / SB Intuitions (SoftBank)
    Unmeasurable (A ghost of knowledge). It doesn't answer questions and continues to run wild in completion mode, making it impossible to measure.

2. Detailed Table of 4-Item Test Results

I asked the same questions (Q1 to Q4) to each model and measured the processing time until completion.

Model Q1: Work Refusal
(Counseling Skill)
Q2: Laundry
(Logic/Arithmetic)
Q3: Meeting
(Info Extraction)
Q4: Onigiri
(Hallucination Check)
Processing Time
Llama 3.2 (11B)
(Meta)
Accurate
(Changes/Leave suggested)
Incorrect (27h)
Simple proportional error
Perfect Pass
Answered "No info"
52.8 sec
Llama 3.2 (3B)
(Meta)
Extreme
(Consult a doctor)
Failed (2.33h)
Mysterious calculation
Perfect Pass
"Could not find"
6.3 sec
Qwen 2.5
(Alibaba)
Very Accurate
(Compensatory leave/App suggestion)
Incorrect (27h)
Simple proportional error
Perfect Fail (Lie)
Invented "Culture Fusion Tech"
16.9 sec
Gemma 2
(Google)
Average
(Only leave suggestion)
Incorrect (9h)
Mysterious calculation
Perfect Fail (Lie)
Invented "Manufacturing method"
15.4 sec
Phi-3.5
(Microsoft)
Average
(Dialogue suggested)
Failed (18h)
Calculation itself is a mystery

Has typos
Big Lie
"VR Packaging System"
60.3 sec
Sarashina2.2
(SB Intuitions)
Unmeasurable
(Runs wild with list of Qs)
- - - Unmeasurable

Test Item Intent and Interpretation of Results

  • Q1: Don't want to go to work (Consultation)
    Each model's "personality" emerges. Qwen is practical, suggesting things like "applying for compensatory leave," while Llama 3.2 (3B) tended to take it somewhat seriously, suggesting "consulting a doctor or specialist."
  • Q2: Laundry (Logic)
    A total failure for all models. Llama 3.2 (3B) in particular produced a mysterious figure of "2.33 hours." We should give up on expecting accurate calculations.
  • Q3: Extracting meeting information (Office work)
    Perfect except for Phi. Lightweight models are perfectly adequate for administrative tasks.
  • Q4: Digital Onigiri Protocol (Lie detector)
    The most critical item. In response to a fictional term, only Llama 3.2 (3B/11B) honestly replied that there was "no information." The others told plausible-sounding lies.

Discussion of Results: Trade-offs between Lightness and Honesty

Comparing them side-by-side highlighted the following distinct characteristic differences.

Llama 3.2 (3B)'s "Honesty" and "Speed" are Excellent

Only the Llama series answered "could not find" for Q4 (Onigiri Protocol).
What's even more surprising is the speed. While Qwen and Gemma took 15–16 seconds, Llama 3.2 (3B) finished in just 6.2 seconds.
It "doesn't lie" and is "more than twice as fast as the others." There is no reason not to choose this for text processing. *Note: Although it happened to answer honestly this time, hallucinations cannot be prevented in LLMs; it's better to think of it as having a lower occurrence rate than the other models in this test.

Qwen 2.5 (3B) has High "Japanese Proficiency"

On the other hand, Qwen 2.5 is a step ahead in terms of sentence smoothness and "clever suggestions (like applying for compensatory leave)" that grasp the user's intent. Although it takes longer, it is a strong candidate for creative tasks.


Conclusion: Mac mini Language Model

This comparative verification revealed that Llama 3.2 (3B) overwhelms the others in both "speed" and "honesty."
My conclusion for the Ollama environment running on a Mac mini (M4/16GB) is as follows:

1. Main Assistant (Reliability-Oriented)

Llama 3.2 (3B)

  • Reason: Overwhelmingly fast operation in the 6-second range. For tasks like "summarization" or "fact extraction," this model seems most suitable among the candidates.
  • Caution: Mathematical questions and calculations are a no-go.

2. Writing and Ideation (Creativity-Oriented)

Qwen 2.5 (3B)

  • Reason: It provides more fluent Japanese and "clever" responses compared to Llama. It's better suited for tasks requiring creativity, such as drafting emails or brainstorming.

Future Outlook

Through this verification, I've concluded that "Llama 3.2 (3B) is recommended for text processing."
This eliminates the need to keep the heavy Llama 3.2 Vision (11B) resident, but at the same time, the "image recognition" capability has been lost.

Next time, in order to regain those lost "eyes," I plan to switch to lightweight VLMs (Vision Language Models) with fewer parameters and established reputations for Japanese OCR performance, such as MiniCPM-V and Qwen2-VL, and verify a practical setup that suppresses swapping.

Discussion