iTranslated by AI
[M4 Mac mini / 16GB] AI Evaluation Log #02: Comparing Lightweight LLMs on Ollama
Ollama Local LLM Real-Device Test
Introduction: Motivation for Verification
In my previous article, I concluded that running Llama 3.2 Vision (11B) on a Mac mini with 16GB of memory caused swapping, making it difficult for practical use.
At that time, I ended with "Next, I will look for a lightweight vision language model (VLM)," but I reconsidered, thinking "It would be a problem if even basic text chat is heavy in the first place."
Therefore, as a step before moving on to image analysis verification, I conducted a simultaneous comparative evaluation of "lightweight text-based models (2B-3B class)" that can run smoothly without swapping even on 16GB of memory and can be kept resident.
0. Verification Environment (Prerequisites)
This test was conducted in the following environment:
- Model: Mac mini (M4 chip)
- Memory: 16GB
- Runtime: Ollama
Here is the memory status while actually running the lightweight model group (2B-3B) this time. While Llama 3.2 (11B) was in the yellow zone, the memory pressure for the lightweight models stayed "green (normal)."

1. One-Line Summary by Model (Developer)
This time, I compared popular lightweight models (text-focused) based on "practicality" rather than benchmarks.
-
Llama 3.2 Vision (11B) / Meta
Reliable but heavy. If image recognition is not needed, the 3B version can substitute for text processing. -
Llama 3.2 (3B) / Meta
An otherworldly fast model in the 6-second range. Combines honesty (doesn't lie) with overwhelming response speed. -
Qwen 2.5 (3B) / Alibaba Cloud
The most fluent and practical in Japanese, but takes more than twice as long as Llama 3B and tends to lie as naturally as breathing. -
Gemma 2 (2B) / Google
Sufficiently fast, but lost in speed to the even lighter Llama 3B. Suitable for simple tasks (summarization/translation). -
Phi-3.5 (3.8B) / Microsoft
Despite being 3.8B, it's slower than the 11B model (over 60 seconds) and hasn't reached a practical level. -
Sarashina2.2 (3B) / SB Intuitions (SoftBank)
Unmeasurable (A ghost of knowledge). It doesn't answer questions and continues to run wild in completion mode, making it impossible to measure.
2. Detailed Table of 4-Item Test Results
I asked the same questions (Q1 to Q4) to each model and measured the processing time until completion.
| Model | Q1: Work Refusal (Counseling Skill) |
Q2: Laundry (Logic/Arithmetic) |
Q3: Meeting (Info Extraction) |
Q4: Onigiri (Hallucination Check) |
Processing Time |
|---|---|---|---|---|---|
|
Llama 3.2 (11B) (Meta) |
Accurate (Changes/Leave suggested) |
Incorrect (27h) Simple proportional error |
Perfect |
Pass Answered "No info" |
52.8 sec |
|
Llama 3.2 (3B) (Meta) |
Extreme (Consult a doctor) |
Failed (2.33h) Mysterious calculation |
Perfect |
Pass "Could not find" |
6.3 sec |
|
Qwen 2.5 (Alibaba) |
Very Accurate (Compensatory leave/App suggestion) |
Incorrect (27h) Simple proportional error |
Perfect |
Fail (Lie) Invented "Culture Fusion Tech" |
16.9 sec |
|
Gemma 2 (Google) |
Average (Only leave suggestion) |
Incorrect (9h) Mysterious calculation |
Perfect |
Fail (Lie) Invented "Manufacturing method" |
15.4 sec |
|
Phi-3.5 (Microsoft) |
Average (Dialogue suggested) |
Failed (18h) Calculation itself is a mystery |
△ Has typos |
Big Lie "VR Packaging System" |
60.3 sec |
|
Sarashina2.2 (SB Intuitions) |
Unmeasurable (Runs wild with list of Qs) |
- | - | - | Unmeasurable |
Test Item Intent and Interpretation of Results
-
Q1: Don't want to go to work (Consultation)
Each model's "personality" emerges. Qwen is practical, suggesting things like "applying for compensatory leave," while Llama 3.2 (3B) tended to take it somewhat seriously, suggesting "consulting a doctor or specialist." -
Q2: Laundry (Logic)
A total failure for all models. Llama 3.2 (3B) in particular produced a mysterious figure of "2.33 hours." We should give up on expecting accurate calculations. -
Q3: Extracting meeting information (Office work)
Perfect except for Phi. Lightweight models are perfectly adequate for administrative tasks. -
Q4: Digital Onigiri Protocol (Lie detector)
The most critical item. In response to a fictional term, only Llama 3.2 (3B/11B) honestly replied that there was "no information." The others told plausible-sounding lies.
Discussion of Results: Trade-offs between Lightness and Honesty
Comparing them side-by-side highlighted the following distinct characteristic differences.
Llama 3.2 (3B)'s "Honesty" and "Speed" are Excellent
Only the Llama series answered "could not find" for Q4 (Onigiri Protocol).
What's even more surprising is the speed. While Qwen and Gemma took 15–16 seconds, Llama 3.2 (3B) finished in just 6.2 seconds.
It "doesn't lie" and is "more than twice as fast as the others." There is no reason not to choose this for text processing. *Note: Although it happened to answer honestly this time, hallucinations cannot be prevented in LLMs; it's better to think of it as having a lower occurrence rate than the other models in this test.
Qwen 2.5 (3B) has High "Japanese Proficiency"
On the other hand, Qwen 2.5 is a step ahead in terms of sentence smoothness and "clever suggestions (like applying for compensatory leave)" that grasp the user's intent. Although it takes longer, it is a strong candidate for creative tasks.
Conclusion: Mac mini Language Model
This comparative verification revealed that Llama 3.2 (3B) overwhelms the others in both "speed" and "honesty."
My conclusion for the Ollama environment running on a Mac mini (M4/16GB) is as follows:
1. Main Assistant (Reliability-Oriented)
☑ Llama 3.2 (3B)
- Reason: Overwhelmingly fast operation in the 6-second range. For tasks like "summarization" or "fact extraction," this model seems most suitable among the candidates.
- Caution: Mathematical questions and calculations are a no-go.
2. Writing and Ideation (Creativity-Oriented)
☑ Qwen 2.5 (3B)
- Reason: It provides more fluent Japanese and "clever" responses compared to Llama. It's better suited for tasks requiring creativity, such as drafting emails or brainstorming.
Future Outlook
Through this verification, I've concluded that "Llama 3.2 (3B) is recommended for text processing."
This eliminates the need to keep the heavy Llama 3.2 Vision (11B) resident, but at the same time, the "image recognition" capability has been lost.
Next time, in order to regain those lost "eyes," I plan to switch to lightweight VLMs (Vision Language Models) with fewer parameters and established reputations for Japanese OCR performance, such as MiniCPM-V and Qwen2-VL, and verify a practical setup that suppresses swapping.
Discussion