iTranslated by AI
Investigating MoE Behavior with Tanuki
I will investigate the behavior of MoE in Ollama using Tanuki-8B-dpo-v1.0 and Tanuki-8x8B-dpo-v1.0.
MoE
MoE (Mixture of Experts) is a form of Large Language Model (LLM). It prepares multiple "experts" within the model and selects the most suitable experts for each part of a question or sentence to generate a response.
For example, imagine having experts specialized in mathematics and others in history, where each expert works only when necessary. Since not all experts are used simultaneously, the computation is lighter relative to the model's total size.
DeepSeek, which has recently become a hot topic, is also an MoE. While DeepSeek-V3 has a total of 671B parameters, it consists of 256 experts, with 8 experts (37B) being active at any given time. 👉 Reference
Since DeepSeek-V3 is too massive to run on a local machine, this article uses Tanuki-8x8B-dpo-v1.0. (The "8x" denotes the number of experts.)
Environment
The measurements were taken in the following environment:
- OS: Windows 11 Home [Version 10.0.26100.3194]
- RAM: 64GB (DDR4-3200)
- CPU: AMD Ryzen 5 5600X 6-Core Processor
- GPU: AMD Radeon RX 7600 XT (VRAM 16GB)
- Ollama: 0.5.11
I executed the following prompt with /set verbose enabled:
Predict the future of AI. (100 characters)
Tanuki-8B-dpo-v1.0
First, as a baseline for comparison, I will measure an 8B model that is not an MoE.
ollama run 7shi/tanuki-dpo-v1.0
| Params | Quantization | Size | Env | tps | RAM | VRAM |
|---|---|---|---|---|---|---|
| 8B | Q6_K | 6.2GB | CPU | 6.77 | 6.8GB | |
| 8B | Q6_K | 6.2GB | GPU | 37.50 | 6.2GB | 7.9GB |
Output Example
The future of AI is expected to bring innovation to a wide range of fields such as medicine, education, transportation, and entertainment through the realization of "Artificial General Intelligence" and "Super Intelligence" with more advanced and human-like intelligence. Additionally, ethical issues, privacy protection, and impacts on the labor market will become important themes, leading to further symbiosis between humans and AI. Furthermore, while the realization of autonomous robots and smart cities will improve the quality of life, new social issues may also arise.
The future of AI will involve advanced automation and collaboration with humans. Innovation is expected in various fields such as medicine, education, and entertainment, and discussions regarding ethical issues will deepen toward solutions. Transparency and privacy protection will become crucial, and sustainable development will be required.
Tanuki-8x8B-dpo-v1.0
I will measure the MoE model, which is the primary objective.
Q4_K_M
First is the model quantized with the commonly used Q4_K_M.
ollama run hf.co/team-hatakeyama-phase2/Tanuki-8x8B-dpo-v1.0-GGUF
| Params | Quantization | Size | Env | tps | RAM | VRAM |
|---|---|---|---|---|---|---|
| 8x8B | Q4_K_M | 28GB | CPU | 5.08 | 27.6GB | |
| 8x8B | Q4_K_M | 28GB | CPU+GPU | 6.44 | 28.7GB | 13.2GB |
Output Example
The future of AI is very bright, and applications in a wide range of fields are expected. In medicine, there is diagnostic support and drug development; in autonomous vehicles, traffic safety and efficiency; and in education, support for individualized learning. We also need to address ethical challenges, and AI will evolve while coexisting with humans.
The future of AI has a very bright outlook. In the future, more advanced natural language processing and image recognition technologies will evolve, making interactions with humans even smoother. Additionally, the accuracy of diagnostic support systems in the medical field will improve, enabling personalized treatment. Furthermore, it is expected that curriculums optimized for individual learners will be provided in education. While addressing ethical challenges, AI will contribute to the realization of a sustainable and efficient society.
Since the model size is too large, only the layers that fit into the 16GB VRAM are offloaded to the GPU.
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/33 layers to GPU
One of the 33 layers is for output and is assigned to the CPU. Of the remaining 32 layers, half (16 layers) are offloaded to the GPU.
In this situation, the speed improved by about 1.2 times compared to full CPU usage.
IQ2_XXS
I will also try a size that fits in the VRAM.
ollama run hf.co/team-hatakeyama-phase2/Tanuki-8x8B-dpo-v1.0-GGUF:IQ2_XXS
| Params | Quantization | Size | Env | tps | RAM | VRAM |
|---|---|---|---|---|---|---|
| 8x8B | IQ2_XXS | 12GB | CPU | 3.87 | 12.9GB | |
| 8x8B | IQ2_XXS | 12GB | GPU | 18.16 | 13.4GB | 13.1GB |
Output Example
As AI technology evolves further and more human-like dialogue becomes possible, the relationship between humans and machines will deepen, and the level of mutual understanding and empathy will improve. Additionally, personalized services tailored to individual needs will advance, significantly enhancing the user experience. As a result, solutions utilizing AI will become widely prevalent in various fields from medicine to education, and it is believed that people's quality of life will greatly improve.
First, the evolution of AI technology is expected to continue. Particularly in the fields of machine learning and deep learning, more advanced algorithms and models will be developed. This is expected to improve the accuracy of natural language processing and make interactions with users even smoother.
Additionally, ethical aspects and social impacts will be addressed as important issues. Technological development to ensure the transparency and fairness of AI will progress. For example, new algorithms to reduce bias and data management technologies to protect privacy are expected to evolve.
Furthermore, the application of AI is predicted to expand across various industries such as medicine and education. This is expected to improve people's quality of life and the efficiency of society as a whole.
Overall, while AI technology continues to evolve, its usage and ethical aspects will be carefully considered. Consequently, a future where AI and humans coexist is expected to be envisioned.
All layers are offloaded to the GPU.
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
Since the calculations become more complex, the performance on the CPU is slower than with Q4_K_M, but because it fits entirely in the VRAM, it becomes significantly faster on the GPU.
Summary
| Params | Quantization | Size | Env | tps | RAM | VRAM |
|---|---|---|---|---|---|---|
| 8B | Q6_K | 6.2GB | CPU | 6.77 | 6.8GB | |
| 8B | Q6_K | 6.2GB | GPU | 37.50 | 6.2GB | 7.9GB |
| 8x8B | Q4_K_M | 28GB | CPU | 5.08 | 27.6GB | |
| 8x8B | Q4_K_M | 28GB | CPU+GPU | 6.44 | 28.7GB | 13.2GB |
| 8x8B | IQ2_XXS | 12GB | CPU | 3.87 | 12.9GB | |
| 8x8B | IQ2_XXS | 12GB | GPU | 18.16 | 13.4GB | 13.1GB |

Even though MoE does not activate the entire model, it seems that whether it fits in the GPU is still crucial.
This result might be useful as a reference when scaling up to run DeepSeek.
Comparison
This is an article where I measured the non-MoE Llama 3.3 70B in a similar way.
Although the conditions differ due to different quantization methods, there is a clear difference in operating speed even with almost the same size, due to the characteristic of MoE where the whole model is not active.
| Model | Params | Quantization | Size | Env | tps | RAM | VRAM |
|---|---|---|---|---|---|---|---|
| Tanuki-dpo-v1.0 | 8x8B | Q4_K_M | 28GB | CPU | 5.08 | 27.6GB | |
| Tanuki-dpo-v1.0 | 8x8B | Q4_K_M | 28GB | CPU+GPU | 6.44 | 28.7GB | 13.2GB |
| Llama 3.3 | 70B | Q2_K | 26GB | CPU | 1.46 | 27.2GB | |
| Llama 3.3 | 70B | Q2_K | 26GB | CPU+GPU | 2.42 | 26.9GB | 15.7GB |
Related Articles
This article features measurements using a similar prompt. It explains Radeon drivers and includes a comparison with the GeForce RTX 4060 Ti.
This article includes a comparison with the Intel Arc A770.
Addendum
zakki-san informed me that the override-tensor feature is currently under development.
Discussion