iTranslated by AI
Running Llama 3.3 70B with Insufficient VRAM
I ran Llama 3.3 70B in a situation where it doesn't fit in VRAM and compared it with CPU-only operation.
Environment
I conducted measurements in the following environment:
- OS: Windows 11 Home [Version 10.0.26100.3194]
- RAM: 64GB (DDR4-3200)
- CPU: AMD Ryzen 5 5600X 6-Core Processor
- GPU: AMD Radeon RX 7600 XT (VRAM 16GB)
- Ollama: 0.5.11
I executed the following prompt with /set verbose specified:
AIの未来を予測してください。(100字)
(Predict the future of AI. (100 characters))
Q4_K_M
ollama run llama3.3
| Params | Quantization | Size | Environment | tps | RAM | VRAM |
|---|---|---|---|---|---|---|
| 70B | Q4_K_M | 42GB | CPU | 0.99 | 42GB | |
| 70B | Q4_K_M | 42GB | CPU+GPU | 1.34 | 41.6GB | 15.2GB |
Output Examples
AI will be utilized in various fields such as medicine and finance. In the future, collaboration between humans and AI will improve productivity and lead to the birth of new industries. There are also concerns about an increase in cyberattacks misusing AI. Development of technology and strengthening of regulations to ensure safety and ethics are necessary.
AI will become increasingly sophisticated in the future and will work closely with humans. While practical applications such as autonomous driving and medical support by AI will spread, concerns regarding the ethics and safety of artificial intelligence are also expected to deepen. Furthermore, since the development of AI may bring significant changes to the nature of work and social structures, responses such as education and labor policies will be required.
Since the model size is too large, only the layers that fit within the 16GB VRAM are offloaded to the GPU. (In practice, about 2GB spills over into shared GPU memory.)
llm_load_tensors: offloading 27 repeating layers to GPU
llm_load_tensors: offloaded 27/81 layers to GPU
Of the 81 layers, one is for output and is assigned to the CPU. Out of the remaining 80 layers, 27 are offloaded to the GPU.
In this situation, the speed is improved by approximately 1.3 times compared to running entirely on the CPU.
Q2_K
Since Q4_K_M is too massive to achieve decent speed, I will also try a smaller model size.
ollama run hf.co/unsloth/Llama-3.3-70B-Instruct-GGUF:Q2_K
| Params | Quantization | Size | Environment | tps | RAM | VRAM |
|---|---|---|---|---|---|---|
| 70B | Q2_K | 26GB | CPU | 1.46 | 27.2GB | |
| 70B | Q2_K | 26GB | CPU+GPU | 2.42 | 26.9GB | 15.7GB |
Output Examples
AI is highly likely to evolve through artificial intelligence, machine learning, deep learning, etc., and become an existence with intelligence comparable to humans. While wide applications are expected in data analysis, autonomous driving, medicine, and more, there are also concerns about a loss of sense of morality and a decrease in employment. In the United States, educational reforms have begun to prevent talent poaching by AI.
The future of artificial intelligence (AI) will bring many benefits to humanity. It is predicted that AI technology will evolve and revolutionize fields such as medicine, education, and transportation. Furthermore, a transition to a new society through the coexistence of robots and humans is also expected.
Since the model size is too large, only the layers that fit within the 16GB VRAM are offloaded to the GPU. (In practice, about 2GB spills over into shared GPU memory.)
llm_load_tensors: offloading 46 repeating layers to GPU
llm_load_tensors: offloaded 46/81 layers to GPU
Of the 81 layers, one is for output and is assigned to the CPU. Out of the remaining 80 layers, 46 are offloaded to the GPU.
In this situation, the speed is improved by approximately 1.6 times compared to running entirely on the CPU.
Summary
| Params | Quantization | Size | Environment | tps | RAM | VRAM |
|---|---|---|---|---|---|---|
| 70B | Q4_K_M | 42GB | CPU | 0.99 | 42GB | |
| 70B | Q4_K_M | 42GB | CPU+GPU | 1.34 | 41.6GB | 15.2GB |
| 70B | Q2_K | 26GB | CPU | 1.46 | 27.2GB | |
| 70B | Q2_K | 26GB | CPU+GPU | 2.42 | 26.9GB | 15.7GB |

Comparison
This is an article where I measured MoE using a similar method.
Although conditions differ due to different quantization methods, there is a clear difference in operating speed even with almost the same size, because MoE does not activate its entire structure at once.
| Model | Params | Quantization | Size | Environment | tps | RAM | VRAM |
|---|---|---|---|---|---|---|---|
| Llama 3.3 | 70B | Q2_K | 26GB | CPU | 1.46 | 27.2GB | |
| Llama 3.3 | 70B | Q2_K | 26GB | CPU+GPU | 2.42 | 26.9GB | 15.7GB |
| Tanuki-dpo-v1.0 | 8x8B | Q4_K_M | 28GB | CPU | 5.08 | 27.6GB | |
| Tanuki-dpo-v1.0 | 8x8B | Q4_K_M | 28GB | CPU+GPU | 6.44 | 28.7GB | 13.2GB |
Related Articles
These are articles measured with the same prompt. I explain Radeon drivers there. There is a comparison with the GeForce RTX 4060 Ti.
There is a comparison with the Intel Arc A770.
Postscript
Haruneko Daifuku (@haru_arc) informed me that when Llama 3.3 70B (Q4_K_M) fits entirely in VRAM on an M2 Ultra, it achieves 12.21 tps.
Discussion
AMDのドライバはどれを使われてます?
Adrenalin 24.12.1です。24.9.1と24.10.1にはバグがあるため正常動作しません。詳細は以下をご参照ください。