iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🦙

Running Llama 3.3 70B with Insufficient VRAM

に公開
2

I ran Llama 3.3 70B in a situation where it doesn't fit in VRAM and compared it with CPU-only operation.

Environment

I conducted measurements in the following environment:

  • OS: Windows 11 Home [Version 10.0.26100.3194]
  • RAM: 64GB (DDR4-3200)
  • CPU: AMD Ryzen 5 5600X 6-Core Processor
  • GPU: AMD Radeon RX 7600 XT (VRAM 16GB)
  • Ollama: 0.5.11

I executed the following prompt with /set verbose specified:

AIの未来を予測してください。(100字)
(Predict the future of AI. (100 characters))

Q4_K_M

ollama run llama3.3
Params Quantization Size Environment tps RAM VRAM
70B Q4_K_M 42GB CPU 0.99 42GB
70B Q4_K_M 42GB CPU+GPU 1.34 41.6GB 15.2GB
Output Examples

AI will be utilized in various fields such as medicine and finance. In the future, collaboration between humans and AI will improve productivity and lead to the birth of new industries. There are also concerns about an increase in cyberattacks misusing AI. Development of technology and strengthening of regulations to ensure safety and ethics are necessary.


AI will become increasingly sophisticated in the future and will work closely with humans. While practical applications such as autonomous driving and medical support by AI will spread, concerns regarding the ethics and safety of artificial intelligence are also expected to deepen. Furthermore, since the development of AI may bring significant changes to the nature of work and social structures, responses such as education and labor policies will be required.

Since the model size is too large, only the layers that fit within the 16GB VRAM are offloaded to the GPU. (In practice, about 2GB spills over into shared GPU memory.)

llm_load_tensors: offloading 27 repeating layers to GPU
llm_load_tensors: offloaded 27/81 layers to GPU

Of the 81 layers, one is for output and is assigned to the CPU. Out of the remaining 80 layers, 27 are offloaded to the GPU.

In this situation, the speed is improved by approximately 1.3 times compared to running entirely on the CPU.

Q2_K

Since Q4_K_M is too massive to achieve decent speed, I will also try a smaller model size.

ollama run hf.co/unsloth/Llama-3.3-70B-Instruct-GGUF:Q2_K
Params Quantization Size Environment tps RAM VRAM
70B Q2_K 26GB CPU 1.46 27.2GB
70B Q2_K 26GB CPU+GPU 2.42 26.9GB 15.7GB
Output Examples

AI is highly likely to evolve through artificial intelligence, machine learning, deep learning, etc., and become an existence with intelligence comparable to humans. While wide applications are expected in data analysis, autonomous driving, medicine, and more, there are also concerns about a loss of sense of morality and a decrease in employment. In the United States, educational reforms have begun to prevent talent poaching by AI.


The future of artificial intelligence (AI) will bring many benefits to humanity. It is predicted that AI technology will evolve and revolutionize fields such as medicine, education, and transportation. Furthermore, a transition to a new society through the coexistence of robots and humans is also expected.

Since the model size is too large, only the layers that fit within the 16GB VRAM are offloaded to the GPU. (In practice, about 2GB spills over into shared GPU memory.)

llm_load_tensors: offloading 46 repeating layers to GPU
llm_load_tensors: offloaded 46/81 layers to GPU

Of the 81 layers, one is for output and is assigned to the CPU. Out of the remaining 80 layers, 46 are offloaded to the GPU.

In this situation, the speed is improved by approximately 1.6 times compared to running entirely on the CPU.

Summary

Params Quantization Size Environment tps RAM VRAM
70B Q4_K_M 42GB CPU 0.99 42GB
70B Q4_K_M 42GB CPU+GPU 1.34 41.6GB 15.2GB
70B Q2_K 26GB CPU 1.46 27.2GB
70B Q2_K 26GB CPU+GPU 2.42 26.9GB 15.7GB

graph.png

Comparison

This is an article where I measured MoE using a similar method.

https://zenn.dev/7shi/articles/8c4255d199264f

Although conditions differ due to different quantization methods, there is a clear difference in operating speed even with almost the same size, because MoE does not activate its entire structure at once.

Model Params Quantization Size Environment tps RAM VRAM
Llama 3.3 70B Q2_K 26GB CPU 1.46 27.2GB
Llama 3.3 70B Q2_K 26GB CPU+GPU 2.42 26.9GB 15.7GB
Tanuki-dpo-v1.0 8x8B Q4_K_M 28GB CPU 5.08 27.6GB
Tanuki-dpo-v1.0 8x8B Q4_K_M 28GB CPU+GPU 6.44 28.7GB 13.2GB

Related Articles

These are articles measured with the same prompt. I explain Radeon drivers there. There is a comparison with the GeForce RTX 4060 Ti.

https://qiita.com/7shi/items/dc037c2d5b0add0da33a

There is a comparison with the Intel Arc A770.

https://7shi.hateblo.jp/entry/2024/12/17/020636

Postscript

Haruneko Daifuku (@haru_arc) informed me that when Llama 3.3 70B (Q4_K_M) fits entirely in VRAM on an M2 Ultra, it achieves 12.21 tps.

GitHubで編集を提案

Discussion