iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐈

Trouble with Updating spark-vllm-docker: A Lesson in Backups

に公開

This is a story about why you should back up your vLLM Docker images before updating, especially those for LLMs you use regularly.

Background

https://github.com/eugr/spark-vllm-docker

spark-vllm-docker receives updates to vLLM and FlashInfer every few days. Although I didn't keep up with these updates very often in the past, I had started updating them every time recently.

However, this backfired, and the behavior of cyankiwi/MiniMax-M2.7-AWQ-4bit, a model I use regularly, became unreliable.

https://huggingface.co/cyankiwi/MiniMax-M2.7-AWQ-4bit

Specifically, it started generating non-existent names, inserting spaces before symbols like . and - (causing them to be treated as separate files), and so on. Since these issues significantly impacted my work, I couldn't get anything done.

Because I had overwritten the Docker image, I couldn't revert to the previous version, leaving me completely stuck. "I should have taken a backup..." was a thought that came too late.

I checked the spark-vllm-docker issues, but since nothing similar was reported, I decided to open an issue myself.

Causes and Solutions

https://github.com/eugr/spark-vllm-docker/issues/208

As it turned out, someone else was experiencing the same symptoms. I'm glad I opened the issue.

Potential Cause

  • A bug in FlashInfer? (The general consensus seems to be that it is not an issue with vLLM itself.)

Solutions

  • Downgrade the spark-vllm-docker version (v0.18.1).
  • Abandon spark-vllm-docker and use the official NVIDIA vLLM Docker image (v0.17).

Since I couldn't stop my work, I initially used the NVIDIA vLLM Docker image. Later, after receiving information that downgrading confirmed normal operation, I switched back to that version and continued my work for a while.

Aftermath

While corresponding on the issue, the vLLM version was updated and reached v0.20.1. Upon hearing that this version was stable, I immediately updated and verified the behavior. Indeed, the problems were resolved. A huge relief.

Needless to say, I have made it a habit to always back up my Docker images before any subsequent updates.

Conclusion

vLLM is still software in active development, so I expect these kinds of things will happen often. I have taken this lesson to heart.

Discussion