iTranslated by AI
Trouble with Updating spark-vllm-docker: A Lesson in Backups
This is a story about why you should back up your vLLM Docker images before updating, especially those for LLMs you use regularly.
Background
spark-vllm-docker receives updates to vLLM and FlashInfer every few days. Although I didn't keep up with these updates very often in the past, I had started updating them every time recently.
However, this backfired, and the behavior of cyankiwi/MiniMax-M2.7-AWQ-4bit, a model I use regularly, became unreliable.
Specifically, it started generating non-existent names, inserting spaces before symbols like . and - (causing them to be treated as separate files), and so on. Since these issues significantly impacted my work, I couldn't get anything done.
Because I had overwritten the Docker image, I couldn't revert to the previous version, leaving me completely stuck. "I should have taken a backup..." was a thought that came too late.
I checked the spark-vllm-docker issues, but since nothing similar was reported, I decided to open an issue myself.
Causes and Solutions
As it turned out, someone else was experiencing the same symptoms. I'm glad I opened the issue.
Potential Cause
- A bug in FlashInfer? (The general consensus seems to be that it is not an issue with vLLM itself.)
Solutions
- Downgrade the spark-vllm-docker version (v0.18.1).
- Abandon spark-vllm-docker and use the official NVIDIA vLLM Docker image (v0.17).
Since I couldn't stop my work, I initially used the NVIDIA vLLM Docker image. Later, after receiving information that downgrading confirmed normal operation, I switched back to that version and continued my work for a while.
Aftermath
While corresponding on the issue, the vLLM version was updated and reached v0.20.1. Upon hearing that this version was stable, I immediately updated and verified the behavior. Indeed, the problems were resolved. A huge relief.
Needless to say, I have made it a habit to always back up my Docker images before any subsequent updates.
Conclusion
vLLM is still software in active development, so I expect these kinds of things will happen often. I have taken this lesson to heart.
Discussion