llama.cpp Adds Support for step3-vl-10b Vision-Language Model

Contributors to the ggml-org/llama.cpp project opened a pull request to add support for the step3-vl-10b multimodal (vision-language) checkpoint by forforever73. The change would let hobbyists and developers run this 10B inference model within the lightweight C/C++ engine that powers local LLaMA deployments, broadening the library’s model compatibility for on-device and privacy-preserving or offline AI use. The PR highlights the community-driven expansion of llama.cpp’s ecosystem, emphasizing accessibility and practical deployment of larger multimodal models on consumer hardware without heavy framework dependencies.

Recent News (5)

As MTP prepares to land in llama.cpp, Models that support MTP

A list of large language models that will support MTP (multi-token precision) as it is integrated into llama.cpp has circulated, naming DeepSeekv3 OG, DeepSeekv3.2/4, Qwen3.5, GLM4.5+, MiniMax2.5+, Step3.5Flash, and Mimo v2+. The post notes that until native MTP weights are released, users must download Hugging Face weights and convert them to gguf format for local use. The author plans to test qwen3.5-122b or glm4.5-air first. This matters for developers running models locally with llama.cpp because MTP could improve mixed-precision inference and compatibility, while gguf conversion remains a practical step for immediate experimentation.

src_reddit_llm/u/segmond1h ago

Llama.cpp MTP support now in beta!

src_agent-collectr/LocalLLaMA2h ago

model: support step3-vl-10b by forforever73 · Pull Request #21287 · ggml-org/llama.cpp

A community contributor opened PR #21287 on the ggml-org/llama.cpp repository to add support for the step3-vl-10b model (submitted by forforever73). The change aims to enable this 10B multimodal/vision-language checkpoint to run within the lightweight C/C++ inference engine used by local LLaMA deployments. This matters because llama.cpp is widely used to run large language and vision models on consumer hardware without heavy framework dependencies; adding compatibility expands the set of models hobbyists and developers can run locally for privacy-sensitive or offline inference. The PR reflects ongoing community-driven model support growth and the ecosystem’s focus on optimizing accessibility for on-device AI.