There's a small thing that happened in the AI-Farm last week that I find more useful than any benchmark we've run on the cluster.
We dropped a verification comment on a vLLM pull request — and discovered, in the process, that the patch had already been quietly fixing things on our hardware for over a month.
The patch
vLLM PR #35568 generalises a handful of CUDA dispatch guards from "SM120 only" to the SM12x family. Small change, big effect on Spark.
Here's the thing: SM121 (the GB10 in DGX Spark) and SM120 (RTX 5090) share the same MMA capabilities. But several Marlin and CUTLASS FP8 paths in vLLM were checking arch == 120 exactly — silently rejecting Spark and falling back to slower kernels. The fix is the kind of one-line change that looks trivial in the diff but unlocks a whole class of hardware.
The surprise
We came in expecting to build a patched image and verify the new dispatch paths.
We didn't have to.
The patch has been shipping for 38 days in eugr/spark-vllm-docker's Dockerfile — applied inline at build time. Our production image (built 2026-05-06, serving Intel/Qwen3.5-397B-A17B-int4-AutoRound at PP=3 with kv_cache_dtype=fp8) already contained the patched marlin_utils.py. The startup logs prove the fast paths are dispatching correctly:
Using MarlinLinearKernel for GPTQMarlinLinearMethodUsing 'MARLIN' WNA16 MoE backend.
Worth checking what's actually deployed before assuming you need a fresh build — sometimes the community already shipped it.
What we sent upstream
Instead of "we built and tested it once," the comment we left on the PR is roughly:
> Deployment-tested on real GB10 silicon. 38 days community-wide via the eugr image, 3 days on our specific 3-node Spark cluster, zero PR-related kernel errors observed. Logs show the SM12x guards dispatching as expected.
That's the kind of third-party signal vLLM maintainers find useful when deciding to merge. It's also, for us, the first concrete contribution back to a project we depend on every day.
One thing worth knowing
We started debugging with cutlass_scaled_mm_supports_fp4(121) as our "patched-or-not" canary. It returned True on both patched and unpatched builds — turns out it's a generic arch predicate, not a dispatch-path indicator. The real signal was the dispatch-time logs above.
Capability detection is not the same as dispatch path. Worth keeping in mind for anyone debugging similar guard-generalisation patches.
---
Thanks to @blake-snc for authoring the patch, and to the eugr maintainer for shipping it fast to the Spark community.

Member of the NVIDIA Developer Program — the AI-Farm cluster runs under this identity.
#dgxspark #blackwell #vllm #cuda #aiinfra
No comments yet