First upstream contribution from the AI-Farm: verifying vLLM PR #35568 on DGX Spark GB10

A short field note from the AI-Farm: we left a verification comment on vLLM PR #35568 — and found the patch had been quietly fixing our GB10 dispatch for 38 days.

There's a small thing that happened in the AI-Farm last week that I find more useful than any benchmark we've run on the cluster.

We dropped a verification comment on a vLLM pull request — and discovered, in the process, that the patch had already been quietly fixing things on our hardware for over a month.

The patch

vLLM PR #35568 generalises a handful of CUDA dispatch guards from "SM120 only" to the SM12x family. Small change, big effect on Spark.

Here's the thing: SM121 (the GB10 in DGX Spark) and SM120 (RTX 5090) share the same MMA capabilities. But several Marlin and CUTLASS FP8 paths in vLLM were checking arch == 120 exactly — silently rejecting Spark and falling back to slower kernels. The fix is the kind of one-line change that looks trivial in the diff but unlocks a whole class of hardware.

The vLLM PR #35568 patch visualised — SM120 (RTX 5090) and SM121 (GB10) silicon share the same MMA capabilities; the dispatch guard generalises from arch == 120 to arch in {120, 121}

The surprise

We came in expecting to build a patched image and verify the new dispatch paths.

We didn't have to.

The patch has been shipping for 38 days in eugr/spark-vllm-docker's Dockerfile — applied inline at build time. Our production image (built 2026-05-06, serving Intel/Qwen3.5-397B-A17B-int4-AutoRound at PP=3 with kv_cache_dtype=fp8) already contained the patched marlin_utils.py. The startup logs prove the fast paths are dispatching correctly:

Using MarlinLinearKernel for GPTQMarlinLinearMethod

Using 'MARLIN' WNA16 MoE backend.

AI-Farm vLLM startup logs on DGX Spark GB10 — Marlin backend dispatching SM121 paths, NVIDIA accelerated computing branded

Worth checking what's actually deployed before assuming you need a fresh build — sometimes the community already shipped it.

What we sent upstream

Instead of "we built and tested it once," the comment we left on the PR is roughly:

> Deployment-tested on real GB10 silicon. 38 days community-wide via the eugr image, 3 days on our specific 3-node Spark cluster, zero PR-related kernel errors observed. Logs show the SM12x guards dispatching as expected.

That's the kind of third-party signal vLLM maintainers find useful when deciding to merge. It's also, for us, the first concrete contribution back to a project we depend on every day.

One thing worth knowing

We started debugging with cutlass_scaled_mm_supports_fp4(121) as our "patched-or-not" canary. It returned True on both patched and unpatched builds — turns out it's a generic arch predicate, not a dispatch-path indicator. The real signal was the dispatch-time logs above.

Capability detection is not the same as dispatch path. Worth keeping in mind for anyone debugging similar guard-generalisation patches.

---

Thanks to @blake-snc for authoring the patch, and to the eugr maintainer for shipping it fast to the Spark community.

NVIDIA Developer Program member

Member of the NVIDIA Developer Program — the AI-Farm cluster runs under this identity.

#dgxspark #blackwell #vllm #cuda #aiinfra

No comments yet