FastKernels: Benchmarking GPU Kernel Generation in Production

Oliaro, Gabriele; Fu, Yichao; Jiang, May; Lu, Owen; Wang, Junli; Jia, Zhihao; Zhang, Hao; Rajbhandari, Samyam

46

representative architectures

8

model categories

96.2%

HF Transformers coverage (409 / 425)

L1–L4

compositional task hierarchy

Abstract

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading—agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems.

We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases.

Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94× aggregate speedup over production baselines, with weaker agents at 0.78× and 0.53×—confirming that benchmark–production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements.

Why FastKernels?

LLM-based kernel-generation agents now achieve strong scores on isolated benchmarks, with state-of-the-art systems autonomously compiling, profiling, and refining CUDA or Triton code. Yet these gains often fail to transfer into production inference frameworks such as vLLM and SGLang: kernels that score well in sandboxes routinely regress once they encounter real serving interfaces, compilation stacks, and workloads.

The root cause is benchmark–production misalignment. Current benchmarks rely on synthetic inputs, single-GPU isolated kernels, simplified interfaces, and independent task levels that do not compose into full inference pipelines. They therefore reward kernels that are fast in isolation but brittle in deployed systems, hiding interface mismatches, compilation-stack conflicts, and correctness degradation that only appears at model scale.

FastKernels closes this gap by being a benchmark that is the framework: optimized kernels run inside a real inference pipeline rather than being ported from a separate harness, and each task's interface mirrors the corresponding module in production libraries so winning kernels are immediately deployable.

Contributions

Benchmark-as-framework

FastKernels is a self-contained inference framework whose task interfaces match production modules, so optimized kernels can be evaluated in place and transferred into systems such as vLLM and SGLang with no heavy refactoring.

Compositional task hierarchy

Tasks progress from primitives (L1) to fused operators (L2), layers (L3), and full models (L4). Agents can reuse lower-level optimizations inside higher-level modules instead of solving each level independently.

Production-faithful evaluation

Kernels are evaluated with production baselines (cuBLAS, FlashAttention-3, FlashInfer, vLLM CUDA ops, DeepGEMM, FLA), captured tensors, compilation-stack effects, and multi-GPU communication including tensor and expert parallelism.

End-to-end validation & metrics

FastKernels injects generated kernels into full model executions, checks downstream quality, and reports MacroEval, which combines calibrated correctness, coverage, and end-to-end throughput–latency speedup across model families.

Broad architecture coverage

46 architectures across 8 categories—dense and MoE LLMs, linear attention and SSMs, vision, audio, video, robotics, 3D graphics, recommendation, and world models—benchmarked against production references rather than hardware-theoretical bounds.

Multi-GPU communication kernels

To our knowledge the first kernel benchmark to include collectives as first-class tasks: tensor-parallel all-reduce / reduce-scatter, expert-parallel all-to-all dispatch and combine for MoE routing, and computation–communication overlap kernels.

Compositional Task Hierarchy

FastKernels organizes tasks into four levels of increasing scope. Unlike benchmarks whose levels are independent, this hierarchy enables a dynamic-programming style optimization loop: an agent can reuse an optimized lower-level kernel (e.g. a linear operator) when optimizing higher-level modules (e.g. MLPs or transformer blocks), instead of rediscovering the same building block from scratch.

L1 — Primitive operators

Individual operations: attention variants (GQA, MLA, sliding-window), normalizations (RMSNorm, LayerNorm), activations (SwiGLU, GeGLU), positional encodings (RoPE, ALiBi), and quantization/dequantization routines.

L2 — Fused operators

Multi-operation kernels representing natural fusion opportunities as they arise in real models: residual-add + RMSNorm + quantization, attention + output projection, MoE gate + dispatch + expert computation.

L3 — Full layers and blocks

Complete architectural blocks: transformer decoder layers, SSM scan blocks, MoE layers with routing, cross-attention layers, and vision encoder stages.

L4 — End-to-end models

Full model inference paths evaluated as integrated systems. The 46 representative architectures in L4 serve as the end-to-end evaluation set, and underlying L1–L3 kernels cover the overwhelming majority of operators across ML and AI models.

Results

Per-architecture speedup over production references

Each bar shows the speedup of FastKernels over the strongest production or upstream reference for one architecture. Panels group architectures by category; the dashed line in each panel marks the average speedup within that category.

Across 46 representative benchmarks, FastKernels averages 1.24× throughput (1.04× median). On mainstream LLM serving against hardened frameworks like vLLM and SGLang, FastKernels sits at parity (e.g. Llama-3.1 1.04×, GPT-OSS 1.02×, Mixtral 0.97×, DeepSeek-V3.2 0.84×). Supra-unity gains concentrate on architectures whose only available reference is a research repository or a non-serving library: Pi0 3.48×, ColBERTv2 3.08×, CosyVoice3 2.13×, RetNet / GLA ~1.86× vs. FLA.

Existing kernel agents fall below 1× on production-shaped tasks

We evaluate three state-of-the-art kernel-generation agents—Dr. Kernel, KernelAgent, and OpenAI Codex—on FastKernels L1 and L2 (88 target families). On aggregate, all three land below 1× against production-grade references:

OpenAI Codex: 0.943× geomean (88/88 correct)
KernelAgent: 0.777× geomean (28/76 correct)
Dr. Kernel: 0.527× geomean (8/38 correct)

Two findings explain the gap: (1) production-grade references (cuBLAS, FlashAttention-3, FlashInfer, DeepGEMM, FLA) separate apparent gains from real ones—Codex hits 1.16× on torch-eager-wrapped operators but only 0.93× on vendor- or hand-written kernels; and (2) production-shaped composite modules (full attention, MoE, and YOLOv10 blocks lifted from production code) are markedly harder than primitives—every agent that reaches L2 regresses on it.

Captured tensors matter: MoE routing study

FastKernels captures and replays the tensors production models actually feed to kernels. Our MoE routing study shows that synthetic inputs materially change both load skew and hot-expert identity: kernels that look great on uniform-random routing decisions can underperform once the actual production routing distribution is replayed.

MacroEval: Cross-Architecture Metrics

Model families expose different correctness signals—tokens, embeddings, labels, rankings, trajectories, audio, video—and different throughput–latency tradeoffs. MacroEval calibrates architecture-specific correctness to a common [0, 1] scale, measures end-to-end speedup, and macro-averages across families so no single architecture dominates the leaderboard.

The default leaderboard score multiplies three components: macro-geometric speedup S_macro, macro-calibrated correctness C_macro, and macro-coverage. This rewards agents that are simultaneously fast, correct, and broadly applicable, while still exposing the three component metrics so users can identify throughput–correctness or coverage–performance tradeoffs.

Getting Started

FastKernels requires Python 3.10+, CUDA 12.x, and a recent NVIDIA GPU (Hopper / Blackwell tested; Ampere supported for a subset of kernels). Clone the repository and install:

git clone git@github.com:Snowflake-AI-Research/fastkernels.git
cd kb_nano
pip install .

Each supported architecture is implemented under tasks/baseline/, organized by level. To scaffold a candidate kernel with the correct signature:

# Scaffold candidate stubs
python agent/create_stubs.py                       # all operators
python agent/create_stubs.py --level 1             # L1 only
python agent/create_stubs.py --architecture llama  # Llama only

Then drop your replacement in tasks/candidate/L<level>/<op_name>.py exposing the same class name and forward signature as the baseline. The benchmark validates correctness against the production reference and reports speedup:

# Kernel-level: correctness + speedup for a single operator
kb_nano kernels --target rms_norm

# Standardized end-to-end eval sweep
kb_nano eval --model meta-llama/Llama-3.1-8B-Instruct

Full Documentation Code

Frequently Asked Questions

How is FastKernels different from KernelBench and similar benchmarks?

Existing benchmarks construct tasks bottom-up by combining primitive operators or extracting subgraphs via automated pipelines, evaluate on a single GPU with synthetic inputs, and compare against PyTorch eager. FastKernels takes a top-down approach: it starts from real model families, walks the forward pass, and produces standalone tasks whose interfaces match the corresponding modules in production libraries (vLLM, SGLang, diffusers, FLA, timm). Speedups are measured against the kernels production systems actually ship—cuBLAS, FlashAttention-3, FlashInfer, vLLM CUDA ops, DeepGEMM—not torch eager.

Why only 46 architectures? Doesn't that limit coverage?

FastKernels consolidates near-identical operators (e.g. RoPE variants, normalizations, attention backends) into a single L1/L2 task and picks a small set of representative architectures whose union of operators subsumes the long tail of model families. An audit of every PyTorch modeling file in HuggingFace Transformers (commit da6c53e4; 425 entries) indicates that these 46 architectures cover 96.2% (409/425) of HF architectures with no new compute primitive—only 5 architectures require a genuinely new kernel and 2 require an external library.

Can I deploy a kernel optimized with FastKernels into vLLM or SGLang?

Yes. Each task's __init__ constructor signature and forward method are designed to closely match the corresponding module in the state-of-the-art library for its architecture family. A kernel optimized within FastKernels can be deployed into vLLM, SGLang, or another production framework essentially via a copy-paste of the module, with no heavy interface refactoring. Unlike FlashInfer-Bench's FIApply (which substitutes at the kernel dispatch level), FastKernels operates at the module level—the unit of composition that production frameworks actually use.

How does FastKernels handle multi-GPU collectives?

Production inference at scale is almost always distributed, and the resulting collectives, synchronization barriers, and communication–computation overlaps materially affect end-to-end latency. To our knowledge FastKernels is the first kernel benchmark to include them as first-class tasks: tensor-parallel all-reduce / reduce-scatter, expert-parallel all-to-all dispatch and combine for MoE routing (DeepSeek-V3, Mixtral), and overlap kernels that must hide NCCL collectives behind computation. A kernel that achieves 1.5× on a single GPU can still degrade end-to-end throughput if it disrupts the communication schedule—these tasks are unreachable from single-GPU benchmarks.

What kernel-generation agents has FastKernels been evaluated against?

We evaluate three state-of-the-art agents on FastKernels L1 and L2 (88 target families):

OpenAI Codex — 0.943× geomean, 88/88 correct
KernelAgent — 0.777× geomean, 28/76 correct
Dr. Kernel — 0.527× geomean, 8/38 correct

All three land below 1× on aggregate, in contrast to the supra-unity numbers reported on operator-level benchmarks whose reference is PyTorch eager. The per-target results pinpoint two design choices in FastKernels that account for the gap: production-grade references and production-shaped composite modules at L2.

Can I add my own architecture to FastKernels?

Yes. For each model architecture, the FastKernels construction pipeline loads the HuggingFace configuration and architecture definition, walks the forward pass, and produces standalone task implementations for each kernel with all configuration constants (hidden size, number of heads, data types) inlined from the model's configuration. The pipeline ships as part of the framework, so users can add new architectures from HuggingFace with minimal effort. See the documentation for details.

What's the benchmarking workflow?

FastKernels exposes three tiers of increasing scope:

Tier 1 — Kernel: instantiates a candidate operator next to the baseline, copies weights, and compares forward() outputs and runtimes across an input registry of shapes, dtypes, and init arguments. For data-dependent operators (e.g. MoE dispatch) it replays golden inputs captured from real end-to-end executions.
Tier 2 — E2E: end-to-end model benchmarks, measuring full-model throughput, latency, and serving behavior under user-specified workloads.
Tier 3 — Eval: standardized evaluation sweeps, comparing baseline and candidate executions across fixed models, tensor-parallel configurations, and throughput–latency workloads. This is the benchmark used for the leaderboard.

Profiling is first-class: Tier 1 integrates with NVIDIA Nsight Compute (NCU); Tier 2 with NVIDIA Nsight Systems (NSYS). MLflow integration tracks kernel lineage and benchmark history.

BibTeX

@article{oliaro2026fastkernels,
  author    = {Gabriele Oliaro and Yichao Fu and May Jiang and Owen Lu and
               Junli Wang and Hao Zhang and Zhihao Jia and Samyam Rajbhandari},
  title     = {FastKernels: Benchmarking GPU Kernel Generation in Production},
  journal   = {arXiv preprint arXiv:2605.23215},
  year      = {2026},
  url       = {https://arxiv.org/abs/2605.23215}
}

FastKernels: Benchmarking GPU KernelGeneration in Production