Deploying 2:4 Sparsity with FP8 on Hopper: A Production Cookbook
On Hopper‑class GPUs, the combination of 2:4 structured sparsity and FP8 precision routinely delivers 1.5–2.0× end‑to‑end throughput gains for LLM decoding, while cutting energy per token by 20–40% when implemented in production engines like TensorRT‑LLM. The kernel‑level picture is simple: NVIDIA’s Sparse Tensor Cores double effective math throughput for eligible GEMMs under the 2:4 N:M pattern, and Transformer Engine’s FP8 pipeline reduces bandwidth and scales activations robustly. But making those gains stick in production—without tanking quality—requires a careful pipeline: the right layers, properly curated calibration, safe pruning, quick adapter‑based recovery, and a well‑tuned engine build.
This article is a hands‑on playbook to take a dense model to a pruned, adapter‑recovered, FP8‑calibrated build in TensorRT‑LLM. We’ll show you how to baseline and profile, select layers and formats compatible with 2:4 pipelines, apply structured pruning across linear/FFN paths, recover with LoRA/AdaLoRA, integrate into a production engine with static shapes and fused attention, and validate with a robust matrix of metrics, latency distributions, memory, and power logging. By the end, you’ll have a repeatable, rollback‑safe process you can operate in a real fleet.
Architecture/Implementation Details
End‑to‑end pipeline overview
- Baseline in production conditions
- Build a dense FP16 or FP8 baseline in TensorRT‑LLM with your production decoding params (batching, max prompt/response lengths, tokenizer, scheduler).
- Enable fused attention (e.g., FlashAttention‑style kernels) so the MLP matmuls become the dominant compute path you’ll later sparsify.
- Measure tokens/s, p50/p99 latency, peak/activation memory, and GPU power (nvidia‑smi dmon or NVML). This is your control.
- Choose 2:4‑eligible layers and orientation
- Target linear/FFN mats with large K dimensions: attention QKV/proj and MLP up/down projections are typically eligible.
- Avoid embedding tables, layer norms, and the final LM head for pruning.
- Respect the 2:4 grouping along the innermost compute dimension expected by cuSPARSELt (groups of 4 contiguous values along K) so Sparse Tensor Cores can engage.
- Calibrate for FP8 and sparsity
- Curate a calibration set of 2–5k prompts that mirrors production: instruction, reasoning, and code. Include a long‑context slice.
- Run FP8 activation/weight scaling with Transformer Engine (per‑tensor dynamic scaling) to avoid overflow while you collect activation stats.
- Apply 2:4 structured pruning safely
- Use magnitude or activation‑aware scores within each 4‑value group to drop the two least‑important elements.
- Start at 30–40% global sparsity (2:4 is 50% within eligible mats but applied only to selected layers). Avoid aggressive pruning in late attention layers which can be KV‑critical for long‑range behavior.
- Adapter‑based recovery (LoRA/AdaLoRA)
- Attach low‑rank adapters to pruned modules and fine‑tune for a few thousand steps on a task mixture similar to your calibration set.
- Use early stopping based on a small validation basket (e.g., slices of MMLU/GSM8K/MT‑Bench/HumanEval) to recover 0.5–2 points without expensive full SFT.
- Precision pipeline and post‑pruning recalibration
- Establish a stable FP8 baseline first, then prune, then re‑calibrate FP8 scales post‑pruning.
- If you prefer INT8 (W8A8 or weight‑only), create a pre‑pruning baseline (LLM.int8(), GPTQ), prune, then re‑calibrate and validate.
- Engine integration for production
- Export pruned weights with 2:4 masks.
- Build a TensorRT‑LLM engine with FP8 and sparsity enabled, static shape profiles covering your prompt/response envelope, and fused attention enabled.
- Verify sparse kernel engagement across all configured shapes; shape mismatches will silently disable sparsity.
- Validate and iterate
- Re‑measure tokens/s, p50/p99 latency, memory, and power. Expect 1.3–1.8× from 2:4 alone and 1.5–2.0× with FP8 on H100/H200, with 20–40% energy per token reductions.
- Evaluate MMLU, GSM8K, HumanEval, MT‑Bench, BBH, and a long‑context task with fixed decoding params for apples‑to‑apples comparisons.
Diagram (conceptual): Data flows from tokenizer → FP8‑calibrated embeddings → attention blocks (FlashAttention‑2 fused kernels) → FFN blocks (2:4 sparse matmuls via cuSPARSELt) → LM head. Recovery adapters sit on pruned linear paths. Static shape profiles ensure kernels stay on optimized paths.
Practical performance notes on Hopper
- Eligibility is binary: only correctly masked 2:4 mats get the 2× kernel uplift. A single mis‑grouped axis or unsupported layout drops you to dense kernels.
- FP8’s biggest gains come from bandwidth savings and keeping mats in fast paths; keep scale outliers in check with updated calibration after pruning.
- Fused/optimized attention (e.g., FlashAttention‑2) reduces attention overhead, amplifying the speedup realized from sparsifying the MLPs.
Comparison Tables
Precision and sparsity recipes on Hopper (TensorRT‑LLM)
| Recipe | Expected e2e speedup (decoding) | Memory/BW impact | Risk profile | Notes |
|---|---|---|---|---|
| Dense FP16 | 1.0Ă— | High | Low | Baseline; easiest to validate |
| Dense FP8 | 1.2–1.4× | Activation BW ↓ | Low–mod | Requires careful scaling; good first step |
| 2:4 + FP16 | 1.3–1.8× | Weight BW ↓ | Mod | Ensure layer eligibility/orientation |
| 2:4 + FP8 | 1.5–2.0× | Weight + activation BW ↓ | Mod | Sweet spot on H100/H200 |
| INT8 W8A8 | 1.2–1.6× | Weight/act BW ↓ | Mod | Broadly supported; recalibrate after pruning |
What to prune (dense LLMs)
| Module | 2:4 eligible | Risk to quality | Guidance |
|---|---|---|---|
| Attention Q/K/V | Yes | Medium | Prune conservatively in later layers; preserve KV‑critical heads |
| Attention output proj | Yes | Low–mod | Generally safe; validate long‑context |
| FFN up/down (gate) | Yes | Low | Primary target for 2:4 speedups |
| Embeddings | No | High | Do not prune |
| RMSNorm/LayerNorm | No | High | Do not prune |
| LM head | Generally avoid | High | Optional only with strong validation |
Pros and cons (2:4 vs alternatives on NVIDIA)
| Approach | Pros | Cons |
|---|---|---|
| 2:4 N:M | Kernel‑level 2× math throughput; production‑ready in TensorRT‑LLM | Pattern constraints; strict layout/orientation |
| FP8 only | Easy to adopt; portable across layers | Smaller gains; still bandwidth‑bound in places |
| Unstructured sparsity | High compressibility | Little/no speedup without specialized kernels |
| Block‑structured sparsity | Good locality; easier kernels than unstructured | Requires tuned custom kernels or specific coverage |
Best Practices
Calibration set curation and scaling hygiene
- Mix instruction, reasoning (math), code, and long‑context prompts. 2–5k prompts is enough for stable scaling and pruning scores.
- Fix decoding params (temperature/top‑p/max‑new‑tokens) for both calibration and evaluation to avoid confounds.
- For FP8, use Transformer Engine’s per‑tensor dynamic scaling and re‑run calibration after pruning to account for distribution shifts.
Safe structured pruning
- Prune only layers with proven 2:4 support in your engine. Follow cuSPARSELt’s grouping along the inner K dimension and keep weights in supported layouts.
- Stage sparsity increases (e.g., 20% → 30% → 40%) with quick evaluations between steps. Stop when MMLU/GSM8K move by ~1–2 points.
- Treat late attention layers as KV‑critical: prune less aggressively there, and preserve heads known to carry long‑range signal.
Adapter‑based recovery
- Start with LoRA rank 8–16 on attention and FFN paths; raise rank only if validation does not recover within 1 point on your metrics.
- AdaLoRA can allocate rank dynamically across modules; useful when pruning budgets differ by layer depth.
- Train with your calibration mix and a light weight on code/math to stabilize GSM8K/HumanEval.
Engine integration and shape discipline
- Build engines with static shapes that cover real workloads (prompt/response buckets). If you fall off a profiled shape at runtime, sparsity may be disabled.
- Enable fused attention (FlashAttention‑2 style) to surface the MLP bottleneck and maximize end‑to‑end gains.
- Verify sparsity engagement by inspecting kernel traces and TensorRT‑LLM logs; a sudden p99 spike often signals a fallback to dense paths.
Validation matrix and guardrails
- Report: tokens/s, p50/p90/p99 latency, peak/activation memory, energy per token (via power logging). Compare apples‑to‑apples.
- Task suite: MMLU, GSM8K, HumanEval, MT‑Bench/BBH, plus at least one long‑context benchmark.
- Canary tests: short prompts targeting safety, instruction‑following, and long‑context integrity.
- Rollback plan: versioned artifacts per stage (baseline → pruned → recovered → recalibrated). Progressive traffic ramp with automated rollback on error budgets or p99 regressions. ✅
Troubleshooting quick hits
- FP8 overflow/NaNs: tighten clip‑max or re‑collect scales with outlier‑heavy prompts.
- Speedup <1.2Ă—: check that 2:4 masks align to the correct axis and that all production shapes are profiled with sparsity enabled.
- Attention regressions: roll back Q/K pruning in late layers or increase LoRA rank on those modules.
- p99 spikes: ensure the scheduler doesn’t exceed profiled max lengths, and that batching does not introduce shape drift.
Practical Examples
1) 2:4 mask application (PyTorch, illustrative)
import torch
@torch.no_grad()
def apply_2_4_mask(weight: torch.Tensor, group_dim: int = -1):
# Reshape so groups of 4 lie along the innermost dimension
assert weight.shape[group_dim] % 4 == 0
w = weight.transpose(group_dim, -1)
g = w.reshape(*w.shape[:-1], w.shape[-1] // 4, 4)
# Magnitude score within each group of 4
scores = g.abs()
# Keep top‑2 per group
top2 = scores.topk(k=2, dim=-1).indices
mask = torch.zeros_like(g, dtype=torch.bool)
mask.scatter_(-1, top2, True)
w_pruned = (g * mask).reshape_as(w)
return w_pruned.transpose(group_dim, -1)
# Example: apply to an MLP up‑projection
for name, mod in model.named_modules():
if isinstance(mod, torch.nn.Linear) and is_eligible(name):
mod.weight.copy_(apply_2_4_mask(mod.weight, group_dim=1)) # group along K
Note: The exact grouping/layout must match cuSPARSELt’s expectations for Sparse Tensor Core engagement.
2) FP8 calibration with Transformer Engine (simplified)
import transformer_engine.pytorch as te
class FP8Block(te.fp8_autocast):
def __init__(self, enabled=True):
super().__init__(enabled=enabled)
# Calibration pass
with te.fp8_autocast(enabled=True):
for batch in calib_loader:
_ = model(**batch)
# Save collected scales (framework manages per‑tensor stats)
Reference: NVIDIA Transformer Engine provides FP8 casting, scaling recipes, and integration guidance.
3) LoRA recovery (PEFT‑style pseudocode)
from peft import LoraConfig, get_peft_model
lora = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","k_proj","v_proj","o_proj","up_proj","down_proj"], lora_dropout=0.05)
model = get_peft_model(model, lora)
# Train for 3–10k steps on mixed instruction/math/code; early stop on MMLU/GSM8K slice
Background: LoRA/AdaLoRA recover quality at low compute following pruning.
4) TensorRT‑LLM build (example config excerpt)
{
"precision": "fp8",
"enable_sparse_weights": true,
"fused_attention": "flash_v2",
"profiles": [
{"prompt": [1, 2048], "response": [1, 512]},
{"prompt": [1, 4096], "response": [1, 1024]}
],
"plugins": {"kv_cache": {"static": true}}
}
Build and run (CLI varies by version; check TensorRT‑LLM docs):
trtllm-build --model./pruned_lora_recovered --config./trt_config.json --output./engine
trtllm-run --engine./engine --dataset./eval.jsonl --metrics tokens_per_s,latency_p50,latency_p99
Documentation: TensorRT‑LLM repository and docs for enabling FP8, sparsity, and fused kernels.
5) Power and latency logging
# Power
nvidia-smi dmon -s pucmt -i 0 -o DT >> power.log &
# Latency distribution via your runner
trtllm-run... --metrics latency_histogram
6) INT8 alternative (weight‑only)
If your stack prefers INT8, establish a GPTQ/LLM.int8() baseline, prune, then re‑quantize/re‑calibrate before engine build.
Conclusion
2:4 sparsity plus FP8 on Hopper is no longer a lab trick—it’s a deployable recipe that consistently yields 1.5–2.0× throughput and material energy savings when executed with discipline in TensorRT‑LLM. The critical path is operational: nail your baseline, prune only eligible layers with correct orientation, recalibrate precision after structural changes, recover with lightweight adapters, and keep engines on optimized kernels with static shape profiles. A rigorous validation matrix, canaries, and rollback guardrails turn those gains into something you can trust at scale.
Key takeaways
- Establish a dense FP16/FP8 baseline in TensorRT‑LLM with fused attention before any pruning.
- Apply 2:4 only to eligible linear/FFN mats, respecting cuSPARSELt’s grouping; prune conservatively in late attention layers.
- Re‑calibrate FP8 (or INT8) after pruning and run brief LoRA/AdaLoRA recovery to keep metric deltas within ~1–2 points.
- Validate tokens/s, latency p50/p99, memory, and power; use MMLU, GSM8K, HumanEval, MT‑Bench/BBH, and long‑context to catch regressions.
- Guard with canaries and versioned rollbacks; shape discipline is essential to keep sparse kernels engaged.
Next steps
- Prototype the pipeline on a mid‑size model (e.g., 7–13B) with a small calibration set to validate tooling.
- Move to your target scale, stage sparsity in increments, and codify pass/fail thresholds for quality and latency.
- Automate engine builds for each profile bucket and integrate power/latency telemetry into your deploy pipeline.
Forward‑looking: as TensorRT‑LLM coverage widens and FP8 tooling matures, expect easier, more automated paths to mix 2:4 sparsity with precision scaling—and more of your fleet running comfortably in the fast lane.