ai 8 min read • intermediate

Deploying 2:4 Sparsity with FP8 on Hopper: A Production Cookbook

A step‑by‑step playbook to take a dense model to a pruned, adapter‑recovered, calibrated build in TensorRT‑LLM

By AI Research Team •
Deploying 2:4 Sparsity with FP8 on Hopper: A Production Cookbook

Deploying 2:4 Sparsity with FP8 on Hopper: A Production Cookbook

On Hopper‑class GPUs, the combination of 2:4 structured sparsity and FP8 precision routinely delivers 1.5–2.0× end‑to‑end throughput gains for LLM decoding, while cutting energy per token by 20–40% when implemented in production engines like TensorRT‑LLM. The kernel‑level picture is simple: NVIDIA’s Sparse Tensor Cores double effective math throughput for eligible GEMMs under the 2:4 N:M pattern, and Transformer Engine’s FP8 pipeline reduces bandwidth and scales activations robustly. But making those gains stick in production—without tanking quality—requires a careful pipeline: the right layers, properly curated calibration, safe pruning, quick adapter‑based recovery, and a well‑tuned engine build.

This article is a hands‑on playbook to take a dense model to a pruned, adapter‑recovered, FP8‑calibrated build in TensorRT‑LLM. We’ll show you how to baseline and profile, select layers and formats compatible with 2:4 pipelines, apply structured pruning across linear/FFN paths, recover with LoRA/AdaLoRA, integrate into a production engine with static shapes and fused attention, and validate with a robust matrix of metrics, latency distributions, memory, and power logging. By the end, you’ll have a repeatable, rollback‑safe process you can operate in a real fleet.

Architecture/Implementation Details

End‑to‑end pipeline overview

  1. Baseline in production conditions
  • Build a dense FP16 or FP8 baseline in TensorRT‑LLM with your production decoding params (batching, max prompt/response lengths, tokenizer, scheduler).
  • Enable fused attention (e.g., FlashAttention‑style kernels) so the MLP matmuls become the dominant compute path you’ll later sparsify.
  • Measure tokens/s, p50/p99 latency, peak/activation memory, and GPU power (nvidia‑smi dmon or NVML). This is your control.
  1. Choose 2:4‑eligible layers and orientation
  • Target linear/FFN mats with large K dimensions: attention QKV/proj and MLP up/down projections are typically eligible.
  • Avoid embedding tables, layer norms, and the final LM head for pruning.
  • Respect the 2:4 grouping along the innermost compute dimension expected by cuSPARSELt (groups of 4 contiguous values along K) so Sparse Tensor Cores can engage.
  1. Calibrate for FP8 and sparsity
  • Curate a calibration set of 2–5k prompts that mirrors production: instruction, reasoning, and code. Include a long‑context slice.
  • Run FP8 activation/weight scaling with Transformer Engine (per‑tensor dynamic scaling) to avoid overflow while you collect activation stats.
  1. Apply 2:4 structured pruning safely
  • Use magnitude or activation‑aware scores within each 4‑value group to drop the two least‑important elements.
  • Start at 30–40% global sparsity (2:4 is 50% within eligible mats but applied only to selected layers). Avoid aggressive pruning in late attention layers which can be KV‑critical for long‑range behavior.
  1. Adapter‑based recovery (LoRA/AdaLoRA)
  • Attach low‑rank adapters to pruned modules and fine‑tune for a few thousand steps on a task mixture similar to your calibration set.
  • Use early stopping based on a small validation basket (e.g., slices of MMLU/GSM8K/MT‑Bench/HumanEval) to recover 0.5–2 points without expensive full SFT.
  1. Precision pipeline and post‑pruning recalibration
  • Establish a stable FP8 baseline first, then prune, then re‑calibrate FP8 scales post‑pruning.
  • If you prefer INT8 (W8A8 or weight‑only), create a pre‑pruning baseline (LLM.int8(), GPTQ), prune, then re‑calibrate and validate.
  1. Engine integration for production
  • Export pruned weights with 2:4 masks.
  • Build a TensorRT‑LLM engine with FP8 and sparsity enabled, static shape profiles covering your prompt/response envelope, and fused attention enabled.
  • Verify sparse kernel engagement across all configured shapes; shape mismatches will silently disable sparsity.
  1. Validate and iterate
  • Re‑measure tokens/s, p50/p99 latency, memory, and power. Expect 1.3–1.8Ă— from 2:4 alone and 1.5–2.0Ă— with FP8 on H100/H200, with 20–40% energy per token reductions.
  • Evaluate MMLU, GSM8K, HumanEval, MT‑Bench, BBH, and a long‑context task with fixed decoding params for apples‑to‑apples comparisons.

Diagram (conceptual): Data flows from tokenizer → FP8‑calibrated embeddings → attention blocks (FlashAttention‑2 fused kernels) → FFN blocks (2:4 sparse matmuls via cuSPARSELt) → LM head. Recovery adapters sit on pruned linear paths. Static shape profiles ensure kernels stay on optimized paths.

Practical performance notes on Hopper

  • Eligibility is binary: only correctly masked 2:4 mats get the 2Ă— kernel uplift. A single mis‑grouped axis or unsupported layout drops you to dense kernels.
  • FP8’s biggest gains come from bandwidth savings and keeping mats in fast paths; keep scale outliers in check with updated calibration after pruning.
  • Fused/optimized attention (e.g., FlashAttention‑2) reduces attention overhead, amplifying the speedup realized from sparsifying the MLPs.

Comparison Tables

Precision and sparsity recipes on Hopper (TensorRT‑LLM)

RecipeExpected e2e speedup (decoding)Memory/BW impactRisk profileNotes
Dense FP161.0Ă—HighLowBaseline; easiest to validate
Dense FP81.2–1.4×Activation BW ↓Low–modRequires careful scaling; good first step
2:4 + FP161.3–1.8×Weight BW ↓ModEnsure layer eligibility/orientation
2:4 + FP81.5–2.0×Weight + activation BW ↓ModSweet spot on H100/H200
INT8 W8A81.2–1.6×Weight/act BW ↓ModBroadly supported; recalibrate after pruning

What to prune (dense LLMs)

Module2:4 eligibleRisk to qualityGuidance
Attention Q/K/VYesMediumPrune conservatively in later layers; preserve KV‑critical heads
Attention output projYesLow–modGenerally safe; validate long‑context
FFN up/down (gate)YesLowPrimary target for 2:4 speedups
EmbeddingsNoHighDo not prune
RMSNorm/LayerNormNoHighDo not prune
LM headGenerally avoidHighOptional only with strong validation

Pros and cons (2:4 vs alternatives on NVIDIA)

ApproachProsCons
2:4 N:MKernel‑level 2× math throughput; production‑ready in TensorRT‑LLMPattern constraints; strict layout/orientation
FP8 onlyEasy to adopt; portable across layersSmaller gains; still bandwidth‑bound in places
Unstructured sparsityHigh compressibilityLittle/no speedup without specialized kernels
Block‑structured sparsityGood locality; easier kernels than unstructuredRequires tuned custom kernels or specific coverage

Best Practices

Calibration set curation and scaling hygiene

  • Mix instruction, reasoning (math), code, and long‑context prompts. 2–5k prompts is enough for stable scaling and pruning scores.
  • Fix decoding params (temperature/top‑p/max‑new‑tokens) for both calibration and evaluation to avoid confounds.
  • For FP8, use Transformer Engine’s per‑tensor dynamic scaling and re‑run calibration after pruning to account for distribution shifts.

Safe structured pruning

  • Prune only layers with proven 2:4 support in your engine. Follow cuSPARSELt’s grouping along the inner K dimension and keep weights in supported layouts.
  • Stage sparsity increases (e.g., 20% → 30% → 40%) with quick evaluations between steps. Stop when MMLU/GSM8K move by ~1–2 points.
  • Treat late attention layers as KV‑critical: prune less aggressively there, and preserve heads known to carry long‑range signal.

Adapter‑based recovery

  • Start with LoRA rank 8–16 on attention and FFN paths; raise rank only if validation does not recover within 1 point on your metrics.
  • AdaLoRA can allocate rank dynamically across modules; useful when pruning budgets differ by layer depth.
  • Train with your calibration mix and a light weight on code/math to stabilize GSM8K/HumanEval.

Engine integration and shape discipline

  • Build engines with static shapes that cover real workloads (prompt/response buckets). If you fall off a profiled shape at runtime, sparsity may be disabled.
  • Enable fused attention (FlashAttention‑2 style) to surface the MLP bottleneck and maximize end‑to‑end gains.
  • Verify sparsity engagement by inspecting kernel traces and TensorRT‑LLM logs; a sudden p99 spike often signals a fallback to dense paths.

Validation matrix and guardrails

  • Report: tokens/s, p50/p90/p99 latency, peak/activation memory, energy per token (via power logging). Compare apples‑to‑apples.
  • Task suite: MMLU, GSM8K, HumanEval, MT‑Bench/BBH, plus at least one long‑context benchmark.
  • Canary tests: short prompts targeting safety, instruction‑following, and long‑context integrity.
  • Rollback plan: versioned artifacts per stage (baseline → pruned → recovered → recalibrated). Progressive traffic ramp with automated rollback on error budgets or p99 regressions. âś…

Troubleshooting quick hits

  • FP8 overflow/NaNs: tighten clip‑max or re‑collect scales with outlier‑heavy prompts.
  • Speedup <1.2Ă—: check that 2:4 masks align to the correct axis and that all production shapes are profiled with sparsity enabled.
  • Attention regressions: roll back Q/K pruning in late layers or increase LoRA rank on those modules.
  • p99 spikes: ensure the scheduler doesn’t exceed profiled max lengths, and that batching does not introduce shape drift.

Practical Examples

1) 2:4 mask application (PyTorch, illustrative)

import torch

@torch.no_grad()
def apply_2_4_mask(weight: torch.Tensor, group_dim: int = -1):
 # Reshape so groups of 4 lie along the innermost dimension
 assert weight.shape[group_dim] % 4 == 0
 w = weight.transpose(group_dim, -1)
 g = w.reshape(*w.shape[:-1], w.shape[-1] // 4, 4)
 # Magnitude score within each group of 4
 scores = g.abs()
 # Keep top‑2 per group
 top2 = scores.topk(k=2, dim=-1).indices
 mask = torch.zeros_like(g, dtype=torch.bool)
 mask.scatter_(-1, top2, True)
 w_pruned = (g * mask).reshape_as(w)
 return w_pruned.transpose(group_dim, -1)

# Example: apply to an MLP up‑projection
for name, mod in model.named_modules():
 if isinstance(mod, torch.nn.Linear) and is_eligible(name):
 mod.weight.copy_(apply_2_4_mask(mod.weight, group_dim=1)) # group along K

Note: The exact grouping/layout must match cuSPARSELt’s expectations for Sparse Tensor Core engagement.

2) FP8 calibration with Transformer Engine (simplified)

import transformer_engine.pytorch as te

class FP8Block(te.fp8_autocast):
 def __init__(self, enabled=True):
 super().__init__(enabled=enabled)

# Calibration pass
with te.fp8_autocast(enabled=True):
 for batch in calib_loader:
 _ = model(**batch)
# Save collected scales (framework manages per‑tensor stats)

Reference: NVIDIA Transformer Engine provides FP8 casting, scaling recipes, and integration guidance.

3) LoRA recovery (PEFT‑style pseudocode)

from peft import LoraConfig, get_peft_model

lora = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","k_proj","v_proj","o_proj","up_proj","down_proj"], lora_dropout=0.05)
model = get_peft_model(model, lora)
# Train for 3–10k steps on mixed instruction/math/code; early stop on MMLU/GSM8K slice

Background: LoRA/AdaLoRA recover quality at low compute following pruning.

4) TensorRT‑LLM build (example config excerpt)

{
 "precision": "fp8",
 "enable_sparse_weights": true,
 "fused_attention": "flash_v2",
 "profiles": [
 {"prompt": [1, 2048], "response": [1, 512]},
 {"prompt": [1, 4096], "response": [1, 1024]}
 ],
 "plugins": {"kv_cache": {"static": true}}
}

Build and run (CLI varies by version; check TensorRT‑LLM docs):

trtllm-build --model./pruned_lora_recovered --config./trt_config.json --output./engine
trtllm-run --engine./engine --dataset./eval.jsonl --metrics tokens_per_s,latency_p50,latency_p99

Documentation: TensorRT‑LLM repository and docs for enabling FP8, sparsity, and fused kernels.

5) Power and latency logging

# Power
nvidia-smi dmon -s pucmt -i 0 -o DT >> power.log &
# Latency distribution via your runner
trtllm-run... --metrics latency_histogram

6) INT8 alternative (weight‑only)

If your stack prefers INT8, establish a GPTQ/LLM.int8() baseline, prune, then re‑quantize/re‑calibrate before engine build.

Conclusion

2:4 sparsity plus FP8 on Hopper is no longer a lab trick—it’s a deployable recipe that consistently yields 1.5–2.0× throughput and material energy savings when executed with discipline in TensorRT‑LLM. The critical path is operational: nail your baseline, prune only eligible layers with correct orientation, recalibrate precision after structural changes, recover with lightweight adapters, and keep engines on optimized kernels with static shape profiles. A rigorous validation matrix, canaries, and rollback guardrails turn those gains into something you can trust at scale.

Key takeaways

  • Establish a dense FP16/FP8 baseline in TensorRT‑LLM with fused attention before any pruning.
  • Apply 2:4 only to eligible linear/FFN mats, respecting cuSPARSELt’s grouping; prune conservatively in late attention layers.
  • Re‑calibrate FP8 (or INT8) after pruning and run brief LoRA/AdaLoRA recovery to keep metric deltas within ~1–2 points.
  • Validate tokens/s, latency p50/p99, memory, and power; use MMLU, GSM8K, HumanEval, MT‑Bench/BBH, and long‑context to catch regressions.
  • Guard with canaries and versioned rollbacks; shape discipline is essential to keep sparse kernels engaged.

Next steps

  • Prototype the pipeline on a mid‑size model (e.g., 7–13B) with a small calibration set to validate tooling.
  • Move to your target scale, stage sparsity in increments, and codify pass/fail thresholds for quality and latency.
  • Automate engine builds for each profile bucket and integrate power/latency telemetry into your deploy pipeline.

Forward‑looking: as TensorRT‑LLM coverage widens and FP8 tooling matures, expect easier, more automated paths to mix 2:4 sparsity with precision scaling—and more of your fleet running comfortably in the fast lane.

Sources & References

developer.nvidia.com
Accelerating Sparsity in the NVIDIA Ampere Architecture Explains 2:4 structured sparsity and the 2�d7 kernel-level throughput uplift on NVIDIA GPUs used in this production recipe.
docs.nvidia.com
cuSPARSELt Documentation Documents Sparse Tensor Core requirements, grouping/orientation, and APIs that underpin 2:4 execution in production.
github.com
TensorRT-LLM (repository and docs) Provides the production engine, configuration, and build guidance for enabling FP8, sparsity, and fused attention kernels.
github.com
NVIDIA Transformer Engine (FP8) Describes FP8 pipelines and scaling/calibration practices essential for the FP8 stages of this cookbook.
arxiv.org
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Supports the recommendation to use fused attention to shift bottlenecks and magnify realized 2:4 speedups.
arxiv.org
Are Sixteen Heads Really Necessary? Background for preserving KV-critical heads and cautious attention pruning decisions in the pipeline.
arxiv.org
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot Used in comparisons to explain limitations of unstructured sparsity for throughput without specialized kernels.
arxiv.org
Wanda: A Simple and Effective Pruning Approach for Large Language Models Complements the comparison by describing activation-aware pruning and why it doesn't directly yield speedups without kernel support.
github.com
CUTLASS Sparse Examples (block/structured kernels) Supports the block-structured sparsity comparison and kernel considerations.
arxiv.org
GPTQ: Accurate Post-Training Quantization for Generative Pretrained Transformers Cited for INT8 post-training quantization baselines and post-pruning recalibration.
arxiv.org
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale Provides background and practices for INT8 quantization used as an alternative precision path.
arxiv.org
MMLU: Measuring Massive Multitask Language Understanding One of the key evaluation benchmarks used in the validation matrix.
arxiv.org
GSM8K: Training Verifiers to Solve Math Word Problems Math reasoning benchmark used to validate pruning and recovery impacts.
arxiv.org
HumanEval: Evaluating Large Language Models Trained on Code Code generation benchmark included in the validation suite.
arxiv.org
MT-Bench Instruction-following/dialogue benchmark for post-pruning evaluation.
arxiv.org
BIG-bench: Beyond the Imitation Game Benchmark Provides broad task coverage for post-pruning evaluation.
arxiv.org
LoRA: Low-Rank Adaptation of Large Language Models Foundation for the adapter-based recovery step after structured pruning.
arxiv.org
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning Supports adaptive allocation of adapter capacity during recovery in this pipeline.

Advertisement