Deploying 2:4 Sparsity with FP8 on Hopper: A Production Cookbook

On Hopper‑class GPUs, the combination of 2:4 structured sparsity and FP8 precision routinely delivers 1.5–2.0× end‑to‑end throughput gains for LLM decoding, while cutting energy per token by 20–40% when implemented in production engines like TensorRT‑LLM. The kernel‑level picture is simple: NVIDIA’s Sparse Tensor Cores double effective math throughput for eligible GEMMs under the 2:4 N:M pattern, and Transformer Engine’s FP8 pipeline reduces bandwidth and scales activations robustly. But making those gains stick in production—without tanking quality—requires a careful pipeline: the right layers, properly curated calibration, safe pruning, quick adapter‑based recovery, and a well‑tuned engine build.

This article is a hands‑on playbook to take a dense model to a pruned, adapter‑recovered, FP8‑calibrated build in TensorRT‑LLM. We’ll show you how to baseline and profile, select layers and formats compatible with 2:4 pipelines, apply structured pruning across linear/FFN paths, recover with LoRA/AdaLoRA, integrate into a production engine with static shapes and fused attention, and validate with a robust matrix of metrics, latency distributions, memory, and power logging. By the end, you’ll have a repeatable, rollback‑safe process you can operate in a real fleet.

Architecture/Implementation Details

End‑to‑end pipeline overview

Baseline in production conditions

Build a dense FP16 or FP8 baseline in TensorRT‑LLM with your production decoding params (batching, max prompt/response lengths, tokenizer, scheduler).
Enable fused attention (e.g., FlashAttention‑style kernels) so the MLP matmuls become the dominant compute path you’ll later sparsify.
Measure tokens/s, p50/p99 latency, peak/activation memory, and GPU power (nvidia‑smi dmon or NVML). This is your control.

Choose 2:4‑eligible layers and orientation

Target linear/FFN mats with large K dimensions: attention QKV/proj and MLP up/down projections are typically eligible.
Avoid embedding tables, layer norms, and the final LM head for pruning.
Respect the 2:4 grouping along the innermost compute dimension expected by cuSPARSELt (groups of 4 contiguous values along K) so Sparse Tensor Cores can engage.

Calibrate for FP8 and sparsity

Curate a calibration set of 2–5k prompts that mirrors production: instruction, reasoning, and code. Include a long‑context slice.
Run FP8 activation/weight scaling with Transformer Engine (per‑tensor dynamic scaling) to avoid overflow while you collect activation stats.

Apply 2:4 structured pruning safely

Use magnitude or activation‑aware scores within each 4‑value group to drop the two least‑important elements.
Start at 30–40% global sparsity (2:4 is 50% within eligible mats but applied only to selected layers). Avoid aggressive pruning in late attention layers which can be KV‑critical for long‑range behavior.

Adapter‑based recovery (LoRA/AdaLoRA)

Attach low‑rank adapters to pruned modules and fine‑tune for a few thousand steps on a task mixture similar to your calibration set.
Use early stopping based on a small validation basket (e.g., slices of MMLU/GSM8K/MT‑Bench/HumanEval) to recover 0.5–2 points without expensive full SFT.

Precision pipeline and post‑pruning recalibration

Establish a stable FP8 baseline first, then prune, then re‑calibrate FP8 scales post‑pruning.
If you prefer INT8 (W8A8 or weight‑only), create a pre‑pruning baseline (LLM.int8(), GPTQ), prune, then re‑calibrate and validate.

Engine integration for production

Export pruned weights with 2:4 masks.
Build a TensorRT‑LLM engine with FP8 and sparsity enabled, static shape profiles covering your prompt/response envelope, and fused attention enabled.
Verify sparse kernel engagement across all configured shapes; shape mismatches will silently disable sparsity.

Validate and iterate

Re‑measure tokens/s, p50/p99 latency, memory, and power. Expect 1.3–1.8× from 2:4 alone and 1.5–2.0× with FP8 on H100/H200, with 20–40% energy per token reductions.
Evaluate MMLU, GSM8K, HumanEval, MT‑Bench, BBH, and a long‑context task with fixed decoding params for apples‑to‑apples comparisons.

Diagram (conceptual): Data flows from tokenizer → FP8‑calibrated embeddings → attention blocks (FlashAttention‑2 fused kernels) → FFN blocks (2:4 sparse matmuls via cuSPARSELt) → LM head. Recovery adapters sit on pruned linear paths. Static shape profiles ensure kernels stay on optimized paths.

Practical performance notes on Hopper

Eligibility is binary: only correctly masked 2:4 mats get the 2× kernel uplift. A single mis‑grouped axis or unsupported layout drops you to dense kernels.
FP8’s biggest gains come from bandwidth savings and keeping mats in fast paths; keep scale outliers in check with updated calibration after pruning.
Fused/optimized attention (e.g., FlashAttention‑2) reduces attention overhead, amplifying the speedup realized from sparsifying the MLPs.

Comparison Tables

Precision and sparsity recipes on Hopper (TensorRT‑LLM)

Recipe	Expected e2e speedup (decoding)	Memory/BW impact	Risk profile	Notes
Dense FP16	1.0×	High	Low	Baseline; easiest to validate
Dense FP8	1.2–1.4×	Activation BW ↓	Low–mod	Requires careful scaling; good first step
2:4 + FP16	1.3–1.8×	Weight BW ↓	Mod	Ensure layer eligibility/orientation
2:4 + FP8	1.5–2.0×	Weight + activation BW ↓	Mod	Sweet spot on H100/H200
INT8 W8A8	1.2–1.6×	Weight/act BW ↓	Mod	Broadly supported; recalibrate after pruning

What to prune (dense LLMs)

Module	2:4 eligible	Risk to quality	Guidance
Attention Q/K/V	Yes	Medium	Prune conservatively in later layers; preserve KV‑critical heads
Attention output proj	Yes	Low–mod	Generally safe; validate long‑context
FFN up/down (gate)	Yes	Low	Primary target for 2:4 speedups
Embeddings	No	High	Do not prune
RMSNorm/LayerNorm	No	High	Do not prune
LM head	Generally avoid	High	Optional only with strong validation

Pros and cons (2:4 vs alternatives on NVIDIA)

Approach	Pros	Cons
2:4 N:M	Kernel‑level 2× math throughput; production‑ready in TensorRT‑LLM	Pattern constraints; strict layout/orientation
FP8 only	Easy to adopt; portable across layers	Smaller gains; still bandwidth‑bound in places
Unstructured sparsity	High compressibility	Little/no speedup without specialized kernels
Block‑structured sparsity	Good locality; easier kernels than unstructured	Requires tuned custom kernels or specific coverage

Best Practices

Calibration set curation and scaling hygiene

Mix instruction, reasoning (math), code, and long‑context prompts. 2–5k prompts is enough for stable scaling and pruning scores.
Fix decoding params (temperature/top‑p/max‑new‑tokens) for both calibration and evaluation to avoid confounds.
For FP8, use Transformer Engine’s per‑tensor dynamic scaling and re‑run calibration after pruning to account for distribution shifts.

Safe structured pruning

Prune only layers with proven 2:4 support in your engine. Follow cuSPARSELt’s grouping along the inner K dimension and keep weights in supported layouts.
Stage sparsity increases (e.g., 20% → 30% → 40%) with quick evaluations between steps. Stop when MMLU/GSM8K move by ~1–2 points.
Treat late attention layers as KV‑critical: prune less aggressively there, and preserve heads known to carry long‑range signal.

Adapter‑based recovery

Start with LoRA rank 8–16 on attention and FFN paths; raise rank only if validation does not recover within 1 point on your metrics.
AdaLoRA can allocate rank dynamically across modules; useful when pruning budgets differ by layer depth.
Train with your calibration mix and a light weight on code/math to stabilize GSM8K/HumanEval.

Engine integration and shape discipline

Build engines with static shapes that cover real workloads (prompt/response buckets). If you fall off a profiled shape at runtime, sparsity may be disabled.
Enable fused attention (FlashAttention‑2 style) to surface the MLP bottleneck and maximize end‑to‑end gains.
Verify sparsity engagement by inspecting kernel traces and TensorRT‑LLM logs; a sudden p99 spike often signals a fallback to dense paths.

Validation matrix and guardrails

Report: tokens/s, p50/p90/p99 latency, peak/activation memory, energy per token (via power logging). Compare apples‑to‑apples.
Task suite: MMLU, GSM8K, HumanEval, MT‑Bench/BBH, plus at least one long‑context benchmark.
Canary tests: short prompts targeting safety, instruction‑following, and long‑context integrity.
Rollback plan: versioned artifacts per stage (baseline → pruned → recovered → recalibrated). Progressive traffic ramp with automated rollback on error budgets or p99 regressions. ✅

Troubleshooting quick hits

FP8 overflow/NaNs: tighten clip‑max or re‑collect scales with outlier‑heavy prompts.
Speedup <1.2×: check that 2:4 masks align to the correct axis and that all production shapes are profiled with sparsity enabled.
Attention regressions: roll back Q/K pruning in late layers or increase LoRA rank on those modules.
p99 spikes: ensure the scheduler doesn’t exceed profiled max lengths, and that batching does not introduce shape drift.

Practical Examples

1) 2:4 mask application (PyTorch, illustrative)

import torch

@torch.no_grad()
def apply_2_4_mask(weight: torch.Tensor, group_dim: int = -1):
 # Reshape so groups of 4 lie along the innermost dimension
 assert weight.shape[group_dim] % 4 == 0
 w = weight.transpose(group_dim, -1)
 g = w.reshape(*w.shape[:-1], w.shape[-1] // 4, 4)
 # Magnitude score within each group of 4
 scores = g.abs()
 # Keep top‑2 per group
 top2 = scores.topk(k=2, dim=-1).indices
 mask = torch.zeros_like(g, dtype=torch.bool)
 mask.scatter_(-1, top2, True)
 w_pruned = (g * mask).reshape_as(w)
 return w_pruned.transpose(group_dim, -1)

# Example: apply to an MLP up‑projection
for name, mod in model.named_modules():
 if isinstance(mod, torch.nn.Linear) and is_eligible(name):
 mod.weight.copy_(apply_2_4_mask(mod.weight, group_dim=1)) # group along K

Note: The exact grouping/layout must match cuSPARSELt’s expectations for Sparse Tensor Core engagement.

2) FP8 calibration with Transformer Engine (simplified)

import transformer_engine.pytorch as te

class FP8Block(te.fp8_autocast):
 def __init__(self, enabled=True):
 super().__init__(enabled=enabled)

# Calibration pass
with te.fp8_autocast(enabled=True):
 for batch in calib_loader:
 _ = model(**batch)
# Save collected scales (framework manages per‑tensor stats)

Reference: NVIDIA Transformer Engine provides FP8 casting, scaling recipes, and integration guidance.

3) LoRA recovery (PEFT‑style pseudocode)

from peft import LoraConfig, get_peft_model

lora = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","k_proj","v_proj","o_proj","up_proj","down_proj"], lora_dropout=0.05)
model = get_peft_model(model, lora)
# Train for 3–10k steps on mixed instruction/math/code; early stop on MMLU/GSM8K slice

Background: LoRA/AdaLoRA recover quality at low compute following pruning.

4) TensorRT‑LLM build (example config excerpt)

{
 "precision": "fp8",
 "enable_sparse_weights": true,
 "fused_attention": "flash_v2",
 "profiles": [
 {"prompt": [1, 2048], "response": [1, 512]},
 {"prompt": [1, 4096], "response": [1, 1024]}
 ],
 "plugins": {"kv_cache": {"static": true}}
}

Build and run (CLI varies by version; check TensorRT‑LLM docs):

trtllm-build --model./pruned_lora_recovered --config./trt_config.json --output./engine
trtllm-run --engine./engine --dataset./eval.jsonl --metrics tokens_per_s,latency_p50,latency_p99

Documentation: TensorRT‑LLM repository and docs for enabling FP8, sparsity, and fused kernels.

5) Power and latency logging

# Power
nvidia-smi dmon -s pucmt -i 0 -o DT >> power.log &
# Latency distribution via your runner
trtllm-run... --metrics latency_histogram

6) INT8 alternative (weight‑only)

If your stack prefers INT8, establish a GPTQ/LLM.int8() baseline, prune, then re‑quantize/re‑calibrate before engine build.

Conclusion

2:4 sparsity plus FP8 on Hopper is no longer a lab trick—it’s a deployable recipe that consistently yields 1.5–2.0× throughput and material energy savings when executed with discipline in TensorRT‑LLM. The critical path is operational: nail your baseline, prune only eligible layers with correct orientation, recalibrate precision after structural changes, recover with lightweight adapters, and keep engines on optimized kernels with static shape profiles. A rigorous validation matrix, canaries, and rollback guardrails turn those gains into something you can trust at scale.

Key takeaways

Establish a dense FP16/FP8 baseline in TensorRT‑LLM with fused attention before any pruning.
Apply 2:4 only to eligible linear/FFN mats, respecting cuSPARSELt’s grouping; prune conservatively in late attention layers.
Re‑calibrate FP8 (or INT8) after pruning and run brief LoRA/AdaLoRA recovery to keep metric deltas within ~1–2 points.
Validate tokens/s, latency p50/p99, memory, and power; use MMLU, GSM8K, HumanEval, MT‑Bench/BBH, and long‑context to catch regressions.
Guard with canaries and versioned rollbacks; shape discipline is essential to keep sparse kernels engaged.

Next steps

Prototype the pipeline on a mid‑size model (e.g., 7–13B) with a small calibration set to validate tooling.
Move to your target scale, stage sparsity in increments, and codify pass/fail thresholds for quality and latency.
Automate engine builds for each profile bucket and integrate power/latency telemetry into your deploy pipeline.

Forward‑looking: as TensorRT‑LLM coverage widens and FP8 tooling matures, expect easier, more automated paths to mix 2:4 sparsity with precision scaling—and more of your fleet running comfortably in the fast lane.

Sources & References

Accelerating Sparsity in the NVIDIA Ampere Architecture Explains 2:4 structured sparsity and the 2�d7 kernel-level throughput uplift on NVIDIA GPUs used in this production recipe.

cuSPARSELt Documentation Documents Sparse Tensor Core requirements, grouping/orientation, and APIs that underpin 2:4 execution in production.

TensorRT-LLM (repository and docs) Provides the production engine, configuration, and build guidance for enabling FP8, sparsity, and fused attention kernels.

NVIDIA Transformer Engine (FP8) Describes FP8 pipelines and scaling/calibration practices essential for the FP8 stages of this cookbook.

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning Supports the recommendation to use fused attention to shift bottlenecks and magnify realized 2:4 speedups.

Are Sixteen Heads Really Necessary? Background for preserving KV-critical heads and cautious attention pruning decisions in the pipeline.

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot Used in comparisons to explain limitations of unstructured sparsity for throughput without specialized kernels.

Wanda: A Simple and Effective Pruning Approach for Large Language Models Complements the comparison by describing activation-aware pruning and why it doesn't directly yield speedups without kernel support.

CUTLASS Sparse Examples (block/structured kernels) Supports the block-structured sparsity comparison and kernel considerations.

GPTQ: Accurate Post-Training Quantization for Generative Pretrained Transformers Cited for INT8 post-training quantization baselines and post-pruning recalibration.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale Provides background and practices for INT8 quantization used as an alternative precision path.

MMLU: Measuring Massive Multitask Language Understanding One of the key evaluation benchmarks used in the validation matrix.

GSM8K: Training Verifiers to Solve Math Word Problems Math reasoning benchmark used to validate pruning and recovery impacts.

HumanEval: Evaluating Large Language Models Trained on Code Code generation benchmark included in the validation suite.

MT-Bench Instruction-following/dialogue benchmark for post-pruning evaluation.

BIG-bench: Beyond the Imitation Game Benchmark Provides broad task coverage for post-pruning evaluation.

LoRA: Low-Rank Adaptation of Large Language Models Foundation for the adapter-based recovery step after structured pruning.

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning Supports adaptive allocation of adapter capacity during recovery in this pipeline.