ai 8 min read • intermediate

Reproducible RTX Training Playbook for 5090, 5080, and RTX PRO 5000

Hands‑on environment setup, benchmarking recipes, and tuning to produce trustworthy tokens/s and images/s

By AI Research Team •
Reproducible RTX Training Playbook for 5090, 5080, and RTX PRO 5000

Reproducible RTX Training Playbook for 5090, 5080, and RTX PRO 5000

Training throughput on consumer and workstation GPUs has jumped in a single generation: independent vision training runs show the RTX 5090 averaging roughly 44% higher images/s than RTX 4090 across diverse timm models under PyTorch 2.6, with even larger gains on transformer-heavy architectures. That uplift doesn’t automatically translate into trustworthy tokens/s or images/s in your lab, though. Variability from drivers, CUDA/cuDNN mismatches, boost transients, and multi-GPU topology can swing results by double digits.

This playbook shows how to produce measurements you can trust on RTX 5090, 5080, and RTX PRO 5000. It covers clean, reproducible environment setup; power and thermal steady-state methods; repeatable training templates for transformers, vision, and SDXL; multi-GPU torchrun practices and NCCL tuning; and a results checklist you can publish confidently. You’ll learn which software versions align with Blackwell and Ada readiness, how to run steady-state tokens/s and images/s without boost artifacts, and how to document a complete reproducibility dossier. The goal: tightly controlled, apples-to-apples numbers that withstand scrutiny.

Architecture/Implementation Details

Clean, reproducible software stack

flowchart TD
 M[Components] --> A[Frameworks]
 M --> G[Precision and Kernels]
 M --> K[Distributed]
 A -->|uses| B["PyTorch 2.6+ with CUDA 12.8"]
 A -->|install| C[cuDNN 9-series]
 A -->|install| D["NCCL 2.19-2.20+"]
 A -->|prefer| E[Linux for measurements]
 A -->|aligns with| F[NVIDIA AI Enterprise 3.3]
 G -->|default to| H[bf16 autocast]
 G -->|enable| I[FlashAttention-2]
 G -->|track| J[FP8 Transformer Engine]
 K -->|uses| L[torchrun with NCCL backend]

A flowchart showcasing the architecture and implementation details for a clean and reproducible software stack for modern RTX training, covering frameworks, precision handling, and distributed computing.

Modern RTX training stability hinges on a coherent CUDA/cuDNN/PyTorch/NCCL matrix and kernels that match the GPU generation.

  • Frameworks:
  • Use PyTorch 2.6 or newer with CUDA 12.8 builds for Blackwell readiness.
  • Install cuDNN 9-series and NCCL 2.19–2.20+.
  • Prefer Linux for first-pass measurements; pro SKUs also align with NVIDIA AI Enterprise 3.3 support matrices.
  • Precision and kernels:
  • Default to bf16 autocast (mixed precision) with gradient scaling as needed.
  • Enable FlashAttention‑2 (or equivalent attention kernels) at 2k–8k contexts for transformers.
  • Track FP8 Transformer Engine enablement in framework release notes if you experiment with FP8 paths; validate convergence.
  • Distributed:
  • Use torchrun with NCCL backend.
  • Tune gradient bucket sizes and overlap compute/communication.
  • Do not mix GPU generations within a node.
  • Data pipeline:
  • Use pinned memory and NUMA-aware data loading on dual-CPU or multi-root systems.
  • Logging (mandatory):
  • GPU model/SKU, power/cooling/clock settings (stock vs OC).
  • Driver version, CUDA/cuDNN/NCCL versions, PyTorch build/commit.
  • Dataloader parameters, precision mode, kernel choices (e.g., FlashAttention‑2).
  • Throughput (tokens/s, images/s, steps/s), time-to-target loss/accuracy, global batch and grad accumulation.
  • Peak VRAM, checkpointing/sharding settings, and GPU-only plus wall power in steady state.

Baseline software matrix (Blackwell- and Ada-ready)

Stack componentRecommended versionNotes
PyTorch≥ 2.6Verified runs used 2.6 nightly; compile mode can help multi‑GPU CV
CUDA12.8Matches Blackwell-ready builds
cuDNN9.xUse 9‑series with CUDA 12.8
NCCL2.19–2.20+Required for robust PCIe scaling and logging
Attention kernelsFlashAttention‑2Material at 2k–8k contexts
Mixed precisionbf16Default for robust training across Ada/Blackwell/Hopper

Hardware capabilities to align with

  • RTX 5090: 32 GB GDDR7 on 512‑bit bus, 1.792 TB/s bandwidth, PCIe Gen 5, no NVLink.
  • RTX 5080: 16 GB GDDR7 on 256‑bit bus, 960 GB/s bandwidth, PCIe Gen 5, no NVLink.
  • RTX PRO 5000 (Blackwell): 48 GB or 72 GB GDDR7 with ECC, PCIe Gen 5, up to ~1.344 TB/s bandwidth for the 48 GB model, no NVLink, fifth‑gen Tensor Cores, second‑gen Transformer Engine, and up to two MIG instances per GPU.

These specifications matter because memory bandwidth and capacity strongly influence training throughput and feasible global batches, especially for transformer contexts ≥2k.

Best Practices

Power, thermals, and steady-state methodology

Short “boosty” bursts distort performance claims. Capture steady state:

  • Warm-up period: Run each workload for 10–20 minutes before recording throughput and power to reach thermal equilibrium.
  • Power telemetry:
  • Record GPU-only power (from device telemetry) to normalize perf/W comparisons.
  • Log wall power for context; it captures platform overheads.
  • Cooling:
  • Sustained training behaves differently on blower-style vs open-air coolers; ensure adequate case airflow and monitor hotspot temperatures.
  • PSUs:
  • Follow vendor guidance; e.g., a 1000 W recommended system power for RTX 5090-class rigs. Under-provisioning will throttle.
  • Clocks:
  • Avoid overclocks for baseline reproducibility. If you do test OC, document exact settings.

Transformer training template (LLM pretraining and fine-tuning)

Objective: comparable tokens/s and time-to-target loss at controlled contexts.

  • Precision and kernels:
  • Use bf16 autocast with gradient scaling as needed.
  • Enable FlashAttention‑2 for 2k/4k/8k contexts.
  • Contexts and memory:
  • Run 2k, 4k, and 8k context lengths; test with and without gradient checkpointing.
  • Log peak VRAM, global batch size (including gradient accumulation), and any sharding (ZeRO/FSDP).
  • Models and feasibility:
  • 24–32 GB GPUs (e.g., RTX 5090, RTX 5000 Ada): prioritize LoRA/QLoRA for 7B/13B fine‑tunes; full‑parameter 7B at 2k–4k is feasible with checkpointing and sharding; 13B will be sharding‑heavy.
  • 48–72 GB GPUs (e.g., RTX PRO 5000, RTX 6000 Ada): larger per‑GPU global batches for 7B/13B and less reliance on deep sharding, enabling higher‑context 13B full‑parameter fine‑tunes.
  • FP8/FP4 pathways:
  • Blackwell hardware supports FP8 and FP4; FP8 TE can raise throughput and reduce memory if framework support is enabled. Validate convergence.
  • FP4 is early for general-purpose training; avoid unless your stack explicitly supports it.
  • Metrics to record:
  • Tokens/s, steps/s, time-to-target loss, peak VRAM, precision mode, attention kernel, checkpointing/sharding settings, GPU-only power.
  • Sanity checks:
  • Expect memory‑bound phases to benefit from RTX 5090 bandwidth; 32 GB VRAM enables larger batches and higher sequence contexts than 24 GB cards.

Vision training template (timm workflows)

Objective: comparable images/s, steps/s, and time-to-accuracy.

  • Baseline recipe:
  • PyTorch 2.6 with CUDA 12.8 and cuDNN 9‑series.
  • Use timm reference training, batch size 256, reporting both FP32 and mixed‑precision throughput.
  • Mixed precision and compile:
  • Enable AMP for mixed precision; use PyTorch compile mode to unlock additional speedups, especially on PCIe multi‑GPU.
  • What to expect:
  • Across diverse models, RTX 5090 averaged roughly +44% training throughput versus RTX 4090; transformer‑heavy vision models (e.g., Swin‑B) saw larger jumps, while classic CNNs (e.g., ResNet‑50) showed smaller—but still substantial—gains.
  • Metrics to record:
  • Images/s, steps/s, time-to-target top‑1 accuracy, precision mode, compile mode status, dataloader parallelism, GPU-only power in steady state.

SDXL training template

Objective: comparable samples/s and time-to-validation loss.

  • Precision and augmentations:
  • Fix bf16 training and control augmentations; keep the exact augmentation set and schedulers identical across GPUs.
  • Reporting:
  • Log samples/s and time-to-validation loss in steady state. Clearly distinguish training from inference.

Comparison Tables

GPU quick picks for this playbook

GPUVRAM / BandwidthTGPTraining fitNotes
GeForce RTX 509032 GB GDDR7 / 1.792 TB/s575 WHigh‑throughput single‑node BF16 training; bandwidth-heavy fine‑tunes at 2k–4k; larger vision transformersClear uplift (~+44% average CV training vs 4090); no NVLink
GeForce RTX 508016 GB GDDR7 / 960 GB/s360 WEntry Blackwell training where 16 GB sufficesBandwidth helps, but 16 GB constrains batch/seq; no NVLink
RTX PRO 5000 (Blackwell)48/72 GB GDDR7 / up to ~1.344 TB/s (48 GB model)~300 WWorkstation reliability with ECC; full‑parameter 13B fine‑tunes at higher contexts; larger‑batch CV/SDXL5th‑gen Tensor Cores, 2nd‑gen Transformer Engine; PCIe‑only, no NVLink

Multi‑GPU scaling expectations (PCIe only)

PlatformAdditional‑GPU efficiency (indicative)Notes
RTX 6000 Ada~0.94–0.95Near‑linear CV scaling observed with AMP and compile mode
RTX 5090~0.91–0.97Validate P2P and topology; PCIe Gen 5 recommended for heavy pipelines
RTX 4090~0.62–0.75Lower efficiency; Blackwell/workstation platforms fare better

Note: Efficiency ranges reflect independent training indications; measure and report your own with full topology disclosure.

Multi‑GPU torchrun, NCCL tuning, and topology validation

flowchart TD
 A["Workstation & GeForce cards"] --> B[PCIe Topology Validation]
 B --> C{Validate Peer-to-Peer}
 C --> D[nvidia-smi Topology Checks]
 C --> E[NCCL Logging]
 B --> F[PCIe Link Speed]
 F --> G["Single-GPU (Gen 5 vs Gen 4)"]
 F --> H["Multi-GPU (Gen 5 provides more headroom)"]
 B --> I[NUMA Placement]
 I --> J[Affinitize Processes]

Diagram illustrating the workflow for validating GPU topology, PCIe link speed, and NUMA placement in a multi-GPU environment.

Workstation and GeForce cards here use PCIe (no NVLink), so topology diligence matters:

  • Validate peer-to-peer: Use nvidia‑smi topology checks and NCCL logging to confirm P2P and switch placement.
  • PCIe link speed:
  • Single‑GPU: Gen 5 vs Gen 4 is within a few percent across many tasks; Gen 3 is often close for typical LLM/CV kernels.
  • Multi‑GPU and bandwidth‑heavy pipelines: Gen 5 provides more headroom; data‑transfer‑heavy training suffers the most at lower PCIe generations.
  • NUMA placement:
  • On dual‑CPU or multi‑root systems, affinitize processes, ensure pinned memory, and make dataloaders NUMA‑aware.

NCCL and distributed settings

  • torchrun with NCCL backend is the baseline.
  • Tune gradient bucket sizes to overlap compute and all‑reduce effectively.
  • NCCL tuning levers to try and document: number of channels, tree vs ring algorithm selection.
  • Use matched GPUs per node; avoid mixing generations to reduce tail latency in collectives.

Scaling targets

  • Aim for ≥0.9 data‑parallel efficiency on RTX 5090, RTX PRO 5000, or RTX 6000 Ada platforms with AMP and compile mode.
  • Record both weak and strong scaling curves; include PCIe link speed/width and P2P status in your report.

Results checklist and reproducibility dossier for publication

Make your dossier exhaustive so peers can repeat the run end-to-end.

  • Platform and environment
  • GPU: exact SKU and memory variant; cooler type.
  • Host CPU(s), memory configuration, PCIe slots/lanes per GPU, storage.
  • OS, driver version, CUDA/cuDNN/NCCL versions, PyTorch version and build/commit.
  • NVIDIA AI Enterprise version if applicable.
  • Workload configuration
  • Model and dataset versions; tokenizer where relevant.
  • Precision mode (bf16/fp16/fp8), attention kernel (FlashAttention‑2), compile mode status.
  • Batch size, gradient accumulation, gradient checkpointing, optimizer and LR schedule.
  • Sharding approach (ZeRO/FSDP) and parameters.
  • Dataloader workers, pinned memory, NUMA settings.
  • For SDXL: exact augmentations and schedulers.
  • Topology and distributed
  • PCIe link speed/width per GPU, P2P status, switch placement (nvidia‑smi topology output summarized).
  • torchrun launch parameters at a high level (no secrets), world size, NCCL tuning (channels, tree/ring).
  • Measurement methodology
  • Warm‑up duration (10–20 minutes) and criteria for steady state.
  • Throughput: tokens/s, images/s, samples/s; steps/s.
  • Time‑to‑target accuracy or loss and the exact target values.
  • Power: GPU‑only telemetry and wall power, both in steady state.
  • Peak VRAM and typical VRAM during steady state.
  • Sanity signals
  • CV: verify the RTX 5090 uplift trend on transformer‑heavy models compared to previous‑gen.
  • LLM: memory‑bound phases should benefit from bandwidth; 32 GB vs 48/72 GB capacity differences reflected in global batches.
  • Perf/$ and perf/W
  • Report tokens/s or images/s per dollar based on actual invoiced GPU cost (not MSRP).
  • Include perf/W normalized to GPU‑only power in steady state.

If a parameter is unknown or not applicable, say so explicitly. Ambiguity is the enemy of reproducibility.

Conclusion

Training on RTX 5090, 5080, and RTX PRO 5000 can be fast and defensible—if the environment and methodology are disciplined. A coherent PyTorch 2.6/CUDA 12.8/cuDNN 9/NCCL 2.19+ stack, bf16 autocast, and FlashAttention‑2 set the foundation for robust transformer runs at modern contexts. Steady‑state thermals and power logging eliminate boost artifacts. timm workflows with AMP and compile mode provide a transparent vision baseline that already shows the generational uplift Blackwell delivers. On PCIe multi‑GPU, torchrun with careful NCCL tuning and topology validation produces near‑linear scaling on the right platforms. Finally, a thorough reproducibility dossier ensures your tokens/s and images/s can be replicated—not just admired.

Key takeaways:

  • Use Blackwell‑ready builds (PyTorch 2.6 + CUDA 12.8 + cuDNN 9 + NCCL ≥2.19) and enable bf16 + FlashAttention‑2 for transformers.
  • Warm up 10–20 minutes and log GPU‑only plus wall power in steady state to avoid boost artifacts.
  • For LLMs, pick batch/context based on VRAM: 32 GB favors LoRA/QLoRA or carefully sharded 7B; 48–72 GB enables larger 13B batches.
  • Expect strong vision training uplifts on RTX 5090 and aim for ≥0.9 multi‑GPU efficiency on Blackwell and workstation Ada platforms with AMP and compile mode.
  • Publish a complete dossier (stack, topology, configs, telemetry) and include perf/$ derived from actual invoiced costs.

Next steps:

  • Lock your software matrix and publish it with your repo.
  • Run the three templates (transformer, vision, SDXL) with steady‑state logging and identical configs across GPUs.
  • Validate PCIe P2P and NCCL behavior, then sweep bucket sizes and algorithms.
  • Share raw logs and a reproducibility checklist alongside your plotted results. 🚀

With a strict setup and measurement ritual, the tokens/s and images/s you report on RTX 5090, 5080, and RTX PRO 5000 will be numbers others can actually reproduce.

Sources & References

nikolasent.github.io
Benchmarking NVIDIA RTX 5090 (Computer Vision Lab) Provides methodologically explicit CV training results showing ~44% average throughput uplift of RTX 5090 vs RTX 4090 across timm models, informing expectations and validation checks.
images.nvidia.com
NVIDIA RTX Blackwell GPU Architecture (Official brief) Confirms Blackwell architectural capabilities (5th‑gen Tensor Cores, FP8/FP4 support) and GDDR7 characteristics relevant to precision choices and memory bandwidth.
www.nvidia.com
GeForce RTX 5090 Graphics Cards (Official Product Page) Documents 5090 key specs (32 GB GDDR7, 512‑bit, 1.792 TB/s bandwidth, PCIe Gen 5, TGP) used in the playbook’s capacity and bandwidth guidance.
www.nvidia.com
GeForce RTX 5080 Graphics Cards (Official Product Page) Provides 5080 specs (16 GB GDDR7, 960 GB/s, PCIe Gen 5, TGP) to frame memory/bandwidth constraints and best‑fit training uses.
www.nvidia.com
NVIDIA RTX PRO 5000 (Blackwell) – Product Page Establishes RTX PRO 5000 capabilities (48/72 GB GDDR7 ECC, fifth‑gen Tensor Cores, second‑gen Transformer Engine, PCIe Gen 5, no NVLink) central to workstation guidance.
www.nvidia.com
NVIDIA RTX PRO 5000 (Blackwell) – Datasheet Lists bandwidth (~1.344 TB/s for 48 GB model) and power that inform perf/W and batch‑size recommendations.
www.nvidia.com
NVIDIA RTX 6000 Ada Generation (Product Page) Provides a 48 GB Ada workstation baseline for comparison and scaling expectations in PCIe environments.
www.aime.info
Deep Learning GPU Benchmarks (AIME) Offers independent multi‑GPU PCIe scaling efficiencies and methodology cues (AMP, compile mode) that shape scaling targets (≥0.9) in this playbook.
www.youtube.com
NVIDIA RTX 5090 PCIe Scaling for Local LLM and AI (Moby Motion) Shows PCIe link speed sensitivity patterns and single‑GPU parity between Gen 5 and Gen 4 across many tasks, informing topology guidance.
docs.nvidia.com
NVIDIA AI Enterprise Release Notes v3.3 Anchors enterprise stack support for Ada/Hopper-class professional SKUs referenced in environment setup recommendations.

Advertisement