Grace Hopper Unified Memory Delivers 900 GB/s Coherent CPU–GPU Bandwidth
Why NVLink/NVSwitch platforms change the rules of memory, locality, and scaling compared with classic PCIe-attached GPU systems
The CPU–GPU boundary used to be a chasm. Data lived in host DRAM, models and activations lived in GPU HBM, and everything that crossed the gap paid the toll of a PCIe link—tens of gigabytes per second and microsecond-scale latencies that punished irregular access and fine-grained sharing. With Grace Hopper and the broader NVLink/NVSwitch ecosystem, that boundary compresses into a high-bandwidth, coherent bridge. Up to roughly 900 GB/s of coherent CPU–GPU bandwidth flows across NVLink-C2C inside a Grace Hopper superchip—more than an order of magnitude beyond a PCIe 5.0 x16 link per direction—reframing how developers think about memory placement, NUMA, and scaling.
This isn’t “one memory to rule them all.” It’s a practical evolution: unified address spaces, coherent CPU–GPU load/store within a package, and fabric-level access across GPUs that’s fast enough to make remote reads a first-class option. The physics still reward locality, but the penalties for going nonlocal are smaller and more predictable—especially when you understand the mechanics of CUDA Unified Memory.
Two architectures, one goal: keep data close to compute
The classic discrete model is simple to describe and notoriously tedious to optimize: CPUs attach DDR or LPDDR, GPUs integrate on-package HBM, and a PCIe interconnect connects the two sides. Hopper-class GPUs deliver on the order of 3 TB/s of HBM3 bandwidth per device, with 80–96 GB of capacity. Host memory offers far larger capacity but lower bandwidth; for example, Grace-class LPDDR5X sustains above 500 GB/s aggregate. There is no hardware cache coherence between CPU and GPU in this model—each side keeps its own caches, and data moves via explicit DMA (cudaMemcpy) or page-granular migrations when using Unified Memory.
Unified-memory platforms reconfigure the picture. In a Grace Hopper superchip, NVLink-C2C ties the CPU complex and GPU together with a coherent protocol and very high bandwidth—up to about 900 GB/s for CPU–GPU traffic—with latency far lower than PCIe. That enables fine-grained load/store sharing between CPU and GPU without explicit copies. At node scale, NVLink and NVSwitch interconnect GPUs so that each device can access peer HBM at high bandwidth and relatively uniform latency, with support for remote loads/stores and atomics. Memory remains physically attached to each GPU, but software can treat it as a global address space with NUMA characteristics rather than a hard boundary.
In both worlds, the overarching goal is unchanged: keep hot data close to the compute engines that use it most. The difference is how punishing it is when you can’t.
Memory and coherence: from hard boundaries to NUMA tiers
Classic PCIe-attached systems draw a hard line between CPU RAM and each GPU’s HBM. No hardware CPU–GPU cache coherence exists; drivers and runtimes must flush, copy, or migrate data explicitly, and stale cache lines are the developer’s problem if they get the ordering wrong. NUMA effects stack up quickly: multi-socket hosts impose their own tiers, and PCIe topologies add asymmetries through switches and root complexes.
Grace Hopper softens the line inside the package. The Grace CPU and Hopper GPU share coherent access across NVLink-C2C. Page tables are coordinated so that when the CPU touches GPU-resident data—or the GPU touches CPU-resident data—the access has coherent load/store semantics. That opens the door to pointer-rich structures shared across CPU/GPU phases without copy choreography.
Across GPUs, coherence stops at the L2 boundary. NVLink and NVSwitch don’t make every GPU’s cache coherent with every other. Instead, the fabric offers high-bandwidth remote access and atomics, while CUDA’s Unified Memory (and libraries like NVSHMEM) orchestrate migration and mapping. The result is a NUMA hierarchy rather than a single uniform pool:
- Local GPU HBM remains the highest-bandwidth, lowest-latency tier for kernels.
- Local CPU DDR/LPDDR is the natural tier for host code.
- Cross-CPU–GPU access over NVLink-C2C is slower than local, but dramatically faster and lower-latency than PCIe.
- Peer GPU HBM over NVSwitch is reachable at high bandwidth with near-uniform latency, yet still slower than local HBM.
Treating these tiers explicitly—rather than assuming “unified” means “uniform”—is the key to consistent performance.
Interconnects and topology: PCIe ceilings vs NVLink fabrics
PCIe sets the baseline. A PCIe Gen5 x16 link sustains roughly 63 GB/s per direction (Gen4 ~31.5 GB/s; Gen6 ~128 GB/s). Latencies are in the microseconds to initiate DMA, and switch hops add both latency and contention. Multi-GPU servers often hang two to eight GPUs off one or more PCIe switches per socket; under load, it’s easy to saturate uplinks to the root complex, and traffic that crosses CPU sockets pays additional inter-socket penalties.
NVLink and NVSwitch push well beyond those ceilings. Hopper-generation NVLink exposes as much as ~900 GB/s of aggregate GPU–GPU bandwidth per H100, with lower latency than PCIe. Within a node, NVSwitch builds a fully connected, high-bisection fabric so any GPU can reach any other at near-uniform latency and full per-GPU bandwidth to the switch. CPU–GPU coupling inside Grace Hopper uses NVLink-C2C to deliver up to roughly 900 GB/s of coherent bandwidth—fast enough to make fine-grained sharing practical.
Scale these ideas out, and the NVLink Switch System stitches many Grace Hopper nodes into one logical domain. DGX GH200-class systems combine up to 256 superchips, exposing a massive address space—on the order of 144 TB of aggregate memory—that software can traverse with NUMA-aware placement.
CXL adds coherence to the PCIe physical layer and is rapidly maturing for CPU-centric memory expansion and accelerator attach. In today’s unified GPU memory platforms, however, NVLink/NVSwitch and NVLink-C2C carry the weight for high-performance GPU coherence and fabric-level memory access.
Unified Memory in practice: page migration, mapping, and control
CUDA Unified Memory (UVM) ties the programming model together. A single cudaMallocManaged allocation yields one pointer that both CPU and GPU can dereference. Under the hood, UVM manages pages, not bytes or structs. When a processor accesses a page that isn’t resident locally, the access raises a page fault:
- The driver can migrate the page to the accessing processor.
- Or it can establish a remote mapping so the access is serviced across the interconnect.
Granularity matters. On Linux, UVM operates with base 4 KB pages. Frequent faults on small pages can stall kernels, especially on PCIe systems where servicing a fault costs microseconds and the link caps practical throughput. That’s where policy comes in:
- cudaMemAdviseSetPreferredLocation pins frequently accessed regions near a given GPU.
- cudaMemAdviseSetAccessedBy primes page tables for a device without migrating data.
- cudaMemAdviseSetReadMostly allows duplication for shared read access across processors.
- cudaMemPrefetchAsync turns demand paging into proactive movement before kernels launch.
On NVLink-enabled platforms, the driver increasingly prefers remote mappings for read-mostly sharing to avoid ping-pong migrations, and when migrations are necessary, NVLink offers far higher bandwidth and lower latency than PCIe. Within Grace Hopper, coherent load/store semantics tighten CPU–GPU interactions even further: pointer-heavy data structures and mixed CPU–GPU phases behave more like a unified process than an accelerator bolted onto a host.
Oversubscription is the other half of the story. Managed allocations can exceed HBM capacity. Cold pages spill to CPU memory and migrate on demand. On PCIe, practical throughput is then bounded by the PCIe link; batched prefetching and streaming help, but random fine-grained access remains painful. On NVLink and especially Grace Hopper, oversubscription becomes more forgiving—remote access is markedly faster—but it still trails local HBM and benefits from careful prefetch, data layout, and batching.
Pinned and zero-copy host memory sits alongside UVM. Pinned buffers allow the GPU to DMA directly, avoiding paging overhead. Zero-copy lets the GPU access host memory without staging. On PCIe, zero-copy is niche; on NVLink/Grace Hopper, it can be substantially faster and viable for streaming-style operators, though still not a substitute for local HBM.
Performance trade-offs: locality still wins, but penalties shrink
Locality remains the dominant lever. Local HBM delivers multi-terabyte-per-second bandwidth and low latency; keeping hot data there dwarfs most other optimizations. CPU-side DDR/LPDDR offers lower bandwidth but ample capacity and responsiveness for host-resident phases.
Interconnects set the cost of going nonlocal. Relative to PCIe Gen5’s ~63 GB/s per direction, NVLink and NVLink-C2C deliver up to an order-of-magnitude more bandwidth with lower latency. That shrinks—but does not eliminate—the penalty of accessing remote memory. The practical outcomes:
- Fine-grained CPU–GPU sharing becomes reasonable on Grace Hopper, where coherent access avoids heavy copy orchestration.
- Multi-GPU read sharing thrives on NVSwitch, where remote mapping can keep hot pages resident on several GPUs.
- Write-sharing across processors remains expensive; avoid frequent writes to the same page from CPU and GPU.
Unified Memory behavior is often decisive. On PCIe systems, tiny 4 KB faults mid-kernel can snowball into fault storms that collapse throughput. Setting preferred locations, marking read-mostly regions, and prefetching large contiguous chunks before kernel launches transforms unpredictable demand paging into predictable transfers. On NVLink systems, the same discipline further reduces in-kernel faults, and read-sharing over remote mappings prevents thrash.
Explicit memory copies still have their place. When working sets fit comfortably in HBM and dataflows are regular, cudaMemcpy with stream pipelining offers deterministic overlap of transfers and compute. With UVM, surprise page faults can disrupt planned overlaps unless you prefetch aggressively. Tools like Nsight Systems/Compute close the loop by showing page faults, migrations, and residency so you can verify that hot kernels run without faulting.
Capacity and pooling: from per-GPU caps to system-scale address spaces
The split-memory model caps usable capacity at each GPU’s HBM. If models or datasets exceed that cap, you shard across GPUs, tile explicitly, or design out-of-core algorithms to move data in and out of HBM. Unified Memory loosens the constraint. A Grace Hopper superchip exposes a single address space spanning HBM and large LPDDR; allocating far beyond HBM is straightforward, and coherent access over NVLink-C2C lets the GPU reach CPU-resident pages with more grace than PCIe-attached systems.
Across GPUs, NVSwitch and software provide a pooled view. CUDA peer access and UVM enable mapping and migrating pages across devices, while NVSHMEM adds PGAS-style one-sided operations that align with analytics and simulation codes. At platform scale, NVLink Switch Systems turn many Grace Hopper superchips into one logical memory domain; in GH200-class configurations, software can target hundreds of terabytes of addressable memory with NUMA-aware placement.
Storage spill rounds out the hierarchy in analytics stacks: frameworks use memory pools in HBM, spill to pinned host memory, then to NVMe via GPUDirect Storage to avoid host-bounce overhead. Compression happens at the application layer, not as a transparent GPU memory feature.
System factors, security, and deployment realities
High-end unified-memory platforms carry data-center-grade reliability and isolation. HBM and LPDDR employ ECC, and Hopper-generation devices add robust RAS features like page retirement and error containment, with telemetry accessible to tooling. NVLink and NVSwitch incorporate end-to-end protection to keep data intact across the fabric.
Isolation and multi-tenancy rely on established mechanisms:
- IOMMUs constrain device DMA.
- Multi-Instance GPU (MIG) partitions a physical GPU into isolated instances with dedicated memory and compute slices.
- SR-IOV and mediated pass-through enable virtual GPU sharing.
- Hopper introduces confidential computing features—hardware-enforced TEEs, memory encryption, and attestation—so sensitive models and data can run in shared infrastructure with stronger guarantees.
Containerized deployment is straightforward with standard toolchains; performance-sensitive setups typically use PCIe pass-through with MIG or exclusive devices and rely on topology discovery to exploit NVLink/NVSwitch effectively.
Practical guidance: when to choose explicit copies vs unified memory
Choose explicit copies (cudaMemcpy/streams) when:
- The working set fits in HBM and the dataflow is regular enough to batch large transfers.
- You need deterministic overlap of H2D/D2H transfers with compute.
- Kernels are tight and latency-sensitive, and you can enforce tiling/placement.
Lean on Unified Memory when:
- You traverse complex, pointer-rich data structures shared across CPU and GPU phases.
- You prototype rapidly and want correctness with minimal copy orchestration.
- You operate on NVLink/NVSwitch systems where remote mappings for read-mostly data can avoid duplication and thrashing.
- You need functional correctness with oversubscription, and you’re prepared to prefetch/batch access to mitigate stalls.
Apply UVM controls to make it predictable:
- Set preferred locations for hot regions to the target GPU.
- Mark read-mostly regions so they can be safely duplicated.
- Use SetAccessedBy to prime page tables for peer GPUs without migrating.
- Prefetch ahead of kernel launches to eliminate in-kernel faults.
- Avoid frequent write-sharing on the same pages across CPU and GPU.
A quick side-by-side
| Dimension | Classic discrete (PCIe, separate CPU RAM + GPU HBM) | Unified-memory (Grace Hopper, NVLink/NVSwitch, CUDA UVM) |
|---|---|---|
| Memory and bandwidth | CPU DDR/LPDDR; GPU HBM3 at up to ~3 TB/s locally; no CPU–GPU coherence | HBM3 + large LPDDR5X; coherent CPU–GPU via NVLink-C2C up to ~900 GB/s; pooled view across GPUs via NVLink/NVSwitch |
| Coherence and NUMA | No CPU–GPU hardware coherence; strong NUMA boundaries | Coherent CPU–GPU within GH; NUMA tiers across HBM, LPDDR, and peers; NVSwitch yields uniform high-bandwidth peer access |
| Interconnect | PCIe Gen4/5/6 x16 ≈ 31.5/63/128 GB/s per direction; microsecond latencies; switch contention | NVLink GPU–GPU up to ~900 GB/s aggregate per GPU with lower latency; NVLink-C2C ~900 GB/s coherent CPU–GPU; NVSwitch full-bisection fabrics |
| UVM mechanics | Demand paging over PCIe; 4 KB faults expensive; prefetch/advice critical; oversubscription bounded by PCIe | Demand paging and remote mapping over NVLink; higher fault-service bandwidth and lower latency; coherent fine-grained sharing on GH; oversubscription more viable but still slower than local HBM |
| Programming model | Explicit cudaMemcpy and stream pipelining; maximal predictability | cudaMallocManaged with prefetch/advice; simpler code; performance hinges on residency and locality |
Bottom line: simplify the boundary, not the physics 🚀
Grace Hopper and NVLink/NVSwitch don’t abolish locality; they make crossing boundaries less costly and more controllable. Coherent CPU–GPU bandwidth up to roughly 900 GB/s inside the superchip changes what’s practical for mixed CPU–GPU algorithms. NVSwitch and NVLink expand the idea across GPUs, turning remote reads and atomics into viable, high-throughput primitives rather than last resorts. CUDA Unified Memory ties it together with a single-pointer model that, with the right advice and prefetch, turns fault-prone demand paging into predictable data motion.
The winning strategy is clear:
- Put the hottest working sets in local HBM.
- Use NVLink/NVSwitch to share and scale without drowning in copies.
- Apply UVM policies to keep kernels free of page faults.
- Treat memory as a tiered NUMA landscape, not a flat ocean.
In short, simplify the software boundary—unify the address space and coherence where it counts—while respecting the physics of bandwidth and latency. Do that, and the move from PCIe-attached silos to NVLink/NVSwitch fabrics becomes less about rewriting code and more about unblocking performance you couldn’t touch before.