PetaLinux NVMe Break Through 7 GB/s on the AMD ZCU106 Evaluation Kit with Design Gateway’s Solution

Door Design Gateway Co., Ltd.

Most Zynq™ UltraScale+™ MPSoC designs on Embedded Linux achieve only about 2 GB/s throughput when using the standard NVMe driver with the PCIe Gen3 hard block. This article introduces the world’s first NVMe PetaLinux solution operating at PCIe Gen4 speed without relying on the PCIe Hard Block of the FPGA. This breakthrough is achieved by using NVMe IP Core together with a tailor-made device driver from Design Gateway. On the AMD ZCU106 evaluation kit, this article will demonstrate that approximately 7.5 GB/s read and 6.9 GB/s write with a mainstream NVMe Gen4 SSD is possible and showcase how Design Gateway’s technology unlocks the full potential of high-speed NVMe on embedded Linux platforms.

Introduction to the Zynq™ UltraScale+™ MPSoC ZCU106 evaluation kit

The AMD ZCU106 evaluation kit is built on the Zynq™ UltraScale+™ MPSoC platform, integrating quad-core Arm Cortex®-A53 processors with high-speed programmable logic. This powerful combination allows engineers to run full operating systems such as PetaLinux, enabling software control, high-speed I/O management, and FPGA accelerator operation within a single environment.

Image of AMD Zynq™ UltraScale+™ EV (click to enlarge)Figure 1: AMD Zynq™ UltraScale+™ EV. (Image source: Advanced Micro Devices, Inc.)

Image of Linux (PetaLinux) running on AMD Zynq™ UltraScale+™ MPSoCFigure 2: Linux (PetaLinux) running on AMD Zynq™ UltraScale+™ MPSoC — combining the flexibility of software with FPGA hardware acceleration. (Image source: AMD)

To fully unlock the performance potential of PCIe Gen4 NVMe SSDs, however, developers must also understand how the traditional Linux NVMe stack behaves on embedded SoCs. Even with the ZCU106’s powerful heterogeneous architecture, PetaLinux systems can encounter throughput and efficiency limitations, a challenge explored in the following section.

Concept primer: why NVMe on PetaLinux often bottlenecks

While PetaLinux provides a powerful software layer for controlling hardware and managing data I/O, its conventional NVMe implementation is not optimized for sustained high-throughput performance. The interaction between the Linux kernel and the ARM-based processing system introduces multiple sources of inefficiency that limit bandwidth utilization even when PCIe Gen4 resources are available:

  • ⚙️ Kernel stack overhead: The standard NVMe driver runs entirely inside the Linux kernel, which involves multiple context switches, interrupt handling, buffer copies, and cache maintenance. These software-driven operations limit IOPS and prevent full bandwidth utilization.
  • 📋 Scheduling and queue depth limitations: Default NVMe configurations often use shallow queues and small I/O block sizes. Combined with kernel scheduling overhead, CPU utilization saturates before the link bandwidth does.
  • 💾 CPU and memory subsystem constraints: On embedded SoCs like the Zynq UltraScale+ MPSoC, DDR bandwidth and cache coherency traffic between the PS and PL can become the real performance ceiling.
  • Power and IRQ management: Systems configured with on-demand CPU governors or unbalanced interrupt affinities may experience reduced performance under heavy I/O workloads.
  • 📊 Real-world impact: Even well-tuned systems using the conventional NVMe driver rarely exceed ~1.5–2.5 GB/s. This is only 50–60% from 4 GB/s bandwidth, which is the full capabilities of PCIe Gen3x4.
  • 🧩 Unavailable PCIe Hard Block for Gen4: Although certain SoC devices feature transceivers that support PCIe Gen4 signaling rates (16 Gbps per lane), their built-in PCIe Hard Block remains limited to Gen3 operation. This architectural gap prevents the system from leveraging the full potential of Gen4 bandwidth, constraining high-performance or data-intensive applications that demand sustained multi-gigabyte throughput on cost-optimized FPGA platforms.

Breaking through the limit

To overcome this performance limitation, developers typically move away from the kernel-managed storage stack toward user-space or hardware-accelerated I/O. There are two mainstream approaches:

  • SPDK/DPDK frameworks: Use poll-mode drivers in user space, eliminating kernel context switches and interrupts. However, this method consumes huge CPU resources.
  • FPGA offload (e.g., NVMe-IP + DMA engine): Moves command processing, queuing, and data transfer into programmable logic, achieving near-wire-speed throughput with deterministic hardware-level performance.

Solution architecture

Image of comparison of NVMe Solutions on PetaLinux Using Zynq UltraScale+Figure 3: Comparison of NVMe Solutions on PetaLinux Using Zynq UltraScale+. (Image source: Design Gateway)

Design Gateway’s DMA PetaLinux Solution replaces the traditional PCIe Hard IP and NVMe driver with a Soft NVMeG4-IP Core and a custom DG NVMe Driver.

This hardware-offloaded architecture runs PCIe Gen4 entirely through FPGA transceivers, achieving 7 GB/s throughput on Zynq UltraScale+ platforms. By combining NVMeG4-IP and dual-AXI DMA under a unified DG driver, the system eliminates CPU overhead, enabling full Gen4 x4 performance on PetaLinux.

Key features

  • NVMe Gen4 Soft IP in PL — complete hardware-offloaded NVMe solution integrating a PCIe Gen4 Soft IP Core, eliminating the need of PCIe Hard Block and utilizing FPGA transceivers to their maximum potential.
  • Dual‑DMA channels to double DMA bandwidth, reaching 8 GB/s for PCIe Gen4 speed.
  • Custom PetaLinux driver with clean control and monitoring interfaces, well-optimized to eliminate bottleneck in Software-Hardware data movement.
  • AXI‑compatible interfaces for easy integration within PL data pipelines
  • Complete Demo package with source code, scripts, documentation, and quick bring-up instructions
  • Portable design adaptable to any AMD FPGA device that supports embedded Linux.

Implementation and Performance Results on ZCU106

Figure 4 shows the overview of the reference design based on the AMD ZCU106 Evaluation Kit (XCZU7EV). The system integrates Design Gateway’s NVMe Gen4 Soft IP with a dual-DMA architecture and a custom PetaLinux driver, enabling high-speed access between the NVMe Gen4 SSD and the PetaLinux OS.

For more details of NVMeG4-IP with DMA on PetaLinux reference design, please refer to the NVMeG4-IP with DMA on PetaLinux reference design document provided on Design Gateway’s website.

Image of NVMeG4-IP with DMA on PetaLinux reference design documentFigure 4: Reference design overview. (Image source: Design Gateway)

The demo system is designed to write and verify data with the NVMe SSD on the ZCU106. Test execution is controlled via a serial console on PetaLinux using the DG NVMe Application. This application transfers data between host memory and the NVMe SSD through dual DMA channels for high-speed operation. The CPU is responsible only for setup and monitoring, while all data movement is handled in hardware.

An AB17-M2FMC adapter board is used to connect the NVMe SSD to the FMC-HPC slot, as shown in Figure 5.

Image of demo environment set up on ZCU106Figure 5: Demo environment set up on ZCU106. (Image source: Design Gateway)

The example test result when running the demo system on the ZCU106 while using the 1 TB Samsung 990 Pro is shown in Figure 6, confirming full utilization of the PCIe Gen4 x4 bandwidth on PetaLinux.

Image of NVMe SSD read/write performance comparison on ZCU106Figure 6: NVMe SSD read/write performance comparison on ZCU106: Traditional NVMe driver vs. DG NVMe Solution. (Image source: Design Gateway)

Conclusion

Design Gateway’s NVMe Gen4 Soft IP for PetaLinux transforms the Zynq UltraScale+ into a high-performance storage platform, achieving world-first 7 GB/s throughput at Gen4 speed. By fully offloading the NVMe protocol into hardware logic, together with a well-optimized PetaLinux device driver, the solution eliminates software-level bottlenecks, maximizes data-path efficiency, and scale bandwidth for DAQ and video-processing workloads. The design is portable, efficient, and ideal for edge or embedded systems that demand both high throughput and determinism performance.

Disclaimer: The opinions, beliefs, and viewpoints expressed by the various authors and/or forum participants on this website do not necessarily reflect the opinions, beliefs, and viewpoints of DigiKey or official policies of DigiKey.

Achtergrondinformatie over deze auteur

Design Gateway Co., Ltd.