PCIe, BAR0, and DMA Explained with QEMU

PCIe, BAR0, and DMA Explained with QEMU

Introduction

In this post, we’ll break down how BAR0 and DMA interact in PCIe devices.
This is the theoretical foundation of a two-part series:

  • Part 1 (this post): BARs, DMA flows, and system-level design.
  • Part 2 (next post): Full hands-on implementation with QEMU and a Linux kernel driver.

πŸ’‘ Historical note:
In old ISA systems, the CPU directly drove parallel address + data lines to the card.
With PCIe, the same principle applies β€” the CPU still writes to registers β€”
but every access is now packed into a PCIe transaction and sent over lanes.

The goal here is to clarify the concepts β€” BARs as control registers, DMA as the data mover,
and how the CPU, IOMMU, and hardware engine all interact.
In the next part, we’ll take this theory into practice with a working QEMU device model and driver.

Base Address Registers (BARs)

  • BARs define memory-mapped regions exposed by the PCIe device.
  • Typically implemented with AXI-Lite in FPGA/ASIC designs.
  • Host drivers map BARs with pci_iomap() and access them using ioread32() / iowrite32().
  • Common usage: BAR0 β†’ control registers, BAR1 β†’ status, BAR2 β†’ DMA descriptors.

DMA Basics

DMA (Direct Memory Access) allows large data transfers without CPU intervention.

  • Software allocates and maps buffers (via kernel DMA API).
  • Hardware DMA engine performs the actual transfer across PCIe as a bus master.
  • Writing to BAR registers configures the DMA engine (address, length, command).

How BAR0 and DMA Work Together

  • BAR0 acts as a control panel: the driver writes registers to configure address, length, and commands.
  • DMA engine is implemented in hardware: once started, it moves the data autonomously over the PCIe bus.
  • Software only sets up the transfer and waits for completion (via interrupt or status register).
graph LR CPU["CPU / Driver"] --> BAR0["BAR0 Registers"] BAR0 --> DMA["DMA Engine (hardware)"] DMA --> RAM["System RAM"]

Who Controls the DMA?

Two common system architectures exist for heterogeneous FPGA + ARM platforms:

System 1: ARM as Manager

  • ARM configures registers, DMA descriptors, and orchestrates sequencing.
  • Host mostly consumes results.
  • Used in embedded/standalone systems (cameras, medical, industrial).

System 2: Host as Manager

  • Host configures registers and DMA directly.
  • ARM is just a compute worker (DSP, ML, crypto).
  • Used in PCIe accelerator cards in datacenters.

System 1 vs System 2 – Comparison Table

Aspect ARM = Manager Host = Manager
Ownership ARM owns control plane & registers Host owns control plane, ARM is worker
DMA Management ARM allocates buffers, IRQs, sequencing Host manages DMA, ARM processes data
Role Focus System orchestration & lifecycle Compute tasks only
Latency Sensitivity Real-time control, sequencing High-throughput pipelines
Deployment Context Embedded, standalone devices Servers/datacenters with PCIe accelerators

System View Diagram

graph TD subgraph "System 1: ARM as Manager" ARM1["ARM CPU"] -->|AXI-Lite Control| FPGA1["FPGA Modules"] ARM1 -->|Setup DMA| DMA1["DMA Engine"] DMA1 --> HostRAM1["System RAM Host"] end subgraph "System 2: Host as Manager" Host2["Host CPU"] -->|PCIe BAR Control| FPGA2["FPGA Modules"] Host2 -->|Setup DMA| DMA2["DMA Engine"] DMA2 --> HostRAM2["System RAM Host"] ARM2["ARM CPU"] -->|Worker Tasks| FPGA2 end

How to read the diagram:

  • System 1 (ARM as Manager): ARM configures registers and DMA, Host only consumes results.
  • System 2 (Host as Manager): Host controls BARs and DMA, ARM works as a compute engine.
  • DMA Engine: Always the hardware block actually moving the data across PCIe.
  • Control Path vs Data Path: Control goes via BAR/AXI-Lite, data streams via DMA.

Practical Tips

  • Define clear ownership of the control plane (ARM vs Host).
  • Always include a REG_IF_VERSION register for compatibility.
  • Provide telemetry: counters, status, error flags.
  • Prototype with QEMU before moving to hardware.

πŸ–§ PCIe, BAR0, and DMA with QEMU

PCIe (Peripheral Component Interconnect Express) is the standard high-speed bus connecting CPUs with devices like FPGAs, GPUs, and NICs.
In this post we’ll go hands-on: understand BAR0 registers, how a DMA engine makes a device a true Bus Master, and how we can debug both logic and driver side with QEMU.


πŸ”Œ BAR0 – Control Plane

  • A PCIe device exposes Base Address Registers (BARs), each one mapping to a memory region.
  • BAR0 is often used for control registers (status, DMA setup, configuration).
  • From the kernel side, once the OS enumerates PCIe, the driver maps BAR0 using pci_iomap().
  • Access becomes simple: readl() and writel() from the driver hit the device registers.

πŸ‘‰ Writing to BAR0 + REG_DMA_ADDR doesn’t move data. It just tells the device’s DMA engine where in system RAM to operate.


βš™οΈ DMA – Who Really Moves Data?

  • The kernel DMA API (dma_alloc_coherent, dma_map_single) allocates and maps buffers in host RAM.
  • The driver writes the buffer’s dma_handle (bus address) and length to BAR0 registers.
  • The device DMA engine becomes Bus Master on the PCIe fabric and actually transfers the data by issuing PCIe TLPs.
  • CPU is not involved in the memcpy β€” only sets things up.

πŸ“Ά Flow of Operations

Step-by-step DMA flow (as shown in the diagram):

  • User space app β†’ Kernel driver: An application issues a request (write() / ioctl()), triggering a DMA transaction.
  • Driver β†’ DMA API: The driver allocates a physically contiguous buffer with the Linux DMA API (dma_alloc_coherent).
  • Kernel DMA API β†’ RAM: The API ensures the buffer is accessible by the device and returns a bus address (dma_handle).
  • Driver β†’ Device registers (BAR0): The driver programs the device by writing into BAR0 registers (DMA address, length, and a START command).
  • Device DMA Engine β†’ System RAM: Acting as Bus Master, the device generates PCIe TLPs (Memory Read/Write) to transfer data directly to/from host RAM.
  • Device β†’ Driver: Once the transfer completes, the device signals via MSI/MSI-X interrupt or by updating a status register.
  • Driver β†’ User space app: The driver wakes up the application and reports DMA completion.
sequenceDiagram participant App as πŸ–₯️ User App participant Driver as πŸ“„ Kernel Driver participant DMA_API as βš™οΈ DMA API participant Device as πŸš€ PCIe Device + DMA Engine participant RAM as πŸ’Ύ System RAM participant Driver participant Device App->>Driver: write()/ioctl() Driver->>DMA_API: dma_alloc_coherent() DMA_API->>RAM: allocate physical buffer DMA_API-->>Driver: return dma_handle (bus address) Driver->>Device: writel(dma_handle, BAR0 + REG_DMA_ADDR) Driver->>Device: writel(size, BAR0 + REG_DMA_LEN) Driver->>Device: writel(START, BAR0 + REG_DMA_CMD) Device->>RAM: issues PCIe Memory Read/Write TLPs RAM-->>Device: data flows Device->>Driver: MSI interrupt or REG_STATUS update Driver-->>App: complete()

πŸ—οΈ Hardware View

How to read the diagram:

  • Local RAM/FIFO – the device’s own buffers or on-chip memory.
  • DMA Engine – the β€œbridge” that turns BAR0 register commands into PCIe transactions.
  • PCIe Endpoint – serializes packets (TLPs) onto the PCIe lanes.
  • PCIe Fabric – the physical/high-speed lanes carrying the data.
  • Root Complex – the host’s PCIe controller that receives the TLPs.
  • System DRAM – where data finally lands (or is fetched from) in the host memory
graph LR subgraph Device["PCIe Device (FPGA/GPU/NIC)"] RAM_Device["Local RAM / FIFO"] DMA["DMA Engine (Bus Master)"] PCIe_EP["PCIe Endpoint (SERDES + TLPs)"] RAM_Device <--> DMA DMA <--> PCIe_EP end subgraph PCIe_Bus["PCIe Fabric (lanes)"] Link["High-speed serial lines (x1/x4/x16)"] end subgraph Host["Host PC"] Root["Root Complex"] MC["Memory Controller"] RAM_Host["System DRAM"] Root <--> MC MC <--> RAM_Host end PCIe_EP <--> Link Link <--> Root

πŸ’‘ Here we see how the DMA engine bridges between local device buffers and host DRAM through the PCIe lanes.


πŸ”„ Bulk vs Streaming DMA

  • Bulk DMA (one-shot): Driver writes address + size β†’ device moves a block (e.g. 1MB). No indices needed.
  • Streaming DMA (ring buffer): Shared structure in RAM with Producer/Consumer indices.
    • Device updates Producer as it fills.
    • Driver/CPU updates Consumer as it drains.
    • Common in NICs, audio, video pipelines.

πŸ“Œ Key Principles

  • BAR0 = control plane. The CPU writes registers; no data moves yet.
  • Device = bus master. Once told, the device issues its own PCIe TLPs.
  • DMA = hardware job. CPU is out of the data path.
  • Interrupts complete the loop. Device signals finish via MSI/MSI-X or status register.

πŸ§ͺ Practicing with QEMU

With QEMU you can simulate:

  • A fake PCIe device exposing BAR0 registers.
  • A Linux driver that allocates buffers, writes BAR0, and waits for DMA completion.
  • GDB debugging of both kernel and QEMU side.

πŸ‘‰ This allows you to debug logic and register maps before touching real FPGA hardware.


πŸ“‚ Code Example (to be added)

πŸ‘‰ Full implementation (QEMU device model + Linux kernel driver + user app) will be published in the next post.
Stay tuned, or check the GitHub repo here: github.com/yairgadelov/qemu-pcie-demo
Here we’ll place two parts:

  1. QEMU PCIe Device Stub

    • Defines BAR0.
    • Implements registers (REG_DMA_ADDR, REG_DMA_LEN, REG_DMA_CMD, REG_STATUS).
    • Simulates DMA with dma_memory_read/dma_memory_write.
  2. Linux Kernel Driver

    • Uses dma_alloc_coherent().
    • Writes BAR0 with bus addresses and length.
    • Handles MSI interrupt.
    • Demonstrates 1MB transfer flow.

The full implementation (QEMU device model + Linux kernel driver + user-space test app)
is detailed in a follow-up post:

πŸ‘‰ Hands-On PCIe with QEMU: From Fake Device to Kernel Driver


βœ… Takeaways

  • BAR0 = control registers, exposed to host via PCIe.
  • DMA engine = hardware moves data autonomously once configured.
  • System 1 vs System 2 = ARM controls vs Host controls.
  • QEMU = safe way to test drivers before real hardware.

References


Comments
comments powered by Disqus