Published at September 24, 2025 · Last Modified at September 27, 2025 · 9 min read · Tags: linux kernel qemu pcie dma fpga embedded driver-development

Introduction

In this post, we’ll break down how BAR0 and DMA interact in PCIe devices.
This is the theoretical foundation of a two-part series:

Part 1 (this post): BARs, DMA flows, and system-level design.
Part 2 (next post): Full hands-on implementation with QEMU and a Linux kernel driver.

💡 Historical note:
In old ISA systems, the CPU directly drove parallel address + data lines to the card.
With PCIe, the same principle applies — the CPU still writes to registers —
but every access is now packed into a PCIe transaction and sent over serial lanes.

Note on PCIe Root Complex Drivers
On SoC platforms (e.g., Zynq MPSoC, Jetson), the PCIe Root Complex (RC) is not “magic.”
A platform driver (such as zynqmp-pcie or tegra-pcie) initializes the PCIe controller hardware and registers low-level callbacks (pci_ops) with the Linux PCI core.
After this step, the generic PCI core performs device enumeration and makes the discovered Endpoints available to higher-level pci_driver modules (e.g., the Xilinx XDMA driver).
As a driver developer for the Endpoint, you typically interact with the PCI core APIs (like pci_ioremap_bar() and the DMA API) without worrying about the RC’s low-level bring-up.

The goal here is to clarify the concepts — BARs as control registers, DMA as the data mover,
and how the CPU, IOMMU, and hardware engine all interact.
In the next part, we’ll take this theory into practice with a working QEMU device model and driver.

Base Address Registers (BARs)

BARs define memory-mapped regions exposed by the PCIe device.
Typically implemented with AXI-Lite in FPGA/ASIC designs.
Host drivers map BARs with pci_iomap() and access them using ioread32() / iowrite32().
Common usage: BAR0 → control registers, BAR1 → status, BAR2 → DMA descriptors.

DMA Basics

DMA (Direct Memory Access) allows large data transfers without CPU intervention.

Software allocates and maps buffers (via kernel DMA API).
Hardware DMA engine performs the actual transfer across PCIe as a bus master.
Writing to BAR registers configures the DMA engine (address, length, command).

How BAR0 and DMA Work Together

BAR0 acts as a control panel: the driver writes registers to configure address, length, and commands.
DMA engine is implemented in hardware: once started, it moves the data autonomously over the PCIe bus.
Software only sets up the transfer and waits for completion (via interrupt or status register).

graph LR CPU["CPU / Driver"] --> BAR0["BAR0 Registers"] BAR0 --> DMA["DMA Engine (hardware)"] DMA --> RAM["System RAM"]

Who Controls the DMA?

In heterogeneous FPGA + ARM platforms there are two common PCIe topologies, depending on whether the ARM SoC acts as the Root Complex (RC) or as a PCIe Endpoint (EP) connected to an external host.

Mode 1 — ARM as Root Complex (Embedded RC)

The ARM SoC is the Root Complex: it initializes the PCIe link, enumerates devices, assigns BARs, and programs DMA descriptors.
Any attached FPGA or PCIe IP acts as an Endpoint.
Used in fully embedded/standalone systems (cameras, medical, industrial) where no external PC host exists.

Mode 2 — ARM as PCIe Endpoint (External RC Host)

The ARM/FPGA board acts as a PCIe Endpoint (EP).
An external PC/server is the Root Complex: it enumerates the device, assigns BARs, and directly manages DMA.
The ARM cores typically run compute kernels (DSP, ML, crypto) but do not own the PCIe control plane.
Used in PCIe accelerator cards in workstations and datacenters.

Embedded RC vs External RC – Comparison Table

Aspect	Mode 1 – ARM as RC	Mode 2 – ARM as EP (Host RC)
PCIe Role	ARM = Root Complex	PC/Server = Root Complex
Ownership	ARM owns control plane & register map	Host owns control plane, ARM is worker
DMA Management	ARM allocates buffers, IRQs, sequencing	Host manages DMA, ARM processes data
Role Focus	System orchestration & device lifecycle	Compute tasks only
Latency Sensitivity	Real-time control & sequencing	High-throughput pipelines
Deployment Context	Embedded / standalone systems	Servers & datacenter accelerator cards

System View Diagram

graph TD subgraph "Mode 1 — ARM as Root Complex (Embedded RC)" ARM1["ARM SoC (RC)"] -->|AXI-Lite / BAR Control| FPGA1["FPGA Modules (EP)"] ARM1 -->|Program DMA| DMA1["DMA Engine"] DMA1 --> ArmRAM1["System RAM (ARM)"] end subgraph "Mode 2 — ARM as Endpoint (External RC Host)" Host2["PC/Server (RC)"] -->|PCIe BAR Control| FPGA2["FPGA Modules (EP)"] Host2 -->|Program DMA| DMA2["DMA Engine"] DMA2 --> HostRAM2["System RAM (Host)"] ARM2["ARM cores on EP"] -->|Compute Tasks| FPGA2 end

How to read the diagram:

Mode 1 (ARM as RC): ARM controls PCIe (enumeration, BARs, DMA). External host is optional or just consumes results.
Mode 2 (ARM as EP): External PC/server controls PCIe, configures BARs and DMA; ARM acts as a compute accelerator.
DMA Engine: Always the hardware block moving data across PCIe.
Control Path vs Data Path: Control is done via BAR/AXI-Lite registers, data moves via the DMA engine.

Practical Tips

Define clear ownership of the control plane (ARM vs Host).
Always include a REG_IF_VERSION register for compatibility.
Provide telemetry: counters, status, error flags.
Prototype with QEMU before moving to hardware.

🖧 PCIe, BAR0, and DMA with QEMU

PCIe (Peripheral Component Interconnect Express) is the standard high-speed bus connecting CPUs with devices like FPGAs, GPUs, and NICs.
In this post we’ll go hands-on: understand BAR0 registers, how a DMA engine makes a device a true Bus Master, and how we can debug both logic and driver side with QEMU.

🔌 BAR0 – Control Plane

A PCIe device exposes Base Address Registers (BARs), each one mapping to a memory region.
BAR0 is often used for control registers (status, DMA setup, configuration).
From the kernel side, once the OS enumerates PCIe, the driver maps BAR0 using pci_iomap().
Access becomes simple: readl() and writel() from the driver hit the device registers.

👉 Writing to BAR0 + REG_DMA_ADDR doesn’t move data. It just tells the device’s DMA engine where in system RAM to operate.

⚙️ DMA – Who Really Moves Data?

The kernel DMA API (dma_alloc_coherent, dma_map_single) allocates and maps buffers in host RAM.
The driver writes the buffer’s dma_handle (bus address) and length to BAR0 registers.
The device DMA engine becomes Bus Master on the PCIe fabric and actually transfers the data by issuing PCIe TLPs.
CPU is not involved in the memcpy — only sets things up.

📶 Flow of Operations

Step-by-step DMA flow (as shown in the diagram):

User space app → Kernel driver: An application issues a request (write() / ioctl()), triggering a DMA transaction.
Driver → DMA API: The driver allocates a physically contiguous buffer with the Linux DMA API (dma_alloc_coherent).
Kernel DMA API → RAM: The API ensures the buffer is accessible by the device and returns a bus address (dma_handle).
Driver → Device registers (BAR0): The driver programs the device by writing into BAR0 registers (DMA address, length, and a START command).
Device DMA Engine → System RAM: Acting as Bus Master, the device generates PCIe TLPs (Memory Read/Write) to transfer data directly to/from host RAM.
Device → Driver: Once the transfer completes, the device signals via MSI/MSI-X interrupt or by updating a status register.
Driver → User space app: The driver wakes up the application and reports DMA completion.

sequenceDiagram participant App as 🖥️ User App participant Driver as 📄 Kernel Driver participant DMA_API as ⚙️ DMA API participant Device as 🚀 PCIe Device + DMA Engine participant RAM as 💾 System RAM App->>Driver: write()/ioctl() Driver->>DMA_API: dma_alloc_coherent() DMA_API->>RAM: allocate physical buffer DMA_API-->>Driver: return dma_handle (bus address) Driver->>Device: writel(dma_handle, BAR0 + REG_DMA_ADDR) Driver->>Device: writel(size, BAR0 + REG_DMA_LEN) Driver->>Device: writel(START, BAR0 + REG_DMA_CMD) Device->>RAM: issues PCIe Memory Read/Write TLPs RAM-->>Device: data flows Device->>Driver: MSI interrupt or REG_STATUS update Driver-->>App: complete()

🏗️ Hardware View

How to read the diagram:

Local RAM/FIFO – the device’s own buffers or on-chip memory.
DMA Engine – the “bridge” that turns BAR0 register commands into PCIe transactions.
PCIe Endpoint – serializes packets (TLPs) onto the PCIe lanes.
PCIe Fabric – the physical/high-speed lanes carrying the data.
Root Complex – the host’s PCIe controller that receives the TLPs.
System DRAM – where data finally lands (or is fetched from) in the host memory

graph LR subgraph Device["PCIe Device (FPGA/GPU/NIC)"] RAM_Device["Local RAM / FIFO"] DMA["DMA Engine (Bus Master)"] PCIe_EP["PCIe Endpoint (SERDES + TLPs)"] RAM_Device <--> DMA DMA <--> PCIe_EP end subgraph PCIe_Bus["PCIe Fabric (lanes)"] Link["High-speed serial lines (x1/x4/x16)"] end subgraph Host["Host PC"] Root["Root Complex"] MC["Memory Controller"] RAM_Host["System DRAM"] Root <--> MC MC <--> RAM_Host end PCIe_EP <--> Link Link <--> Root

💡 Here we see how the DMA engine bridges between local device buffers and host DRAM through the PCIe lanes.

🔄 Bulk vs Streaming DMA

Bulk DMA (one-shot): Driver writes address + size → device moves a block (e.g. 1MB). No indices needed.
Streaming DMA (ring buffer): Shared structure in RAM with Producer/Consumer indices.
- Device updates Producer as it fills.
- Driver/CPU updates Consumer as it drains.
- Common in NICs, audio, video pipelines.

📌 Key Principles

BAR0 = control plane. The CPU writes registers; no data moves yet.
Device = bus master. Once told, the device issues its own PCIe TLPs.
DMA = hardware job. CPU is out of the data path.
Interrupts complete the loop. Device signals finish via MSI/MSI-X or status register.

🧪 Practicing with QEMU

With QEMU you can simulate:

A fake PCIe device exposing BAR0 registers.
A Linux driver that allocates buffers, writes BAR0, and waits for DMA completion.
GDB debugging of both kernel and QEMU side.

👉 This allows you to debug logic and register maps before touching real FPGA hardware.

📂 Code Example (to be added)

👉 Full implementation (QEMU device model + Linux kernel driver + user app) will be published in the next post.
Stay tuned, or check the GitHub repo here: github.com/yairgadelov/qemu-pcie-demo
Here we’ll place two parts:

QEMU PCIe Device Stub
- Defines BAR0.
- Implements registers (REG_DMA_ADDR, REG_DMA_LEN, REG_DMA_CMD, REG_STATUS).
- Simulates DMA with dma_memory_read/dma_memory_write.
Linux Kernel Driver
- Uses dma_alloc_coherent().
- Writes BAR0 with bus addresses and length.
- Handles MSI interrupt.
- Demonstrates 1MB transfer flow.

The full implementation (QEMU device model + Linux kernel driver + user-space test app)
is detailed in a follow-up post:

👉 Hands-On PCIe with QEMU: From Fake Device to Kernel Driver

✅ Key Takeaways

BAR0 → A memory-mapped register space that the host (RC) uses to configure and control the device.
DMA Engine → Hardware inside the Endpoint that actually moves data once the driver sets up addresses and lengths.
Embedded RC vs External RC →
- Embedded RC (ARM as RC): ARM SoC owns PCIe control, enumerates, and drives DMA.
- External RC (Host as RC): PC/server owns PCIe control; the ARM/FPGA card is an Endpoint.
QEMU → Great for prototyping and debugging PCIe drivers safely before using real hardware.

PCIe, BAR0, and DMA Explained with QEMU