Introduction
In this post, weβll break down how BAR0 and DMA interact in PCIe devices.
This is the theoretical foundation of a two-part series:
- Part 1 (this post): BARs, DMA flows, and system-level design.
- Part 2 (next post): Full hands-on implementation with QEMU and a Linux kernel driver.
π‘ Historical note:
In old ISA systems, the CPU directly drove parallel address + data lines to the card.
With PCIe, the same principle applies β the CPU still writes to registers β
but every access is now packed into a PCIe transaction and sent over lanes.
The goal here is to clarify the concepts β BARs as control registers, DMA as the data mover,
and how the CPU, IOMMU, and hardware engine all interact.
In the next part, weβll take this theory into practice with a working QEMU device model and driver.
Base Address Registers (BARs)
- BARs define memory-mapped regions exposed by the PCIe device.
- Typically implemented with AXI-Lite in FPGA/ASIC designs.
- Host drivers map BARs with
pci_iomap()
and access them using ioread32()
/ iowrite32()
.
- Common usage: BAR0 β control registers, BAR1 β status, BAR2 β DMA descriptors.
DMA Basics
DMA (Direct Memory Access) allows large data transfers without CPU intervention.
- Software allocates and maps buffers (via kernel DMA API).
- Hardware DMA engine performs the actual transfer across PCIe as a bus master.
- Writing to BAR registers configures the DMA engine (address, length, command).
How BAR0 and DMA Work Together
- BAR0 acts as a control panel: the driver writes registers to configure address, length, and commands.
- DMA engine is implemented in hardware: once started, it moves the data autonomously over the PCIe bus.
- Software only sets up the transfer and waits for completion (via interrupt or status register).
graph LR
CPU["CPU / Driver"] --> BAR0["BAR0 Registers"]
BAR0 --> DMA["DMA Engine (hardware)"]
DMA --> RAM["System RAM"]
Who Controls the DMA?
Two common system architectures exist for heterogeneous FPGA + ARM platforms:
System 1: ARM as Manager
- ARM configures registers, DMA descriptors, and orchestrates sequencing.
- Host mostly consumes results.
- Used in embedded/standalone systems (cameras, medical, industrial).
System 2: Host as Manager
- Host configures registers and DMA directly.
- ARM is just a compute worker (DSP, ML, crypto).
- Used in PCIe accelerator cards in datacenters.
System 1 vs System 2 β Comparison Table
Aspect |
ARM = Manager |
Host = Manager |
Ownership |
ARM owns control plane & registers |
Host owns control plane, ARM is worker |
DMA Management |
ARM allocates buffers, IRQs, sequencing |
Host manages DMA, ARM processes data |
Role Focus |
System orchestration & lifecycle |
Compute tasks only |
Latency Sensitivity |
Real-time control, sequencing |
High-throughput pipelines |
Deployment Context |
Embedded, standalone devices |
Servers/datacenters with PCIe accelerators |
System View Diagram
graph TD
subgraph "System 1: ARM as Manager"
ARM1["ARM CPU"] -->|AXI-Lite Control| FPGA1["FPGA Modules"]
ARM1 -->|Setup DMA| DMA1["DMA Engine"]
DMA1 --> HostRAM1["System RAM Host"]
end
subgraph "System 2: Host as Manager"
Host2["Host CPU"] -->|PCIe BAR Control| FPGA2["FPGA Modules"]
Host2 -->|Setup DMA| DMA2["DMA Engine"]
DMA2 --> HostRAM2["System RAM Host"]
ARM2["ARM CPU"] -->|Worker Tasks| FPGA2
end
How to read the diagram:
- System 1 (ARM as Manager): ARM configures registers and DMA, Host only consumes results.
- System 2 (Host as Manager): Host controls BARs and DMA, ARM works as a compute engine.
- DMA Engine: Always the hardware block actually moving the data across PCIe.
- Control Path vs Data Path: Control goes via BAR/AXI-Lite, data streams via DMA.
Practical Tips
- Define clear ownership of the control plane (ARM vs Host).
- Always include a
REG_IF_VERSION
register for compatibility.
- Provide telemetry: counters, status, error flags.
- Prototype with QEMU before moving to hardware.
π§ PCIe, BAR0, and DMA with QEMU
PCIe (Peripheral Component Interconnect Express) is the standard high-speed bus connecting CPUs with devices like FPGAs, GPUs, and NICs.
In this post weβll go hands-on: understand BAR0 registers, how a DMA engine makes a device a true Bus Master, and how we can debug both logic and driver side with QEMU.
π BAR0 β Control Plane
- A PCIe device exposes Base Address Registers (BARs), each one mapping to a memory region.
- BAR0 is often used for control registers (status, DMA setup, configuration).
- From the kernel side, once the OS enumerates PCIe, the driver maps BAR0 using
pci_iomap()
.
- Access becomes simple:
readl()
and writel()
from the driver hit the device registers.
π Writing to BAR0 + REG_DMA_ADDR
doesnβt move data. It just tells the deviceβs DMA engine where in system RAM to operate.
βοΈ DMA β Who Really Moves Data?
- The kernel DMA API (
dma_alloc_coherent
, dma_map_single
) allocates and maps buffers in host RAM.
- The driver writes the bufferβs
dma_handle
(bus address) and length to BAR0 registers.
- The device DMA engine becomes Bus Master on the PCIe fabric and actually transfers the data by issuing PCIe TLPs.
- CPU is not involved in the memcpy β only sets things up.
πΆ Flow of Operations
Step-by-step DMA flow (as shown in the diagram):
- User space app β Kernel driver: An application issues a request (write() / ioctl()), triggering a DMA transaction.
- Driver β DMA API: The driver allocates a physically contiguous buffer with the Linux DMA API (dma_alloc_coherent).
- Kernel DMA API β RAM: The API ensures the buffer is accessible by the device and returns a bus address (dma_handle).
- Driver β Device registers (BAR0): The driver programs the device by writing into BAR0 registers (DMA address, length, and a START command).
- Device DMA Engine β System RAM: Acting as Bus Master, the device generates PCIe TLPs (Memory Read/Write) to transfer data directly to/from host RAM.
- Device β Driver: Once the transfer completes, the device signals via MSI/MSI-X interrupt or by updating a status register.
- Driver β User space app: The driver wakes up the application and reports DMA completion.
sequenceDiagram
participant App as π₯οΈ User App
participant Driver as π Kernel Driver
participant DMA_API as βοΈ DMA API
participant Device as π PCIe Device + DMA Engine
participant RAM as πΎ System RAM
participant Driver
participant Device
App->>Driver: write()/ioctl()
Driver->>DMA_API: dma_alloc_coherent()
DMA_API->>RAM: allocate physical buffer
DMA_API-->>Driver: return dma_handle (bus address)
Driver->>Device: writel(dma_handle, BAR0 + REG_DMA_ADDR)
Driver->>Device: writel(size, BAR0 + REG_DMA_LEN)
Driver->>Device: writel(START, BAR0 + REG_DMA_CMD)
Device->>RAM: issues PCIe Memory Read/Write TLPs
RAM-->>Device: data flows
Device->>Driver: MSI interrupt or REG_STATUS update
Driver-->>App: complete()
ποΈ Hardware View
How to read the diagram:
- Local RAM/FIFO β the deviceβs own buffers or on-chip memory.
- DMA Engine β the βbridgeβ that turns BAR0 register commands into PCIe transactions.
- PCIe Endpoint β serializes packets (TLPs) onto the PCIe lanes.
- PCIe Fabric β the physical/high-speed lanes carrying the data.
- Root Complex β the hostβs PCIe controller that receives the TLPs.
- System DRAM β where data finally lands (or is fetched from) in the host memory
graph LR
subgraph Device["PCIe Device (FPGA/GPU/NIC)"]
RAM_Device["Local RAM / FIFO"]
DMA["DMA Engine (Bus Master)"]
PCIe_EP["PCIe Endpoint (SERDES + TLPs)"]
RAM_Device <--> DMA
DMA <--> PCIe_EP
end
subgraph PCIe_Bus["PCIe Fabric (lanes)"]
Link["High-speed serial lines (x1/x4/x16)"]
end
subgraph Host["Host PC"]
Root["Root Complex"]
MC["Memory Controller"]
RAM_Host["System DRAM"]
Root <--> MC
MC <--> RAM_Host
end
PCIe_EP <--> Link
Link <--> Root
π‘ Here we see how the DMA engine bridges between local device buffers and host DRAM through the PCIe lanes.
π Bulk vs Streaming DMA
- Bulk DMA (one-shot): Driver writes address + size β device moves a block (e.g. 1MB). No indices needed.
- Streaming DMA (ring buffer): Shared structure in RAM with Producer/Consumer indices.
- Device updates Producer as it fills.
- Driver/CPU updates Consumer as it drains.
- Common in NICs, audio, video pipelines.
π Key Principles
- BAR0 = control plane. The CPU writes registers; no data moves yet.
- Device = bus master. Once told, the device issues its own PCIe TLPs.
- DMA = hardware job. CPU is out of the data path.
- Interrupts complete the loop. Device signals finish via MSI/MSI-X or status register.
π§ͺ Practicing with QEMU
With QEMU you can simulate:
- A fake PCIe device exposing BAR0 registers.
- A Linux driver that allocates buffers, writes BAR0, and waits for DMA completion.
- GDB debugging of both kernel and QEMU side.
π This allows you to debug logic and register maps before touching real FPGA hardware.
π Code Example (to be added)
π Full implementation (QEMU device model + Linux kernel driver + user app) will be published in the next post.
Stay tuned, or check the GitHub repo here: github.com/yairgadelov/qemu-pcie-demo
Here weβll place two parts:
-
QEMU PCIe Device Stub
- Defines BAR0.
- Implements registers (REG_DMA_ADDR, REG_DMA_LEN, REG_DMA_CMD, REG_STATUS).
- Simulates DMA with
dma_memory_read
/dma_memory_write
.
-
Linux Kernel Driver
- Uses
dma_alloc_coherent()
.
- Writes BAR0 with bus addresses and length.
- Handles MSI interrupt.
- Demonstrates 1MB transfer flow.
The full implementation (QEMU device model + Linux kernel driver + user-space test app)
is detailed in a follow-up post:
π Hands-On PCIe with QEMU: From Fake Device to Kernel Driver
β
Takeaways
- BAR0 = control registers, exposed to host via PCIe.
- DMA engine = hardware moves data autonomously once configured.
- System 1 vs System 2 = ARM controls vs Host controls.
- QEMU = safe way to test drivers before real hardware.
References