What's the point of DMA in embedded CPU's?
The long and short is that DMA allows the CPU to effectively behave at its native speed, while the peripherals can effectively behave at their native speed. Most of the numbers in the example are made up.
Let's compare two options to periodically collect data from an ADC:
- You can set the ADC as part of an interrupt (periodic or otherwise)
- You can create a buffer, and tell the DMA to transfer ADC readings to the buffer.
Let's transfer 1000 samples from the ADC to RAM.
Using option 1: For every sample there is
- 12 cycles are spent entering interrupt
- read adc(s)
- store in ram
- 12 cycles are spent exiting interrupt
Let's pretend this interrupt function is 76 instructions, the whole routine is 100 instructions long, assuming single cycle execution (best case). That means option 1 will spend 100,000 cycles of CPU time executing.
Option 2: DMA is configured to collect 1000 samples of ADC. Let's assume the ADC has a hardware trigger from a timer counter.
- ADC and DMA transfer 1000 samples data in to ram
- DMA interrupts your CPU after 1000 samples
- 12 cycles are spent entering interrupt
- Code happens (let's say it tells the DMA to overwrite the RAM)
- 12 cycles are spent exiting interrupt
Pretending the whole interrupt (with entry and exit overhead) is 100 single-cycle instructions. Using DMA, you only spend 100 cycles to save the same 1000 samples.
Now, each time the DMA accesses the bus, yes, there might be a dispute between CPU and DMA. The CPU may even be forced to wait for the DMA to finish up. But waiting for the DMA to finish is much much shorter than locking the CPU in to servicing the ADC. If the CPU core clock is 2x Bus clock, then the CPU might waste a few core cycles waiting for the DMA to finish. This means that your effective execution time of the transfer is between 1000 (assuming CPU never waits)and 9000 cycles. Still WAY better than the 100,000 cycles.
The LPC1768 datasheet I found has the following quotes (emphasis mine):
Eight channel General Purpose DMA controller (GPDMA) on the AHB multilayer matrix that can be used with SSP, I2S-bus, UART, Analog-to-Digital and Digital-to-Analog converter peripherals, timer match signals, and for memory-to-memory transfers.
Split APB bus allows high throughput with few stalls between the CPU and DMA
The block diagram on page 6 shows SRAM with multiple channels between the AHB matrix and the following quote backs this up:
The LPC17xx contain a total of 64 kB on-chip static RAM memory. This includes the main 32 kB SRAM, accessible by the CPU and DMA controller on a higher-speed bus, and two additional 16 kB each SRAM blocks situated on a separate slave port on the AHB multilayer matrix. This architecture allows CPU and DMA accesses to be spread over three separate RAMs that can be accessed simultaneously
And this is reinforced by the following quote:
The GPDMA enables peripheral-to-memory, memory-to-peripheral, peripheral-to-peripheral, and memory-to-memory transactions.
Therefore you could stream data to your DAC from one of the separate SRAM blocks or from a different peripheral, whilst using the main SRAM for other functions.
This kind of peripheral-peripheral DMA is common in smaller parts where the memory interface is quite simple (compared to say a modern Intel processor).
If on a given cycle the processor and a DMA controller would need to access the same bus, one or the other would have to wait. Many systems, however, contain multiple areas of memory with separate buses along with a bus "bridge" that will allow the CPU to access one memory while the DMA controller accesses another.
Further, many CPUs may not need to access a memory device on every cycle. If a CPU would normally only need to access memory on two out of three cycles, a low-priority DMA device may be able to exploit cycles when the memory bus would otherwise be idle.
Even in cases where every DMA cycle would cause the CPU to be stalled for a cycle, however, DMA may still be very helpful if data arrives at a rate which is slow enough that the CPU should be able to do other things between incoming data items, but fast enough that the per-item overhead needs to be minimized. If an SPI port was feeding data to a device at a rate of one byte every 16 CPU cycles, for example, interrupting the CPU for each transfer would likely cause it to spend almost all its time entering and returning from the interrupt service routine and none doing any actual work. Using DMA, however, the overhead could be reduced to 13% even if each DMA transfer caused the CPU to stall for two cycles.
Finally, some CPUs allow DMA to be performed while the CPU is asleep. Using an interrupt-based transfer would require that the system wake up completely for each unit of data transferred. Using DMA, however, it may be possible for the sleep controller to feed the memory controller a couple of clocks every time a byte comes in but let everything else stay asleep, thus reducing power consumption.