High-Speed GPIO Access? « LeafLabs Garden

LeafLabs Garden » Support » Maple IDE Support

High-Speed GPIO Access?

(39 posts) (9 voices)

Started 4 years ago by robodude666
Latest reply from hans leuthold

12 3 Next »

robodude666
Member
Howdy folks!

Got my Maple the other day, and I've been busy porting my QVGA LCD code from my ATXMega128A1 AVR! I've managed to get everything over and it seems to work well, except performance is a bit lacking in my opinion.

Here's a snippet of what I've done:
```
// ...

/*
	LCD's 8-bit parallel bus connected to PC0-PC7
	LCD's control pins connected to PB10-15
*/

#define LCD_CTRL_PORT GPIOB
#define LCD_DB_PORT GPIOC
#define LCD_WR_PIN 10

// Configure data direction
LCD_CTRL_PORT->regs->CRH = 0x33333333;
LCD_DB_PORT->regs->CRL = 0x33333333; // 50MHz output?

//...

// perform a benchmark
for(a crap ton of times)

	// some other stuff

	//benchmark only the time to write one pixel
	start = micros();

	LCD_DB_PORT->regs->ODR = ((LCD_DB_PORT->regs->ODR & 0xFFFF0000) | color);
	LCD_CTRL_PORT->regs->BRR = BIT(LCD_WR_PIN);
	LCD_CTRL_PORT->regs->BSRR = BIT(LCD_WR_PIN);

	end = micros();
	section1 += (end - start);

	// some other stuff

end

//averagetime = section1 / (a crap ton of times)
//print averagetime to LCD
```
I've done the above benchmark, and the average time to write a pixel is about 606ns (the above benchmark only calculates half a pixel's speed). This equates to just over 1.65MHz per pixel (6 GPIO accesses -- ODR write, BRR write, BSRR write, ODR write, BRR write, BSRR write). Ignoring the value of the upper 8-bit on the LCD Data port, i.e.: LCD_DB_PORT->regs->ODR = color; bumps the speed up by nearly 3x to 216ns per pixel i.e. 4.629MHz. That's about 27.7MHz per GPIO access.

Am I doing something wrong, is there a better way of doing this, or are we not supposed to come close to the 50MHz output speed configured initially?

-robodude666
Posted 4 years ago #
gbulmer
Moderator
robodude666 -

Am I doing something wrong, is there a better way of doing this, or are we not supposed to come close to the 50MHz output speed configured initially?

That 50MHz refers to the speed that the external electrical signal driven by a GPIO pin changes from low to high, or vice versa; the rise-time. This configures the rate that the drive transistors work at.
That has nothing to do with the speed that the processor can change the state of a GPIO pin through software. Think of it as allowing you to put a load (resistor and capacitor) on the pin to slow it down.

You can configure the electrical signal speed (rise-time) because it can generate electrical noise, and will radiate like a radio transmitter (remember the bandwidth of old-fashioned PAL TV is about 4.5MHz, so 50MHz is quite fast). So it is a good idea to run the pin rise time more slowly if driving a piece of wire.

The speed that you can change the state of a GPIO pin is determined by the speed the MCU executes Store (STR) instructions across the bus.

Grab a copy of ARM's Cortex-M3 Technical Reference manual:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337g/DDI0337G_cortex_m3_r2p0_trm.pdf
and look at Chapter 18 "Instruction Timing"
It explains that Load/Store instructions take 2 cycles for the first access, and 1 for subsequent accesses. That 1 can turn into 2 if the write buffer is full (which it might be).

The best I have achieved is 18MHz to toggle a pin repeatedly with some odd-looking code, but 12MHz was okay.

The processor will load, decode and execute an (ordinary) instruction in one or two cycles of the 72MHz clock.

Assuming the compiler can optimise this code fully:
```
LCD_DB_PORT->regs->ODR = ((LCD_DB_PORT->regs->ODR & 0xFFFF0000) | color);
LCD_CTRL_PORT->regs->BRR = BIT(LCD_WR_PIN);
LCD_CTRL_PORT->regs->BSRR = BIT(LCD_WR_PIN);
```
it is 4 or more instructions, then that will take at least 5 cycles = 72MHz / 5 = about 14.4MHz

Then you have the code for micros()
```
for(a crap ton of times)

	// some other stuff

	//benchmark only the time to write one pixel
	start = micros();

	// -- something that may run at 14.4MHz ---

	end = micros();
	section1 += (end - start);

	// some other stuff

end
```
which is
```
static inline uint32 micros(void) {
    uint32 ms;
    uint32 cycle_cnt;
    uint32 res;

    nvic_globalirq_disable();

    cycle_cnt = systick_get_count();
    ms = millis();

    nvic_globalirq_enable();

    /* SYSTICK_RELOAD_VAL is 1 less than the number of cycles it
       actually takes to complete a SysTick reload */
    res = (ms * US_PER_MS) +
        (SYSTICK_RELOAD_VAL + 1 - cycle_cnt)/CYCLES_PER_MICROSECOND;

    return res;
}
```
Which is *MUCH* slower than changing a pin value.

So, the first thing I would do is take the timing for each pixel *OUT* of the for loop
```
//benchmark only the time to write one pixel
	start = micros();
	// some other stuff
       for(a crap ton of times) {
           // -- something that may run at 14.4MHz ---
           LCD_DB_PORT->regs->ODR = ((LCD_DB_PORT->regs->ODR & 0xFFFF0000) | color);
           LCD_CTRL_PORT->regs->BRR = BIT(LCD_WR_PIN);
           LCD_CTRL_PORT->regs->BSRR = BIT(LCD_WR_PIN);
        }
	end = micros();
	section1 += (end - start);

	// some other stuff
```
time the whole for loop, then divide that time by 'a crap ton of times', and you'll get a value in the right ball park.

Then you will have only the time to change some GPIO pins, and the overhead of the for loop, and the overhead of one call of micros().
If you want to factor out the micros(), you can time micros() 'a crap ton of times' and see how long it takes. Make sure you do something with the value returned from micros() or it might get optimised away.

(full disclosure: I am not a member of LeafLabs staff).
Posted 4 years ago #
robodude666
Member
I'm sorry, but lost by what the meaning of 2/10/50MHz output actually means. The reference manual (RM0008) refers to it as the Max. output speed. I understand you can't reach the maximum spec of a device, but with 2 instructions per GPIO access, I'd hope to at least get to ~25-30MHz.

I've gone ahead and tried:
```
for(a crap ton of times)
{

	start = micros();
	do {		

		LCD_DB_PORT->regs->ODR = ((LCD_DB_PORT->regs->ODR & 0xFFFF0000) | colora);
		LCD_CTRL_PORT->regs->BRR = BIT(LCD_WR_PIN);
		LCD_CTRL_PORT->regs->BSRR = BIT(LCD_WR_PIN);

		LCD_DB_PORT->regs->ODR = ((LCD_DB_PORT->regs->ODR & 0xFFFF0000) | colorb);
		LCD_CTRL_PORT->regs->BRR = BIT(LCD_WR_PIN);
		LCD_CTRL_PORT->regs->BSRR = BIT(LCD_WR_PIN);

		i += 2;
	} while(a few hundred times);

	end = micros();
	section1 += (end - start);

}

average time = ((section1) / (a crap ton of times * a few hundred times)) / 6;
```
And that's gotten me 528nS for one iteration of the entire do-while block, which is about 88ns per one of the six GPIO accesses (assuming the overhead from the loop is divided equally between them). That's about 11.3MHz. If I replace

LCD_DB_PORT->regs->ODR = ((LCD_DB_PORT->regs->ODR & 0xFFFF0000) | colora);

with

LCD_DB_PORT->regs->ODR = colora;

ignoring the fact that something else is on the upper 8-bit of my port, I can get performance up to 52ns per GPIO access, or about 19.23MHz, a ~70% improvement.

Is the only way to get high GPIO access to bypass the software, and use something like the FSMC on the fancier STM32 chips? I can has a Maple Native?

One of my benchmarks I have is filling the screen with a single color repeatedly, a few thousand times. That benchmark yields about ~1 million pixels per second on my ATXMega128A1 @ 32MHz (32MIPS). The same test on the STM32 yields only ~60% of the performance (600 thousand pixels per second).

How much of a difference will the length of wire between the Maple to the device effect performance? I know that signals stretch out and squares waves get distorted if the cable doesn't meet its bandwidth requirement (since a square wave of a combination of sin waves). But, I figure that the ~12" of cable I'm using shouldn't effect performance much.

-robodude666

EDIT: When it doesn't work, rewrite the software! Preliminary changes show some improvement.
Posted 4 years ago #
trevden
Member
My test for this was an unrolled loop attached to a 'scope. The pin mode is set to 50MHz by default, I believe. So:
```
// BIT(6) should be collapsed to a constant by the optimizer
#define PIN_B6_HIGH (GPIOB_BASE)->BSRR = BIT(6)
#define PIN_B6_LOW  (GPIOB_BASE)->BRR  = BIT(6)

void setup() {
  pinMode(5, OUTPUT);  // B6 is pin 5
}

void loop() {
  myloop:
    PIN_B6_HIGH;
    PIN_B6_LOW;
    PIN_B6_HIGH;
    PIN_B6_LOW;
    PIN_B6_HIGH;
    PIN_B6_LOW;
    // ...
  goto myloop;
}
```
On the scope, this got me a sine wave of roughly 18MHz period, i.e. 4 processor cycles per period, 2 processor cycles per transition. I didn't try to see if there was a way to get down to 1 cycle per transition, as this was fine for my application.

By the way, is there a straightforward way to get a disassembly of the compiled code? Would be useful for experimenting with this sort of timing-sensitive code.
Posted 4 years ago #
robodude666
Member

trevden,

I have a few unrolled loops when performance really matter. When filling the screen with a single color, I managed to reach 5.5 million pixels per second -- nearly 5x the performance of my ATXMega128A1 benchmarks. These performance improvements also aid when rendering from RAM. When drawing just one pixel to the screen, for example when drawing a line, the performance stinks.

I've noticed that when compiling with libmaple the *disas file contains the assembly that is being used. It seems well optimized from what I can see, but then again I'm not too familiar with STM32's assembly commands.

I'm going to pull out my Logic analyzer soon and see what kind of performance I'm really getting instead of messing with micros/millis for benchmarking. Hopefully I can surpass the 25MHz capability of my analyzer!

-robodude666

Posted 4 years ago #
gbulmer
Moderator

robodude666 -

I'm sorry, but lost by what the meaning of 2/10/50MHz output actually means.

When a voltage transitions from 0.4V to 2.9V it takes time. You understand that a square wave can be treated as a sum of sine waves, and the faster the square wave, the higher the frequency of the sine wave components.
If a pin is driving a wire off-board, it will radiate energy, exactly like a radio transmitter. The faster the change between the two voltages the higher the frequency. This is 'a bad thing'. The radiated signal is noise which might interfere with other systems.
The facility to slow the voltage change is very helpful if you want to drive a piece of wire more than a few inches.

So even if the state of a pin is only changed, say each microsecond, the rate at which the voltage driven by the pin changes could be different. If it were 2MHz, it would have a much lower frequency component than the same pin, driven at the 1 microsecond, but which is transitioning at 50MHz, i.e. 25 times faster.

How much of a difference will the length of wire between the Maple to the device effect performance? I know that signals stretch out and squares waves get distorted if the cable doesn't meet its bandwidth requirement (since a square wave of a combination of sin waves). But, I figure that the ~12" of cable I'm using shouldn't effect performance much.

USB Full-speed is 12MHz. It has a good ground/earth shield, and differential signalling and is good for a couple of metres.
With a good ground return, you might be okay. It is likely more of a problem for somebody with a radio near bye than for you :-)
It would likely fail any FCC tests, for example. With a 50MHz transition, it'll be radiating at only an octave below FM radio, so harmonics might have enough energy to be picked up on an FM radio (very easy test if you have a radio to hand).

Is the only way to get high GPIO access to bypass the software, and use something like the FSMC on the fancier STM32 chips? I can has a Maple Native?

As long as it is being accessed via the same instructions, the instruction timings for load and store are the same.
Unless the LCD can be memory mapped, in which case each access will do two jobs.
If you could devise a way to use the DMA controllers, then you might get a speedup.

One of my benchmarks I have is filling the screen with a single color repeatedly, a few thousand times. That benchmark yields about ~1 million pixels per second on my ATXMega128A1 @ 32MHz (32MIPS). The same test on the STM32 yields only ~60% of the performance (600 thousand pixels per second).

I've had a quick look at the ATXMega manual, and I believe some of the store instructions claim single cycle execution (ATmega was 2 cycles). So I am impressed, it may well be able change a pin at 32MHz, which would give a pin toggle frequency of 16MHz; not much slower than the STM32F.

Further, the ATXmega does have some nifty load and store instruction addressing modes. For example, a byte can be loaded from memory, and a register, which contains the address of the data, can be incremented, all in a single instruction.
So, I am very surprised, but can just about imagine it might be faster than an STM32F for this job.

But, if that code is representative of filling the LCD with one colour, then I don't really understand why it is so slow. I'd expect it to be a couple of megapixels/second.

I'd have to understand the LCD better and read your code to give you good advice.
How fast can the LCD go?

My knee-jerk reaction is to read the code very carefully, and maybe try to use the whole 16 bits of a port, and/or DMA to go faster.

If you could line things (in time) up exactly, you might be able to drive the pin-toggled LCD_WR_PIN with a timer at a fixed frequency, and use DMA to pour colour data out the port. That would have the nice side effect that almost no processor is used. This would take some serious study of the manual.

trevden -

On the scope, this got me a sine wave of roughly 18MHz period, i.e. 4 processor cycles per period, 2 processor cycles per transition.

Yes, that's what I got. It is pretty useless code. If all that is needed is a square wave at a fixed frequency on a pin, it would be almost as easy to program one of the timers.
As soon as actual data is retrieved and presented to the pin, the rate that the pin can be changed is slowed.
When I loaded data too, it came down to about 12MHz.

(full disclosure: I am not a member of LeafLabs staff).

Posted 4 years ago #
trevden
Member

It is pretty useless code.

The M3 can execute code in RAM, so you could generate a scanline's worth of unrolled bit-banging code in your off cycles, and then just call it when it's go time.

But then, the Apple IIgs demoscene was my babysitter growing up.

Posted 4 years ago #
robodude666
Member

Very interesting stuff. I've been browsing through the 1096 page RM0008 Reference manual while waiting for my class to start and read up on the DMA controllers.

From what I've gathered, I can map a block of SRAM to act as a buffer for a desired port's ODR with the auto-increment memory address mode enabled, as well as the interrupt. I can then write a bunch of stuff to my RAM buffer, enable the DMA channel. Once the transfer has completed, the interrupt will trigger where I can then disable the DMA channel. This is exactly the same functionality as the FSMC available in high-density STM32s (*cough* Maple Native *cough*), except the DMA only handles the data transfer between memory/peripheral to memory/peripheral. The FSMC goes a bit further and toggles the corresponding write/read/select bits for you as well (as the external LCD can be mapped to the FSMC as a 76,800x16-bit block of external SRAM). As you mentioned, in order for this DMA trickery to work I would have to toggle the Write pin via a timer, and the timing would have to be perfect to the DMA controller's auto-memory increment -- very unlikely.

ST has the "AN2548: Using the STM32F101xx and STM32F103xx DMA controller" application note available which shows how a DMA channel can be used to acquire data from a GPIO using a clock signal. The control of the channel is done via a Timer's Input Capture (manually toggled by the STM32 itself, but could represent a clock from a device sending data). Could something similar be used, but in the opposite direction? A timer that triggers the DMA to write one value from the SRAM buffer to the GPIO. The timer would also have its own interrupt which would toggle the Write bit on the LCD so that it can accept the data. It will then repeat this for every value in the buffer. The program would have to be blocked until the buffer is emptyed, where more can be written into the buffer and the process could then be repeated.

On a side note, have there been any performance benchmarks done on the usart dma test that's included in the libmaple examples folder? I'm wondering how much faster it is than traditional ways.

-robodude666

Posted 4 years ago #
trevden
Member

Seems like you're on the right track to me. I'm pretty sure you can hang multiple things off of one timer, and if not, there's also a way to synchronize timers.

Posted 4 years ago #
gbulmer
Moderator

the external LCD can be mapped to the FSMC as a 76,800x16-bit block of external SRAM

Okay, that could be a faster way to access it than via a port. Using DMA, there wouldn't be the overhead of reading instructions, it should be pure address generation and data movement.

How fast can the LCD be written too?

Posted 4 years ago #
robodude666
Member

The LCD controller is the ILI9325, and the i80-System Interface (page 108) timing show a minimum of a 50nS pulse per write, which is 20MHz, but I may be reading the timing diagram wrong.

Posted 4 years ago #
mbolivar
LeafLabs

robodude666,

On a side note, have there been any performance benchmarks done on the usart dma test that's included in the libmaple examples folder? I'm wondering how much faster it is than traditional ways.

No, I just made that to convince myself that the DMA implementation wasn't a complete crock, so I didn't bother measuring how fast it went.

Posted 4 years ago #
gbulmer
Moderator
robodude666 - I've had a skim through the datasheet you pointed out. There are several ways to write faster.

It is not clear to me if you transfer 16-bits of colour data or 18-bits.
As I understand your example code, it loads two 8-bit data values, so I assume it is 16 bits.

16-bits of data can be transferred in 2 load from memory and 2 store to GPIO pins instructions, and it takes another 4 store instructions to toggle the control pin twice, once for each 8 bits (I may be wrong, and maybe 12 instructions for 18 bits).

If 16 pins (16 bits of GPIO) were connected to the LCD, then it would only be necessary to toggle the control pin once.

The STM32F has 16 bit ports, so in theory a single load and store would transfer 16 bits of data.
Unfortunately, Maple consumes at least one pin on every 16-bit GPIO port. So if you could find a free 16-bit port on a Maple Native it would be faster because you could move 16 bits of data to one port of 16 GPIO pins in one load and one store.

I'll take the case of two 8-bit pieces of data to set the value on 16 GPIO pins in two ports.

Currently:
1 load and 1 store transfer the first 8 data bits to the 8 GPIO pins
2 stores toggle the control pin
1 load and 1 store transfer the second 8 data bits to the 8 GPIO pins
2 stores toggle the control pin a second time
A total of 8 instructions.

If the 16-bit interface is used, the data could be loaded in 2 instructions, stored in 2 instructions and a single toggle of the control pin (2 instructions) to enter it:
1 load and 1 store transfer the first 8 data bits to 8 of 16 GPIO pins
1 load and 1 store transfer the second 8 data bits to the other 8 of 16 GPIO pins
2 stores toggle the control pin once

The result would be 6 instructions instead of 8.

This would need an extra 8 pins to be connected.

If the software uses three steps to transfer 18-bits of pixel data, then the speedup is bigger. Instead of 12 instructions, it would need 8. This would need 18 interface pins; the full 18bits is transferred in one step, 18 pins are connect to the LCD, and it takes a single toggle of the control pin to write the data.

There are other instructions in the loop to adjust pointers, so it isn't this quick, and remember loads and stores can take 2 cycles.

I think it could be made a cycle faster by loading 16 bits of data with one instruction and writing to two different ports with two instructions. It could be made faster still by loading 32 bits of data, and writing 4 times. Depending on how you feel about it, it may be the last thing to try because it might take some assembler, though the compiler might be smart and do all the work:
```
uint32 pixels[...];
uint32* pixelptr = &pixels[...];
for (a scan line?)
    ...
    unint32 pixel = *pixelptr;
    GPIO_0to7 = (uint8)pixel; // assumes everything is set up perfectly to ignore
    GPIO_8to15 = (uint8)(pixel >>= 8);
    toggle CONTROL_PIN high then low
    ...
    GPIO_0to7 = (uint8)(pixel >>= 8);
    GPIO_8to15 = (uint8)(pixel >>= 8);
    toggle CONTROL_PIN high then low
```
Summary - make the data interface wider (use more GPIO pins) and hence reduce the number of instructions spent toggling the control pin. Read the data in wider pieces too.

If interrupts can be switched off while moving pixel data, then moving data to the GPIO ports is deterministic; it takes exactly the same number of instructions, and the same amount of time every time. In that case, toggling the control pin could be handled by a timer, so the processor would only execute instructions to move pixel data. Maybe do this one scan line at a time, and take interrupts in between. This would be hard code to get right.

I think that the pixel data transfer and pin toggle could be handled by the DMA system and a timer. This may be possible, and maybe easier to do by using multiple DMA channels to write pixel data and the control signal. But that would be a fiendish bit of programming in either case.

My read of page 108 is a write cycle is 100nS, so data could be written at 10MHz, so you don't need to get any faster than 8 cycles :-)

I had a quick look to see how the interface works, but I didn't see a way to connect the LCD as a memory-mapped device to an STM32F Flexible static memory controller (FSMC).
Posted 4 years ago #
robodude666
Member

Yeah, I'm using the LCD in 16-bit color mode with two 8-bit writes. I am limited to this because the LCD breakout board I have only brings out 8 of the data pins. Unfortunately, there are no good breakout boards on the market that do all 16-bit. The only good board is made by Embedded Artists... and you know how their prices are -- $120, yeah right! Until I find a good 16-bit parallel breakout board for this LCD, or end up designing my own (which is unlikely as I don't have the $ to make PCBs) I'm stuck with 8-bit mode.

I spoke to one of my professors yesterday about this, and he had an interesting idea. Instead of using the DMA controller with 8-bit data, use it with 16-bit data. The lower 8-bit would contain data, while the next couple bits would contain control bits. This would bloat the amount of DMA writes needed by 2x, but it may still be faster than using CPU time. For example, to send a single pixel (assuming we're in GRAM mode already and the x/y and limiting positions have already been selected, manually possibly):

color_lower | WR0
color_lower | WR1
color_upper | WR0
color_upper | WR1

You may have to toggle CS/RD/RS depending on what type of command/data is being sent, but it would allow the DMA to 100% control the LCD. At most 12 I/O would be required on a single GPIO, and that's certainly possible on both GPIOA, B, or even C -- A has 14, B has 15, and C has 12.

One other thought I had was using an external chunk of SRAM (~4Mbit) that could act as an external buffer for the LCD. A year or two ago I ran across this Design Note #028 published by AVRFreaks. It discusses using external SRAM with a low-I/O AVR using ripple counters to control the address lines. Because I already need 8-bits for data, this will add only a handful of extra pins.

I found a couple SRAM ICs from Cypress on Mouser that fit my requirements well, as well as some counters. The idea is to have incredibly fast sequential access. Random access will become incredibly slow, but I won't be doing it too much. Up/Down counters could also maybe be used to increase random access performance. Additionally, the upper 2-3 bits of the address could be manually controlled to create 4 or 8 "pages" to increase performance.

The way it would work would be:

The STM32 writes to the SRAM. Once a single frame has been "constructed" the counter is reset and all of the bits are shoved from the SRAM directly to the LCD. Only thing that needs to happen is toggling the ripple counter and the write bit on the LCD. I'm unsure of how performance will be like, as the time to generate a single frame would take longer due to random access being slow.. but once the frame is constructed updating the LCD will be relatively fast - DMA may even help here too.

Any thoughts on these two options? - Using a 16-bit DMA's lower 8-bit for data, and upper 8-bit for control bits, and using external SRAM as a buffer with ripple counters for fast sequential access.

-robodude666

Posted 4 years ago #

robodude666
Member

Hey guys,

Woke up today at 7AM, got my coffee, and started to mess around with the DMA stuff. Here's what I got so far:

#include "dma.h"
#include "gpio.h"

#include "wirish.h"

__attribute__((constructor)) void premain() { init(); }

#define DMA_DEV DMA1
#define DMA_CHANNEL DMA_CH6

#define BUF_SIZE 32
uint8 tx_buf[BUF_SIZE];

int main(void)
{	

	GPIOA->regs->CRL = 0x33333333;

	// Fill buffer with goodies
	for(int i = 0; i < BUF_SIZE; i++)
	{
		tx_buf[i] = (i % 2 == 0) ? '0' + 0b110 : '1' + 0b100;
	}

	// Setup DMA
	dma_init(DMA_DEV);
	dma_setup_transfer(DMA_DEV, DMA_CHANNEL,
		&GPIOA->regs->ODR, DMA_SIZE_8BITS,
		tx_buf,            DMA_SIZE_8BITS,
		(DMA_MINC_MODE | DMA_CIRC_MODE | DMA_TRNS_CMPLT | DMA_FROM_MEM)
		);
	dma_set_num_transfers(DMA_DEV, DMA_CHANNEL, BUF_SIZE);

	// Setup Timer
	timer_set_mode(TIMER3, TIMER_CH1, TIMER_OUTPUT_COMPARE);
	timer_pause(TIMER3);
	timer_set_count(TIMER3, 0);
	timer_set_reload(TIMER3, 1);
	timer_resume(TIMER3);

	int isDMAEnabled = 0;

	while(1)
	{
		delay(100);

		if(isDMAEnabled == 0 && millis() >= 10000)
		{
			isDMAEnabled = 1;

			// Enable DMA Channel & Configure DMA on Timer
			dma_enable(DMA1, DMA_CHANNEL);
			timer_dma_enable_request(TIMER3, TIMER_CH1);
			timer_trigger_dma_enable_request(TIMER3);
			timer_set_dma_burst_length(TIMER3, 18);

			SerialUSB.println("DMA Channel Enabled!");
		}

		dma_channel_reg_map *ch_regs = dma_channel_regs(DMA_DEV, DMA_CHANNEL);
		uint8 isr_bits = dma_get_isr_bits(DMA_DEV, DMA_CHANNEL);

		SerialUSB.print("[");
		SerialUSB.print(millis());
		SerialUSB.print("] ISR bits: 0x");
		SerialUSB.print((int32)isr_bits, HEX);

		SerialUSB.print(" CCR: 0x");
		SerialUSB.print((int64)ch_regs->CCR, HEX);

		SerialUSB.print(" CNDTR: 0x");
		SerialUSB.print((int64)ch_regs->CNDTR, HEX);

		SerialUSB.print(" Buffer contents: ");
		for (int i = 0; i < (BUF_SIZE <= 16 ? BUF_SIZE : 16); i++)
		{
			SerialUSB.print('\'');
			SerialUSB.print(tx_buf[i]);
			SerialUSB.print('\'');
			if (i < BUF_SIZE - 1) SerialUSB.print(", ");
		}
		if(BUF_SIZE > 16) SerialUSB.print("...");

		SerialUSB.println();

		if(isr_bits == 0x7)
		{
			SerialUSB.println("** Clearing ISR bits.");
			dma_clear_isr_bits(DMA_DEV, DMA_CHANNEL);
		}

	}

    return 0;
}

It's based on the UART Example, but it uses GPIOA as the peripheral and the DMA Controller is sending from the memory to the peripheral. After 10 seconds, the DMA channel is enabled and bits start flying out of GPIOA. I have a Logic Analyzer connected to those pins, and capture @ 24MHz. The buffer gets filled with ASCII 5s and 6s (arbitrarily chosen), but for my purposes I only pay attention to the lower 3 bits. Because the bits toggle, a square wave is generated with an average HIGH/LOW time of about 149 nanoseconds... about 6.7MHz.

I'm not completely sure if I configured the timer correctly, but from what I understand the reload value is when the timer's interrupt is triggered. Setting it to 1 seems to result in the fastest output to GPIOA -- leaving the default 65k resulted in a pulse width of ~900ms. How exactly does this pulse time relate to the timer's configuration? I'd expect 1 count @ 72MHz to result in a 13.9ns timer, unless the default prescaler is not 1. I'm also unsure of what about the timer causes the DMA channel to push bits out: is it an interrupt, or an increment?

High-Speed GPIO Access?

Reply »