lonewolf - my further analysis ...
Clearly, a frame rate above about 60 frames/second is not necessary because of human's relatively slow persistence of vision. We just won't see any benefit above that sort of rate. So I assume frame rate is an indirect measure of how much time is available to do other things, like render text or images.
Assuming the UART interface is fast enough to reach 60 frames/second, then the question becomes does this approach leave more free time for the processor to do other things? Agreed?
There are three ways to use the UART:
1. Direct, polled UART
2. Interrupt driven
3. DMA driven
3. DMA driven
This technique isn't possible on Arduino's AVR, but both the Maple STM32 and the ChipKit PIC32 could do this.
Assuming the 'frame buffer' can be arranged conveniently, then DMA should use the least processor time, and leave the most 'free CPU time' for other processing. If you are lucky, it may be possible to set up DMA to continuously refresh the LED controller without any processor intervention, ever.
So this has the potential of 'orders of magnitude' less CPU load than any other technique.
1. Direct, polled UART.
I assume this uses the most CPU time to refresh the display controller of the three approaches, but, depending on the baud rate, it might still be comparable with bit-banging. A 'screen refresh' could be triggered by a timer interrupt.
2. Interrupt driven
This will take a bit of estimation. I'll assume the frame buffer data dominates, and otherwise the UART uses a similar number of bytes.
Sending a single bit, using bit-band addressing, is, roughly:
// send one bit
*_data = (bits & msb) ? 1 : 0;
*_wr = 0;
*_wr = 1;
msb >>= 1;
- three pin writes are 3 x 2 cycle instructions
- msb >> = 1
is 1 cycle
- (bits & msb) ? 1 : 0
is at least 1 cycle
So let's call that 8 cycles/bit.
Of course, it is a bit slower in reality because of loop overhead, and this may be faster than the HT3216 can receive data, but this'll do for a quick and dirty comparison (if the UART has lower overhead, then these other bit-banging overheads tilt the balance further towards the UART).
Taking and returning from an interrupt is 12+12 cycles.
writing a byte to the UART might be
// write one byte to UART
UART_DATA_BUFFER = *current_buffer++; // must be a byte because the interrupt was on
if (current_buffer >= BUFFER_END) disable_UART_interrupt;
So, assuming the buffer contains all the data, this will get called repeatedly, until the whole 'frame buffer' has been sent.
I estimate:
- load current_buffer
- worst case 4 cycles
- load UART_DATA_BUFFER
- worst case 4 cycles
- load *current_buffer
- 2 cycles
- store *current_buffer
to UART_DATA_BUFFER
- 2 cycles
- current_buffer++
- 2 cycles to store
- load BUFFER_END
- worst case 4 cycles
- compare current_buffer
with BUFFER_END
- 1 cycle
- return without disabling interrupts (the common case) 1 cycle
that ~= 20 cycles
interrupt (most common case) 44 cycles, which is less than bit banging 6 bits
So even this may be faster than bit banging, but it is so close, it'd take careful analysis.
But it may be possible to do better!-)
The UART has two buffers, a data buffer, and the shift register which shifts out the data. According to RM0008, section 27.3.2 'Transmitter', the value written into the data register gets immediately copied to the shift register, if their is no active transmission. So if the interrupt can be arranged to happen at the end of a transmission, two bytes can be copied into the UART for each interrupt.
The interrupt service routine will be slightly longer, but, I guess something like:
test UART data register empty, or that there is no transmission - about 6 cycles
load data byte if there is one, and store to UART data register - about 6 cycles
So for about 12 more cycles, a total of about 56 cycles, 16 bits are sent via the UART, in the time that bit-banging sends 7 bits.
So this might be significantly faster, i.e. less CPU load, than bit-banging.
The STM32 has, in effect, a two byte buffer, so this could be quite efficient.
The PIC32 has a 4 or 8 byte deep buffer, and so should work even better.
Using interrupts might work on AVR's with more than one USART, but the cost of AVR instructions will be very different, and so may be much worse.
Observation: the fastest technique on an 8bit AVR may be far from optimal on a 32bit processor, with more peripherals and DMA support.