I spoke to one of my professors yesterday about this, and he had an interesting idea. Instead of using the DMA controller with 8-bit data, use it with 16-bit data. The lower 8-bit would contain data, while the next couple bits would contain control bits. This would bloat the amount of DMA writes needed by 2x, but it may still be faster than using CPU time.
I had the same thought, but I decided that because you might not have enough RAM in a Maple to store the whole image (and it has just got 2x worse because the control signal needs to be stored), the processor would probably need to build up a "scan line" for every displayed line. If that is the case, the instructions to interlace the data with the control would probably run slower than just picking up the data, and writing it to the GPIO.
But, using DMA with the timer generating control signals would reduce the amount of RAM needed (good) and avoid using the processor to interleave data and control (good).
One other thought I had was using an external chunk of SRAM (~4Mbit) that could act as an external buffer for the LCD ...
Yes, you could do that.
This is exactly like an old fashioned PC with a 'random access port' for updating the screen RAM, and 'serial port' for generating the display.
This is the sort of job Oak should be able to do, with the FPGA handling all of the external logic. But unless LeafLabs have one ready, I guess that isn't an option.
I still think software should be able to put up a few million pixels/second on that display in a 'benchmark', so I don't understand what is so slow. My outline code is 12 clock cycles for a 16-bit pixel+some house keeping, so triple that, and its still only 36 cycles/pixel, which should yield 2 million pixels/second.
I assume the display is storing the unchanged parts, so what are you trying to do that needs so much performance? For example, if it is video data can't you find a way to stream it directly to the LCD, and miss out the processor?
What is it you are trying to do?