Yup.
I can confirm that you're able to use the FSMC address as your source/destination in a DMA transfer.
I was able to transfer 1,920 16-bit wide values from FSMC into a buffer in 485 microseconds. That's a read rate of 3.95 MHz (per 16bit value).
Without the DMA transfer, reading the address manually into your buffer (loop not unrolled; just one at a time as worst case) takes 1070 microseconds, which is a read rate of 1.79 MHz. That's a 220% performance gain :).
For speed reference.... Doing the same 1,920 16-bit value transfer between two SRAM (on the STM32) buffers takes 670 uS without DMA and 165 uS with DMA.
FSMC memory access via the DMA controller certainly gains you some performance, but it's not nearly as fast as the onboard memory. The larger your transfer size is, the faster it'll get transfered but up to a point. A FSMC to SRAM DMA transfer of 240 16-bit values took 65 uS. 1920 is 8x the size and only took 485 instead of the expected 520. Doing 3072 transfers took 770 uS. With a buffer that's 1.6x larger we'd expect a transfer time of (485*1.6)=776... Not much of a savings. So it's clearly not worth it to allocate a giant buffer on the STM32 as there's very little performance gain after some point. I'll stick to 1920 for my application, after I finish with these benchmarks :).
I'm going to look into write speeds next. Followed by FSMC to FSMC transfers. But first, coffee.
EDIT:
The results are in, and they're quite surprising. Manually writing 1920 16-bit values (not unrolled) to FSMC took 830 uS, or a write rate of 2.3 MHz. That's quite a bit slower than the 16x unrolled loop the FSMC test demo uses, however... Doing it with DMA took on 325 uS or a write rate of 5.9 MHz. That's just a little slower than the unrolled loop... However the unrolled loop demo just wrote garbage values that weren't very useful. If you also unroll your manually write loop it also goes down to 325 uS to manually write 16 values at a time in a loop.
EDIT2:
Interesting! I went back and edited the STM32 SRAM to FSMC benchmark program and unrolled the manual read loop to 16x. Performance is actually FASTER than DMA for my 1920 values. Comes out to 445 (unrolled) vs 485 (DMA) uS.
But, there's something to keep in mind. While manually doing the transfers you're taking away CPU time from whatever your application is. DMA transfers happen behind the scenes, and supposedly have only a 1% CPU usage. So, in that 480 microseconds you're waiting for the DMA transfer to read/write to FSMC, you can be processing user input, filtering analog data, etc. Which still makes it worthwhile to implement if performance is a high priority.
For my particular application, I will be using the external SRAM to buffer frames for my LCD and then use DMA transfers to get data from the external SRAM, to an internal buffer, and dump it to GPIO. I'm considering possibly going with FSMC -> SPI directly instead of getting a parallel LCD.
Some other cool things you could do is maybe have FLASH/EEROM chip and use a DMA transfer to load data into SRAM on bootup :)! Or, if we ever get SDIO working buffer stuff off of an SD card.
Next I'm going to look into transferring data between parts of external FSMC-controlled memory. And then publish everything on gist/github.
-robodude666