Hiya!
The document is very sparse on the optimization details, but the good news is that it lists tables of times for all of the operations. e.g. 1024 point complex FFT takes 2.2ms @ 72 MHz. That's approx. 512K samples per second. That means, if Fs = 48KH, it's running about 10x sample rate, which is good, because a typical FFT/IFFT loop requires 4 such operations to complete, (e.g. with 50% overlap). Therefore, it's running at about 2.5X necessary speed for a mono FFT/IFFT Loop. So, stereo is likely even possible. Additionally, there are ways to further use this. for example, if the code could be modified to only work on real input and output, it could perhaps be 50% faster, but so far that's not necessary, energy can be spent on more fun things :)
p.s. the PID loop executes in 750ns! And the doc says "Analysis of the PID timing shows that assembly code is not as fast as C code. The compiler is more efficient in accessing variables than manual optimization (offset computation and data placement in literal pool)."
As far as the bootloader, are you releasing the source to it? If so, I think the easiest way is to make my own subtle variation on it, which requires a button to be held down during boot up to enter image upgrade mode. Will that be straightforward?