There's libmaple support for this via HardwareSPI::beginSlave()
As there are two SPI on Maple (and three SPI on RET6 and Maple Native), it should be practical to generate or receive two (or three) pins-worth of input/output using two (or three) SPI interfaces run as slaves, synchronised to the same timer output, at upto 18MHz.
I'd start with the table technique mentioned here http://forums.leaflabs.com/topic.php?id=1060#post-6506
After all, if there is only 16224 bits, so you could generate an initialised data table/array (in flash) long enough (2*16224*sizeof(uint32)) to be written to the port, and wiggle the pins for the entire 16224 bit data exchange.
You could even generate a very long function, made of 2*16244 (GPIOB_BASE)->BSRR = ...;
(and 'noops' to get the timing right)
I believe the data table based technique should be capable of going faster than 1.6MHz, so it might need 'noops' to slow down to the correct speed.
(full disclosure: I am not a member of LeafLabs staff.)