I'm not sure I understand how you want to do the profiling, or what you want to get out of it.
Do you want to run a emulator on your PC, and get some information from it?
Or do you want to run on the Maple's STM32F103, and get some information from it?
What is the information you want?
Have you got access to an oscilloscope?
Assuming the program is processing a real time data stream, flowing from some inputs to some outputs, I'd be tempted to add a little extra code.
The extra code could be on the input or output side of the application. All it's tryung to do is establish a timing marker. For example it could toggle an unused digital pin up or down each time it executes. Measure that on an oscilloscope. It'd show frequency information, which is the time to run through all the code, and the mark-space ratio would indicate if the code runs at a very consistent rate. A storage scope would let you look at this in detail, but an ordinary scope would let you see the frequency, and hence 1/duration.
I did some experiments and generated a square wave at about 12MHz out of C (I'll try to remember to check). Maple's STM32F103 can generate higher frequencies but the code is a bit specialised, and less use for this.
Assuming the sample data rate is significantly lower, i.e. less than 1MHz, then this should represent less than 10% impact. This may work okay for an initial test, and maybe enough for your purposes.
If you have a signal analyser, you could probably do better, but I don't have one (yet:-).
If you had a second Maple, you could take the generated square wave in as an external timer signal, and measure the average over some period (maybe stacking timers for longer time periods). This would give an overall average. Depending on what the application does, you may not need a second Maple, but just use timers in one. This would take a bit more care as it could easily introduce a 'Heisenberg effect', distorting the outcome.
You could go further with a second Maple, but it really depends what it is you want to know. An STM32F is quite a powerful analysis tool.
Edit - I believe I decided the maximum frequncy was 9MHz if the address of the pin needs to be loaded (the 12MHz got the address loaded once), and that might flush a register. I think a single high/low pulse would have lower overall overhead than toggling each time round.