Any tips for profiling C code? Practical tips for instrumenting, etc. I'm messing around with several implementations of AES block cipher, and am curious about how it well runs.
Profiling C code
(9 posts) (5 voices)-
Posted 5 years ago #
-
looking at the assembly output from the compiler is probably the most direct, and requires the least extra tools.
from http://www.arm.com/products/processors/cortex-m/cortex-m3.php :
basically, the cortex-m3 has a 3 stage pipeline, which means it theoretically approaches 1 cycle per instruction (CPI). i say theoretically because when real-world issues of branching and interrupts come into play, that number increases a bit, but shouldn't be higher than 3 CPI for any given instruction (since there are 3 stages).
so, count the total number of instructions, and characterize your branching -- how many branch instructions you have, how frequently they're taken, and approximate number of repetitions of loops (AES is convenient for this since you have a known number of rounds depending on your key size)
i'm not sure what your background on architecture is, so this may be something you already know, but "branch speculation" is most likely a single bit that remembers whether or not the branch was taken the previous time a branch instruction occurred. typically, these "speculators" presume that the behavior will be consistent, so if you didn't take a branch last time, you probably won't this time either. this means it's guaranteed to be wrong at least once (but usually not more than twice) in the execution of a for loop. more complex branch prediction exists, but they're often more complex than their increase in effectiveness is worth, especially on an inexpensive microcontroller.
in general, fewer instructions means it's faster, but you probably already knew that part ;)
Posted 5 years ago # -
if you use the unix toolchain, the disassembly is available after each "make" under build/maple.disas.
Posted 5 years ago # -
Different people mean different things by profiling C code.
Are you looking for function-level granularity, or statement level granularity?
Will execution counts (i.e. this was done N times) be sufficient, or do you want timing?
Would statistical sampling be okay, or must it be total coverage?I have not tried it on Maple, but gcc has support for prof and gprof profiling.
It might be interesting to see how far that gets if you compile and link using -p or -pg (I wouldn't expect it to work, but it might be feasible to make it work).If you have access to The Definitive Guide to ARM Cortex-M3 (second edition) by Joseph Yiu, there are a couple of chapters on debugging, which includes an explanation of some tracing hardware built into the core. IIRC it uses a common instruction cycle counter, and the 'watch registers' could be used to measure and time a bunch of things. This same stuff is described in ARM Technical reference documents, but I found Yiu's book a bit easier to understand.
There are ARM simulators available as part of some of the commercial development systems. They would give statement-granularity cycle-accurate timing. They cost money, but would give definitive answers.
Failing those, and if it's adequate, you could make a "poor man's statement profiler" with some sneaky awk script, or flex, or ... (insert your favourite text manipulation technology). Add extra statement-count-code to every C statement, e.g. count[__LINE__]++, and an initialisation and 'dump' function. I haven't done that for a while, but, if you write C code in quite a disciplined way (one statement/line, and always use { and } to wrap around every part of a control statement) then it isn't too hard to do.
Failing those, use the 24-bit systick timer to get timing within your code. Get a rough idea how long the code you want to measure runs, then adjust the Systick prescaler to give you good resolution, and just grab values from it when you want to get 'lap times', and print at the end.
Timing mechanisms would have a 'Heisenberg effect', so avoiding measuring the measuring!
HTH
Posted 5 years ago # -
basically, the cortex-m3 has a 3 stage pipeline, which means it theoretically approaches 1 cycle per instruction (CPI).
Almost.
The ARM Technical Reference Manual at http://infocenter.arm.com/help/index.jsp
gives instruction timing.Roughly it is:
- 1 cycle for in-register operations, excluding divide
- 2 to 12 for divide (data dependent)
- 2 cycles for 1st Load/Store to 'memory', then 1 for neighbouring load/store,
- N for load/store multiple
- 1 to 4 cycles for branch, depending on
1 if it's not taken, or
2-4 if it is, and hence pipeline refill/branch speculation
- 'barrier' can be 1 to 'cycles for pipeline refill'
- semaphore likely unused in your codeIt is well within the capabilities of a text processing language (e.g. awk, flex+C, ruby, perl, ...) to annotate the assembler. You could do the estimates by hand, or have a second pass to roll up the counts.
You could mix techniques, using one to direct or validate another. For example, use statement counts to identify likely hot-spots, use cycle-counters as timers to get the actual time, or use assembler roll-ups.
A quick google suggests there may be an Open Source ARM simulator, but I haven't pursued it yet.
Posted 5 years ago # -
Didn't realize it had branch prediction; the pipeline is pretty short then there's the conditional execution of instructions. Wonder how good compiler at translating the C into the conditional execution assembly.
gbulmer, I'll have to check out the book.
i was looking at timings and at a functional level; I'm used to TI's tools for doing all this stuff (in conjunction with an RTOS). I'll have to read up on the GNU tools too.
Generally, I don't want to stare at assembly code and prefer to benchmark on actual hardware. I was going to take run the different implementations (and variations on those) on several test vectors and put the times into a spreadsheet. Good to know there's something more than just instrumenting with counters/timers.
Posted 5 years ago # -
Hi larryang,
You can set up a timer to time your functions, depending on what resolution and timescale you're looking for. You can use either the STM32 timers, or the 24-bit core systick timer. Longer running benchmarks may need to take care of overflow.
Posted 5 years ago # -
perry - have you used the DWT cycle count (DWT_CYCCNT) 'timer'?
That is 32 bit, and so has more dynamic range than any other single timer, though it has no prescaler. There are also 'counters' in the DWT to exclude cycles spent in exceptions etc., so the measurments could be made pretty meaningful.
Posted 5 years ago # -
gbulmer,
I didn't know that existed. Thanks for the tip!
Posted 5 years ago #
Reply
You must log in to post.