pyrohaz - I think ala42 identified the most important questions.
There are some great quotes about program optimisatio at wikipedia
My favourite is: "The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet."
If you are interested in understanding how to improve code performance, I would still recommend Jon Louis Bentley's "Writing Efficient Programs'" (if you can find a copy), or his ACM column, "Programming Pearls", or the books of his column. He identifies 6 'levels' where a program can be optimise, 3 are software and 3 hardware. Trying to code in assembler instead of a high-level language corresponds to one of those levels, 'translation'. Based on a quick glance, this seems an okay summary http://www.crowl.org/lawrence/programming/Bentley82.html though it misses that important abstraction.
I think it's important to understand that writing some in-line assembler may worsen program performance. Some people seem to think it is at worst 'net neutral' but it is not; performance can get much worse. Mixing two different languages, C/C++ and assembly language, also makes codes harder to understand, change and maintain. This cost may be catastrophic.
The compiler does not analyse your assembler code. So it can't move it to a more efficient place (maybe outside a loop, or a limb of an if
which isn't executed every time round a loop), it must leave the registers your assembler has 'reserved' free for use by your code (when it might have made better use of them), and make memory look 'right enough' for the assembler to work. The compiler likely understands the processor pipeline, and can sequence instructions to exploit it (IMHO if you have no idea what this means, please don't attempt to write any assembler until you do). Writing in-line assembler generally pokes a hole in the compiler's analysis of the C/C++ code, which is a 'bad thing'. Even a single line of badly judged assembler could force the compiler to generate code which runs a couple of times slower.
Before attempting to write any assembler, make sure you know how to get a listing (from the tool chain) of the assembly language produced by the compiler mingled with the source code, and learn to read it. IMHO, until you've understood this, making assembler changes are unlikely to be rational or improve the program performance.
Further, you need a reliable way to make accurate measurements of the code. At the level of improving an if
test, that measurement system needs to be sensitive to a couple of machine cycles (i.e. 1/72MHz useconds).
Assume the compiler has been polished for 10 years by people who intimately understand the architecture of the processor, until proven otherwise. Further, the compiler can keep track of a lot of stuff which a human brain will struggle with. As a simple example, have a look at the assembler code generated by the compiler, then change a couple of variables from volatile
to static
, comparing the generated assembler code. The set of assumptions the compiler makes changes, and the code reflects that. It is quite hard to make all of the correct changes to the code by hand, as the 'book keeping' can be quite subtle yet tedious. People tend to make much simpler assumptions and hence write inferior code. A piece of inline assembler may force the compiler to use much 'safer', less 'aggressive', simpler analysis optimisations which may swamp the improvement from in-line assembly language.
I think one way to do better than a good optimising compiler (e.g. gcc), is to exploit something you know about the code that the compiler can't deduce. That usually means writing better algorithms, or re-organising code to exploit some property of the algorithm. For example using a sort which exploits the order of data, or merging for loops to avoid reloading lots of coefficients.
A simple exercise is to try the different compiler optimisation flags, and measure the differences.
Another is to change all volatile
keywords to static
, and measure those differences.
This might indicate that it is faster to 'double buffer' data which is shared with the main loop (letting the compiler do the most aggressive optimisation); use a single volatile to communicate to the main loop which buffer-full of data is ready, but otherwise avoid volatile
.
Declare data at the top level (outside a function) as static
so the compiler can know functions outside a file can't access it.
mlundinse wrote "Cortex M3/4 are 32 bit processors so declaring variables as byte or halfword will in general only make code slower"
That was true for earlier generations of ARM processor. However Cortex-M3/M4 are ARMv7-M architecture processors, where this is not necessarily true. Cortex-M3/M4 have byte and half-word (2 byte) load and store instructions which are the same size and take the same time as word (4 byte int) load and store instructions. So code for byte (char), 2-byte (short) and 4-byte (int) data is the same size and runs at the same speed. Further, for some data structures like arrays of structs, making the struct bigger by using int for each struct member might run slower because the compiler may be forced to use less compact addressing modes.
I don't think this is the case for C/C++ to Cortex-M3, but in general, using a word (int) where a byte or half-word is correct, may generate longer code sequences. For example in multiply where the compiler might need an extra register for a double word (8 byte) result. The register might otherwise be available for a variable, saving memory access.
There is a problem with using non-word data. If a half-word or word cross the natural word 4-byte boundary, then the processor will need to access memory twice, which is slower than reading memory once. This is invisible to the code; it is 'automatic', and is done by the hardware. The compiler may be instructed by a command line flag to 'pack' data, which would create this problem.
The compiler can be told to align on boundaries using the __attribute__ syntax:
short s __attribute__ ((aligned (4))) = 0;
This ensures the variable is on a word (4-byte) boundary. There is a command line option to the compiler which ensures the compiler aligns variables across a word boundaries to avoid reading an extra cycle. I think it is the default, but I don't know what flags the IDE uses.
A 'trick' is to define all int variables first, then all short, then all char to avoid having to worry about this, though that might make code harder to understand.
Summary:
- "The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet."
- Ensure there are extremely good reason for changing working, tested, debuggable code into harder to test, harder to debug code
- Ensure there are good measurement mechanisms, sensitive enough to measure differences reliably before changing code
- The compiler is probably better than a person at generating assembler for simple C statements, and better at using registers
- Using assembler may significantly worsen speed by restricting the compilers ability to re-organise and optimise code
- Be prepared to analyse the compiler-generated assembly code before attempting to optimise by writing in-line assembly code
- Exploit things the compiler can't deduce from code to optimise program code for example better algorithms
- Tell the compiler 'the truth, the whole truth, and nothing but the truth', don't claim volatile
or extern
, or int
when it isn't
EDIT {
Budget for time to get competent with the tools, processor and measurement techniques otherwise you'll underestimate the cost.
IMHO it might be as useful a use of time to use e.g. an STM32F3 or STM32F4 MCU if you can afford the $15, which might give a lot more throughput for heavily DSP-like problems.
}
Another question is:
Do you want the interrupt routine to have a shorter duration, so interrupts are blocked for less time, even if the whole system runs a bit slower?
The main reason for doing an interrupt is to service I/O. What I/O is happening that takes so long? It sounds like a lot of processing is happening which is not I/O. Should that be moved to main? Such an architecture might be easier to test, measure and debug.
Alternatively, the I/O might be better serviced by DMA. Properly using DMA may make a much bigger difference to run-time than re-coding parts of the program in assembly language.