Speed of If statements « LeafLabs Garden

LeafLabs Garden » Support » Maple Hardware Support

Speed of If statements

(19 posts) (4 voices)

Started 2 years ago by pyrohaz
Latest reply from ala42

12 Next »

pyrohaz
Member

Hey guys,

I'm creating a pretty processor efficient interrupt and i'm wondering if I can squeeze any extra time saving out.

Would an If statement be faster if I dropped down to ASM?

I'd think of doing it as something like:

if(a<b){
do x;
}

ld a into reg a
ld b into reg b
compare two (I dont know if there is a less than command in ASM), jmp if zero

Could someone chime in and say if i'm on the right track and how to implement this using the maple ide?

Cheers,

Posted 2 years ago #
mlundinse
Member

In general if your C code is good, then the compiler will generate code that is
as good as what you can write in ASM, and unless your very skilled in ASM, better.

For your example if statement, that is exactly the asm code the compiler will generate.

The trick is to avoid unnecessary reads and writes to RAM, keep variables
local and let the compiler place them in registers.

For instance if you need some variable declared as "volatile int" then load it to
a local variable at the start of the interrupt handler and then save it back right at
the end.

Cortex M3/4 are 32 bit processors so declaring variables as byte or halfword will in general only make code slower.

Calling other functions from within an interrupt handler should be avoided, it generates
a fair amount of saving and restoring state on the stack (in slow ram, slow compared to processor registers)

Posted 2 years ago #
ala42
Member

No way. What makes you think the compiler would do anything else to handle an if statement than you would do in assembler ?
Main question is why do you think you have to optimize the interrupt code ? How often is it called ? What does it calculate ? What is the maximum/average run time of the interrupt code ?

Posted 2 years ago #
pyrohaz
Member

@mlundise So what you're saying is that the compiler is really efficient at that kinda thing? I'm using the volatile data types for all pieces of data used within the interrupt, is this the correct thing to do? And don't worry i'm not calling other functions within the interrupt loop, I can expect that brings havoc!

I'll ensure that all pieces of data are declared as int too.

It doesn't particularly need massively optimizing, I just want to squeeze in as much non interrupt time as I can to allow me to do other things.

I'm just doing a bit of DSP, IIR filtering, bit crushing, that sorta stuff. Its called every 22uS (45kHz sample rate), run time is currently about 16uS, 70% interrupt time isn't too good!

Posted 2 years ago #
mlundinse
Member

Only I/O registers and the data used both inside the interrupt handler and also in the main loop context must be declared volatile. Volatile means that it can change unexpectedly so it must be read from memory every time it is used. That is not necessary for data used only inside the interrupt handler, even if its data stored in ram. It will be reloaded at the start of the next interrupt.

You can safely call other functions from within the interrupt, but if you want to shave a few clock cycles, inling is often better.

Posted 2 years ago #
gbulmer
Moderator

pyrohaz - I think ala42 identified the most important questions.

There are some great quotes about program optimisatio at wikipedia
My favourite is: "The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet."

If you are interested in understanding how to improve code performance, I would still recommend Jon Louis Bentley's "Writing Efficient Programs'" (if you can find a copy), or his ACM column, "Programming Pearls", or the books of his column. He identifies 6 'levels' where a program can be optimise, 3 are software and 3 hardware. Trying to code in assembler instead of a high-level language corresponds to one of those levels, 'translation'. Based on a quick glance, this seems an okay summary http://www.crowl.org/lawrence/programming/Bentley82.html though it misses that important abstraction.

I think it's important to understand that writing some in-line assembler may worsen program performance. Some people seem to think it is at worst 'net neutral' but it is not; performance can get much worse. Mixing two different languages, C/C++ and assembly language, also makes codes harder to understand, change and maintain. This cost may be catastrophic.

The compiler does not analyse your assembler code. So it can't move it to a more efficient place (maybe outside a loop, or a limb of an if which isn't executed every time round a loop), it must leave the registers your assembler has 'reserved' free for use by your code (when it might have made better use of them), and make memory look 'right enough' for the assembler to work. The compiler likely understands the processor pipeline, and can sequence instructions to exploit it (IMHO if you have no idea what this means, please don't attempt to write any assembler until you do). Writing in-line assembler generally pokes a hole in the compiler's analysis of the C/C++ code, which is a 'bad thing'. Even a single line of badly judged assembler could force the compiler to generate code which runs a couple of times slower.

Before attempting to write any assembler, make sure you know how to get a listing (from the tool chain) of the assembly language produced by the compiler mingled with the source code, and learn to read it. IMHO, until you've understood this, making assembler changes are unlikely to be rational or improve the program performance.

Further, you need a reliable way to make accurate measurements of the code. At the level of improving an if test, that measurement system needs to be sensitive to a couple of machine cycles (i.e. 1/72MHz useconds).

Assume the compiler has been polished for 10 years by people who intimately understand the architecture of the processor, until proven otherwise. Further, the compiler can keep track of a lot of stuff which a human brain will struggle with. As a simple example, have a look at the assembler code generated by the compiler, then change a couple of variables from volatile to static, comparing the generated assembler code. The set of assumptions the compiler makes changes, and the code reflects that. It is quite hard to make all of the correct changes to the code by hand, as the 'book keeping' can be quite subtle yet tedious. People tend to make much simpler assumptions and hence write inferior code. A piece of inline assembler may force the compiler to use much 'safer', less 'aggressive', simpler analysis optimisations which may swamp the improvement from in-line assembly language.

I think one way to do better than a good optimising compiler (e.g. gcc), is to exploit something you know about the code that the compiler can't deduce. That usually means writing better algorithms, or re-organising code to exploit some property of the algorithm. For example using a sort which exploits the order of data, or merging for loops to avoid reloading lots of coefficients.

A simple exercise is to try the different compiler optimisation flags, and measure the differences.
Another is to change all volatile keywords to static, and measure those differences.

This might indicate that it is faster to 'double buffer' data which is shared with the main loop (letting the compiler do the most aggressive optimisation); use a single volatile to communicate to the main loop which buffer-full of data is ready, but otherwise avoid volatile.

Declare data at the top level (outside a function) as static so the compiler can know functions outside a file can't access it.

mlundinse wrote "Cortex M3/4 are 32 bit processors so declaring variables as byte or halfword will in general only make code slower"
That was true for earlier generations of ARM processor. However Cortex-M3/M4 are ARMv7-M architecture processors, where this is not necessarily true. Cortex-M3/M4 have byte and half-word (2 byte) load and store instructions which are the same size and take the same time as word (4 byte int) load and store instructions. So code for byte (char), 2-byte (short) and 4-byte (int) data is the same size and runs at the same speed. Further, for some data structures like arrays of structs, making the struct bigger by using int for each struct member might run slower because the compiler may be forced to use less compact addressing modes.

I don't think this is the case for C/C++ to Cortex-M3, but in general, using a word (int) where a byte or half-word is correct, may generate longer code sequences. For example in multiply where the compiler might need an extra register for a double word (8 byte) result. The register might otherwise be available for a variable, saving memory access.

There is a problem with using non-word data. If a half-word or word cross the natural word 4-byte boundary, then the processor will need to access memory twice, which is slower than reading memory once. This is invisible to the code; it is 'automatic', and is done by the hardware. The compiler may be instructed by a command line flag to 'pack' data, which would create this problem.
The compiler can be told to align on boundaries using the __attribute__ syntax:
short s __attribute__ ((aligned (4))) = 0;
This ensures the variable is on a word (4-byte) boundary. There is a command line option to the compiler which ensures the compiler aligns variables across a word boundaries to avoid reading an extra cycle. I think it is the default, but I don't know what flags the IDE uses.
A 'trick' is to define all int variables first, then all short, then all char to avoid having to worry about this, though that might make code harder to understand.

Summary:
- "The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet."
- Ensure there are extremely good reason for changing working, tested, debuggable code into harder to test, harder to debug code
- Ensure there are good measurement mechanisms, sensitive enough to measure differences reliably before changing code
- The compiler is probably better than a person at generating assembler for simple C statements, and better at using registers
- Using assembler may significantly worsen speed by restricting the compilers ability to re-organise and optimise code
- Be prepared to analyse the compiler-generated assembly code before attempting to optimise by writing in-line assembly code
- Exploit things the compiler can't deduce from code to optimise program code for example better algorithms
- Tell the compiler 'the truth, the whole truth, and nothing but the truth', don't claim volatile or extern, or int when it isn't

EDIT {
Budget for time to get competent with the tools, processor and measurement techniques otherwise you'll underestimate the cost.

IMHO it might be as useful a use of time to use e.g. an STM32F3 or STM32F4 MCU if you can afford the $15, which might give a lot more throughput for heavily DSP-like problems.
}

Another question is:
Do you want the interrupt routine to have a shorter duration, so interrupts are blocked for less time, even if the whole system runs a bit slower?
The main reason for doing an interrupt is to service I/O. What I/O is happening that takes so long? It sounds like a lot of processing is happening which is not I/O. Should that be moved to main? Such an architecture might be easier to test, measure and debug.
Alternatively, the I/O might be better serviced by DMA. Properly using DMA may make a much bigger difference to run-time than re-coding parts of the program in assembly language.

Posted 2 years ago #
pyrohaz
Member

mlundise - I have a fair few variables that are used only inside of the interrupt, will it make a notable difference to the speed if I don't declare these as volatile?

Luckily I don't have to call any functions so i'm ok there.

gbulmer - Wow! I'm utterly impressed with the time that you had put into that reply, it answered every question I could have.

I think the wikipedia quotes are brilliant, i'll definitely keep in mind from now on about program optimization.

Some optimizations i've done are replacing all possible divisions with bitshifts (for dividing by powers of two). This has so far given me a drastic increase.

As with the standard model of a DSP system, where you input data with an ADC, process it then output it with a DAC, in my system, i'm reading the data input in the interrupt, processing it (what is being done with the data is selected by lots of if statements where the if statements used are decided outside of the interrupt), external pots are also read outside of the interrupt as their sample rate isn't particularly important. Data inside of the interrupt is 16bit but processed using 32bit values so I can do most calculations by multiplying and shifting. Most of the processing is normal things like filtering (standard MAC style filtering, IIR 1st order series stages mainly) and bitcrushing (simply shifting down them up again to eradicate lower bits).

As mlundise mentioned, within the interrupt, what data types should I be using? Do I need to declare variables that are used in the interrupt but not modified as volatile? Above, you mention about using different data types. Will I get any advantages by declaring true/false statements as bool vs int?

I have actually invested in the STM32F0 (worse than the maple specification wise, but cheap!), though I couldn't find many easily used tutorials, in order of ease, i'd currently say it goes arduino > maple > standalone processors with no main dedicated IDE. Since i'd still call myself "new" to programming (<1 year since I started), i've not really learnt how to master an IDE like programming fuses etc. I can manage the low level maple stuff but all the main internals are pre programmed. Do you know any really useful tutorials for the discovery set of chips? I've used coocox to make mine have flashing LED's but its pretty low level code and I can imagine it would take me a few months to get to grips with the feel of the chip.

But yes, I would like the interrupt to be a bit shorter so I can do a bit more processing out of interrupt time, i'm currently sitting at about 16us of 22us, which results at about ~70% processor time consumed inside the interrupt.

Posted 2 years ago #
pyrohaz
Member

In addition to the above, some of the if statements have a few complex evaluations (I say complex, they're more than just if A==1), would it be faster if I said C = (A==1)&(B==1),
if(C==1)..., predeclaring the situation outside of the interrupt loop?

Posted 2 years ago #
ala42
Member

Quite obviously the interupt code should not reevaluate complex conditions each time that are only changed once in a while by the main code. You should also avoid any float or double operations.

Posted 2 years ago #
pyrohaz
Member

Ok, thats an easy change I can do then. When setting a variable as conditions, how am I meant to do it? Do I do it all with bitwise comparisons or boolean comparisons (As in & or &&)? I've fortunately already taken the step to avoid float and double calculations.

Posted 2 years ago #
gbulmer
Moderator

pyrohaz - "As mlundise mentioned, within the interrupt, what data types should I be using?"

As I wrote "- Tell the compiler 'the truth, the whole truth, and nothing but the truth', don't claim volatile or extern, or int when it isn't"
I don't know how to make it any clearer and be succinct.
If something is a byte size use byte or char/unsigned char, etc.
If it is only used within a function, declare it within the function.
If it must be stored between function calls but is not shared, make it static to the function or file.

As mlundinse explained, there are ways to avoid using volatile, and doing this may have a much bigger impact than tiny details. Have you done the change volatile into static and measured the effect?

I do not know of a document which explains this sort of stuff. Some of it is clear if you think about it, e.g. static vs extern or volatile. Some of it can be worked out by looking at the mingled source+assembler listing and reading the appropriate ARM documentation. Some of it needs very sensitive and accurate measurement. If you had access to professional ($1000+) tools, you would want to get the cycle-accurate simulator, which might answer some of these questions for you.

You could get much of the information that would likely answer questions about code by running the command line compiler, with the two character option to generate the intermingled source+assembler listing. There is no need to use different libraries. There is no need to use a different IDE. You could use another IDE if it gave access to the compiler's command line options, and use libmaple, but that may be more work.

There is a command-line technique for programming Maple described at http://leaflabs.com/docs/unix-toolchain.html
You can skip the start because you already have the tool chain installed and working because it comes with the IDE.
You will need to create a small piece of a makefile which names the source code files you have written. The build process is automated by LeafLabs makefile, you just type make. There are threads on this forum which answer most of the questions about using this mechanism. It is probably worth a couple of hours of your time to attempt to get it working. It will give you more accurate answers than you are likely to get asking questions on a forum (unless the answerer runs the tools themselves).

"When setting a variable as conditions, how am I meant to do it? Do I do it all with bitwise comparisons or boolean comparisons (As in & or &&)?"
1. Your measurement mechanism should be sensitive, accurate, and robust enough to tell you the answer. Or it isn't good enough for the resolution this question requires you to work at. IMHO, if the measurement system isn't good enough, you should stop working at such a tiny level of detail, and focus on bigger impact changes.
2. If you looked at the mingled source code + assembler listing, you might be able to work this out, armed with the appropriate ARM documents e.g.
DDI0337E_cortex_m3_r1p1_trm.pdf
which you can download from the arm.com web site. It has a chapter on instruction timings. It is relatively straightforward, and can be read initially in maybe an hour. As you read the intermingled source+assembler listing you'll want to refer back to that manual. It'll likely take a few days to get familiar. This'll likely be so much quicker than trying to learn ARM assembly language any other way it isn't even funny.
3. The question is likely more complex than can be reliably answered without looking at the code generated by the compiler. The compiler might analyse and generate a way to use a single variable with a few bit flags more efficiently than separate variables. The code sequence might be sensitive to tiny changes, hence I would not guess. So I believe gathering real evidence, using the compiler will be far more accurate and productive than asking people to guess!-)

IMHO the list of techniques at http://www.crowl.org/lawrence/programming/Bentley82.html
is worth pursuing if you want to get good at this sort of thing.

At the level of detail you seem to be focusing on, a good technique is to use constants instead of variables. This also means variables are defined as late as practical (preferably as const), which is usually a 'good thing'. Using constants gives the compiler good opportunities to optimise for you. You might, for example, generate the specific code for the major alternatives paths, by copying the source of the interrupt routine into several different functions (as many as their are independent if tests), and filling in the constants in the if statements. You might see which stuff becomes irrelevant, or can fill in more constants. By putting each path into a separate function, the compiler can analyse more specific code. Then you might get rid of more if tests, by 'calculating' which of the alternative functions applies, and use a function pointer to choose the the single path. I am not suggesting this is a 'must do' technique. However, the simpler, more specific and transparent your code, the more likely the compiler will analyse and generate the best code for you.

EDIT: My goal is to avoid writing assembler to 'beat the compiler', or to have mixed language programs. They are hard to maintain, and hard to get other folks to take up and use. Instead I try to simplify the code, so the compiler can optimise the s#!t out of it! In my experience, this is quicker and easier than learning to write better assembler than the compiler. More importantly it is a more transferable, portable, useful, long-term skill. It can even result in more maintainable code!

The compiler has a lot of optimisations which it does not use with the default optimisation level. IFF you could use the command line build process, you could experiment with those options and get at he mingled assembler+source listings. You might gain deeper insight into your code, how it is translated, and as a result gain more dramatic improvements, and gain some transferable, portable skills

From your description, IMHO it would be straightforward to reduce the time spent in the interrupt routine.
It might make the entire system run a tiny bit slower. Or, you may reduce the amount of volatile memory accesses so much that the system runs much faster. You could get evidence from the mingled source+assembler listing, or your measurement system using smaller, simpler test cases, and not rely on guesses.

One change, which might have a big impact, is using a much newer version of gcc. The beauty is learning how to solve using a newer gcc might yield a handy, portable, long-term skill.

Newer versions usually do a better job of optimisation, and recent versions (from launchpad) support fixed point arithmetic. You might find that coding the 'DSP-like' code using a fixed-point data type, properly understood by the compiler, will generate better code than your hand-coded-fixed-point approach. Of course, I might be wrong. However, knowing that for sure, may be a much more profound insight than anything we can ever guess !-)

Posted 2 years ago #
ala42
Member

Post your code so we can have a look...

Posted 2 years ago #
pyrohaz
Member

Right, I will change the variables to all the desired types now it is cleared up that it processes bytes etc as fast as ints.

Changing from volatile to static made a dramatic difference and shaved off 3.5us! Definitely keeping that change, thank you for that, I wreckon I can shave off a few more us by correcting the data types. This is a great saving though as its reduced processor consumption to 54%!

I will take your advice and stay away from learning the whole assembly for now, I don't think I really understood just how efficient this compiler was so my apologies there. Since i'm new to assembly, I don't think i'd be able to code better than a compiler at all anyway.

I see that you've mentioned the command line compiler, now i'm a bit of a progamming 'newb' if i'm honest, is this suitable for windows?

There are quite a few that I can declare as constants so i'll make sure to change these.

You say about putting each if statement into a seperate function, will this really save time? As I can choose what form of processing is done on the data, how will using a function save time here as I would've though i'd need an if statement within the function to choose?

I'd love to have a go at changing the compiler but i'll be honest, I haven't really got a clue how to do it!

Thank you very much for all the help so far, I never would've imagined changing the variable declaration would have such a large impact on the time consumption!

@ala42:

I'd love to post the code but its a bit sensitive at the moment, i'd hate to have to make you all sign an NDA, sorry about that!

Harris

Posted 2 years ago #
gbulmer
Moderator

pyrohaz - "I wreckon I can shave off a few more us by correcting the data types."
You're unlikely to get anything much from this. But, it is better to be accurate.

"I never would've imagined changing the variable declaration would have such a large impact on the time consumption!"

To me, there are two issues.
One is how to optimise the piece of code you have.
The other is how to gain portable, long-term skills which will help in this case, and also be useful for years to come.

IMHO learning how to use the tools properly will significantly help you in this case, and likely be repeatedly useful. Learning to use the tools will allow you to collect evidence and do proper analysis, analysis which is superior to opinions.

For example, looking at the mingled source+assembler would let you see how a variable definition has an effect, and might help you understand why it has the effect. You could see how many assembly language instruction are in a section, and hence which is likely an improvement. You might develop an improved mental model of what volatile, static, extern, automatic, const, *p, a{], if, while, switch, etc. mean to a compiler.

"You say about putting each if statement into a seperate function"
This is quite a complex idea.

Fundamentally, to speed up code, it needs to do fewer instructions, or quicker instructions. In general, the best approach is to use a better algorithm. Next try to improve the code.
Branch instructions are usually worse than a load or store instructions, which are in turn worse than 'arithmetic' (etc.) on registers.
Hence reducing the number of branch instructions executed can help.

The question is, how to do that. That post was my attempt at an explanation of a process which is 'almost automatic'. I would use command line tools like the C-preprocessor, or generate text with a scripting language, to make the code maintainable.

If all values used in an if test are constant, and known at compile time, the compiler can dramatically transform and improve the generated code. So if there are 16 paths through the code, produce 16 functions contaning the code, each with a specific set of constants to force a unique path. Each function should be much quicker than the original, 16-path-capable code.

EDIT:
"I see that you've mentioned the command line compiler, now i'm a bit of a progamming 'newb' if i'm honest, is this suitable for windows?"
Is it fair to say you don't understand the components that make up the IDE? The IDE that you see (with menus and windows) is pretty much a fancy text editor which can also run the command-line compiler, linker and upload program. The compiler, linker and upload program do the majority of the work. The command line compiler is installed on your system when you installed the IDE. It *is* the compiler which is run by the IDE to generate code. There is no other compiler. AFAIK, most C/C++ IDEs use a command-line compiler, and many use GNU's gcc, including some commercial products.
So yes, the command line compiler is suitable for Windows, and is the only way that code gets generated by the IDE for the Maple.

Posted 2 years ago #
ala42
Member

>You say about putting each if statement into a separate function, will this really save time?
No, this is nonsense, no one said this AFAIK.

Posted 2 years ago #

RSS feed for this topic

12 Next »

Reply »

You must log in to post.