LeafLabs Garden » Topic: Writing fast code

LeafLabs Garden » Topic: Writing fast code http://forums.leaflabs.com/topic.php?id=1133 A place to share, learn, and grow... en-US Fri, 22 Jan 2016 00:09:44 +0000 http://bbpress.org/?v=1.0.2 <![CDATA[Search]]> q http://forums.leaflabs.com/search.php gbulmer on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7107 Mon, 07 Nov 2011 07:08:09 +0000 gbulmer 7107@http://forums.leaflabs.com/ sly - I agree. There are undoubtedly trade-offs to working at different "levels" in the software (approximately characterised as architecture/algorithms, coding, and compilation/translation). Algorithms are an important 'level' to consider. A key point about recognising that there are 'levels' is that there are several places to attack a performance problem, and some may be easier to gain performance improvement than others. It may be something as simple as using the right compiler flags which can deliver a 2x and more improvement. IMHO, people often ignore this option, yet it should require relatively modest effort, all results *should* produce a working program or a compiler error/warning, and the effort to understand the compiler options should be portable across projects. A great example of focusing on the compilation/translation 'level' (though not applicable here) is comparing Python and Cython for numerical algorithms. For the same, 'optimal', algorithm, the Cython developers claim more than 100x improvement by using Cython over Python (<a href="http://cython.org/)" rel="nofollow">http://cython.org/)</a>. Sorting and searching are often identified as good examples of areas where algorithm dominates. Writing and testing a new algorithm has risks. Usually I'd recommend using a well written and tested library rather than code yourself, where practical. There are many published algorithms, especially searches and sorts, which have been published as 'reference' examples and have been later discovered to be incorrect. (Try reading "Engineering a sort function" by JON L. BENTLEY. M. DOUGLAS McILROY <a href="http://www.comp.nus.edu/~tanhuiyi/cs1102/2007-2008SEM2/spe862jb.pdf" rel="nofollow">http://www.comp.nus.edu/~tanhuiyi/cs1102/2007-2008SEM2/spe862jb.pdf</a> which focuses on quicksort. It shows how weak some of the published and implemented examples were, and how to do it well) Evidence is critical. IMHO having an effective method for measuring actual performance, to identify where there might be a performance 'bottleneck', or understanding the problem properly, is more important than choosing the 'best algorithm' because the software engineer might not be able to characterise what adequate is otherwise. The evidence (e.g. software engineering research studies and 'war stories') is that a developers 'intuition' is unreliable. So, for example, a bubble sort may be faster than quicksort in specific circumstances. Not understanding the problem, or having inadequate measurement techniques might lead the software engineer to the wrong conclusion. A system I helped with in the '80's used the C library implementation of quicksort expecting it to be the 'best', but we learned that Shell sort 'spanked it' in all realistic, measured, cases for our application. From my software development experience (30+ years), it is worth trying to keep an open mind, and being prepared to measure and characterise performance rather than 'premature optimisation'. The thing that I usually find the most harmful is making assumptions, and evidence is the best antidote :-) "Writing Efficient Programs" by Jon L. Bentley (<a href="http://www.amazon.com/Writing-Efficient-Programs-Prentice-Hall-Software/dp/013970244X" rel="nofollow">http://www.amazon.com/Writing-Efficient-Programs-Prentice-Hall-Software/dp/013970244X</a>) is almost 30 years old, and seems to be rarely recognised even though Jon L. Bentley wrote the ACM column "Programming Pearls" for years. In the particular case in this thread: <blockquote>The means measured in my setup were: 0.23 us per execution of gpio_write 0.57 us per execution of digitalWrite </blockquote> measuring something like <code>bb_peri_set_bit(&(GPIOB_BASE->IDR), PinIndexOnPortB, HIGH);</code> seems to be worthwhile. It is likely less risk and effort than trying to change the algorithm, but may yield about 3x improvement over the library implementations. Using bit-band addressing does carry risk. For example making a mistake in translating from the high level Maple pin number to a Port and pin. Even so, that should be easier to test than a change to an algorithm. Using bit-band addressing correctly also has the benefit that it is as fast as the software can go, so (assuming the correct compiler flags are used and hence optimal translation) the architecture/algorithm needs to be changed if this isn't quick enough!-) It should be added, that it is extremely useful to have a good definition of what is needed, and have an estimate or model of what 'fast enough' is. I haven't understood that in this case, so I am trusting the OP understands that. (Even if doing optimisation experiments as a 'hobby', it's useful to know when enough is enough). Summary - there are at least three 'levels' to a software application, and each has opportunities to improve performance. Improvement at one 'level' may be lower risk and less effort than another. IMHO effective approaches to measurement, and gaining a proper understanding of the problem are more important than optimisation at any 'level'. Developing a simple, clear, *WORKING* program is more important than focusing on performance too early. Developing a method to test that the program continues to work correctly is important if any changes are to be made. Getting a simple, clear, working program early in the project is a good strategy in general, and especially when performance might be an issue, because evidence based on measurement and understanding is more valuable than intuition or assumptions. siy on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7106 Mon, 07 Nov 2011 02:24:00 +0000 siy 7106@http://forums.leaflabs.com/ From my significant (20+) experience as a software developer I can say that optimizing algorithms is much more efficient than any attempts to tune performance by using some tricks. Rewriting bubble sort in carefully tuned assembler can't make it faster than, for example, shell or heap sort written in plain C without any tuning or tricks. In other words, it worth keep to computer what it can do (code optimization) and let human do what computer can't do - algorithm optimization. gbulmer on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7053 Sun, 30 Oct 2011 19:02:05 +0000 gbulmer 7053@http://forums.leaflabs.com/ For anyone not familiar with the C pre-processor ... #define MACRO defines MACRO as a name used by the pre-processor, it will not be seen by the compiler. <code>#ifdef MACRO</code> is the same as <code>#if defined(MACRO)</code> and <code>#ifndef MACRO</code> is the same as <code>#if !defined(MACRO)</code> The more complete syntax of #if is: <pre><code>#if <expression1> // stuff to include if <expression1> is non-zero #elif <expression2> // stuff to include if <expression1> is zero and <expression2> non-zero #elif <expression3> // stuff to include if <expression1> is zero and <expression2> zero and <expression3> non-zero #else // stuff to include otherwise #endif</code></pre> The <expression> can be any integer expression that can be evaluated at compile, macro values are substitued. It can include integer arithmetic, including characters, bit shifts, bit operations &&, ||, !, and defined(), but not sizeof(). So everything that can be done with #ifdef or #ifndef can be done with #if, and #if can be used for a lot more beside. I use #ifdef to ensure the body of include files is only included once, and wherever the source code is dependant on the definition of a single macro name. It is simple and requires less brain power to figure out what is intended than #if :-) You can give a macro symbol a value from the command line (e.g. -DMACRO=3). So source code be: #if MACRO == 1 // ... #elif MACRO == 2 // ... #elif MACRO == 3 // ... etc #endif I've used this sort of thing to switch on different levels of debugging. JoshSanders on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7049 Sun, 30 Oct 2011 12:04:11 +0000 JoshSanders 7049@http://forums.leaflabs.com/ Thanks, guys! Awesome feedback! Learning lots... robodude666 on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7048 Sun, 30 Oct 2011 10:01:34 +0000 robodude666 7048@http://forums.leaflabs.com/ In addiction, you can use <code>#define</code>s to switch between the two blocks of code: <pre><code>#define USE_FAST_GPIO #ifdef USE_FAST_GPIO gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // ... lots more #else digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); // ... lots more #endif</code></pre> Commenting out <code>#define USE_FAST_GPIO</code> will cause it to not be defined. There is also a <code>#ifndef</code> which checks if something is not defined. I personally find it handier than using <code>#if 0</code>/<code>#if 1</code>, as you can keep your settings/defines in a single location for easy tweaking. This is for more permanent features though... If you're just testing, if 0/1 is fine. I use it to remove large blocks of code that are under development, while I use defines to switch between things I want options to control (debug, modes, etc.). gbulmer on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7045 Sat, 29 Oct 2011 17:28:18 +0000 gbulmer 7045@http://forums.leaflabs.com/ JoshSanders - Okay. I was partly trying to highlight some of the issues and difficulties for other readers than you :-) Having said that, using the 'port bit set/reset register' or direct 'bit band address' (for single pin change) is so much faster (almost 10x) that IMHO using a deterministic timer clock, with a lot more resolution is a step worth taking. On a different area, may I say, rather than use comments to switch code on and off, I like to use the C preprocessor. All lines of source code starting the line with '#' is handled by the pre-processor. Conceptually the pre-processor does things to the source before passing it on to the C/C++ compiler; the C/C++ compiler gets the resulting transformed text from the pre-processor (might not be implemented this way, but this is the concept). The pre-processor does the #include of header files into the source, and the replacement of #define macro's before the C/C++ compiler reads the text. The pre-processor also supports conditions either <code>#if</code> or <code>#ifdef</code>. Those constructs either include or exclude text from the input to the C/C++ compiler. The structure of <code>#if</code> is a bit different from C/C++. <pre><code>#if 1 digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); // ... lots more #else gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // ... lots more #endif</code></pre> Leaves the block of <code>digitalWrite(Pin, HIGH);</code> in the code for the C/C++ compiler, but removes all of the <code>gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH);</code> block. <pre><code>#if 0 digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); // ... lots more #else gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // ... lots more #endif</code></pre> does the opposite. It is easier to edit the <code>#if 1</code> to <code>#if 0</code> than edit the comment markers. JoshSanders on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7043 Sat, 29 Oct 2011 16:19:20 +0000 JoshSanders 7043@http://forums.leaflabs.com/ Agreed - I should have made this more clear - this code isn't for precisely measuring how long it takes Maple to run a block of code, hence my introduction that it "measures and returns the *average* time to execute a block of code, albeit at the expense of information about variance". The purpose of this code is for determining which of two alternative ways of doing something is faster in relative terms, by a factor large enough to warrant more complex code. (In the case of digitalWrite vs. gpio_write_bit, for instance, the factor of 2 helped me make the decision to go with gpio_write_bit in a regime where interrupts have to be on and are unpredictable). gbulmer on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7036 Sat, 29 Oct 2011 06:12:10 +0000 gbulmer 7036@http://forums.leaflabs.com/ JoshSanders - That is a very good shot at timing stuff, and may be fine for what you need to do. I think it is worth mentioning for future readers of this thread that timing this sort of stuff is a bit more complex than it looks. That code gives an estimate for the empty loop of 0.94us, by running 10000 iterations. So I think it is reasonable to assume the whole block of code was executing for much more than 9.4milliseconds I.e.: <pre><code>for(int CycleIndex = 0; CycleIndex < 100000; CycleIndex++){ MeasurementStartTime = micros(); // -----------------Lines of code being measured------------------ //--------------------------------------------------------------- MeasurementEndTime = micros(); CycleDurationSum = CycleDurationSum + (MeasurementEndTime - MeasurementStartTime); }</code></pre> So, on the face of CycleDurationSum should be the time used to run micros() 10000 times. So factor out its run time as CycleDurationSum/10000. The whole loop will run micros() 20000 times, and has the for loop, and some float calculations. So maybe about 20milliseconds. There are unaccounted for activities which also consume run-time. During that 20milliseconds (that is only a guess, it is worth measuring), there should have been about 20 Systick timer interrupts, and about 20 USB interrupts, and it is likely some of those happened during the 'empty' part of the loop. So a chunk of the measured run time is running interrupt service routines, and is not the 'empty loop'. Assuming the time spent in running micros() 20000 times dominates the time in the block of code, a plausible guess would be half the time spent in interrupts contributes to the measured elapsed time during the 10000 calls of micros(). To get meaningful results from the empty loop, either switch off those interrupts, or account for the time spent in interrupt service routines. In this case, I think it is easier to switch off the interrupts, but in general might not be. To switch off USB, I think the call is <code>USBSerial::end()</code> and for Systick <code>systick_disable()</code>. If interrupts are running (i.e. it isn't deterministic) run those blocks many times too (with a bit of random delay between each run if practical) to get a better sample set. A tiny jitter comes from micros() which does not take the same amount of time every time it is called. Looking at the code, I see it contains a loop which rarely executes, but might have an awkward 'synchronisation' effect. So I would suggest using a hardware timer instead of micros(). Read the timers count value straight from the hardware, so it would be deterministic (reset the timer's counter too, and have a test duration smaller than the counter maximum). A hardware timer could also give more resolution than micros(). A counter is capable of measuring a single clock cycle (with a prescaler of 1). Sorry about my program breaking. You're correct, the library has changed. I'll try to do something about it when I get some time. It is also worth reiterating that Pete Harrison <a href="http://www.micromouseonline.com/2011/10/26/stm32f4-the-first-taste-of-speed/#axzz1cB4M8UDw" rel="nofollow">http://www.micromouseonline.com/2011/10/26/stm32f4-the-first-taste-of-speed/#axzz1cB4M8UDw</a> has used Rowley Crossworks which gives cycle counts for chunks of STM32F code. I do not know how accurate that is, but assuming it is accurate, then there is less need to do benchmarks (if you can afford the software). JoshSanders on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7029 Fri, 28 Oct 2011 17:51:34 +0000 JoshSanders 7029@http://forums.leaflabs.com/ Good point about measure and compare. It's also a lot closer to my heart as an experimental biologist =) Here's a tidbit I've tested (along with some specific measurements): You can write digital output to Maple's pins *twice as fast* if you use gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); instead of digitalWrite(Pin, HIGH); (Example code below) The means measured in my setup were: 0.23 us per execution of gpio_write 0.57 us per execution of digitalWrite Here's how I did it: Using Maple Mini rev2, I wrote a generic sketch that measures and returns the average time to execute a block of code (albeit at the expense of information about variance): double CycleDurationSum = 0; unsigned int MeasurementStartTime = 0; unsigned int MeasurementEndTime = 0; void setup() { } void loop(){ CycleDurationSum = 0; for(int CycleIndex = 0; CycleIndex < 100000; CycleIndex++){ MeasurementStartTime = micros(); // -----------------Lines of code being measured------------------ //--------------------------------------------------------------- MeasurementEndTime = micros(); CycleDurationSum = CycleDurationSum + (MeasurementEndTime - MeasurementStartTime); } CycleDurationSum = CycleDurationSum/100000; SerialUSB.println(CycleDurationSum); } With nothing in the space between the "Lines of code being measured" comments, the average cycle duration is 0.94us. I subtracted 0.94us from any measurements made with code in that space. (Since micros() returns a minimum value of 1 or else 0, this is probably an overestimate... but since it rounded up to 1us 94% of the time, the true value ought to be between 0.5us and 1us) so I made sure the number of repetitions of the code of interest was large enough to cause a large increase in the total time measured with respect to 1us, allowing for meaningful comparison between two conditions) For the digital write measurements I cited above, I used the code modified as follows, and uncommenting the 30 "digital write" lines to be measured: #include <stdio.h> #include <gpio.h> #define PIN_PORT GPIOB byte Pin = 16; byte PinIndexOnPortB = 6; double CycleDurationSum = 0; unsigned int MeasurementStartTime = 0; unsigned int MeasurementEndTime = 0; void setup() { pinMode(Pin, OUTPUT); } void loop(){ CycleDurationSum = 0; for(int CycleIndex = 0; CycleIndex < 100000; CycleIndex++){ MeasurementStartTime = micros(); // -----------------Lines of code being measured------------------ digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); digitalWrite(Pin, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); // gpio_write_bit(PIN_PORT, PinIndexOnPortB, HIGH); //--------------------------------------------------------------- MeasurementEndTime = micros(); CycleDurationSum = CycleDurationSum + (MeasurementEndTime - MeasurementStartTime); } CycleDurationSum = CycleDurationSum/100000; SerialUSB.println(CycleDurationSum); } For gpio_write_bit, the system returned 7.97us. 7.97us - 0.94us (empty loop) = 7.03us. 7.03us/30 (repetitions of the line) = 0.23us For digitalWrite, the system returned 17.90us. 17.90us - 0.94us (empty loop) = 16.96us. 16.96us/30 (repetitions of the line) = 0.57us. For those of you who are new to gpio_write_bit, the basic concept is to use Maple's Master Pin Map ( <a href="http://leaflabs.com/docs/hardware/maple.html#master-pin-map" rel="nofollow">http://leaflabs.com/docs/hardware/maple.html#master-pin-map</a> ) to address lines by their GPIO address(Column 2 in the table) instead of their pin number (Column 1). Each pin has a port address (usually letters A-D) and an index within that port (i.e. pin 39 on the maple is also pin PA13 = Port A, Pin 13.) To use gpio_write_bit, you need to add: #include <stdio.h> #include <gpio.h> and for each pin port you plan to use, add #define PIN_PORT_A GPIOA . . . #define PIN_PORT_X GPIOX Some threads touching on this already exist in the forum: <a href="http://forums.leaflabs.com/topic.php?id=1107#post-6827" rel="nofollow">http://forums.leaflabs.com/topic.php?id=1107#post-6827</a> <a href="http://forums.leaflabs.com/topic.php?id=517#post-2644" rel="nofollow">http://forums.leaflabs.com/topic.php?id=517#post-2644</a> It seems there are even faster ways to manipulate pins through DMA channels or direct updates... one example was given that won't compile with my Maple Mini Rev2 + Win 7: //------------------------------------------------------ /* QuickererPin :-) * * Turns a GPIO pin on and off fast using direct updates. * Copyright 2010 G Bulmer */ #include <gpio.h> #include <boards.h> int pinNumb = 13; GPIO_Port *const port = PIN_MAP[pinNumb].port; const int32 pinOffset = PIN_MAP[pinNumb].pin; void setup() { // initialize the digital pin as an output: pinMode(pinNumb, OUTPUT); // don't bother doing this at a low-level, use the library } void loop() { while(true) { // An infinite loop, going fast (can be faster, but yeuk) port->BSRR = 0xFFFF0000 | (1<<pinOffset); port->BSRR = 1<<(pinOffset+16); } } //-------------------------------------------------------------- This code generates the following error (I am using Maple 0.0.12): error: expected constructor, destructor, or type conversion before '*' token removing const* gives: error: 'GPIO_Port' does not name a type. (Sounds like the library is modified somehow since then?) Anyhow, I'll post more speed tips as I figure them out. ~J gbulmer on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7010 Wed, 26 Oct 2011 18:48:35 +0000 gbulmer 7010@http://forums.leaflabs.com/ AFAIK, modern Computer Science degrees barely deal with optimisation. IMHO, partly because CPU performance stopped being a major problem in this millennium (usually I/O is the bigger problem), and partly because of the (obvious) observation that it is much easier to improve the performance of an easy to understand, simple, working program than it is deal with a fast, complex but broken program. There was a great story by Gerald M. Weinberg, in "The Psychology of Computer Programming" about performance. A system had been built, and reworked, still didn't work, and was way over budget. Weinberg was called in as a consultant to propose a solution. After Weinberg had outlined his design, the chief developer of the original system quizzed Weinberg about the speed of Weinberg's program design. The chief developer commented something like 'our solution would be 3x faster'. Weinberg replied something like 'if the requirement to work, and give correct results was removed I could make it run as fast as you like' :-) I am a huge fan of measuring before trying to improve performance. I like John L Bentley's "Writing efficient programs": <a href="http://books.google.co.uk/books/about/Writing_efficient_programs.html?id=up1QAAAAMAAJ&redir_esc=y" rel="nofollow">http://books.google.co.uk/books/about/Writing_efficient_programs.html?id=up1QAAAAMAAJ&redir_esc=y</a> I like this book a lot because it reminds people that there are about 6 'levels' of an implementation (3 hardware and 3 software) where performance can be optimised (people usually mean for speed, but it could be space). Each 'level' typically yields no more than 10x improvement, so if you need 100x, you know there is little point hoping you'll get it by trying to write efficient code, or recoding in assembler alone. It is often easier to get a 3x at one level, and a 3x at another rather than try to get 9x at one level. Improve the algorithm a bit, and polish a small part of the code, and you might get there with little effort. Both sets of articles (quite reasonably) assume you know where the speed bottleneck is, but seem to assume that it needs to be fixed by recoding. I think that is too narrow. The algorithm often makes more difference, and using the right mix of optimizations flags to the compiler might make a significant difference. I usually start by trying a few levels of optimisation, to see if the compiler can improve speed all by itself. Pete Harrison uses Rowley Crossworks for STM32F development, and it seems to annotate programs with the number of cycles for each statement. If you can get a listing like that, you are likely able to see where time is spent. Some you can't do much about :-) That is governed by Amdahl's law (<a href="http://en.wikipedia.org/wiki/Amdahl" rel="nofollow">http://en.wikipedia.org/wiki/Amdahl</a>'s_law), and is worth bearing in mind; If the part that is critical, and can't be changed, is 90% of the run time, then you are going to have to get an amazing improvement just to improve by 10% overall. Most of the libmaple functions are very safe and conservative, and so there is room to go quicker. I haven't read the two links, but only skimmed them. I think there are some pieces of advice in <a href="http://www.codeproject.com/KB/cpp/C___Code_Optimization.aspx" rel="nofollow">http://www.codeproject.com/KB/cpp/C___Code_Optimization.aspx</a> which are inaccurate (I don't believe an ARM compiler uses shift to trim int's to char's, Cortex-M3 has byte swap and byte addressing etc.). Some is subtle, and hence can't be applied without proper analysis and understanding and maybe negative if used wrongly (for example switch vs if). Some which doesn't apply to this specific MCU (the Cortex-M3 only has a very shallow write-back buffer, and no other cache). Another example of subtle is counting ones in a word; I think their is a technique for counting bits in a full word which is faster than its "Population count - counting the number of bits set". The other link seems okay, but some of it seem to to be dealing with stuff which might not be true on a cache-less microcontroller too. Sometimes sneaky tricks can trick the compiler into making less aggressive optimizations. To be more accurate, I have had the experience of trying to make code go faster, and at some point I did something sneaky which I had expected to work, but seemed to cause the compiler to suddenly generate slower (more) code. That was years ago, so might no longer be a problem (can't even remember what it was:-( I would encourage everyone trying to get extra performance to measure and compare techniques and changes. You might need to read the assembler to understand if it makes any difference because it may be very hard to eliminate the 'Heisenberg effect' in a real-time system, i.e. the overhead of measurement code, or the interaction with external events (without spending money on hardware-based measurement). JoshSanders on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7009 Wed, 26 Oct 2011 18:39:16 +0000 JoshSanders 7009@http://forums.leaflabs.com/ Awesome, thanks! I'll definitely check out that course... I've used open courseware before. It was, in its time, the best thing since sliced bread. (Maple came afterwards, of course) mbolivar on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-7006 Wed, 26 Oct 2011 16:37:47 +0000 mbolivar 7006@http://forums.leaflabs.com/ Maple is based on the STM32 line, so any optimization advice pertinent to that or the ARM Cortex M3 should apply. I personally am a big believer in putting off performance optimization until it's necessary. When I'm convinced that it is, I just try to think like the compiler and measure the results. On that front, I think that a basic grounding in how compilers work goes a long way. I majored in math and computer science in college, so I'm definitely biased towards formal methods, and I definitely have mountains to learn about concrete architectural issues. The usual grain of salt applies ;). MIT's 6.035 (Computer Language Engineering) lecture notes and projects are available online via OpenCourseWare, if you'd like to pursue that avenue further: <a href="http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-035-computer-language-engineering-spring-2010/" rel="nofollow">http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-035-computer-language-engineering-spring-2010/</a> I took this class in 2008; the goal was to take a toy C-like language and write an optimizing compiler for it. I lean pretty heavily on what I remember from that course whenever I try to optimize something for libmaple (which, as I mentioned before, is hardly ever). Two books I found really useful while taking that class are: - <a href="http://www.amazon.com/Engineering-Compiler-Keith-Cooper/dp/product-description/155860698X">Cooper and Torczon, Engineering a Compiler</a>: Introductory text on everything from scanning to low IRs. Probably the better book to start with. - <a href="http://www.amazon.com/Advanced-Compiler-Design-Implementation-Muchnick/dp/1558603204">Muchnick, Advanced Compiler Design and Implementation</a>: Nitty-gritty cookbook style reference on individual compiler optimizations and advice on composing them. This book is mostly useful if you want details on how a particular optimization can be implemented. (I mention it because knowing how something gets done sometimes helps write code that lets the optimization "kick in"). JoshSanders on "Writing fast code" http://forums.leaflabs.com/topic.php?id=1133#post-6987 Tue, 25 Oct 2011 19:25:30 +0000 JoshSanders 6987@http://forums.leaflabs.com/ Hi Leaf-ites! As a veteran MATLAB programmer, I've made good use of techniques like these: <a href="http://www.mathworks.com/matlabcentral/fileexchange/5685" rel="nofollow">http://www.mathworks.com/matlabcentral/fileexchange/5685</a> Would be great to have a doc or thread like this one, with examples of how to write efficient MAPLE code, for applications when speed is valued over generality. I've visited a bunch of threads on our forum dealing with ways to optimize code to cut overhead on digital writes and reads, analog reads, etc. There seem to be many ways of doing these things, each with their caveats, and each with varying success. Would love to hear from you guys, in general, which mods you found to be most effective at making your code run faster, and which resources I should turn to for guidance. For instance, do all the standard C++ tricks like those in these docs necessarily apply to Maple? <a href="http://en.wikibooks.org/wiki/Optimizing_C%2B%2B/Writing_efficient_code/Performance_improving_features" rel="nofollow">http://en.wikibooks.org/wiki/Optimizing_C%2B%2B/Writing_efficient_code/Performance_improving_features</a> <a href="http://www.codeproject.com/KB/cpp/C___Code_Optimization.aspx" rel="nofollow">http://www.codeproject.com/KB/cpp/C___Code_Optimization.aspx</a> (at the moment, I'm trying to shave a few microseconds off the duty cycle of an instrument I'm designing, but in the long term scope of my development as a MAPLE programmer, any wisdom is greatly appreciated!)