LeafLabs Garden » Topic: Simple Maple-Arduino speed comparison reveals compiler bugs

LeafLabs Garden » Topic: Simple Maple-Arduino speed comparison reveals compiler bugs http://forums.leaflabs.com/topic.php?id=149 A place to share, learn, and grow... en-US Fri, 22 Jan 2016 00:13:44 +0000 http://bbpress.org/?v=1.0.2 <![CDATA[Search]]> q http://forums.leaflabs.com/search.php gbulmer on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149&page=2#post-998 Sat, 04 Sep 2010 11:36:21 +0000 gbulmer 998@http://forums.leaflabs.com/ Thanks for that. As an even less important question, has anyone played with the gcc -fmudflap options? From the gcc manual: -fmudflap -fmudflapth -fmudflapir For front-ends that support it (C and C++), instrument all risky pointer/array dereferencing operations, some standard library string/heap functions, and some other associated constructs with range/validity tests. Modules so in- strumented should be immune to buffer overflows, invalid heap use, and some other classes of C/C++ programming errors. The instrumentation relies on a separate runtime library (‘libmudflap’), which will be linked into a program if ‘-fmudflap’ is given at link time. Run-time behavior of the instrumented program is controlled by the MUDFLAP_OPTIONS environment variable. See env MUDFLAP_OPTIONS=-help a.out for its options. Use ‘-fmudflapth’ instead of ‘-fmudflap’ to compile and to link if your pro- gram is multi-threaded. Use ‘-fmudflapir’, in addition to ‘-fmudflap’ or ‘-fmudflapth’, if instrumentation should ignore pointer reads. This produces less instrumentation (and therefore faster execution) and still provides some protection against outright memory corrupting writes, but allows erroneously read data to propagate within a program. This seemed like a useful option, especially during development and debugging, but the library has to be available. mbolivar on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149&page=2#post-995 Sat, 04 Sep 2010 10:34:25 +0000 mbolivar 995@http://forums.leaflabs.com/ leaflabsandy: <blockquote> Reply: Change your "member" name to Key Master so other users can indentify you as a leafblower. </blockquote> Done :) <blockquote> Is your background in hardware or software or both? BTW ... Welcome. </blockquote> My background is in software and math. And thanks for the welcome! gbulmer: <blockquote> Welcome mbolivar, and thanks for the answer. Is there a reason why it is -march=armv7-m and not -mcpu=cortex-m3 ? Just slightly curious, so don't spend any time on this unless you believe there is a problem building using -mcpu=cortex-m3. </blockquote> Thanks for your welcome as well! As to your question, if you look at GLOBAL_CFLAGS, you'll notice that we do actually use -mcpu=cortex-m3. From the GCC manpage: <blockquote> -march=name This specifies the name of the target ARM architecture. GCC uses this name to determine what kind of instructions it can emit when generating assembly code. This option can be used in conjunction with or instead of the -mcpu= option. </blockquote> gbulmer on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149&page=2#post-981 Fri, 03 Sep 2010 20:46:16 +0000 gbulmer 981@http://forums.leaflabs.com/ Welcome mbolivar, and thanks for the answer. Is there a reason why it is -march=armv7-m and not -mcpu=cortex-m3 ? Just slightly curious, so don't spend any time on this unless you believe there is a problem building using -mcpu=cortex-m3. StephenFromNYC on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149&page=2#post-975 Fri, 03 Sep 2010 08:42:01 +0000 StephenFromNYC 975@http://forums.leaflabs.com/ September 3, 2010 Hello- gbulmer, thanks for the explanation. It makes complete sense that the value of <code>int counter1</code> increments and eventually overflows to -3276X (I always forget the lower boundary value) BEFORE it is promoted to type <code>long</code> and compared to <code>limit</code>. Yes, floating point calculations are subject to "rounding" errors. However, I feel it should not make a difference if a loop repeats 1000X or just 1X. If the loop is empty I think it should be removed by the compiler. Maybe the compiler authors have a reason for this feature. Perhaps it is a simple way to measure of the speed of variable promotion! I am still surprised that in my sample code the empty loops using a limit of type float require more than 0 ms to execute. I just thought of those examples of "bad" code where a programmer crams a lot unreadable code into the "test" expression of a FOR loop. Perhaps this is another reason to use just WHILE loops. <a href="http://forums.leaflabs.com/topic.php?id=117#post-947" rel="nofollow">http://forums.leaflabs.com/topic.php?id=117#post-947</a> I guess I just answered my own question about the execution times of empty loops. Thanks! Stephen from NYC leaflabsandy on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149#post-973 Fri, 03 Sep 2010 07:30:41 +0000 leaflabsandy 973@http://forums.leaflabs.com/ Quote: "(Hello to everybody on the forums from a new member of the LeafLabs team!)" Reply: Change your "member" name to Key Master so other users can indentify you as a leafblower. Is your background in hardware or software or both? BTW ... Welcome. mbolivar on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149#post-972 Fri, 03 Sep 2010 03:11:01 +0000 mbolivar 972@http://forums.leaflabs.com/ (Hello to everybody on the forums from a new member of the LeafLabs team!) <blockquote>PS, Leaflabs: What optimisation flags does the Maple IDE hand to the compiler? </blockquote> Currently, we compile with the following optimization flags: -Os -march=armv7-m -ffunction-sections -fdata-sections We also use these flags for C++: -fno-rtti -fno-exceptions Compiler flags are not currently user-configurable from within the IDE, but if there's interest, we might fold that functionality in at some point. If you're interested in experimenting with different flags right now, you might want to jump to using our Unix toolchain: <a href="http://leaflabs.com/docs/libmaple/unix-toolchain/" rel="nofollow">http://leaflabs.com/docs/libmaple/unix-toolchain/</a> Note that direct use of the Unix toolchain is OPTIONAL, and requires that you know your way around a Unix shell / how to use make / how to program in C and C++ / etc. If you're uncomfortable with any of these, we recommend you stick with the Maple IDE. Specifically, after you've made a fresh project (see: <a href="http://leaflabs.com/docs/libmaple/unix-toolchain/#startingyourown)" rel="nofollow">http://leaflabs.com/docs/libmaple/unix-toolchain/#startingyourown)</a> you can check out the Makefile variables <code>GLOBAL_CFLAGS</code>, <code>GLOBAL_CXXFLAGS</code>, and <code>LDFLAGS</code>. Happy hacking! gbulmer on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149#post-966 Thu, 02 Sep 2010 22:45:53 +0000 gbulmer 966@http://forums.leaflabs.com/ <blockquote> CarlO, when I was learning C I remember reading about variable type "promotion". This concept is usually taught in the context of int vs float. In my code, when I changed the limit variable to type long I assumed the int counters would be promoted to float during compilation to type long. If this is not a bug in the compiler(s) I obviously do not understand some subtle property of variable promotion. Maybe I should find my old text and read it again!</blockquote> C/C++ is very obedient, which can be surprising. When a variable is defined as <code>int counter0;</code> it will take up exactly the amount of space the compiler for that processor uses. Unlike Java, an <code>int</code> in C/C++ can vary in size on different processors. It is 16bits (2 bytes) on ATmega (Arduino) and 32bits (4 bytes) on STM32F (Maple). C/C++ will always allocate the same amount of space for a data type on the same processor (its compiler allocated location address may 'wiggle about' in memory if you ask the compiler to put a variable on 4byte boundaries, but the space used for the variable won't change). Type promotion is sometimes used to describe how arithmetic expressions are evaluated, not how values are stored. For example, when C/C++ does arithmetic, on an <code>int</code> and a <code>long</code>, if the <code>int</code> in smaller, C/C++ will compile in machine instructions to convert the <code>int</code> to a <code>long</code>, and then do the arithmetic on two <code>long</code>s. Comparing two numbers is treated as arithmetic. Hence the reason the for statement never exited was because the <code>int counter</code> couldn't store a value bigger than an <code>int</code> can hold, and on an ATmega (Arduino) it's only -32769 to +32767. So the <code>int</code> is promoted to a <code>long</code> for the comparison, but it will only ever be a value in the range that an <code>int</code> can hold, which on an ATmega is always less than 100000. I feel CarlO's explanation for not optimising an empty loop which uses float variables is plausible; the compiler can't calculate at compile time how may times the loop will execute, because float arithmetic is imprecise and subject to small errors, so it can't deduce the final values of the counter variables, so it can't remove the code. If you are concerned about not having an empty float-controlled for loop removed, try fiddling with the optimisation flags. There are a bunch of them, I usually use optimise for space (as small code fits in cache better, so ends up being quick), but there are a lot to choose from with the GCC compiler suite, one may be more aggresive, and take the code out. HTH PS, Leaflabs: What optimisation flags does the Maple IDE hand to the compiler? StephenFromNYC on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149#post-952 Thu, 02 Sep 2010 08:56:30 +0000 StephenFromNYC 952@http://forums.leaflabs.com/ September 2, 2010 Hello, Thanks for all the replies to my bug/benchmark post. leaflabsandy, yes, based on the faster clock speed I was expecting to see the 72 MHz Maple run faster than a 16 MHz Arduino. I was expecting a speedup of roughly 4.5X (=72/16). I did the simple benchmarking to show myself that the Maple runs similar code faster. Do you remember President Reagan quote regarding nuclear treaties "Trust, but verify". Note LeafLabs development team: using a Pentium III computer the Arduino compiles and loads the code faster than the Maple. In my post I chose the expression "4X-5X" because the speedup varies on the test. The speedup is not always 4.5X. It is occasionally less than 4X but occasionally greater than 5X. It is possible I copied the results incorrectly, but the general trend is clear. In one test which does not show the benefits of a 32- bit MCU the Arduino requires 4X-5X more time to execute similar code. gbulmer, thanks for your ideas. When I said (at the end of my post) "Note: I did not try to change the type of the 'counterN' variables from int to float or long. I do not know if this will affect how the code is compiled" I did not think it was the cause of the problem! CarlO, when I was learning C I remember reading about variable type "promotion". This concept is usually taught in the context of int vs float. In my code, when I changed the limit variable to type long I assumed the int counters would be promoted to float during compilation to type long. If this is not a bug in the compiler(s) I obviously do not understand some subtle property of variable promotion. Maybe I should find my old text and read it again! Regardless of the type of the variable(s) used in a FOR loop, I assumed any empty loop would be optimized by the compiler. Execution of all empty loops should be 0 milliseconds. I forgot, does infinity times zero equal zero or does the product equal NaN (not a number)? Does the NaN concept have any relevance to compilation of an empty loop? I do not have the time to contact the avr-gcc compiler project. If you are a contributor to the avr-gcc forums and you think I have uncovered a bug please pass on the information. I wanted to share my observations, because the LeafLabs development team might be interested in sketches where the Maple behaves in surprising ways. Thanks! Stephen from NYC gbulmer on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149#post-944 Wed, 01 Sep 2010 16:42:57 +0000 gbulmer 944@http://forums.leaflabs.com/ ST have announced 120MHz STM32 parts: <a href="http://www.st.com/stonline/stappl/cms/press/news/year2010/t2477.htm" rel="nofollow">http://www.st.com/stonline/stappl/cms/press/news/year2010/t2477.htm</a> If they migrate production of existing parts to the smaller process, overclocking may be more stable. The improvement comes from moving to a smaller scale process (90nm) which means faster transistors. To make up for the disparity between the processor and Flash, they have an even fancier prefetch mechanism than the existing STM32F's: <blockquote>To release the processor’s full 150 DMIPS performance at this frequency the accelerator implements an instruction pre-fetch queue and branch cache, enabling program execution from Flash at up to 120MHz with zero wait states.</blockquote> I assume the key words are "at up to" 120MHz :-) Marketing, don't ya just love 'em?-) gbulmer on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149#post-942 Wed, 01 Sep 2010 16:32:51 +0000 gbulmer 942@http://forums.leaflabs.com/ <blockquote>"- 72 MHz maximum frequency,1.25 DMIPS/MHz (Dhrystone 2.1)performance at 0 wait state memory access" </blockquote> Yes, I think there is a bit of "marketing spin" creeping in here, but I don't think it is as bad as it might seem. There is a prefetch buffer which holds two blocks of 64-bits, 2x8 bytes, of instructions which is read in one read from Flash (Flash memory and the ICode instruction bus is 64-bits wide). If the CPU were executing only 16-bit instructions (that is 4 instructions/block), in perfect conditions the processor wouldn't wait even if Flash had 3-wait states. Of course, things are not perfect, but for straightline code, or small loops, it might get quite close. As you say, the "1.25 DMIPS/MHz" is true for internal RAM, and I believe is true for Flash when the STM32F runs at 24MHz, but that's where the "marketing spin" is misleading. It would be very helpful to have some 'typical' values for a suite of apps run at 48MHz and 72MHz. AFAIK, all the MCU manufacturers apply this same "marketing spin" when their processor runs faster than flash, not just ARM licensees, but others too. While on the subject of "marketing spin", I noticed that the ADC only reaches 1M sample/second when the processor is clocked at 56MHz, and drops to a best case of roughly 850K samples/second at 72MHz. The disparity is because the ADC clock can't exceeed 14MHz, and it is derived from the system clock, divided down for the ADC by an integer value. To be fair, 850K samples/second, on each of two or three ADC's, is so much better than an Arduino's <10k samples/second, I am still extremly happy. poslathian on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149#post-939 Wed, 01 Sep 2010 15:03:02 +0000 poslathian 939@http://forums.leaflabs.com/ Bryan's post on the subject: <a href="http://forums.leaflabs.com/topic.php?id=31#post-114" rel="nofollow">http://forums.leaflabs.com/topic.php?id=31#post-114</a> poslathian on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149#post-938 Wed, 01 Sep 2010 15:02:34 +0000 poslathian 938@http://forums.leaflabs.com/ Since were talking about performance. The stm32 on maple CAN be overclocked! Obviously, at your own peril! User reports and internal experiments suggest that the probability of brickage from stepping up to 120MHz is minimal, however the digital I/O and many of the peripheral start to get funky, square waves turn to sines, and some comm peripherals just stop functioning. You can overclock via setting the PLL multiplier to higher than 9, you can find where this happens in libmaple, probably in rcc.c CarlO on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149#post-937 Wed, 01 Sep 2010 14:41:42 +0000 CarlO 937@http://forums.leaflabs.com/ gbulmer, That's great info on the access time for Flash, thanks for doing the research on this. I guess ST's claim of 1.25 DMIPS/MHz on the front page of the data sheet only applies to SRAM then? Interesting that they hide this info in a footnote. When I get some time I'll do some tests to see what the difference is in practical application terms. I also discovered that the I/O ports are clocked or latched at a lower rate then the CPU core. "– 72 MHz maximum frequency,1.25 DMIPS/MHz (Dhrystone 2.1)performance at 0 wait state memory access" CarlO. gbulmer on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149#post-934 Wed, 01 Sep 2010 13:27:32 +0000 gbulmer 934@http://forums.leaflabs.com/ CarlO, I think llandy's calculation of 72MHz/16MHz - 4.5 gives an answer in a similar ball park to the relative speed of the Maple and Arduino on the cases where there is a nonzero time. The theory that the Maple I/O is making a significant difference could be tested by printing a singe character instead of <code>SerialUSB.println("Inner loop finished");</code> or its Arduino equivalent, print one character: <code>SerialUSB.print(".");</code> or its Arduino equivalent. The serial output (USART) on the Aduino has one byte of buffering, so this should minimise any differences between the two systems, leaving mostly the difference in for loop speed. As CarlO wrote, comparing loops using 32-bit <code>long</code> for limit and counters should show up a bigger performance difference than the clock speed alone would account for because the Maple's 32-bit Cortex-M3 (STM32F) can process 32 bit numbers in far fewer instructions than the Arduino's 8-bit ATmega can perform the same operation. I am actually impressed that the Arduino seems to be slightly faster than the clock speed difference might suggest on integer operations. The memeory access speed for flash and SRAM on the Maple's STM32F is different. According to document 13587.pdf (from st.com), section "2.3.4 Embedded SRAM" says: Twenty Kbytes of embedded SRAM accessed (read/write) at CPU clock speed with 0 wait states. Whereas document 13902.pdf (also from st.com), section "2.3.3 Embedded Flash memory", page 53 says: "Note: 1 These options should be used in accordance with the Flash memory access time. The wait states represent the ratio of the SYSCLK (system clock) period to the Flash memory access time: zero wait state, if 0 < SYSCLK <= 24 MHz one wait state, if 24 MHz < SYSCLK <= 48 MHz two wait states, if 48 MHz < SYSCLK <= 72 MHz" The impact of the wait states is reduced by a prefetch buffer, page 52 says: "Prefetch buffer (2 x 64-bit blocks): it is enabled after reset; a whole block can be replaced with a single read from the Flash memory as the size of the block matches the bandwidth of the Flash memory. Thanks to the prefetch buffer, faster CPU execution is possible as the CPU fetches one word at a time with the next word readily available in the prefetch buffer" So Flash memory may be slower than SRAM, but the estimation of how different is not easy because it depends on the dynamic flow of the program. I don't know if your theory is correct, but think your explanation for the compilers not optimising the for loops is a good one. I'm happy to buy it. As you say the compiler may know that it can't exactly predict how floating arithmetic works in these situations, and takes the conservative (and guaranteed to be correct) approach of compiling the code as written. I believe the test shows no evidence of *bugs* in the compiler. I don't believe it is a bug to fail to optimise away code in a program. IMHO, a compiler should always strive to geneate code gives the correct answer (i.e. what the code says), and secondarly try to optimise that code. I think it would be helpful if the compiler did print a warning to the effect that 'comparing int counter0 to long limit might give unexpected results'. Maybe that warning is off by default as many programs would provoke it. There are a few gcc warnings which might help: -Wstrict-overflow -Wfloat-equal -Wunsafe-loop-optimizations -Wtype-limits I think it would be helpful if every set of tests (e.g. using float or long) used the same set of values for limit so that it is easier to compare results across tests. HTH poslathian on "Simple Maple-Arduino speed comparison reveals compiler bugs" http://forums.leaflabs.com/topic.php?id=149#post-933 Wed, 01 Sep 2010 13:18:35 +0000 poslathian 933@http://forums.leaflabs.com/ cool! In terms of performance, we havnt really scratched the surface. Sure the clock is faster. But the real win comes from DMA, using fast interrupts, and dedicated communication peripheral (i2c,spi, etc). I cant wait to start benchmarking that stuff. Incidentally weve finally started on bringing up the DMA, which allows you to do fancy stuff like copy memory from location A to B in the "background" with wasting cycles (except the cycles needed to dispatch the copy operation through the DMA). This works really nicely for communication, where you can throw a bunch of data into a buffer and the use the DMA to squirrel it all out of a uart/i2c/spi port without actually spending cycles to do that. If anyone has some free cycles, it would be a great help to do some advanced play using those features! *warning* i2c peripheral is nasty complicated, but the dma is simple.