Why are GPIO reads done with conditionals?

LeafLabs Garden » Support » Maple Hardware Support

Why are GPIO reads done with conditionals?

(5 posts) (3 voices)

Started 2 years ago by pyrohaz
Latest reply from gbulmer

pyrohaz
Member

I'm merely wondering here. After reading through source code of multiple different microcontrollers, GPIO reads always seem to be done with a conditional operator '?'. Is this faster than something like:

((GPIO_PORT_READ>>GPIO_Pin)&1)

It seems to me that a shift and AND function would be less time consuming than a conditional though I suppose that would depend on the processor.

Anybody care to enlighten me? :)

Posted 2 years ago #
bnewbold
Ex-Leaflabs

pryohaz,

I'm not certain, but I suspect that both forms might be optimized to the same machine code, and that the different methods come down to aesthetics or "code style".

You could verify this by compiling code with both methods for the same target microcontroller and inspecting the assembler output.

Do note that some architectures have special GPIO test modes. Eg, the ARM Cortex-M3 has a "bitbanding" feature which allows individual bits of GPIO registers to be written in a single operation (aka, without a read/modify/write operation, which in some cases would require a full disable-interrupts/read/modify/write/enable-interrupts).

Posted 2 years ago #
gbulmer
Moderator

pyrohaz - That is a very interesting question. The 'C' source code may be the same in several cases, yet the explanation may be different for different Central Processing Units (CPUs). Would you post a few links to the GPIO code you've been reading? It might make this thread even more interesting and useful.

As bnewbold wrote, the compiler may convert different 'C' into the same machine code. Modern compilers can be fiendishly sneaky, though not perfect. You'd need to look at the compiler's output (e.g. by compiling to assembler, or using objdump to dis-assemble machine code) to see what is actually happening.
Even then, it is not clear unless we understand instruction timing.

Though RISC (Reduced Instruction Set Computing) CPU's aim to have single cycle instruction execution, several implementations including ARM Cortex-M3 do not. Sometimes longer sequences of instructions run faster than shorter sequences of instructions; more instructions run faster than fewer instruction!

On Cortex-M3, an arithmetic operation between registers takes one cycle, but a load or store between memory (or I/O port) and a register takes at least 2 cycles. So a sequence which needs four loads, and one arithmetic operation may be slower than one load and six arithmetic operations.

Loads may be even slower for Flash memory, which usually contain literal constants, because it is much slower than SRAM; at 72MHz Flash access has a couple of waits states on STM32F1. This is usually hidden for program code because the STM32F1 CPU reads 8bytes of instructions at a time, but isn't hidden for random access.

Most modern CPUs overlap fetching, decoding and executing one instruction with the next instruction (they are 'pipelined'), Cortex-M3 has a three stage pipeline. The speed at which load or store instructions (which themselves need a memory access) are executed by the pipeline can increase the use of memory to a rate that is too fast for memory to support, and causes wait cycles, which "stall" the CPU. So the speed difference between instruction sequences may be slightly more complex than simply counting cycles for each instruction type. For example a sequence of instructions with several sequential reads might run more slowly than sequences with reads interspersed by arithmetic operations. The compiler should be hiding these effects, but it isn't necessarily possible.

Normally the "?:" operator implies a branch instruction, which reduces the effectiveness of a pipelined CPU. A branch can cause a "stall" because the CPU's pipeline has already read and decoded instructions which are after the branch, but if the branch is taken those instructions are thrown away, and not used. So the CPU "stalls" while it fetches new instructions and begins decoding them.

However the Cortex-M3 has 'conditional' instruction execution. It can do a test instruction, e.g. test for non-zero, and upto four following instructions may be executed or skipped depending on their sensitivity to the test. So
int a = foo&0x0020?1:0;
might be such a short sequence that it does not need a branch instruction, instead it is conditional execution. So this can avoid a "pipeline stall" because it consumes the following instructions, and avoids a 'random" read of Flash memory.

Further foo&0x0020?1:0 might require loading fewer constants than shift and mask, so it may use fewer cycles that way too.

Summary:
- Loads and Stores are 2 cycles vs 1 cycle for register-to-register operations on Cortex-M3
- Flash has wait states at 72MHz; that is often hidden for straight-line program code by STM32F1 reading 8bytes of Flash at a time
- Conditional instruction blocks can avoid potential "pipeline stalls" and Flash wait states
- Sequential loads/stores can be slower than loads/stores interspersed with register-to-register operations.
- It isn't simple to estimate relative performance, even of short instruction sequences.

An example of using bit-band addressing to speed up reading or writing is explained in the thread:
http://forums.leaflabs.com/topic.php?id=737&page=2#post-22939

Forum member tlgosser very helpfully took the example, expanded it (for all Maples), tested it and posted a complete library at: https://github.com/tlgosser/Maplefiles

The comparison of using bit-banding vs 'normal' port access is
A. loading the bit-band address (unique for every port-pin) then reading the bit, vs
B. loading a port address (e.g. GPIO_A_PORT_READ, common for upto 16 pins on ST32F) reading the port, loading the shift count (GPIO_Pin), possibly also loading a mask (1) and doing the mask operation (&).

For single pin I/O, digitalReadFaster/digitalWriteFaster should always be quicker than digitalRead/digitalWrite.

A way to understand this stuff is to write tiny programs, look at the machine code, and estimate its performance. You'll need a copy of the ARM manual which gives Cortex-M3 instruction times (I think the appropriate one is DDI 0337G, "Cortex-M3 r2p0 Technical Reference Manual").

An easier way is to buy ARM's simulator (included in the professional version of Keil's tools, I believe), but IIRC, that is in the $thousands. (I haven't checked recently, but last time I looked, their wasn't an Open Source cycle-accurate ARM simulator)

(Another way is to run programs and measure them, though that needs some care if the results are to be valid.)

Posted 2 years ago #
pyrohaz
Member

That has literally answered all of my questions on the subject! I'm not currently in need of really fast GPIO options, I was merely wondering why it was implemented in such a manner. Its quite interesting though how even if doing more instructions can lead to a faster final solution, it had never occurred to me before! I read through your post on bitband addressing though i'm still trying to get my head around it!

Thanks,
Harris

Posted 2 years ago #
gbulmer
Moderator

pyrohaz - You usually ask very interesting questions, so thank you for asking :-)

A way to understand bit-band addressing is to think of it as bit-addressing instead of the usual byte or word-addressing; the bit-band address is literally the address of a single bit. Then the problem is how to achieve it.

AFAIK, in 'olden times' ARM CPU's could only read or write memory and peripherals in 32bit words. It didn't do byte addressing.

So, to change one bit in a peripheral register, the CPU:
1. Loaded the 32-bit word from the peripheral register
2. Loaded a bit mask for the Nth bit (or used 1<<N)
3. Set or cleared the Nth bit
4. Stored the 32-bit word back to the peripheral register
This is even worse if the operation had to be atomic, as interrupts would need to be switched off first, then on again afterwards.

So, bit-band addressing is a mechanism to give direct, atomic, access to each bit in 32-bit words using a single instruction.

Cortex MCU's have bit-band addressing which supports bit addressing for some of memory and all peripheral interfaces' registers (peripheral interface registers look like memory locations, but are actually part of the peripheral interface electronics).

So, some memory bits and all peripheral register bits really have two different addresses. Maybe think of bit-addresses as aliases.

The 'normal' word address gets at 32bits. Then there are 32 bit addresses which are aliases for each bit in the word. Each bit address is a unique address which gets at each individual bit within the same 32-bit word.

ARM could have done this for all addresses, but every unique (32bit) word address would also have 32 unique bit-addresses, which is wasteful of the 32bit address range. So, the Cortex-M3 only provides bit-addresses for some of the address range. The groups of alias bit-addresses are the bit-bands.

So the arithmetic for bit-band addressing converts a normal address and bit offset within the word to its bit-band address (alias). That bit-band address is recognised by the underlying electronics, and it extracts the bit for a read, or updates the bit for a write, all within a single load or store instruction time.

Posted 2 years ago #

RSS feed for this topic

Reply

You must log in to post.