pyrohaz - That is a very interesting question. The 'C' source code may be the same in several cases, yet the explanation may be different for different Central Processing Units (CPUs). Would you post a few links to the GPIO code you've been reading? It might make this thread even more interesting and useful.
As bnewbold wrote, the compiler may convert different 'C' into the same machine code. Modern compilers can be fiendishly sneaky, though not perfect. You'd need to look at the compiler's output (e.g. by compiling to assembler, or using objdump to dis-assemble machine code) to see what is actually happening.
Even then, it is not clear unless we understand instruction timing.
Though RISC (Reduced Instruction Set Computing) CPU's aim to have single cycle instruction execution, several implementations including ARM Cortex-M3 do not. Sometimes longer sequences of instructions run faster than shorter sequences of instructions; more instructions run faster than fewer instruction!
On Cortex-M3, an arithmetic operation between registers takes one cycle, but a load or store between memory (or I/O port) and a register takes at least 2 cycles. So a sequence which needs four loads, and one arithmetic operation may be slower than one load and six arithmetic operations.
Loads may be even slower for Flash memory, which usually contain literal constants, because it is much slower than SRAM; at 72MHz Flash access has a couple of waits states on STM32F1. This is usually hidden for program code because the STM32F1 CPU reads 8bytes of instructions at a time, but isn't hidden for random access.
Most modern CPUs overlap fetching, decoding and executing one instruction with the next instruction (they are 'pipelined'), Cortex-M3 has a three stage pipeline. The speed at which load or store instructions (which themselves need a memory access) are executed by the pipeline can increase the use of memory to a rate that is too fast for memory to support, and causes wait cycles, which "stall" the CPU. So the speed difference between instruction sequences may be slightly more complex than simply counting cycles for each instruction type. For example a sequence of instructions with several sequential reads might run more slowly than sequences with reads interspersed by arithmetic operations. The compiler should be hiding these effects, but it isn't necessarily possible.
Normally the "?:" operator implies a branch instruction, which reduces the effectiveness of a pipelined CPU. A branch can cause a "stall" because the CPU's pipeline has already read and decoded instructions which are after the branch, but if the branch is taken those instructions are thrown away, and not used. So the CPU "stalls" while it fetches new instructions and begins decoding them.
However the Cortex-M3 has 'conditional' instruction execution. It can do a test instruction, e.g. test for non-zero, and upto four following instructions may be executed or skipped depending on their sensitivity to the test. So
int a = foo&0x0020?1:0;
might be such a short sequence that it does not need a branch instruction, instead it is conditional execution. So this can avoid a "pipeline stall" because it consumes the following instructions, and avoids a 'random" read of Flash memory.
Further foo&0x0020?1:0
might require loading fewer constants than shift and mask, so it may use fewer cycles that way too.
Summary:
- Loads and Stores are 2 cycles vs 1 cycle for register-to-register operations on Cortex-M3
- Flash has wait states at 72MHz; that is often hidden for straight-line program code by STM32F1 reading 8bytes of Flash at a time
- Conditional instruction blocks can avoid potential "pipeline stalls" and Flash wait states
- Sequential loads/stores can be slower than loads/stores interspersed with register-to-register operations.
- It isn't simple to estimate relative performance, even of short instruction sequences.
An example of using bit-band addressing to speed up reading or writing is explained in the thread:
http://forums.leaflabs.com/topic.php?id=737&page=2#post-22939
Forum member tlgosser very helpfully took the example, expanded it (for all Maples), tested it and posted a complete library at: https://github.com/tlgosser/Maplefiles
The comparison of using bit-banding vs 'normal' port access is
A. loading the bit-band address (unique for every port-pin) then reading the bit, vs
B. loading a port address (e.g. GPIO_A_PORT_READ, common for upto 16 pins on ST32F) reading the port, loading the shift count (GPIO_Pin), possibly also loading a mask (1) and doing the mask operation (&).
For single pin I/O, digitalReadFaster/digitalWriteFaster should always be quicker than digitalRead/digitalWrite.
A way to understand this stuff is to write tiny programs, look at the machine code, and estimate its performance. You'll need a copy of the ARM manual which gives Cortex-M3 instruction times (I think the appropriate one is DDI 0337G, "Cortex-M3 r2p0 Technical Reference Manual").
An easier way is to buy ARM's simulator (included in the professional version of Keil's tools, I believe), but IIRC, that is in the $thousands. (I haven't checked recently, but last time I looked, their wasn't an Open Source cycle-accurate ARM simulator)
(Another way is to run programs and measure them, though that needs some care if the results are to be valid.)