Fast 64-bit division by a constant on 32-bit ARMs
Modern versions of gcc and clang do not optimize long division by a constant into a multiplication by its reciprocal.
On 32-bit ARM processors, a library call is generated (e.g. __aeabi_uldivmod
or __aeabi_ldivmod
), which takes more
than a hundred clock cycles to execute. Reduction of a 64-bit division by a constant to multiplication requires
multiplying two 64 bit numbers to produce a 128-bit result, although its lower half is discarded. Such a long
multiplication is not available on 32-bit arms (same is true for other 32 bit processors). However, all modern arms
(i.e. arm v7 and later) include a relatively fast 32x32 to 64 bit multiplication instruction (UMULL
). The 64x64
multiplication operation can be written using four 32x32 multiplications and a few additions.
Below is an example code for a 32-bit ARM.
This implementation was compared against the call to __aeabi_uldivmod
on several ARM v7 processors and was found
between 4 and 8 times faster. Such an improvement will be very welcome for e.g. inner loops doing time unit conversion.
Approximate clock cycle costs of dividing by 1000 are in the table below. The results are not cycle-accurate and may
be off by a cycle or two.
CPU | Clock Frequency, MHz | Inline UMULL-based, cycles | __aeabi_uldivmod call, cycles | Times faster |
---|---|---|---|---|
Cortex M3 | 16 | 42 | 154 | 3.7 |
Cortex M3 | 72 | 50 | 197 | 3.9 |
Cortex M4 | 96 | 22 | 180 | 8.2 |
Cortex M4 | 16 | 22 | 152 | 6.9 |
Cortex A7 | 1000 | 24 | 136 | 5.7 |
Note that for the M cores, the measurements are for code running from flash, which causes additional memory stalls on instruction fetches. Therefore, in addition to their nominal frequency, execution on the M cores was also measured at a zero wait state frequency (16 MHz). The low frequency result is more representative of the integer arithmetic block performance rather than of the processor as a whole.
The difference between the M3 and M4 cores was quite surprising. Previously, I have been assuming that these cores are identical, with the exception of the floating point/DSP block added in M4.
The only downside of the UMULL
-based implementation is a potential code size increase as it is 50-60 bytes longer than the call. However, this will rarely be a problem, as long as there is not a huge number of divisions in the code. In fact, the library function is not small either, and a program with just a few divisions will be actually smaller without it linked in.
The ARM architecture also defines the UMLAL
instruction which combines a 32-bit multiplication with 64-bit addition. Unfortunately, UMLAL
does not set the carry flag and therefore cannot be used to reduce the number of additions, with the possible exception of the two last ones.