Fast 64-bit division by a constant on 32-bit ARMs

Modern versions of gcc and clang do not optimize long division by a constant into a multiplication by its reciprocal. On 32-bit ARM processors, a library call is generated (e.g. __aeabi_uldivmod or __aeabi_ldivmod), which takes more than a hundred clock cycles to execute. Reduction of a 64-bit division by a constant to multiplication requires multiplying two 64 bit numbers to produce a 128-bit result, although its lower half is discarded. Such a long multiplication is not available on 32-bit arms (same is true for other 32 bit processors). However, all modern arms (i.e. arm v7 and later) include a relatively fast 32x32 to 64 bit multiplication instruction (UMULL). The 64x64 multiplication operation can be written using four 32x32 multiplications and a few additions.

Below is an example code for a 32-bit ARM.

uint64_t umul64x64hi(uint32_t b, uint32_t a, uint32_t d, uint32_t c)
{
    __asm__ ("\n\
umull   r4, r5, %[b], %[d]   \n\
umull   %[d], r4, %[a], %[d]    \n\
adds    r5, %[d]         \n\
umull   %[d], %[a], %[a], %[c]  \n\
adcs    r4, %[d]        \n\
adc     %[a], #0        \n\
umull   %[c], %[b], %[b], %[c]  \n\
adds    r5, %[c]        \n\
adcs    %[b], r4        \n\
adc     %[a], #0        \n\
"   : [a] "+r" (a), [b] "+r" (b), [c] "+r" (c), [d] "+r" (d) : : "r4", "r5");
    return (uint64_t) a << 32 | b;
}

uint64_t div1000(uint64_t x)
{
#if 0
    uint64_t d = 0x20c49ba5e353f7cf;
    x >>= 3;
    return umul64x64hi(x, x >> 32, d, d >> 32) >> 4;
#else
    return x / 1000u;
#endif
}

This implementation was compared against the call to __aeabi_uldivmod on several ARM v7 processors and was found between 4 and 8 times faster. Such an improvement will be very welcome for e.g. inner loops doing time unit conversion. Approximate clock cycle costs of dividing by 1000 are in the table below. The results are not cycle-accurate and may be off by a cycle or two.

CPU	Clock Frequency, MHz	Inline UMULL-based, cycles	__aeabi_uldivmod call, cycles	Times faster
Cortex M3	16	42	154	3.7
Cortex M3	72	50	197	3.9
Cortex M4	96	22	180	8.2
Cortex M4	16	22	152	6.9
Cortex A7	1000	24	136	5.7

Note that for the M cores, the measurements are for code running from flash, which causes additional memory stalls on instruction fetches. Therefore, in addition to their nominal frequency, execution on the M cores was also measured at a zero wait state frequency (16 MHz). The low frequency result is more representative of the integer arithmetic block performance rather than of the processor as a whole.

The difference between the M3 and M4 cores was quite surprising. Previously, I have been assuming that these cores are identical, with the exception of the floating point/DSP block added in M4.

The only downside of the UMULL-based implementation is a potential code size increase as it is 50-60 bytes longer than the call. However, this will rarely be a problem, as long as there is not a huge number of divisions in the code. In fact, the library function is not small either, and a program with just a few divisions will be actually smaller without it linked in.

The ARM architecture also defines the UMLAL instruction which combines a 32-bit multiplication with 64-bit addition. Unfortunately, UMLAL does not set the carry flag and therefore cannot be used to reduce the number of additions, with the possible exception of the two last ones.