Blog Index June 7 2024

Thoughts on Implementing (SIMD) fmod

A while ago I became interested in creating implementations of fmod, and more specifically, SIMD vectorized implementations. This is part of a broader effort to maximize what I’m able to get out of my Ice Lake CPU which features AVX-512. With such high throughput instructions available to me I figured it would be a waste to not use them. However, when it comes to many operations that are even slightly higher-level, it’s a bit of a challenge to utilize SIMD since vectorized implementations of most non-trivial operations either do not exist, or are not widespread.

I was able to find only one SIMD vectorized implementation of fmod online, that belonging to the Sleef library. However, taking a look at their implementation, it seemed more complex than strictly necessary. Their approach appears to heavily rely on Decker arithmetic, that is, using two floats to represent one numeric value such that you effectively have a signficand which is twice as wide as normal. I had a hunch that simpler alternatives were possible, so I decided to look into it.

As it turns out, this problem is both conceptually simple yet features a number of subtleties. In retrospect, that’s not surprising because it belongs to the family of division/remainder algorithms, and there’s an extensive body of work on that subject. Contrary to how my previous experience vectorizing other algorithms informed my expectations, the SIMD vectorization itself did not form the crux of this problem.

For the most part, I’ll be discussing different algorithms for evaluating fmod and the their practical considerations when being implemented for 16, 32, and 64-bit floats, both in the context of scalar and SIMD code, on x86 and ARM.

For now, let’s begin by taking a look at the fmod function itself.

The Requirements

A proper implementation of fmod(n, d) should satisfy the following behaviors:

Note the second-to-last point can be combined with the second and third to simplify them down to just:

We will being ignoring any additional error handling that an implementation may offer, such as floating-point exceptions, and the setting of global variables.

Although listing these requirements is somewhat necessary if your goal is to implement fmod, the real meat of this post will focus on the common case where both inputs are finite values. We’ll also assume that they’re both positive because the sign of fmod’s result is just the sign of n, and that’s trivial to implement using copysign, which itself can be implemented with a couple of bitwise operations. Similarly, the handling of edge cases will be left until the end since they’re not as interesting.

The Naive Solution

An obvious first attempt can fit on a single line:

n - trunc(n / d) * d

Given the fact that this solution is under the word “naive” you’re surely already deduced that this solution is not necessarily a good one. Although based on the correct formula, this solution’s results are not generally correct. Some of those short-comings are with its handling of edge cases, but the core of the problem is that it does not always deliver an exact result, even when the inputs are both finite values. To be clear, fmod is a function where the result is always exact so long as the inputs are both finite. This is unlike other cmath functions such as exp or sin, where some amount of rounding is inherently necessary.

It’s easy to find examples of this implementation producing inexact results by generating inputs at random and comparing outputs against an existing fmod implementation: Compiler Explorer Link

One easy to identify reason for these inexact results is that n / d is subject to overflow. For example, if n is equal to 2^75 and d is equal to 2^-75, then the arithmetic quotient would be 2^150, but that’s larger than a 32-bit float’s maximum representable value of 2^128 - Ɛ. Therefore, the division yields infinity and all that comes after is no longer correct.

But this is not the sole reason for inexact results. It will come as no surprise that the primary culprit is generally just the finite number of significant digits that floats have, which makes rounding a necessity. Any given floating-point operation might produce an exact result if we’re lucky, but it could just as well be an over or under-estimate. However, there is actually relatively little need to think too hard about floating-point rounding to implement fmod. In fact, it’s possible to implement using nothing more than integer arithmetic if so desired, as we’ll see shortly.

Remainders

Since it’s always a good idea to start simple and build up complexity gradually, let’s first tackle the problem of how to compute a remainder by hand in the simple case that n and d are two positive whole numbers.

At this point, the obvious thing to do is apply the long division algorithm that we’re all taught in elementary school, but that misses the point a bit. Since our goal is ultimately to implement our own remainder algorithm, what’s more important is the intuition that we developed for why long division works in the first place. Let’s just ignore what we know about modern computing hardware and instead focus on the algorithmic challenge in the abstract.

Recall those lessons where your teacher taught you that division is about how many times d fits into n, that if you had n marbles in front of you, and you try to make as many groups of d marbles as possible, then the number of groups you’re able to complete is the quotient and the number of marbles left over is the remainder.

The most straightforward way to turn this intuition into a remainder algorithm would be to subtract d from n until n is smaller than d. The final value of n would be the remainder. Of course, this approach is a little naive since it’s algorithmic complexity is linear with respect the magnitude of the quotient, and given that quotients can be arbitrarily large, this approach can easily take a while.

However, it’s more than possible to go faster than this. Performing a subtraction by d, m subsequent times is equivalent to subtracting d * m once. So instead of grouping marbles into groups of size d, we can group them into groups of size d * m. In order to minimize the number of subtractions performed, we’d ideally want m to be as large as possible such that m * d is still less than n. Naturally, the optimal value of m is the whole part of the quotient of n and d, i.e. ⌊n / d⌋. If we can compute this quotient in constant time, then the remainder too can be computed in constant time. But this is really just saying that n mod d = n - ⌊n / d⌋ * d of course.

But let’s assume that computing the quotient isn’t feasible. (Remember the earlier problems with floating-point division?) To go faster, we don’t necessarily need to subtract exactly the largest multiple of d that’s less than n. So long as we subtract any multiple of d larger than d itself, that would be an improvement. But while we want m to be as large as possible, we also don’t want to spend too much time trying to actually find a good value for it, since that would defeat our optimization efforts.

A simple method of doing this would be to multiply d by some exponentiation of whatever base we’re working with, since this would amount to a performing a left shift. Whether you’re working by hand and that means just moving a decimal point, or you’re implementing this in software and it means executing a shift, it’s a cheap and easy thing to do. We just add zeroes to the right of the denominator until we can add no more without it becoming larger than the numerator.

In bases other than two, conventionally we use our knowledge of multiplication tables to refine this estimate by further multiplying by some number in the range of (1, b) where b is the base we’re currently working in. However, at this point, we can go back to considering the reality that computers work in base two, for which that range is empty, and therefore we need not consider this.

So we have the following algorithm:

while n >= d:
  shift_amount = ...
  n -= d << shift_amount

The complexity of this approach is quite respectable at logarithmic with respect to the magnitude of the quotient. With each iteration, this process ensures the numerator’s leading set bit is cleared, and hence, the algorithm will not run for more iterations than the number of bits that the numerator’s value is wide.

Unsigned Integer Remainder: Naive Bit-by-Bit w/ lzcnt

You may have noticed that in the earlier algorithm, the problem of finding the number of places to shift by was left unsolved. This is because there’s a subtle (or rather annoying) detail if you want to do things exactly as was described earlier.

We’ll assume the types of the inputs to our remainder function are integral for the time being.

The obvious thing to try is to compute the shift amount using leading zero count instructions like so: lznct(n) - lzcnt(d). This would ensure that the most significant set bit in n and in the shifted value of d will be in the same position. But if we want to ensure that our shifted denominator is less than the numerator, then how many places to shift by is actually dependent on whether the significant digits in n are greater than the significant digits in d.

Consider two examples expressed in base 10 for the sake of simplicity:  5200 / 17 and 1700 / 52. Note that both numerators and both denominators are the same width and have the same number of significant digits. However, in the first case, it would make sense to shift 17 by two places to get 1700 since that’s less than 5200. In the second case, it would only make sense to shift 52 left by one place to get 520 since shifting by two places to get 5200 would make it larger than 1700. So we see that the magnitude of the significant digits can be relevant.

The most direct way to apply this in software would be to adjust the number of places we shift depending on a comparison for every iteration of the loop:

c = lzcnt(d)
while n >= d
  t = d << (lzcnt(n) - c)
  t >>= (t > n)

  n -= t

Now, there’s a few details worth mentioning about this approach’s feasibility.

In a scalar context, leading zero count instructions are fairly widespread on common ISAs.

But SIMD vectorized leading zero count instructions are not necessarily as widely available, particularly on x86. 32 and 64-bit leading zero count ops were only introduced with AVX-512CD and there are no 16-bit leading zero count instructions, so some emulation will be necessary. If you’re targeting ARM Neon, you’re somewhat more in luck however as it has widespread support for 8, 16, and 32-bit leading zero counts. 64-bit leading zero counts will still have to be emulated.

You could also point out that on x86 SIMD unsigned integer comparisons were not a thing until AVX-512F and that variable shifts were not available until AVX2. So strictly speaking, this approach isn’t great if you’re targeting older hardware.

Unsigned Integer Remainder: Bit-by-Bit w/o lzcnt

The previous approach isn’t terrible, but it can be less than optimal if there isn’t a cheap leading zero count instruction available. The lazy thing to do here would be to use some software implementation of lzcnt, but any off-the-shelf solution wouldn’t leverage the contextual information available to us that the result of lzcnt(n) will increase by at least one with each iteration of the loop. Hence, it’s bound to do more work than is strictly necessary.

One alternative would be to avoid using lzcnt inside of the loop altogether. Upfront, we can shift d to the left so that its most significant digit lines up with that of n. Then we handle the possibility that our shifted value of d, t in the psuedocode below, is greater than n by making the subtraction predicated on a comparison between the current value of n, and t. With each iteration of the loop, we perform a right shift on t to compute a smaller multiple of d.

t = d << max(0, lzcnt(n) - lzcnt(d))
while n >= d:
  if n >= t:
    n -= t

  t >>= 1

This approach does not necessarily guarantee that each iteration clears the leading set bit in n, but for the ith iteration (one-indexed), it ensures that lzcnt(n) will have increased by at least i compared to the initial value of n.

This approach leads to a very tight loop that, although may execute more iterations than the previous solution, may be better overall when a lzcnt instruction is not at our disposal. It may also be better overall if this solution would require a comparable amount of iterations as the previous one, since the iterations would be cheaper.

The branch in the loop is somewhat suboptimal, but it may be made branchless rather straightforwardly. For scalar code, a conditional move would do the trick. For SIMD code, a mask operation, either a dedicated masked subtraction or a bitwise AND operation between the result of the comparison and the subtrahend would handle it.

Contrived Circumstances: Towards Using Integer Division

It’s worth exploring the utility of integer division instructions for computing fmod, however using integers as inputs isn’t very insightful because the problem can generally be solved as n - (n / d) * d. Simply looking at this formula doesn’t provide any more insight into how to construct an fmod implementation than we already had.

So let’s consider a scenario that’s slightly more like fmod. Let’s say that the numerator isn’t just an integer. It’s an integer multiplied by a power of two, so our problem is evaluating mod(n * 2^E, d). Also, let’s assume that the number of significant digits in n and d are less than the width of our machine’s integers. This detail will become important for reasons that will be discussed later.

We’ll begin by modifying the earlier bit-by-bit w/o lzcnt approach. (You’ll want to make sure you refer back to it as you read through this section, as I think it’ll be difficult to make sense of it otherwise.) The straightforward thing to do would be to update the algorithm line-by-line, but at least when it comes to the bit before the loop, the necessary changes aren’t well-motivated just yet.

Let’s start with tackling the problem of how to evaluate the loop condition of n * 2^E >= d. With the numerator being the form of n * 2^E, this is a little difficult since we can no longer use a native comparison operation. However, if we transform both the numerator and denominator into a different, shared representation that effectively emulates floats, the comparison becomes much easier. We’ll represent the numerator and denominator as fixed-point numbers in the range of [1.0, 2.0), multiplied against some power of two. Let’s call these N and D to distinguish them from the function parameters, and say that they have two fields, s for the significant digits, or the significand, and e for the power of two, or rather, the exponent.

We can note that if N’s exponent is greater than D’s exponent, then N >= D will necessarily be true. If the two exponents are equal, then we have to evaluate N.s >= D.s to complete the comparison, however that will only be true at most once. So for the sake of making the loop as simple as possible, that case may be handled by a one-time check performed afterwards. We may also note that if N.s becomes zero, then the remainder will be zero, so the loop should terminate under this scenario as well. Therefore, our loop condition is   (N.e > D.e) && (N.s != 0).

Next, we can move onto handling the loop body. This is largely a straightforward process, but there is one tricky little detail that has to be considered. It’s related to the earlier discussion of shifting by different amounts depending on which of the numerator’s and denominator’s significant digits are larger. Earlier, I mentioned representing the signficands as fixed-point numbers in the range of [1.0, 2.0), but I wasn’t very specific about the actual format. The most obvious format would be one with one whole bit, and with the remaining bits being fractional. However, this representation is problematic when we want to perform the subtraction required by the loop body.

Again, an example in base 10 will be used. Let’s say that we have 3 decimal digits to play with per integer, and n = 123, E = 1, d = 321. In this case, the value of the numerator is 1230 and the value we want to subtract is 321 to get 909. However, if we’re constrained to 3-digit arithmetic, there’s no trivial way to get 909 from 123 and 321. But, if we have 4 digit integers, then so long as the numerators and denominators don’t ever have more than 3 decimal digits, we never run into this problem.

This issue arises when working in base two under similar circumstances and may be solved in just the same manner. Thankfully, earlier it was mentioned that we were assuming that the numerator and denominator would be narrower than the integers on our target system. With this in mind, the fixed-point representation we’ll use will have two whole bits, and the remaining bits will be fractional.

When updating the earlier bit-by-bit approach, there will need to be an additional conversion performed upfront that comes out to:

// Convert inputs to common format
N.e = W - 1 - lzcnt(d) + E
N.s = n << (lznct(n) - 1)

D.e = W - 1 - lzcnt(d)
D.s = d << (lznct(d) - 1)

And with that detail of representation clarified, we can actually deal with the loop body.

If we were doing a very direct translation, then we’d replace the variable t with a corresponding T, like we replaced n and d with N and D. However, as it turns out this isn’t really necessary. You’re free to give it a try yourself and see why, but essentially T.e would never be read from, and T.s would just be equal to D.s since T is just D multiplied by some power of two. So it would just add pointless complexity to our algorithm to introduce this additional variable.

Earlier, the loop body’s conditional subtraction was predicated on whether the current value of n is greater-than-or-equal-to the shifted valued of d. As was the case earlier, a native comparison operation is not available. However, a comparison between N.s and D.s would suffice so long as assume that N.e and T.e are equal (this is equivalent to saying that the position of the the most significand digits of the numbers they represent line up). Therefore, we’ll design our algorithm such that this subtraction only occurs when this is true.

As far as the right shift of t, that would be equal to a decrement on T.e, which we’re not using as has already been mentioned.

A bit of additional complexity that our new algorithm requires us to deal with is renormalizing N. After performing the subtraction of N.s -= D.s, we need to make sure that N.s and N.e are updated appropriately. In order to do this, we have two familiar options:

The first option is use a leading zero count instruction, subtract one from the result, and then perform:

t0 = lzcnt(N.s) - 1
N.s <<= t0
N.e -= t0

This gets N.s to have its most significant digit as the second most significant position, and N.e correctly adjusted for all the places the subtraction may have cleared.

The second option would be to again avoid the leading zero count operation as we did earlier. We shift left N.s by one place and decrement N.e to match. Since the subtraction is already predicated on N.s being larger than T.s, the loop will still only do the subtract when N.s is correctly positioned.

Note that this requires one additional augmentation to the algorithm. If the subtraction in the last loop iteration clear many bits beyond the one at the position D.e, the loop would not adjust the value of N.e or the position of the bits in N.s as necessary. A onetime use of the earlier leading-zero count approach is appropriate.

With the loop body done, we just have to reconstruct the result. This is straightforwardly,

result = N.s >> (W - N.e)

where W is the number of bits wide our integers are.

So the final algorithm without using leading zero counts in the body is:

// Convert inputs to common format
N.e = W - 1 - lzcnt(d) + E
N.s = n << (lznct(n) - 1)

D.e = W - 1 - lzcnt(d)
D.s = d << (lznct(d) - 1)

// Core loop
while (N.e > D.e) && N.s != 0:
  if N.s >= D.s:
    N.s -= D.s

  N.s << 1
  N.e -= 1

// Handle case where N and D have the same exponent
if (N.e == D.e) && N.s >= D.s:
  N.s -= D.s

// Remove any excessive leading zeros from N.s
t0 = lzcnt(N.s) - 1
N.s <<= t0
N.e -= t0

// Reconstruct result as integer
result = N.s >> (W - N.e)

If we use a leading zero count in the body then it is:

// Convert inputs to common format
N.e = W - 1 - lzcnt(d) + E
N.s = n << (lznct(n) - 1)

D.e = W - 1 - lzcnt(d)
D.s = d << (lznct(d) - 1)

// Core loop
while (N.e > D.e) && N.s != 0:
  if N.s >= D.s:
    N.s -= D.s

  t2 = lzcnt(N.s) - 1
  N.s << t2
  N.e -= t2

// Handle case where N and D have the same exponent
if (N.e == D.e) && N.s >= D.s:
  N.s -= D.s

  // Remove any excessive leading zeros from N.s
  t0 = lzcnt(N.s) - 1
  N.s <<= t0
  N.e -= t0

// Reconstruct result as integer
result = N.s >> (W - N.e)

Although we’ve assumed integer inputs thus far, it should be straightforward to see how this would translate to floating-point inputs. The frexp function will decompose a float into significand and exponent, easing the construction of N and D.

As far as practical considerations go, there’s nothing meaningfully new with this technique that we haven’t already encountered with previous approaches, but we can still note something about practicality. Although x86 does not have vectorized unsigned integer comparisons until AVX-512F and AVX-512BW, signed integer comparisons are available much earlier (SSE2 for 8, 16, and 32-bit integers. SSE4.2 for 64-bit). If we just add another leading whole bit to our fixed-point representation and adjust the rest of the algorithm appropriately, the signed comparison instructions may be used instead.

Contrived Circumstances: Integer Division

Now that we’re familiar with this more complex problem, we can actually begin to leverage integer division. The good news is that most of the hard work has already been done. There are just a couple more things to address.

Remember that what we’re trying to do here is to quickly find large multiples of d which are not greater than n. Since we’re computing those multiples as m = N.s / D.s the key to finding these large multiples will be to maximize the value of m. This straightforwardly means making N.s as large as possible and making D.s as small as possible, or rather, it’s about altering their representations. These are processed as integers at the end of the day, even though we were treating them as fixed-point numbers earlier. Hence, we must pack N.s’s bits as far left as possible and D.s’s bits as far right as possible to maximize the integer quotient of N.s / D.s.

When N.s is packed as far left as possible, that’s effectively using a fixed-point representation with 1 whole bit. When D.s is packed as far right as possible, that’s effectively using a fixed-point representation with lznct(d) + tzcnt(d) + 1 whole bits.

To convert to N and D, the details are slightly different than before:

N.s = n << lznct(n)
N.e = (W - 1 - lznct(n)) + E

D.s = d >> tzcnt(d)
D.e = (W - 1 - lzcnt(d) - tzcnt(d))

At this point, modifying the loop body to leverage integer division is a fairly straightforward process. The value (N.s / D.s) * D.s is subtracted from N.s. Then the leading excessive leading zeros in N.s must be dealt with. This can be handled in just the same way as mentioned previously, but the approach that relies on a leading zero count makes the most sense here.

while ... & N.s != 0:
  N.s -= (N.s / D.s) * D.s

  N.s <<= lzcnt(N.s)
  N.e -= lzcnt(N.s)

But we must be careful here. This previous loop will clear at least lzcnt(D.s) leading set bits from N.s with each iteration. That can easily be more bits than are required. For example, 15.0 mod 6.0, effectively 0b1111 mod 0b0110, only requires the clearing of the two leading set bits to get the answer of 0b0011 or 3.0. However, applying the above approach, D.s would equal 0b0011, of which 0b1111 is a multiple, so we’d end up subtracting 0b1111 from itself and getting zero. Subtracting an excessively large value has lead us to an incorrect result.

Thankfully, it’s quite easy to reduce the minimum number of leading set bits which are cleared by reducing the ratio between N.s and D.s, either by shifting N.s right, or shifting D.s left. We’ll opt for the latter approach.

If the difference in the exponents, is small enough that that N.e - D.e + (W - lzncnt(D.s)) <= W, that is to say that if you were treat N and D as infinite width fixed-point integers, then N | D would have fewer than W significant bits. Under these circumstances, a single integer quotient would be all that’s necessary to evaluate the remainder’s signficand.

With a bit of manipulation, we get that we should only run the loop only while N.e > (D.e + lzcnt(D.s)) is true.

Once that’s no longer true, we need to shift D.s left an appropriate amount such that lzcnt(D.s) equals N.e - D.e. This shift amount comes out to lzcnt(D.s) - W - (N.e - D.e) places.

With all that done, the final algorithm comes out to:

// Convert inputs
N.s = n << lznct(n)
N.e = (W - 1 - lznct(n)) + E

D.s = d >> tzcnt(d)
D.e = (W - 1 - lzcnt(d) - tzcnt(d))

// Core loop
while N.e > (D.e + lzcnt(D.s)) & N.s != 0:
  N.s -= (N.s / D.s) * D.s

  N.s <<= lzcnt(N.s)
  N.e -= lzcnt(N.s)

// Handle case where narrower division is necessary
if (N.e > D.e) && (N.s != 0):
  D.s << (lzcnt(D.s) - W - (N.e - D.e))
  N.s -= (N.s / D.s) * D.s

  N.s <<= lzcnt(N.s)
  N.e -= lzcnt(N.s)

// Reconstruct result as integer
result = N.s >> (W - N.e)

It must be noted that vectorized integer division instructions don’t generally exist. To my knowledge, the only notable instruction to set have them is ARM SVE has them, which is not currently available in consumer hardware.

However, knowing how to leverage integer division is a stepping stone. It allows us to consider alternative approaches which either achieve integer division by alternative means, or which leveraging floating-point division instructions instead which are widely available in SIMD form.

Unsigned Integer Division: Granlund-Montgomery Division

We may note that in the previous approach, the loop performed an integer division by a denominator which is constant across iterations. This opens the possibility to utilize the well-known Granlund-Montgomery division technique. For those unaware, this allows us to replace a division by a known denominator with a multiplication followed by come subsequent corrective steps. Given that integer division instructions are generally known to be expensive, this is often a good optimization.

The technique has two parts. The first must be evaluated upfront once per denominator:

l = W - lzcnt(d - 1)

m = (double_width_uint((1 << l) - d) << W) / (d + 1)
sh1 = min(l, 1)
sh2 = l - sh1

Again, W is the width of our integers.

The second part is evaluated each time that a division is to be performed.

t1 = mulhi(m, n)
uint q = t1 + ((n - t1) >> sh1) >> sh2

Plugging this into the previous approach is fairly straightforward, with one exception, that being the final division that is performed outside of the main loop body. In order to perform that last division using the Grunland-Montgomery technique you just have to add D.e - N.e - lzcnt(d)) to sh2.

However, it needs to mentioned that using this technique isn’t necessarily an improvement. In order for this approach to be faster, the overhead of initializing the Granlund-Montgomery state must be less than the amount of work saved by reducing the amount of divisions. However, in many practical contexts, the ratio between the numerator and denominator passed to fmod isn’t that high. Therefore, the loop body will may very well run an average of only around one time.

If we assume an IEEE-754 single-precision float and 32-bit integers, then the ratio would have to be higher than 128 for more than one integer division to be necessary. This is a very real possibility that you’ll no doubt encounter, but it’s not necessarily representative of the common case. If we assume 64-bit integers, then the ratio would have to be larger than 549 billion. If utilize the fact that x86’s 64-bit div instruction actually takes an 128-bit numerator allowing us to clear 63 bits from the numerator with each iteration, then that threshold jumps to more than 9 quintillion. In that case, this means that no more than 5 divisions would be necessary. Even if we change to considering double-precision operands, that’s only 17 of these divisions that would be required.

However, there is something be said about utilizing this optimization for cases where the denominator for fmod is reused across multiple numerators, or where it’s known at compile time. This technique, along with many others that follow in its foot steps, could be worthwhile there.

As far SIMD vectorization goes, there is honestly a fair bit to be said about this technique, perhaps deserving of its own separate post. Properly addressing it would require a rather large tangent, which I’d prefer to avoid. While it’s certainly possible (although not necessarily entirely effective depending on the target architecture) there’s another solution that’s more worthy of our time and attention.

Widening + Division via Fixed-Point Multiplication

A much simpler method for more rapidly computing integer division by a known denominator is possible if we have sufficiently wide integers available to us. If sig_width is the number of bits in our operand’s signficands, then with an integer that has at least 2 * sig_width + 1 bits, we can utilize fixed-point arithmetic instead.

For a fixed-point format with 2 * sig_width fractional bits, the reciprocal of y may be computed as:

r = (1 << 2 * sig_width) + (y - 1)) / y

Then the quotient is computed as:

q = (x * r) >> sig_width

Again, it should be straightforward enough to see how you would substitute these snippets into the integer division approach.

When it comes to 16 and 32-bit floats, this approach can be quite feasible as these would only requires integers which are at least 23 and 49 bits wide respectively, and these are hardly rare on most modern platforms.

For 64-bit floats that would require 107-bit integers, things are a little less convenient, but at least on x86, there are a number of instructions which are meant to facilitate the processing of integers wider than 64 bits. This includes a 64-bit integer division instruction that actually takes in an 128-bit numerator.

Naturally, such an approach is held back when it comes to SIMD vectorization in that the effective number of lanes available is reduced, so the main benefit of higher data throughput is severely compromised.

When it comes to SIMD vectorization on x86 specifically, this technique has some feasibility, particularly with AVX-512IFMA and when processing 32-bit float operands. This family of extensions introduces fused multiply accumulate instructions which operate on 52-bit integers. The parameters to this function are packed into 64-bit lanes within a SIMD vector. What’s particularly powerful about using this technique is that it permits us to perform N.s -= trunc(N.s / D.s) * N.s within a single CPU instruction, one that only has a latency of 4 cycles. And since we’re assuming that we have AVX-512IFMA, we also have AVX-512F and practically speaking we’ll also have AVX-512CD. With these instruction sets available to us, it’s possible to make the loop body remarkably succinct.

Widening + Direct Remainder Computation

So far, all of the approaches we’ve used have been relying on the idea of repeatedly running some variant of n = n - trunc(n / d) * d. However, there does exist a technique which may be used to compute the remainder slightly more directly.

If you take the fractional component of the quotient, instead of the whole part, then multiplying that by the denominator computes the remainder. So, in the abstract:

remainder = frac(n / d) * d

In theory, this saves you 1 subtraction instruction over the last approach. While that’s hardly much, it’s also not too hard implement, so we might as well consider it.

As was the case earlier, the value of n / d may be computed via fixed-point multiplication against a reciprocal. To extract the fractional part, we’ll perform a bitwise AND to extract the low half (this may not actually be necessary depending on implementation). Once we have the low half of our product, we multiply against our denominator and the high half of the resulting product is our answer:

mask = ... // Some constant consisting of sig_width many bits set to 1 
result = mulhi((n * r) & mask, d)

However, actually putting this idea into practice comes with some challenges. Essentially, we must be able to widen to a integer for which we have a widening multiply available.

Off the top of my head, there appear to be only a few circumstances where this would be feasible:

Note that in the last case, there’s no reason to believe it would perform any better than the previous approach since the subtraction this technique eliminates isn’t a separate instruction there.

Floating-Point Remainder: Bit-by-Bit

The initial bit-by-bit approach that was mentioned for integers can be utilized without relying on integer arithmetic at all. It’s actually possible to implement it entirely using nothing but floating-point arithmetic, and it’s surprisingly straightforward:

t = ldexp(frexp(d, ...), ilogb(n) + 1)
while n >= d:
    if n >= t:
        n -= t

    t *= 0.5f

For once, there really is nothing more to it.

On a machine with AVX-512DQ, the computation of t can be accelerated through the use of the the vgetmant**, vgetexp** and vscalef** instructions. These instruction allow you to extract a float’s significand, extract a float’s exponent, and multiply a float by a power two respectively.

Floating-Point Remainder: Widening

If the type of the function’s parameters is not the the widest available floating-point type, then that opens the possibility of using a wider floating-point division as the backbone of our implementation. This becomes particularly advantageous for SIMD vectorization since common platforms such as x86 and ARM only have vectorized division instructions for floating-point operands. Only ARM SVE has vectorized integer division instructions, yet SVE is far from widespread at the current time.

The core idea is the same as it’s always been: subtracting multiples of d from n. We’re still just repeatedly executing n -= d * m. It’s similar to what was done with the integer division approach, however, since we don’t have to explicitly manage the exponents as separate fields, in some sense, the problem becomes easier. The algorithm is ostensibly as straightforward as:

N = widen_float(n)
D = widen_float(d)

while N >= D:
  N -= trunc(N / D) * D

result = narrow_float(N)

But the practical reality makes things more complicated.

First, let’s take a look at the division. Recall that in general, the result of a floating-point operation may be an over-estimate, an exact result, or an under-estimate. Let’s think about how this applies to the division instruction, ignoring the possible rounding introduced by the subsequent multiplication for the time being. In this scenario, an over-estimated quotient leads to an incorrect result since we end up subtracting too large of a value. An exact quotient is naturally no issue. An under-estimated quotient would just mean subtracting a smaller multiple of D, which is fundamentally still generally fine.

In order to remove the possibility of an over-estimate, one thing that can be done is to decrement the result of N / D, that is, treat the float’s bits as an integer and subtract 1 from it. However, we must be careful with this as this may lead to an infinite loop. Consider that if N / D delivers an exact result such that the quotient is 1.0, e.g. N = 1.0, D = 1.0, then decrementing the resulting quotient would mean that after applying the truncation, the result would be 0.0. Hence, nothing would be subtracted, and the loop’s state would not change from iteration to iteration. We can address this by changing the loop termination criteria to 2 * N >= D and then after the loop, performing N -= D predicated on N >= D.

If you’re on a machine with AVX-512F then we may use its embedded rounding mode feature to avoid the possibility of an over-estimate entirely. This feature allows us to encode a rounding mode directly into the division instruction which will override the MXCSR register’s rounding mode. This is great for us since the alternative of setting the MXCSR register state, and then restoring it can be quite costly. By setting the embedded rounding mode on the divide instruction to round-towards-zero, the result will always either by an exact result or an under-estimate.

Now, let’s deal with rounding in the multiplication that comes after the truncation. The approach we’ll take is to remove the possibility of rounding altogether. In order to do this we’ll leverage the following fact: when you multiply two factors, the resulting product will have at most as many significant digits as the sum of the number of significant digits of the two factors.

Let’s assume that we’re widening 32-bit floats to 64-bit floats for the sake of the following explanation. double(d) will have no more than 24 significant digits since that’s the maximum number the 32-bit d would have. We’ll want double(d) * double(m) to have no more than 53 significant digits. Based on the earlier fact, that means that m can have no more than 53 - 24 = 29 significant bits.

It’s possible for m = trunc(N / D) to have a full 53 significant bits, but thankfully, reducing the number of significant figures in a float is a fairly straightforward operation. It’s just a matter of zeroing out the last 24 bits in m using a bitwise AND operation with ~0xffffff.

This is slightly cumbersome to do with a language like C or C++, but it’s hardly a real challenge. Either use memcpy or std::bitcast with C++20 or later. It should be noted that mainstream contemporary compilers tend to generate machine code that moves the float from its resident XMM register to a GPR, perform the bitwise AND there, only to have to move it back to an XMM register, even though bitwise AND instructions with XMM register operands exist. Indeed, a similar thing happens when you try to perform the decrement on the quotient mentioned earlier. This inefficiency can be alleviated through the use of inline assembly or compiler intrinsics.

So, with all of these changes made, the generic version would be:

N = widen_float(n)
D = widen_float(d)

mask = ~0xffffff // Should be adjusted for different fp types

while (N + N) >= D:
  quotient = N / D
  decremented_quotient = as_int(quotient) - 1
  masked_quotient = as_wide_float(decremented_quotient & mask)

  N -= trunc(masked_quotient) * D

if N >= D:
  N -= D

result = narrow_float(N)

With AVX-512F’s embedded rounding leveraged, it would be simpler at:

N = widen_float(n)
D = widen_float(d)

while N >= D:
  quotient = div_trunc(N, D)
  masked_quotient = as_wide_float(as_int(quotient) & mask)

  N -= trunc(masked_quotient) * D

result = narrow_float(N)

Floating-point Remainder: Widening with FMA

Although masking our bits from our truncated quotient is one option for avoiding rounding in the multiplication from the earlier approach, there is a much better alternative available to us on many pieces of contemporary hardware: fused-multiply accumulate instructions. FMA instructions evaluate the formula a * b + c with only one rounding operation performed at the end. It doesn’t matter if a * b has more significant digits than our floating-point type can represent, so long as after adding c, the number of significant digits is reduced back down.

In our particular problem, N -= D * m bit is equivalent to N = -D * m + N, which is in the form required by the FMA instruction so it may be replaced with N = fma(-D, M, N).

Since we don’t clear out bits from our quotient, this approach can clear more bits per loop iteration. If we take a look at 32-bit floats being widened to 64-bit floats, then this loop will run for no more than 6 iterations.

On x86 and ARM, the negation does not even need to be a separate instruction from the FMA, since x86 has the (v)fnmadd***** family of instructions and ARM has the FMLS family of instructions. These perform a fused-multiply-accumulate where a negation is performed on the result of the multiplication before performing the subsequent accumulation. Since these are also available in SIMD vectorized form, this approach is particularly powerful. That said, you could always compute -D upfront and keep it around assuming you have a spare register, which you probably do. If you can, throw AVX-512F’s embedded rounding into the division instruction and the loop body become very terse.

The resulting algorithm isn’t meaningfully simpler than before, but it’s likely to be faster:

N = widen_float(n)
D = widen_float(d)

neg_D = -D

while (N + N) >= D:
  quotient = N / D
  decremented_quotient = as_int(quotient) - 1
  truncated_quotient = trunc(as_wide_float(decremented_quotient))
  N = fma(truncated_quotient, neg_D, N)

if N >= D:
  N -= D

result = narrow_float(N)

Floating-point Remainder: Widening with Reciprocal

As was the case for the integer approach, the fact that we’re dividing by a constant denominator means that precomputed a reciprocal upfront may make for an optimization under the right circumstances.

Again, as has been the case with previous approaches, rounding must be handled so that the subtrahend is not too large.

The solutions are similar to those available before.

With AVX-512, you can use set the embedded rounding mode to rounding upwards for the division instruction that’s used to compute the reciprocal. This ensure that it’s either exact or an over-estimate.

Much as was the case for the integer case, this is unlikely to provide a speedup unless the amount of loop iterations is known to be high, which is not commonly the case under most practical circumstances.

The algorithm is quite similar that what we already had before:

N = widen_float(n)
D = widen_float(d)

recip = increment(1 / D)

mask = (-1 << sig_width)

while (N + N) >= D:
  quotient = N * recip
  decremented_quotient = as_int(quotient) - 1
  masked_quotient = as_wide_float(decremented_quotient & mask)

  N -= trunc(masked_quotient) * D

if N >= D:
  N -= D

result = narrow_float(N)

Edge-Case Handling

Handling fmod’s edge cases isn’t a particularly difficult task.

For scalar implementations of fmod, the following is all that’s necessary.

if isnan(x) || isnan(y) || isinf(x) || (y == 0.0):
    return NAN

if isinf(y) || (abs(x) < abs(y)): 
    return x

For SIMD vectorized implementations, a mask can created for those lanes which meet the first case, and before returning, this can be used to perform a blend between the result computed by the rest of the algorithm, and a vector full of NaN. For the second set of cases, it’s possible they may be handled by our implementation of the core fmod loop, but the possibility of y being infinite is not really something we designed our approaches for.

For a scalar implementation the early return is very likely to be an optimization, but for a vectorized implementation, unless you expect all lanes to be able to be able to return early, then it’s likely to be a pessimization.

On a machine with AVX-512DQ, all of the tests required by the first early return can be performed quite conveniently via the (v)fpclass** instructions.

SIMD-Specific Considerations

The core problem introduced by attempting to create a vectorized fmod implementation is that the loop may need to execute a variable number of times across different lanes. The loop itself must continue while the numerator is larger than or equal to the denominator in any lane, but it must also avoid changing the value of the numerator within the lanes for which that comparison is false.

On the bright side, for most of these approaches, there is no need for any augmentation because the value subtracted from the numerator will come out to zero. Even if this is not the case, only the subtraction performed on n or N.s would need to be predicated on that comparison. For the integer approaches which perform an unconditional shift, then that will have to be predicated on an evaluation of the loop condition.

With ARM Neon, it’s slightly inconvenient to check if a the loop condition is true for at least one lane since there is no counterpart to x86’s movemask instructions. However, umaxv may be handy for constructing a substitute. The instruction performs a reduction that performs an unsigned max reduce across all vector lanes. Checking if this value is non zero is enough for our purposes.

TODO: Implementations and Benchmarks

At the current time, my exploration of this problem is incomplete. I’m currently working to complete a set of implementations which are tailored for x86 under the assumption of particular ISA extensions being available. I hope to complete these implementations soon, however, it is somewhat challenging to implement some of these efficiently with the more constrained earlier SIMD instruction sets such as when working with pure SSE2. Not to mention, once you get into the weeds of it, there are a number of micro-optimizations which have not been mentioned here which must be considered if aiming for optimal performance.

Beyond that, getting meaningful benchmark results is difficult for this problem since the runtime of these algorithms is dependent on the ratio of numerator to denominator, and the denominator’s number of significant digits. Benchmarks would need to be run under the assumption of a particular ratio and a denominator with a certain amount of significant digits. Then, these two parameters would need to be varied to produce a matrix of results from which the performance characteristics of these different algorithms could be meaningfully compared.