1 Introduction

Floating-point arithmetic has long been the prevailing system used for numerical computing, allowing efficient and precise calculation on modern hardware. Whilst it closely approximates real arithmetic, it does not always satisfy its laws. These sporadic exceptions often confuse programmers and complicate mathematical reasoning to the point that deciding even the simplest statements can take excessive amounts of time. Our interest stems from the area of formal software verification, where automated reasoning is used to prove that programs satisfy their specifications. Although modern verification systems can reason about floating-point arithmetic, it has much less support than integer or bit-vector arithmetic. This makes proving correctness far more difficult and drastically limits practical applications of formal methods. Our work seeks to fill these gaps.

Denoting floating-point multiplication by \(\otimes \), we study the following problem:

Problem 1

Let XYZ be intervals over the floating-point numbers and let

$$\begin{aligned} V = \{ (x, y, z) \mid x \in X, y \in Y, z \in Z, x \otimes y = z \}. \end{aligned}$$

What is the smallest boxFootnote 1 enclosing V?

We can give a simplified example without getting into the details of floating-point arithmetic:

Example 1.1

For the sake of exposition, we will use decimal numbers with two decimal places, rounding downward the results of arithmetic operations. For example, since \(0.50 \cdot 1.01 = 0.505\), we have \(0.50 \otimes 1.01 = 0.50\), where \(\otimes \) is rounded multiplication.

Let x and y be two-place decimal numbers, and suppose \(x \otimes y = 5.00\) where \(2.20 \le x \le 2.50\) and \(1.00 \le y \le 2.50\). These bounds are clearly quite loose. For instance, the lower bound on y is too small: if \(x = 2.50\) and \(y = 1.00\), then \(x \otimes y = 2.50\), which is much less than the desired 5.00.

First, we construct an equivalent condition that uses real multiplication instead of floating-point multiplication. This can be accomplished with a simple transformation. Since the numbers that round downward to 5.00 are exactly those in [5.00, 5.01), it follows that \(x \otimes y = 5.00\) is equivalent to \(5.00 \le xy < 5.01\). By getting rid of the \(\otimes \) operator, we can now reason using real arithmetic:

  1. 1.

    Multiplying x and y and combining their bounds, we get \(2.20 \le xy \le 6.25\).

  2. 2.

    This is worse than our existing bounds on xy, so we cannot improve them.

  3. 3.

    Dividing the bounds on xy by y, we obtain \(5.00/2.50 \le x < 5.01/1.00\).

  4. 4.

    This also does not improve our existing bounds on x.

  5. 5.

    Dividing by x instead, we obtain \(5.00/2.50 \le y < 5.01/2.20\).

  6. 6.

    We can thus improve our bounds on y to \(2.00 \le y < 2.27\overline{72}\).Footnote 2

If we were allowed the full range of the reals, we would now be finished, stating that \(5.00 \le xy < 5.01\) where \(2.20 \le x \le 2.50\) and \(2.00 \le y < 2.27\overline{72}\). We can easily show that these bounds are tight for real multiplication:

  • If \(x = 2.20\) and \(y < 2.27\overline{72}\), then \(xy < 5.01\). If \(y \ge 2.27\overline{72}\) instead, then \(xy \ge 5.01\).

  • If \(x = 2.50\) and \(y = 2.00\), then \(xy = 5.00\).

Having used each of the bounds on x, y and xy, we know that this set of bounds is tight on its own. However, we also require that x and y are decimal numbers with two decimal places. Thus, since \(y < 2.27\overline{72}\), we can further conclude that \(y \le 2.27\). Since the bounds all have two decimal places now, we would ideally like to be done. If we check, though, we find that the bounds are no longer tight: if \(x = 2.20\) and \(y = 2.27\), then \(xy = 4.994\), which is too small, and therefore \(x \ge 2.21\).

We can continue tightening even further. If \(x = 2.21\) instead and \(y = 2.27\), then \(xy = 5.0167\), which is too large. Therefore \(y \le 2.26\), and so on. From this point, division does not help us as much as it did earlier. Since the product of the bounds on x and y is so close to the optimal result, all of the quotients differ by less than 0.01, and so it is equivalent to an exhaustive search.

After a few more iterations, we eventually reach the optimal solution: \(2.33 \le x \le 2.50\) and \(2.00 \le y \le 2.15\). This clearly satisfies our bounds, since \(2.33 \cdot 2.15 = 5.0095 \in [5.00, 5.01)\). However, proving optimality is tedious without further results, requiring us to iterate the above argument for every decimal until we reach the optimum.

The toy example above illustrates the limits of current floating-point interval reasoning. As demonstrated, the analogous problem over the reals is straightforward, and the steps in the example suggest a general algorithm. If we were working with the much larger IEEE 754 64-bit floating-point numbers instead, we could easily be forced to iterate over millions of values: for a concrete example, take \(12738103310254127 \le x \le 12738103379848963\), y unrestricted and \(z = 2^{54} - 1\) under rounding to nearest. Although \(x \otimes y = z\) has no solution under these conditions, it will take 69,594,839 steps to prove this using the traditional interval-based approach. In general, the traditional algorithm struggles to prove unsatisfiability if the real relaxation of the problem is solvable, since the optimal intervals for an unsatisfiable system are empty. By contrast, the method we present in this paper arrives at the answer immediately.

We consider the following to be the primary contributions of this paper:

  • An algorithm to efficiently compute the next normal floating-point factor of any floating-point number in radix 2.

    • This algorithm is also partially applicable in higher bases. We do not assume a particular radix in any of our results.

  • When the above algorithm is applicable, an efficient procedure for finding optimal interval bounds on variables of floating-point constraints of the form \(x \otimes y = z\).

In addition, we believe the following secondary contributions are of interest:

  • A demonstration that “round-trip” rounding errors have useful, nontrivial piecewise polynomial extensions to the reals.

  • A characterization of upward and downward floating-point rounding in terms of the remainder of floored division.

  • A floating-point predicate for deciding, in any base, whether a number is floating-point factor of another floating-point number.

  • Results on solving certain kinds of interval constraints over integer remainders.

1.1 Related Work

Although floating-point arithmetic has existed for a long time, work on solving floating-point equations and inequalities has been sporadic and has mostly been concerned with binary floating-point arithmetic. Michel [21] developed partial solutions for floating-point equations of the form \(x \circ y = z\) where \(\circ \) is one of rounded addition, subtraction, multiplication, or division, as well as a complete solution for rounded square root. His work contains an early description of the classical interval-based algorithm, as it applies to the floating-point numbers.Footnote 3 As an early example, only normal, binary floating-point numbers and the standard rounding modes were considered. Ziv et al. [30] showed that, under mild conditions, if z can take at least 2 different values, then the equation \(x \otimes y = z\) has at least one solution in binary floating-point arithmetic. However, their work does not treat other bases, nor the case where z is fixed. Bagnara et al. [2] gave some extremal bounds on the x and y in terms of the bounds on z for binary floating-point arithmetic. However, the resulting bounds are very weak, since they do not account for any of the bounds on x or y. At present, there are no known efficient algorithms for computing the solution set of \(x \otimes y = z\) in general. Indeed, in terms of similar efforts, the only basic operation which has been solved (apart from square root) is addition [1, 12].

1.2 Outline

In Sect. 2, we introduce our notation and develop the formalisms used in this paper. We also give an account of the basic properties of floating-point arithmetic that we will need, and precisely define the classical interval-based algorithm. Due to the substantial amount of notation, a glossary is supplied for reference in Appendix B. In Sect. 3, we relate optimal intervals, the division-based algorithm and floating-point factors, and simplify the problem by narrowing down the conditions under which the division-based algorithm does not produce an optimal result. In Sect. 4, we then develop conditions for numbers being factors based on the rounding error in the product. Then, in Sect. 5, we show that we can control that rounding error under certain conditions and thus directly find floating-point factors. Combining these in Sect. 6, we give an algorithm to compute such factors efficiently and thus solve the problem. Appendix A contains some ancillary proofs used in the results of Sect. 6. In Sect. 7, we conclude and discuss open problems.

2 Preliminaries

Following convention, we denote the integers by \(\mathbb {Z}\), the reals by \(\mathbb {R}\), and the extended reals by \({\overline{\mathbb {R}}} = \mathbb {R}\cup \{ +\infty , -\infty \}\), and the same sets excluding zero by \(\mathbb {Z}^*\), \(\mathbb {R}^*\) and \(\overline{\mathbb {R}}^*\). The power set of a set X is denoted by \({\mathcal {P}}(X)\). We write f[X] for the image of X under a function f, and \(f^{-1}[X]\) for the preimage of X under f. To simplify exposition, we will denote the images of \(X \times Y\) and \(X \times \{ y \}\) under a binary operator \(\circ \) simply by \(X \circ Y\) and \(X \circ y\), respectively. We will use such notation most frequently in the context of intervals. We also discuss the size of intervals in terms of their width on the extended real line. To this end, we define a notion of diameter for subsets of the extended reals:

Definition 2.1

(Set diameter) For any \(X \subseteq {\overline{\mathbb {R}}}\), we define

$$\begin{aligned} {{\,\textrm{diam}\,}}X = \sup {\{ {d(x, y) \mid x, y \in X} \}\cup \{ 0 \}}, \end{aligned}$$

where \(d :{\overline{\mathbb {R}}}^2 \rightarrow {\overline{\mathbb {R}}}\) is a function such that

$$\begin{aligned} d(x, y) = {\left\{ \begin{array}{ll} 0 &{} \text {if } x = y,\\ {|x - y|} &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

The diameter satisfies the following properties for all sets \(X \subseteq {\overline{\mathbb {R}}}\):

$$\begin{aligned} {{\,\textrm{diam}\,}}\emptyset = 0{} & {} \text {(diameter of empty set)} \end{aligned}$$
(1)
$$\begin{aligned} \text {for all } x \in {\overline{\mathbb {R}}},\ {{\,\textrm{diam}\,}}{\{ x \}} = 0{} & {} \text {(diameter of singleton)}\end{aligned}$$
(2)
$$\begin{aligned} \text {if}\ \inf X < \sup X,\ \text {then}\ {{\,\textrm{diam}\,}}X = \sup X - \inf X\end{aligned}$$
(3)
$$\begin{aligned} \text {for all } a \in \mathbb {R}^*,\ {{\,\textrm{diam}\,}}aX = {|a|} {{\,\textrm{diam}\,}}X{} & {} \text {(absolute homogeneity)} \end{aligned}$$
(4)
$$\begin{aligned} \text {for all } b \in \mathbb {R},\ {{\,\textrm{diam}\,}}(X + b) = {{\,\textrm{diam}\,}}X{} & {} \text {(translation invariance)} \end{aligned}$$
(5)

Importantly, the diameter of an interval is equal to the distance between its endpoints:

Lemma 2.2

For all \(a, b \in {\overline{\mathbb {R}}}\), if \(a < b\), then

$$\begin{aligned} {{\,\textrm{diam}\,}}{[a, b]} = {{\,\textrm{diam}\,}}{(a, b]} = {{\,\textrm{diam}\,}}{[a, b)} = {{\,\textrm{diam}\,}}{(a, b)} = b - a. \end{aligned}$$

Additionally, if two intervals overlap, then the larger one contains at least one of the endpoints of the smaller one:

Lemma 2.3

Let \(a, b \in \mathbb {R}\) and let \(I \subseteq {\overline{\mathbb {R}}}\) be an interval. If \([a, b] \cap I\) is nonempty and \({{\,\textrm{diam}\,}}I > b - a\), then either \(a \in I\) or \(b \in I\).

Since the subject of this work is solving equations similar to \(xy = z\), we will often need to use division. However, division alone does not provide what we need. For instance, if \(y = z = 0\), then any finite and nonzero x satisfies, even though \(z/y= 0/0\) is undefined. In order to succinctly describe these solution sets, we introduce a special “division” operator:

Definition 2.4

For any \(Y, Z \subseteq {\overline{\mathbb {R}}}\), we define

The following properties then hold for all \(Y, Z \subseteq {\overline{\mathbb {R}}}\) and \(a \in \mathbb {R}^*\):

(6)
(7)
(8)
(9)

Furthermore, for all \(Y', Z' \subseteq {\overline{\mathbb {R}}}\), we also have the following properties:

(10)
(11)

Remark

Note that, since \(Y = (Y \cap \mathbb {R}^*) \cup (Y {\setminus } \mathbb {R}^*)\), we have

It is also worth noting that \(Y {\setminus } \mathbb {R}^* = Y \cap \{ 0,-\infty ,+\infty \}\). This allows us to compute as the union of an ordinary division and at most three special cases as described above.

We also define the box operator as follows:

Definition 2.5

(Box closure) A box is any Cartesian product of (extended) real intervals. For any set \(X \subseteq {\overline{\mathbb {R}}}^n\), we define its box closure \(\Box X\) as the unique smallest n-dimensional box containing X as a subset.Footnote 4

For any sets \(X, Y \subseteq {\overline{\mathbb {R}}}^n\) we have:

$$\begin{aligned} X \subseteq \Box X{} & {} {\text {(extensivity)}} \end{aligned}$$
(12)
$$\begin{aligned} \text {if } X \subseteq Y, \text {then } \Box X \subseteq \Box Y,{} & {} {\text {(monotonicity)}} \end{aligned}$$
(13)
$$\begin{aligned} \Box \Box X = \Box X{} & {} {\text {(idempotence)}} \end{aligned}$$
(14)

Furthermore, for all subsets \(A_1, \ldots , A_n\) of \({\overline{\mathbb {R}}}\), where the product is the Cartesian product, we have the following property:

$$\begin{aligned} \Box \prod _{i=1}^{n} A_i = \prod _{i=1}^{n} \Box A_i{} & {} {\text {(distributivity)}} \end{aligned}$$
(15)

That is, the box of a product of sets is exactly the product of the boxes of those sets.

Remark

A 1-dimensional box is an interval, a 2-dimensional box is a rectangle, a 3-dimensional box is a cuboid, and so on. Also note that a box does not necessarily contain its boundary: an open interval is still an interval, and therefore a box. However, if X is a finite subset of \({\overline{\mathbb {R}}}\), then \(\Box X = [\min X, \max X]\). In general, if X is finite (or otherwise contains its boundary), then \(\Box X\) is a Cartesian product of closed intervals.

2.1 Floating-Point Arithmetic

Much work has been done on specifying floating-point arithmetic in various formal systems [3,4,5,6, 8,9,10,11, 14,15,20, 22, 23, 25, 26, 29]. However, to the best of the author’s knowledge, there is no well-established and precise informal language of floating-point theory. For the sake of completeness and rigor, the rest of this section will be devoted to defining our terms. Readers interested in a deeper introduction may wish to consult a reference such as Muller et al. [24] or Goldberg [13].

2.1.1 Floating-Point Numbers, Briefly

A finite floating-point number is usually written in the style of scientific notation as \(\pm d_1.d_2d_3\cdots {}d_p \times \beta ^e\), where each digit \(d_i\) lies in \(\{ 0, \dots , \beta - 1 \}\). The \(\pm d_1.d_2d_3\cdots {}d_p\) part is the significand, the number of digits p is the precision, \(\beta \) is the base (commonly 2 or 10), and e is the exponent. For example, with precision \(p = 3\) and base \(\beta = 2\), we can write the decimal number 0.75 as either \(1.10 \times 2^{-1}\) or \(0.11 \times 2^0\). Of the two representations, the former is preferred, as it is normalized—that is, it has no leading zeros—and is therefore unique.

With a higher precision we can represent more numbers. For instance, we need \(p \ge 4\) to represent the binary number 1.101 exactly. Regardless of precision, however, there are still limits imposed by the choice of base. Just as \(\frac{1}{3}\) lacks a finite decimal expansion, for any given base there are numbers that cannot be represented. Notably, the decimal 0.1 is \(0.000\overline{1100}\) in binary, and thus it has no representation in any binary floating-point system. As an example, if we take \(\beta = 2\) and \(p = 3\), then \(1.10 \times 2^{-4}\) and \(1.11 \times 2^{-4}\) are consecutive floating-point numbers, but they correspond to the decimal numbers 0.09375 and 0.109375, respectively.

Consequently, the sum, product or ratio of two precision-p base-\(\beta \) floating-point numbers is rarely itself a precision-p base-\(\beta \) floating-point number and thus computing with them typically involves accepting some degree of rounding error. In ordinary usage, we minimize rounding error by rounding to nearest; that is, by choosing the floating-point number closest to the real number we are rounding. Other rounding modes such as rounding upward, downward, or towards zero (also known as truncation) are typically only used in situations where the direction of the rounding is more important than the accuracy of a single operation. For example, the simplest way to avoid underapproximating the set of true results in interval arithmetic is to always round upper bounds upward and lower bounds downward.

2.1.2 Definitions and Terminology

The arithmetic described in this work is compatible with and generalizes that of the IEEE 754 standard, with two exceptions: we omit signed zerosFootnote 5 “NaN” values, because they are essentially irrelevant to the problem but add clutter. Appendix B contains a summary of the definitions introduced here. We begin with a construction of the floating-point numbers.

Definition 2.6

(Floating-point numbers) A floating-point format is a quadruple of integers \((\beta , p, {e_{\textrm{min}}}, {e_{\textrm{max}}})\) such that \(\beta \ge 2\), \(p \ge 1\), and \({e_{\textrm{min}}}\le {e_{\textrm{max}}}\), where \(\beta \) is the base, p is the precision, \({e_{\textrm{min}}}\) is the minimum exponent, and \({e_{\textrm{max}}}\) is the maximum exponent. For convenience, we also define the minimum quantum exponentFootnote 6\({q_{\textrm{min}}}= {e_{\textrm{min}}}- p + 1\) and the maximum quantum exponent \({q_{\textrm{max}}}= {e_{\textrm{max}}}- p + 1\). Given these quantities, we define the set \(\mathbb {F}^*\) of finite nonzero floating-point numbers as follows:

$$\begin{aligned} \mathbb {F}^* = \{ M \cdot \beta ^{e - p + 1} \mid M, e \in \mathbb {Z}, 0< {|M|} < \beta ^p, {e_{\textrm{min}}}\le e \le {e_{\textrm{max}}} \}. \end{aligned}$$

From here, we define the set \(\mathbb {F}\) of finite floating-point numbers and the set \({\overline{\mathbb {F}}}\) of floating-point numbers:

$$\begin{aligned} \mathbb {F}&= \mathbb {F}^* \cup \{ 0 \},\\ {\overline{\mathbb {F}}}&= \mathbb {F}\cup \{ -\infty , +\infty \}. \end{aligned}$$

To avoid repetition, from this point forward we will assume that the sets of floating-point numbers correspond to some floating-point format \((\beta , p, {e_{\textrm{min}}}, {e_{\textrm{max}}})\).

From the above definition, we immediately obtain the following:

Lemma 2.7

\(\mathbb {F}^*\), \(\mathbb {F}\) and \({\overline{\mathbb {F}}}\) are each closed under negation.

Furthermore, the largest finite floating-point number and the smallest positive floating-point number have simple expressions:

$$\begin{aligned}&\max \mathbb {F}^* = \max \mathbb {F}= (\beta ^p - 1)\beta ^{q_{\textrm{max}}}\end{aligned}$$
(16)
$$\begin{aligned}&\min {\overline{\mathbb {F}}} \cap (0, +\infty ] = \beta ^{q_{\textrm{min}}}\end{aligned}$$
(17)

We further categorize the elements of \(\mathbb {F}^*\) as follows:

Definition 2.8

(Normal and subnormal numbers) A nonzero finite floating-point number \(x \in \mathbb {F}^*\) is normal if \({|x|} \ge \beta ^{e_{\textrm{min}}}\), or subnormal otherwise.

The normal numbers are so called because they each have a normalized representation in scientific notation, restricting the exponent to \(\{ {e_{\textrm{min}}}, \dots , {e_{\textrm{max}}} \}\). By contrast, a subnormal number can only be written without leading zeros by choosing an exponent below \({e_{\textrm{min}}}\). Note that we do not consider zero to be normal or subnormal.

A number written in scientific notation has an exponent and significand. Similarly, each finite floating-point number has an associated exponent and significand determined by the floating-point format. We will now define these concepts and generalize them to the finite reals.

Definition 2.9

(Exponent and significand) The functions \(E :\mathbb {R}\rightarrow \mathbb {Z}\) and \(Q :\mathbb {R}\rightarrow \mathbb {Z}\) are defined as follows:

$$\begin{aligned} E(x)&= {\left\{ \begin{array}{ll} \left\lfloor \log _\beta {|x|}\right\rfloor , &{} {|x|} \ge \beta ^{e_{\textrm{min}}},\\ {e_{\textrm{min}}}, &{} {|x|} < \beta ^{e_{\textrm{min}}}. \end{array}\right. }\\ Q(x)&= E(x) - p + 1. \end{aligned}$$

Given \(x \in \mathbb {R}\), the integers E(x) and Q(x) are the exponent and quantum exponent of x, respectively. The significand of x is the number \(x\beta ^{-E(x)}\). If \(x\beta ^{-Q(x)}\) is an integer, then it is the integral significand [24] of x.

To justify these names, we establish that these indeed compute the exponent of a number written as a product of a significand and a power of \(\beta \), respecting the minimum exponent:

Lemma 2.10

(Exponent law) Let \(m \in \mathbb {R}\) and \(e \in \mathbb {Z}\) such that \({|m|} < \beta \) and \(e \ge {e_{\textrm{min}}}\). If either \({|m|} \ge 1\) or \(e = {e_{\textrm{min}}}\), then \(E(m \beta ^e) = e\). Otherwise, \(E(m \beta ^e) < e\).

Proof

First, note that if \({|m \beta ^e|} \ge \beta ^{e_{\textrm{min}}}\), then

$$\begin{aligned} E(m \beta ^e)&= \left\lfloor \log _\beta {|m \beta ^e|}\right\rfloor \\&= \left\lfloor \log _\beta {|m|} + e\right\rfloor \\&= \left\lfloor \log _\beta {|m|}\right\rfloor + e. \end{aligned}$$

We now proceed by cases:

  • Suppose \({|m|} \ge 1\). Then \({|m \beta ^e|} \ge \beta ^{e_{\textrm{min}}}\) and \(\left\lfloor \log _\beta {|m|}\right\rfloor = 0\), and thus \(E(m \beta ^e) = e\).

  • Suppose \(e = {e_{\textrm{min}}}\) and \({|m|} < 1\). Then \({|m \beta ^e|} < \beta ^{e_{\textrm{min}}}\) and so we have \(E(m \beta ^e) = {e_{\textrm{min}}}= e\) immediately by the definition of E.

  • Suppose \(e > {e_{\textrm{min}}}\) and \({|m|} < 1\). If \({|m \beta ^e|} \ge \beta ^{e_{\textrm{min}}}\), then \(\left\lfloor \log _\beta {|m|}\right\rfloor < 0\) and hence \(E(m \beta ^e) = \left\lfloor \log _\beta {|m|}\right\rfloor + e < e\). If \({|m \beta ^e|} < \beta ^{e_{\textrm{min}}}\) instead, then \(E(m \beta ^e) = {e_{\textrm{min}}}< e\).

Since the result holds in all cases, we are finished. \(\square \)

Expanding on the above, the function E also satisfies the following properties for all \(x, y \in \mathbb {R}\):

$$\begin{aligned} E(x) = E(-x){} & {} \text {(evenness)} \end{aligned}$$
(18)
$$\begin{aligned} \text {if } {|x|} \le {|y|}, \text { then } E(x) \le E(y){} & {} \text {(piecewise monotonicity)} \end{aligned}$$
(19)
$$\begin{aligned} \text {if } x \in \mathbb {F}, \text { then } {e_{\textrm{min}}}\le E(x) \le {e_{\textrm{max}}}{} & {} \text {(exponent bound)}\end{aligned}$$
(20)
$$\begin{aligned} {|x \beta ^{-E(x)}|} < \beta{} & {} \text {(significand bound)}\end{aligned}$$
(21)
$$\begin{aligned} 1 \le {|x \beta ^{-E(x)}|} \text { if and only if } \beta ^{e_{\textrm{min}}}\le {|x|}{} & {} \text {(normality)} \end{aligned}$$
(22)

From the above properties, the definition of Q immediately implies the following:

$$\begin{aligned} Q(x) = Q(-x){} & {} \text {(evenness)}\end{aligned}$$
(23)
$$\begin{aligned} \text {if } {|x|} \le {|y|}, \text { then } Q(x) \le Q(y){} & {} \text {(piecewise monotonicity)}\end{aligned}$$
(24)
$$\begin{aligned} \text {if } x \in \mathbb {F}, \text { then } {q_{\textrm{min}}}\le Q(x) \le {q_{\textrm{max}}}{} & {} \text {(quantum exponent bound)}\end{aligned}$$
(25)
$$\begin{aligned} {|x \beta ^{-Q(x)}|} < \beta ^p{} & {} \text {(integral significand bound)}\end{aligned}$$
(26)
$$\begin{aligned} \beta ^{p - 1} \le {|x \beta ^{-Q(x)}|} \text { iff } \beta ^{e_{\textrm{min}}}\le {|x|}{} & {} \text {(normality)} \end{aligned}$$
(27)

Additionally, the following fact establishes an important connection between integral significands and the finite floating-point numbers:

$$\begin{aligned} \text {if } x \in \mathbb {F}, \text { then } x \beta ^{-Q(x)} \in \mathbb {Z}{} & {} \text {(integrality)} \end{aligned}$$
(28)

In other words, every finite floating-point number has an integral significand.

Definition 2.11

(Unit in first/last place [27]) The functions \({{\,\textrm{ufp}\,}}, {{\,\textrm{ulp}\,}}: \mathbb {R}\rightarrow \mathbb {R}\) are defined as follows:

$$\begin{aligned} {{\,\textrm{ufp}\,}}(x)&= {\left\{ \begin{array}{ll} \beta ^{\left\lfloor \log _\beta {{|x|}}\right\rfloor }, &{} x \ne 0,\\ 0, &{} x = 0, \end{array}\right. }\\ {{\,\textrm{ulp}\,}}(x)&= {{\,\textrm{ufp}\,}}(x)\beta ^{1-p}. \end{aligned}$$

For any \(x \in \mathbb {R}\), we refer to \({{\,\textrm{ufp}\,}}(x)\) as the unit in first place of x, and to \({{\,\textrm{ulp}\,}}(x)\) as the unit in last place of x.Footnote 7

For all \(x, y \in \mathbb {R}\) and \(n \in \mathbb {Z}\), the following statements hold:

$$\begin{aligned} {{\,\textrm{ufp}\,}}(x) \le {|x|}{} & {} \text {(lower bound)}\end{aligned}$$
(29)
$$\begin{aligned} \text {if } x \ne 0, \text { then } {{{\,\textrm{ufp}\,}}(x) \le {|x|} < \beta {{\,\textrm{ufp}\,}}(x)}{} & {} \text {(bounds)}\end{aligned}$$
(30)
$$\begin{aligned} {{\,\textrm{ufp}\,}}(x) = {|x|} \text { iff } x = 0 \text { or } {|x|} = \beta ^k \text { for some } k \in \mathbb {Z}{} & {} \text {(exactness)}\end{aligned}$$
(31)
$$\begin{aligned} {{\,\textrm{ufp}\,}}(x \beta ^n) = {{\,\textrm{ufp}\,}}(x) \beta ^n{} & {} \text {(scaling)}\end{aligned}$$
(32)
$$\begin{aligned} {{\,\textrm{ufp}\,}}(x) {{\,\textrm{ufp}\,}}(y) = {{\,\textrm{ufp}\,}}(x {{\,\textrm{ufp}\,}}(y)){} & {} \text {(product law)}\end{aligned}$$
(33)
$$\begin{aligned} \text {if } y \ne 0, \text { then } {{{\,\textrm{ufp}\,}}(x)/{{\,\textrm{ufp}\,}}(y)} = {{\,\textrm{ufp}\,}}(x/{{\,\textrm{ufp}\,}}(y)){} & {} \text {(quotient law)} \end{aligned}$$
(34)

There is also a close relationship between \({{\,\textrm{ufp}\,}}\) and E. In particular, the base-\(\beta \) logarithm of the unit in first place is the exponent (if exponents were unbounded):

$$\begin{aligned} \text {if } {|x|} \ge \beta ^{e_{\textrm{min}}}, \text { then } {{{\,\textrm{ufp}\,}}(x) = \beta ^{E(x)}}\end{aligned}$$
(35)
$$\begin{aligned} \text {if } {|x|}< \beta ^{e_{\textrm{min}}}, \text { then } {{{\,\textrm{ufp}\,}}(x) < \beta ^{E(x)}} \end{aligned}$$
(36)

The following result allows us to further simplify computing the unit in first place of a quotient by separating out the significands:

Lemma 2.12

For all \(x, y \in \mathbb {R}^*\), we have

$$\begin{aligned} {{\,\textrm{ufp}\,}}(x/y) = {{\,\textrm{ufp}\,}}(m_x/m_y) \frac{{{\,\textrm{ufp}\,}}(x)}{{{\,\textrm{ufp}\,}}(y)},\\ {{\,\textrm{ufp}\,}}(m_x/m_y) = {\left\{ \begin{array}{ll} 1/\beta &{} \text {if } {|m_x|} < {|m_y|},\\ 1 &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$

where \(m_x = x/{{\,\textrm{ufp}\,}}(x)\) and \(m_y = y/{{\,\textrm{ufp}\,}}(y)\).

In a similar vein, we have the following lemma on the exponents of quotients:

Lemma 2.13

Let \(x, z \in \mathbb {R}^*\) and let \(m_x = x\beta ^{-E(x)}\) and \(m_z = z\beta ^{-E(z)}\). If \(\beta ^{e_{\textrm{min}}}\le {|z/x|}\), then \(E(z/x) = E(z) - E(x) + \log _\beta {{\,\textrm{ufp}\,}}(m_z/m_x)\).

Proof

Suppose \(\beta ^{e_{\textrm{min}}}\le {|z/x|}\). Then,

$$\begin{aligned} \beta ^{E(z/x)}&= {{\,\textrm{ufp}\,}}(z/x)\\&= {{\,\textrm{ufp}\,}}\left( \frac{m_z \beta ^{E(z)}}{m_x \beta ^{E(x)}}\right) \\&= {{\,\textrm{ufp}\,}}(m_z/m_x)\beta ^{E(z) - E(x)}. \end{aligned}$$

Taking logarithms, we obtain the desired result. \(\square \)

Having defined the basic structure of \({\overline{\mathbb {F}}}\), we now turn to the definition of the computational operations involving the floating-point numbers.

Definition 2.14

(Rounding) A rounding function is any function \(f :{\overline{\mathbb {R}}} \rightarrow {\overline{\mathbb {F}}}\). The downward rounding function \({{\,\textrm{RD}\,}}:{\overline{\mathbb {R}}} \rightarrow {\overline{\mathbb {F}}}\) and the upward rounding function \({{\,\textrm{RU}\,}}:{\overline{\mathbb {R}}} \rightarrow {\overline{\mathbb {F}}}\) are defined as follows:

$$\begin{aligned} {{\,\textrm{RD}\,}}(x)&= \max \,\{ y \in {\overline{\mathbb {F}}} \mid x \le y \},\\ {{\,\textrm{RU}\,}}(x)&= \min \,\{ y \in {\overline{\mathbb {F}}} \mid x \ge y \}. \end{aligned}$$

A rounding function f is faithful if and only if \(f(x) \in \{ {{\,\textrm{RD}\,}}(x), {{\,\textrm{RU}\,}}(x) \}\) for all \(x \in {\overline{\mathbb {R}}}\).

It is easy to show that both \({{\,\textrm{RD}\,}}\) and \({{\,\textrm{RU}\,}}\) are nondecreasingFootnote 8 functions. In addition, they satisfy the following properties for all \(x \in {\overline{\mathbb {R}}}\):

$$\begin{aligned} {{\,\textrm{RD}\,}}(x) \le x \le {{\,\textrm{RU}\,}}(x){} & {} \text {(lower/upper bound)}\end{aligned}$$
(37)
$$\begin{aligned} {{\,\textrm{RU}\,}}(-x) = -{{\,\textrm{RD}\,}}(x){} & {} \text {(duality)}\end{aligned}$$
(38)
$$\begin{aligned} x \in {\overline{\mathbb {F}}} \text { if and only if } {{\,\textrm{RU}\,}}(x) = {{\,\textrm{RD}\,}}(x){} & {} \text {(exactness)} \end{aligned}$$
(39)

In addition, the following gap property holds for all \(x \in \mathbb {R}\):

$$\begin{aligned} {{\,\textrm{RU}\,}}(x) - {{\,\textrm{RD}\,}}(x) = {\left\{ \begin{array}{ll} \infty &{} \text {if } {|x|} > \max \mathbb {F},\\ 0 &{} \text {if } x \in \mathbb {F},\\ \beta ^{Q(x)} &{} \text {otherwise.}\\ \end{array}\right. } \end{aligned}$$
(40)

We also give a partial definition of rounding to nearest, which is the default rounding mode used in practice:

Definition 2.15

(Rounding to nearest) A rounding function f rounds to nearest if \({|f(x) - x|} = \min _{y \in {\overline{\mathbb {F}}}}\,{|y - x|}\) for all \(x \in [\min \mathbb {F}, \max \mathbb {F}]\).

We denote by \({{\,\textrm{RN}\,}}\) an arbitrary nondecreasing and faithful rounding to nearest. Note that this is not a complete definition! In particular, it does not give a tie-breaking rule for when x is the exact midpoint between two floating-point numbers. It also does not fully specify how to round values outside \([\min \mathbb {F}, \max \mathbb {F}]\). We leave these unspecified since IEEE 754 specifies two different round-to-nearest modes with different rules in these cases.Footnote 9 In particular, even though \(+\infty \) and \(-\infty \) are infinitely far apart from any finite number, according to IEEE 754, they may still result from rounding finite values to nearest. More specifically, it states that the rounding to nearest of any x such that \({|x|} \ge \beta ^{e_{\textrm{max}}}(\beta - \beta ^{1-p}/2)\) is the infinity with the same sign as x. Note that this bound is the exact midpoint of \(\max \mathbb {F}\) and \(\beta ^{{e_{\textrm{max}}}+ 1}\).

We shall denote by \({{\,\textrm{fl}\,}}\) an arbitrary nondecreasing faithful rounding function. This condition is satisfied by most practical rounding modes, including \({{\,\textrm{RD}\,}}\), \({{\,\textrm{RU}\,}}\), and rounding to nearest as defined above. The following properties hold for all \(x \in {\overline{\mathbb {R}}}\):

$$\begin{aligned} x \in {\overline{\mathbb {F}}} \text { if and only if } {{\,\textrm{fl}\,}}(x) = x{} & {} \text {(exactness)}\end{aligned}$$
(41)
$$\begin{aligned} {{\,\textrm{fl}\,}}({{\,\textrm{fl}\,}}(x)) = {{\,\textrm{fl}\,}}(x){} & {} \text {(idempotence)} \end{aligned}$$
(42)

Throughout this paper, we will frequently use the preimages of floating-point intervals under rounding. We will often rely on the following lemma:

Lemma 2.16

Let \(X \subseteq {\overline{\mathbb {F}}}\). If X is a nonempty floating-point interval, then \({{\,\textrm{fl}\,}}^{-1}[X]\) is an extended real interval such that \({{\,\textrm{fl}\,}}^{-1}[X] \supseteq [\min X, \max X]\).

Proof

Suppose X is a nonempty floating-point interval, and suppose to the contrary that \({{\,\textrm{fl}\,}}^{-1}[X]\) is not an interval. Then there are some \(a, b \in {{\,\textrm{fl}\,}}^{-1}[X]\) and \(y \in {\overline{\mathbb {R}}}\) such that \(a< y < b\) and \(y \notin {{\,\textrm{fl}\,}}^{-1}[X]\). Since \({{\,\textrm{fl}\,}}\) is nondecreasing, it follows that \({{\,\textrm{fl}\,}}(a) \le {{\,\textrm{fl}\,}}(y) \le {{\,\textrm{fl}\,}}(b)\). However, since \({{\,\textrm{fl}\,}}(a), {{\,\textrm{fl}\,}}(b) \in X\) where X is a floating-point interval, we thus have \({{\,\textrm{fl}\,}}(y) \in X\). Contradiction! Hence \({{\,\textrm{fl}\,}}^{-1}[X]\) is an interval.

Now, since \({{\,\textrm{fl}\,}}\) is the identity over the floating-point numbers, \({{\,\textrm{fl}\,}}(x) = x\) for all \(x \in X\) and hence \({{\,\textrm{fl}\,}}^{-1}[X] \supseteq X\). Since \({\overline{\mathbb {F}}}\) is finite, X is also finite, and hence \([\min X, \max X]\) is the smallest interval containing X as a subset. Therefore, since \({{\,\textrm{fl}\,}}^{-1}[X]\) is an interval, it follows that \({{\,\textrm{fl}\,}}^{-1}[X] \supseteq [\min X, \max X]\). \(\square \)

To ensure the generality of our results, we define floating-point multiplication in terms of \({{\,\textrm{fl}\,}}\) rather than fixing a specific rounding functionFootnote 10:

Definition 2.17

(Multiplication) The floating-point multiplication operator \(\otimes \) is defined by \(x \otimes y = {{\,\textrm{fl}\,}}(xy)\) for all \(x, y \in {\overline{\mathbb {F}}}\) such that the product xy is defined.

Note that the extended real product is defined if and only if the factors do not include both zero and an infinity. In particular, the product of an extended real with a nonzero (finite) real number is always defined. As a consequence, however, \({\overline{\mathbb {R}}}\) is not closed under multiplication, and so neither is \({\overline{\mathbb {F}}}\). In fact, neither \(\mathbb {F}^*\) nor \(\mathbb {F}\) are necessarily closed under floating-point multiplication, since it is possible for a sufficiently large product to round to infinity or a sufficiently small nonzero product to round to zero.

However, since \({{\,\textrm{fl}\,}}\) is nondecreasing, we can still carry over some important ordering properties from the reals.

Lemma 2.18

Let \(a \in \mathbb {F}^*\) and \(x, y \in {\overline{\mathbb {F}}}\) such that \(x \le y\). If \(a > 0\), then \(a \otimes x \le a \otimes y\). If \(a < 0\), then \(a \otimes y \le a \otimes x\).

Since \({\overline{\mathbb {F}}}\) is a subset of \({\overline{\mathbb {R}}}\), it is totally ordered by the usual order relation. As such, there is a simple definition of an interval over \({\overline{\mathbb {F}}}\):

Definition 2.19

(Floating-point interval) A floating-point interval is any set \(X \cap {\overline{\mathbb {F}}}\) where X is an extended real interval.

With the exception of \([-\infty , +\infty ] \cap {\overline{\mathbb {F}}}\), due to the gaps between floating-point numbers, every floating-point interval has many representations in terms of extended real intervals. However, since \({\overline{\mathbb {F}}}\) is finite, each of its nonempty subsets has a minimum and a maximum, making it trivial to find the smallest interval containing it. For all \(a, b \in {\overline{\mathbb {R}}}\), we have the following:

$$\begin{aligned} \text {if } a, b \in {\overline{\mathbb {F}}} \text { and } a \le b, \text { then } a, b \in [a, b] \cap {\overline{\mathbb {F}}}{} & {} \text {(tightness)}\end{aligned}$$
(43)
$$\begin{aligned} \cap {\overline{\mathbb {F}}} = [{{\,\textrm{RU}\,}}(a), {{\,\textrm{RD}\,}}(b)] \cap {\overline{\mathbb {F}}}{} & {} \text {(rounding)} \end{aligned}$$
(44)

That is, if the bounds are in \({\overline{\mathbb {F}}}\), they are necessarily the tightest possible bounds, and intersecting an interval with \({\overline{\mathbb {F}}}\) corresponds to rounding its bounds “inward”.

2.2 The Classical Iterative Procedure

As shown in Example 1.1, we can use real division to correctly improve the bounds on the factors of a floating-point product. Although it did not find optimal bounds immediately, the division-based approach shown in the example is still essential to solving the problem efficiently, so let us describe it more rigorously.

In the example at the start, we reasoned in terms of inequalities. However, a general method will be easier to describe in terms of sets. Let \(X, Y, Z \subseteq {\overline{\mathbb {F}}}\) be floating-point intervals and let \(x \in X\). First, note that by the definition of floating-point multiplication, for all \(y \in Y\), we have \(x \otimes y \in Z\) if and only if \(xy \in {{\,\textrm{fl}\,}}^{-1}[Z]\). This greatly simplifies things by allowing us to characterize x using extended real division in most cases. More precisely, using our augmented “division”, we have \(x \otimes y \in Z\) for some \(y \in Y\) if and only if . We relate this to our problem statement as follows:

Definition 2.20

(Solution sets and optimality) A problem instance is any triple (XYZ) where \(X, Y, Z \subseteq {\overline{\mathbb {F}}}\). The solution set of (XYZ) is the set of triples

$$\begin{aligned} V = \{ (x, y, z) \mid x \in X, y\in Y, z \in Z, x \otimes y = z \}. \end{aligned}$$

A problem instance is satisfiable if and only if its solution set is nonempty. The triple (XYZ) has optimal bounds if and only if \(\Box V = \Box X \times \Box Y \times \Box Z\).

Lemma 2.21

Let \(X, Y, Z \subseteq {\overline{\mathbb {F}}}\) and let V be the solution set of (XYZ). Then \(\Box V = \Box X' \times \Box Y' \times \Box Z'\), where

Proof

Let \(V_1\), \(V_2\) and \(V_3\) be the sets containing, for each triple in V, its first, second, or third element, respectively. Then,

$$\begin{aligned} \Box V = \Box V_1 \times \Box V_2 \times \Box V_3. \end{aligned}$$

Now, by the definition of , it is easy to check that \(V_1 = X'\), \(V_2 = Y'\), and \(V_3 = Z'\), and the result follows immediately. \(\square \)

Although useful, it is impractical to directly use the above result to solve the problem. Since Y is a floating-point interval, it has many gaps, and so it is very expensive to compute the set of quotients. These gaps were invisible in Example 1.1 because we were working with inequalities, rather than the exact sets. Translated in terms of sets, what we did corresponds to enlarging the set of denominators Y to the extended real interval \(\Box Y\). By enlarging X and Z similarly, we obtain a precise definition of our algorithm:

Definition 2.22

(Classical iteration) We define the function \(F :{\mathcal {P}}({\overline{\mathbb {R}}})^3 \rightarrow {\mathcal {P}}({\overline{\mathbb {R}}})^3\) by \(F(X, Y, Z) = (X', Y', Z')\) where

Since \(\Box Y\) is a superset of Y, it follows that is a superset of , and so this substitution is sound, though it may not produce an optimal result. Importantly, however, a quotient of intervals is easy to compute and represent, since it must itself be either an interval or a union of two disjoint intervals.

Lemma 2.23

(Soundness) Let \(X, Y, Z \subseteq {\overline{\mathbb {F}}}\) and let \(x \in X\), \(y \in Y\), \(z \in Z\). If \(x \otimes y = z\), then \(x \in X', y \in Y', z \in Z'\) where \((X', Y', Z') = F(X, Y, Z)\).

Proof

Suppose \(x \otimes y = z\). Then, since \(x \otimes y = {{\,\textrm{fl}\,}}(xy) \in {{\,\textrm{fl}\,}}[\Box X \Box Y]\) and \(z \in Z\), we have \(z \in Z'\) by definition. Since \(z \in Z\), we also have \(xy \in {{\,\textrm{fl}\,}}^{-1}[Z]\) by definition. Thus, since \(y \in \Box Y\), we have by definition, and hence \(x \in X'\). Since \(x \in \Box X\), we also similarly have \(y \in Y'\), as desired. \(\square \)

It is easy to prove that iterating F converges in finite time, since the sets involved are finite, and we only shrink them:

Lemma 2.24

(Termination) For all \(X_1, Y_1, Z_1 \subseteq {\overline{\mathbb {F}}}\), the sequences of sets given by

$$\begin{aligned} (X_n, Y_n, Z_n) = F(X_{n-1}, Y_{n-1}, Z_{n-1}),\; n>1 \end{aligned}$$

are eventually constant.

Proof

Since \({\overline{\mathbb {F}}}\) is finite, \(X_1\), \(Y_1\) and \(Z_1\) are finite. Therefore, since the definition of F implies that the sequences \(X_n\), \(Y_n\) and \(Z_n\) are monotonically decreasing by the subset relation, they must be eventually constant. \(\square \)

Furthermore, individual applications of F can be computed quickly:

Lemma 2.25

(Time complexity) If X, Y, and Z are floating-point intervals or are each a disjoint union of an all-nonnegative and an all-nonpositive floating-point interval, then F(XYZ) can be computed using O(1) arithmetic operations and computations of the preimage of a floating-point interval under \({{\,\textrm{fl}\,}}\).

Remark

Although the number of arithmetic operations needed does not increase with the problem parameters, the individual cost of computing them does. In the sign-significand-exponent encoding, floating-point numbers (in any base) take up \(O(p + \log ({e_{\textrm{max}}}- {e_{\textrm{min}}}+ 1))\) bits. If \({{\,\textrm{fl}\,}}\) belongs to one of usual families of rounding functions (e.g. \({{\,\textrm{RD}\,}}\), \({{\,\textrm{RU}\,}}\), or \({{\,\textrm{RN}\,}}\)), both it and its preimage can be computed in time linear in the length of its input. Thus, floating-point addition can be computed in time linear in the length of the addends and multiplying two floating-point numbers takes time \(O(M(p) + \log ({e_{\textrm{max}}}- {e_{\textrm{min}}}+ 1))\), where M(p) is the time complexity of multiplying two p-digit integers. Correctly rounding division is more difficult, but it can be computed using Newton-Raphson division in same amount of time asymptotically as multiplication. Thus, for most choices of \({{\,\textrm{fl}\,}}\), computing F takes time \(O(M(p) + \log ({e_{\textrm{max}}}- {e_{\textrm{min}}}+ 1))\).Footnote 11 Note that we do not take the base, precision or exponent range as constants in our analysis, as doing so tends to collapse the time complexity of floating-point algorithms to O(1), precluding any meaningful comparison. We also do not assume a specific multiplication algorithm, as there are many to choose from with different characteristics.

However, given that what we are computing is a relaxation of the original problem, it is unclear how quickly the iteration converges, or even that it necessarily converges to the optimum. Certainly, it can be quite slow in some cases, as we saw in Example 1.1. Although we do not have a concrete answer, we offer the following conjecture on the rate of convergence:

Conjecture 2.26

Let \(X_1, Y_1, Z_1 \subseteq {\overline{\mathbb {F}}}\) be floating-point intervals, and define the sequences of sets

$$\begin{aligned} (X_n, Y_n, Z_n) = F(X_{n-1}, Y_{n-1}, Z_{n-1}),\; n>1. \end{aligned}$$

Then, for all \(m > 2\), the sets \(X_{m} {\setminus } X_{m+1}\), \(Y_{m} {\setminus } Y_{m+1}\) and \(Z_{m} {\setminus } Z_{m+1}\) contain at most 4 members each.

That is, we conjecture that the rate of convergence is constant. The intuition behind this is thus: after two iterations, we will have propagated the bounds from one factor to the other, and back again. After that, however, the bounds are within one floating-point number of the optimum of the real relaxation, so there is no significant progress to be gained from interval reasoning. Instead, each step only removes at most one floating-point number from the end of each interval in the sets. (Each set after the first iteration may be a disjoint union of two intervals, if the initial intervals contain both positive and negative numbers.) This is clearly rather slow, so we make a further conjecture: that, in the worst case, iterating F takes a number of steps to converge exponential in the precision:

Conjecture 2.27

Denote by \(\overline{\mathbb {F}}_n\) the floating-point numbers with format \((\beta , n, {e_{\textrm{min}}}, {e_{\textrm{max}}})\) for each precision \(n \ge 2\) and let \({{\,\textrm{fl}\,}}_n\) be an associated nondecreasing, faithful rounding function. Let \(F_n\) be F parameterized over \({{\,\textrm{fl}\,}}_n\). Let \(\mu _n(X, Y, Z) = m\) be the least positive integer such that \(F^{m+1}_n(X, Y, Z) = F^m_n(X, Y, Z)\), where \(F^k_n\) denotes \(F_n\) composed with itself k times, where \(X, Y, Z \subseteq \overline{\mathbb {F}}_n\). If \(f :\{ 2, 3, \ldots \} \rightarrow \mathbb {Z}\) is a function such that

$$\begin{aligned} f(n) = \max {\{ \mu _n(X, Y, Z) \mid X, Y, Z \text { are floating-point intervals over } \overline{\mathbb {F}}_n \}}, \end{aligned}$$

then \(f(n) = \Theta (\beta ^{cn})\) for some positive real c.

Remark

Proving an upper bound is trivial, as it is of the same order of steps as brute force search. The lower bound is tricky. In order to prove it, we essentially need to prove that bad (i.e. exponentially slow) instances exist in every precision. It would perhaps be easier to show that f is not dominated by any function of lesser order. For this, we merely need an infinite number of precisions with bad instances. A promising candidate is the set of all instances of the form \((X, {\overline{\mathbb {F}}}, \{ \beta ^p - 1 \})\), where X is the widest floating-point interval centered on \(\sqrt{\beta ^{p-1}(\beta ^p - 1)}\) such that the instance is unsatisfiable. (The unsatisfiable example we gave earlier for 64-bit IEEE floating-point numbers is one such instance.)

Leaving aside the rate of convergence, we would also like to know what exactly we are converging to. We can show that whenever \(\Box X\) and \(\Box Y\) do not contain zero, convergence implies that the bounds on X and Y are optimal. In such cases, we obtain explicit solutions to the equation \(x \otimes y = z\) where x or y is an endpoint of X or Y, respectively.

Lemma 2.28

(Conditional partial optimality) Let \(X, Y, Z \subseteq {\overline{\mathbb {F}}}\). If Z is a floating-point interval, and neither \(\Box X\) nor \(\Box Y\) contains zero, and \(F(X, Y, Z) = (X, Y, Z)\), then one of the following statements holds:

  1. 1.

    X, Y and Z are empty.

  2. 2.

    The elements of X have the same sign as the elements of Y, the elements of Z are nonnegative, and \(\min X \otimes \max Y \in Z\) and \(\max X \otimes \min Y \in Z\).

  3. 3.

    The elements of X have the opposite sign of the elements of Y, the elements of Z are nonpositive, and \(\min X \otimes \min Y \in Z\) and \(\max X \otimes \max Y \in Z\).

Proof

Suppose the conditions hold. Then, if any of X, Y, or Z are empty, the definition of F trivially implies that they are all empty, so the result holds. Suppose instead that X, Y and Z are nonempty. Since \(\Box X\) is an interval not containing zero, it is a subset of either \([-\infty , 0)\) or \((0, +\infty ]\), and thus the elements of X all have the same sign. Similarly, we also find that every element of Y has the same sign. We now proceed by cases:

Suppose that the elements of X have the same sign as the elements of Y. Then \({{\,\textrm{fl}\,}}(xy) \ge 0\) for all \(x \in \Box X\) and \(y \in \Box Y\) where the product is defined. Therefore, by the definition of F, the elements of Z are all nonnegative.

  1. 1.

    Let \(x \in \Box X\) and \(y \in \Box Y\) such that \(\min X \cdot y \in {{\,\textrm{fl}\,}}^{-1}[Z]\) and \(x \cdot \max Y \in {{\,\textrm{fl}\,}}^{-1}[Z]\). (These exist by the definition of .)

    1. (a)

      Suppose \(\min X > 0\) and \(\max Y > 0\). Then, since \(\min X \le x\) and \(y \le \max Y\), multiplying and rounding gives

      $$\begin{aligned} \min Z \le {{\,\textrm{fl}\,}}(\min X \cdot y) \le \min X \otimes \max Y \le {{\,\textrm{fl}\,}}(x \cdot \max Y) \le \max Z, \end{aligned}$$

      and since Z is a floating-point interval, we thus have \(\min X \otimes \max Y \in Z\).

    2. (b)

      Suppose \(\min X < 0\) and \(\max Y < 0\) instead. Then, similarly, we obtain

      $$\begin{aligned} {{\,\textrm{fl}\,}}(x \cdot \max Y) \le \min X \otimes \max Y \le {{\,\textrm{fl}\,}}(\min X \cdot y), \end{aligned}$$

      and thus \(\min X \otimes \max Y \in Z\).

  2. 2.

    Let \(x \in \Box X\) and \(y \in \Box Y\) instead such that \(\max X \cdot y \in {{\,\textrm{fl}\,}}^{-1}[Z]\) and \(x \cdot \min Y \in {{\,\textrm{fl}\,}}^{-1}[Z]\).

    1. (a)

      Suppose \(\max X > 0\) and \(\min Y > 0\). Then, since \(x \le \max X\) and \(\min Y \le y\),

      $$\begin{aligned} {{\,\textrm{fl}\,}}(x \cdot \min Y) \le \max X \otimes \min Y \le {{\,\textrm{fl}\,}}(\max X \cdot y), \end{aligned}$$

      and thus \(\max X \otimes \min Y \in Z\).

    2. (b)

      Suppose \(\max X < 0\) and \(\min Y < 0\) instead. Then, since \(x \le \max X\) and \(\min Y \le y\),

      $$\begin{aligned} {{\,\textrm{fl}\,}}(\max X \cdot y) \le \max X \otimes \min Y \le {{\,\textrm{fl}\,}}(x \cdot \min Y), \end{aligned}$$

      and thus \(\max X \otimes \min Y \in Z\).

Suppose instead that the elements of X have the opposite sign of the elements of Y. Then \({{\,\textrm{fl}\,}}(xy) \le 0\) for all \(x \in \Box X\) and \(y \in \Box Y\) and thus the elements of Z are all nonpositive.

  1. 1.

    Let \(x \in \Box X\) and \(y \in \Box Y\) such that \(\min X \cdot y \in {{\,\textrm{fl}\,}}^{-1}[Z]\) and \(x \cdot \min Y \in {{\,\textrm{fl}\,}}^{-1}[Z]\).

    1. (a)

      Suppose \(\min X > 0\) and \(\min Y < 0\). Then, since \(\min X \le x\) and \(\min Y \le y\), multiplying and rounding gives

      $$\begin{aligned} \min Z \le {{\,\textrm{fl}\,}}(x \cdot \min Y) \le \min X \otimes \min Y \le {{\,\textrm{fl}\,}}(\min X \cdot y) \le \max Z, \end{aligned}$$

      and therefore \(\min X \otimes \min Y \in Z\).

    2. (b)

      Suppose \(\min X < 0\) and \(\min Y > 0\) instead. Then, similarly, we obtain

      $$\begin{aligned} {{\,\textrm{fl}\,}}(\min X \cdot y) \le \min X \otimes \min Y \le {{\,\textrm{fl}\,}}(x \cdot \min Y), \end{aligned}$$

      and thus \(\min X \otimes \min Y \in Z\).

  2. 2.

    Let \(x \in \Box X\) and \(y \in \Box Y\) instead such that \(\max X \cdot y \in {{\,\textrm{fl}\,}}^{-1}[Z]\) and \(x \cdot \max Y \in {{\,\textrm{fl}\,}}^{-1}[Z]\).

    1. (a)

      Suppose \(\max X > 0\) and \(\max Y < 0\). Then, since \(x \le \max X\) and \(y \le \max Y\),

      $$\begin{aligned} {{\,\textrm{fl}\,}}(\max X \cdot y) \le \min X \otimes \min Y \le {{\,\textrm{fl}\,}}(x \cdot \max Y), \end{aligned}$$

      and so \(\max X \otimes \max Y \in Z\).

    2. (b)

      Suppose \(\max X < 0\) and \(\max Y > 0\) instead. Then, similarly,

      $$\begin{aligned} {{\,\textrm{fl}\,}}(x \cdot \max Y) \le \min X \otimes \min Y \le {{\,\textrm{fl}\,}}(\max X \cdot y), \end{aligned}$$

      and thus \(\max X \otimes \max Y \in Z\).

Since we have exhausted all cases, the proof is complete. \(\square \)

Remark

We can use this result in the general case by splitting X and Y into positive and negative parts, though doing this does mean that may need to solve multiple smaller instances to solve the overall instance. Alternatively, we could reformulate F to split X and Y appropriately.

Remark

Note that if there are no solutions, this implies that iterating F eventually proves it by converging to empty sets.

Now, although the above result is very informative, it does not tell us about the bounds on Z when F converges. We now show that, unfortunately, iterating F does not always produce optimal bounds:

Example 2.29

Consider \(\beta = 10\), \(p = 2\) and \({e_{\textrm{min}}}\le 0 \le {e_{\textrm{max}}}\). (That is, 1-place decimal floating-point numbers.) Let \(X = [1.2, 1.4] \cap \mathbb {F}\), \(Y = [1.2, 1.3] \cap \mathbb {F}\) and \(Z = [1.5, 1.8] \cap \mathbb {F}\) and suppose \({{\,\textrm{fl}\,}}\) is rounding to nearest, ties to even. We first show that \(F(X, Y, Z) = (X, Y, Z)\). In the following, note that \({{\,\textrm{fl}\,}}^{-1}[Z] = (1.45, 1.85]\):

From the definition of F, the above confirms that \(F(X, Y, Z) = (X, Y, Z)\) indeed. It is also easy to verify that the bounds on X and Y are optimal, as per Lemma 2.28:

$$\begin{aligned}&\min X \otimes \max Y = {{\,\textrm{fl}\,}}(1.2 \cdot 1.3) = {{\,\textrm{fl}\,}}(1.56) = 1.6 \in Z,\\&\max X \otimes \min Y = {{\,\textrm{fl}\,}}(1.4 \cdot 1.2) = {{\,\textrm{fl}\,}}(1.68) = 1.7 \in Z. \end{aligned}$$

However, if we enumerate all the products of the members of X and Y, we find that none of them round to \(\min Z = 1.5\).

Table 1 Real and rounded products of the elements of X and Y

From Table 1, we see that \(X \otimes Y = \{ 1.4, 1.6, 1.7, 1.8 \}\), and so \(Z \cap (X \otimes Y) = [1.6, 1.8] \cap \mathbb {F}\) is a proper subset of Z that is in fact a floating-point interval. Therefore \(\Box X \times \Box Y \times \Box Z\) is not the smallest box enclosing the solution set.

Example 2.29 demonstrates that the classical algorithm does not necessarily produce optimal results. However, there is a simple strategy we can use to remedy this. To find optimal bounds on Z—that is, find the least and greatest values \(z \in Z\) such that \(x \otimes y = z\) is satisfied by some \(x \in X\) and \(y \in Y\)—we can use binary search. Now, testing satisfiability is simple if we can find optimal bounds on any part: if a problem instance is unsatisfiable, then the optimal bounds are all empty sets. The key, then, is to find an efficient optimization method. In this paper, we develop such a procedure for the bounds on factors.

3 Simplifying the Problem

3.1 Sufficient and Necessary Conditions

In order to simplify the problem, we will need to precisely determine the conditions under which the division-based algorithm produces optimal results. In the second half of Example 1.1, we encountered certain combinations of values for which the equation \(x \otimes y = z\) has no solutions over the floating-point numbers. Unlike with the reals, there is no guarantee that we can find a value for x satisfying the equality for any given y and z. The existence of a solution is necessary for any optimal bound, however, and so the following definition will be useful:

Definition 3.1

(Floating-point factors) A floating-point number x is a floating-point factor of a floating-point number z if and only if \(x \otimes y = z\) for some \(y \in {\overline{\mathbb {F}}}\). Given a set \(Z \subseteq {\overline{\mathbb {F}}}\), we say that x is feasible for Z if and only if x is a floating-point factor of some \(z \in Z\). The set of all floating-point factors of the members of Z is denoted \({{\,\textrm{Feas}\,}}(Z)\).

For the sake of brevity, we will typically simply write “factor” when it is clear from context that we are discussing floating-point factors. It is worth noting that sets of factors satisfy the following closure property:

Lemma 3.2

For any \(Z \subseteq {\overline{\mathbb {F}}}\), the set \({{\,\textrm{Feas}\,}}(Z)\) is closed under negation.

Proof

Let \(Z \subseteq {\overline{\mathbb {F}}}\). If \({{\,\textrm{Feas}\,}}(Z)\) is empty, it is trivially closed under negation. Suppose \({{\,\textrm{Feas}\,}}(Z)\) is nonempty instead, and let \(x \in {{\,\textrm{Feas}\,}}(Z)\). Then there is some \(y \in {\overline{\mathbb {F}}}\) such that \(x \otimes y \in Z\). Therefore,

$$\begin{aligned} -x \otimes -y&= {{\,\textrm{fl}\,}}(-x \cdot -y)\\&= {{\,\textrm{fl}\,}}(xy)\\&= x \otimes y \in Z, \end{aligned}$$

and so \(-x \in {{\,\textrm{Feas}\,}}(Z)\). \(\square \)

Corollary 3.2.1

For any \(x \in {\overline{\mathbb {F}}}\) and \(Z \subseteq {\overline{\mathbb {F}}}\), we have

$$\begin{aligned} \max {\{ y \in {{\,\textrm{Feas}\,}}(Z) \mid y \le x \}} = -\min {\{ y \in {{\,\textrm{Feas}\,}}(Z) \mid y \ge -x \}}. \end{aligned}$$

Remark

The above corollary implies that we only need one procedure to compute both the minimum and maximum of \([a, b] \cap {{\,\textrm{Feas}\,}}(Z)\) for any \(a, b \in {\overline{\mathbb {F}}}\).

The following two lemmas are crucial. First, we prove that the real quotients from the division-based algorithm can still provide us with useful bounds.

Lemma 3.3

Let \(x \in {\overline{\mathbb {F}}}\), and let \(Y, Z \subseteq {\overline{\mathbb {F}}}\) where Y is a floating-point interval. If \({{\,\textrm{fl}\,}}(xy) \in Z\) for some \(y \in \Box Y\), then

  1. 1.

    \(x \otimes s \ge \min Z\) for some \(s \in Y\), and

  2. 2.

    \(x \otimes t \le \max Z\) for some \(t \in Y\).

Proof

Suppose \({{\,\textrm{fl}\,}}(xy) \in Z\) for some \(y \in \Box Y\). Since \(\min Y\) and \(\max Y\) are floating-point numbers and \(\min Y \le y \le \max Y\), by the monotonicity of rounding, we have \(\min Y \le {{\,\textrm{RD}\,}}(y) \le \max Y\) and \(\min Y \le {{\,\textrm{RU}\,}}(y) \le \max Y\). Since Y is a floating-point interval, it thus follows that \({{\,\textrm{RD}\,}}(y) \in Y\) and \({{\,\textrm{RU}\,}}(y) \in Y\). We shall first deal with the special cases of x being zero or infinite.

Suppose \(x = 0\). Since xy is defined by assumption, y must be finite, and hence \(xy = 0\). Thus \({{\,\textrm{fl}\,}}(xy) = 0\). Since y is finite, at least one of either \({{\,\textrm{RD}\,}}(y)\) or \({{\,\textrm{RU}\,}}(y)\) is finite, and thus either \(x {{\,\textrm{RD}\,}}(y) = 0\) or \(x {{\,\textrm{RU}\,}}(y) = 0\). Therefore either \(x \otimes {{\,\textrm{RD}\,}}(y) = 0\) or \(x \otimes {{\,\textrm{RU}\,}}(y) = 0\). Since \({{\,\textrm{fl}\,}}(xy) = 0\) and \(\min Z \le {{\,\textrm{fl}\,}}(xy) \le \max Z\), the result follows.

Suppose \(x = +\infty \) or \(x = -\infty \) instead. Since xy is defined by assumption, y must be nonzero and hence either \(xy = +\infty \) or \(xy = -\infty \). Therefore \({{\,\textrm{fl}\,}}(xy) = xy\). Since y is nonzero, at least one of \({{\,\textrm{RU}\,}}(y)\) or \({{\,\textrm{RD}\,}}(y)\) is nonzero. If \({{\,\textrm{RD}\,}}(y)\) is nonzero, then it has the same sign as y and hence \(x y = x {{\,\textrm{RD}\,}}(y)\). Similarly, if \({{\,\textrm{RU}\,}}(y)\) is nonzero, then it has the same sign as y and hence \(xy = x {{\,\textrm{RU}\,}}(y)\). Therefore either \(x \otimes {{\,\textrm{RD}\,}}(y) = {{\,\textrm{fl}\,}}(xy)\) or \(x \otimes {{\,\textrm{RU}\,}}(y) = {{\,\textrm{fl}\,}}(xy)\), and hence the result.

We now handle the remaining case of x being finite and nonzero. Suppose \(x \in \mathbb {F}^*\) instead. Then multiplication by x is always defined. Note that \(\min Y \le y \le \max Y\) and \(\min Z \le {{\,\textrm{fl}\,}}(xy) \le \max Z\). We shall consider each sign of x separately:

  • Suppose x is positive. Multiplying by x, it follows that

    $$\begin{aligned} x \min Y \le xy \le x \max Y, \end{aligned}$$

    and by the monotonicity of rounding, we have

    $$\begin{aligned} x \otimes \min Y \le {{\,\textrm{fl}\,}}(xy) \le x \otimes \max Y. \end{aligned}$$

    Combining these bounds with our previous bounds on \({{\,\textrm{fl}\,}}(xy)\), we obtain

    $$\begin{aligned} x \otimes \min Y \le&{{\,\textrm{fl}\,}}(xy) \le \max Z,\\ \min Z \le&{{\,\textrm{fl}\,}}(xy) \le x \otimes \max Y, \end{aligned}$$

    and hence the result holds.

  • Suppose x is negative. Then, similarly, we have

    $$\begin{aligned} x \otimes \min Y \ge {{\,\textrm{fl}\,}}(xy) \ge x \otimes \max Y. \end{aligned}$$

    Therefore, since \({{\,\textrm{fl}\,}}(xy) \in Z\), it follows that

    $$\begin{aligned} x \otimes \min Y \ge&{{\,\textrm{fl}\,}}(xy) \ge \min Z,\\ \max Z \ge&{{\,\textrm{fl}\,}}(xy) \ge x \otimes \max Y, \end{aligned}$$

    and thus we obtain the result.

Since we have proved the result for all possible cases, we are finished. \(\square \)

Next, we show that the solution set is exactly the set of feasible quotients:

Lemma 3.4

Let \(x \in {\overline{\mathbb {F}}}\), and let \(Y, Z \subseteq {\overline{\mathbb {F}}}\) be floating-point intervals. Then there exists \(y \in Y\) such that \(x \otimes y \in Z\) if and only if

  1. 1.

    x is feasible for Z, and

  2. 2.

    \({{\,\textrm{fl}\,}}(xa) \in Z\) for some \(a \in \Box Y\).

Proof

If \(x \otimes y \in Z\) for some \(y \in Y\), then \(x \in {{\,\textrm{Feas}\,}}(Z)\) by definition and \({{\,\textrm{fl}\,}}(xy) = x \otimes y \in Z\) where \(y \in \Box Y\) trivially. Suppose instead that x is feasible for Z, and that \({{\,\textrm{fl}\,}}(xa) \in Z\) for some \(a \in \Box Y\). Then, by definition, there is some \(w \in {\overline{\mathbb {F}}}\) such that \(x \otimes w \in Z\). Further, by Lemma 3.3, we have \(x \otimes s \ge \min Z\) and \(x \otimes t \le \max Z\) for some \(s, t \in Y\). If \(w \in Y\), then the result follows immediately. Since Y is a floating-point interval, if instead \(w \notin Y\), then either \(w < \min Y\) or \(w > \max Y\). For the following, note that \(\min Z \le x \otimes w \le \max Z\). We shall divide the problem into four cases, depending on the sign of x and the region w lies in:

  1. 1.

    Suppose \(w < \min Y\). Then \(w < s\) and \(w < t\).

    1. (a)

      Suppose \(x \ge 0\). Multiplying, we obtain \(x \otimes w \le x \otimes t\), and thus \(\min Z \le x \otimes t \le \max Z\). Since Z is a floating-point interval, it follows that \(x \otimes t \in Z\).

    2. (b)

      Suppose \(x < 0\). Then, we can multiply \(w < s\) by x to obtain \(x \otimes w \ge x \otimes s\), and thus \(\max Z \ge x \otimes s \ge \min Z\). Since Z is a floating-point interval, we have \(x \otimes s \in Z\).

  2. 2.

    Suppose \(w > \max Y\). Then \(w > s\) and \(w > t\).

    1. (a)

      Suppose \(x \ge 0\). Then \(x \otimes w \ge x \otimes s\), and thus \(x \otimes s \in Z\) by the reasoning of case (1)(b).

    2. (b)

      Suppose \(x < 0\). Then \(x \otimes w \le x \otimes t\), and hence \(x \otimes t \in Z\) by case (1)(a).

Since we have either \(x \otimes s \in Z\) or \(x \otimes t \in Z\) in all cases, the result follows. \(\square \)

Remark

This is not a solution on its own. Note that a is not necessarily a floating-point number!

This brings us to our first major result. The following theorem breaks the problem of computing optimal bounds into two simpler problems: computing bounds using the division-based algorithm, and then shrinking those bounds to the nearest feasible values.

Theorem 3.5

Let \(X, Y, Z \subseteq {\overline{\mathbb {F}}}\) be nonempty floating-point intervals, and let

Then \(X' = {\hat{X}} \cap {{\,\textrm{Feas}\,}}(Z)\).

Proof

By Lemma 3.4, for all \(x \in {\overline{\mathbb {F}}}\), we have \(x \in X'\) if and only if \(x \in X\), \(x \in {{\,\textrm{Feas}\,}}(Z)\) and , and the result follows immediately. \(\square \)

Remark

This result implies that the division-based algorithm produces optimal bounds if and only if it produces feasible bounds. That is, \(\min {\hat{X}} = \min X'\) if and only if \(\min {\hat{X}}\) is feasible for Z. Therefore, the difficult case is when the bounds produced by the division-based algorithm are infeasible.

Theorem 3.5 shows that we can solve Problem 1 by using the classical division-based algorithm to compute the set of quotients and then intersecting it with the set of factors. As such, the remainder of this paper will focus on solving this new problem:

Problem 2

Let \(x \in {\overline{\mathbb {F}}}\) and let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating point interval. What is the least (greatest) number feasible for Z that is no less (greater) than x?

3.2 Narrowing Down Infeasibility

Due to Theorem 3.5, we would like to devise a method to find the factors nearest any given non-factor. To that end, we now identify some more practical conditions for feasibility. The following lemma gives a simple and direct test for feasibility, and also shows that floating-point multiplication and division are closely related:

Lemma 3.6

Let \(x \in \mathbb {F}^*\) and let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval. Let \(z \in Z\). Then x is feasible for Z if and only if either \(x \otimes {{\,\textrm{RD}\,}}(z/x) \in Z\) or \(x \otimes {{\,\textrm{RU}\,}}(z/x) \in Z\).

Proof

If either \(x \otimes {{\,\textrm{RD}\,}}(z/x) \in Z\) or \(x \otimes {{\,\textrm{RU}\,}}(z/x) \in Z\), then x is trivially feasible for Z. For the other half, suppose x is feasible for Z. Then \(x \otimes y \in Z\) for some \(y \in {\overline{\mathbb {F}}}\). Now, let \(Y = {{\,\textrm{fl}\,}}^{-1}[Z]/x\). Then \(y \in Y\) and \(z/x \in Y\), and since Z is a floating-point interval, it follows that Y is an interval. Therefore, if \(y \le z/x\), then \(y \le {{\,\textrm{RD}\,}}(z/x) \le z/x\), and so \({{\,\textrm{RD}\,}}(z/x) \in Y\). If \(y > z/x\) instead, then \(y \ge {{\,\textrm{RU}\,}}(z/x) \ge z/x\), and thus \({{\,\textrm{RU}\,}}(z/x) \in Y\). Since either \({{\,\textrm{RD}\,}}(z/x) \in Y\) or \({{\,\textrm{RU}\,}}(z/x) \in Y\), we have either \(x \otimes {{\,\textrm{RD}\,}}(z/x) \in Z\) or \(x \otimes {{\,\textrm{RU}\,}}(z/x) \in Z\), as desired. \(\square \)

Corollary 3.6.1

Let \(x \in \mathbb {F}^*\) and \(z \in {\overline{\mathbb {F}}}\). Then x is a floating-point factor of z if and only if either \(x \otimes {{\,\textrm{RD}\,}}(z/x) = z\) or \(x \otimes {{\,\textrm{RU}\,}}(z/x) = z\).

Additionally, it is worth mentioning that the above serves as a manual proof of a general statement of the automatically derived “invertibility condition” for multiplication of Brain et al. [7]

Although it is clear that if the set of products is large, there will be many solutions, it is unclear how large exactly it needs to be. The following results give a more precise idea of how its diameter relates to the set of solutions. In the following lemma, we describe a lower bound on the diameter sufficient to ensure feasibility in all cases.

Lemma 3.7

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval, and let \(x \in \mathbb {F}^*\) and \(z \in Z\). If \({|z/x|} \le \max \mathbb {F}\) and \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] > {|x|}\beta ^{Q(z/x)}\), then x is feasible for Z.

Proof

If \(z/x \in {\overline{\mathbb {F}}}\), then \(x \otimes (z/x) = z\), and hence x is feasible for Z. Suppose \(z/x \notin {\overline{\mathbb {F}}}\) instead, and suppose \({|z/x|} \le \max \mathbb {F}\) and \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] > {|x|}\beta ^{Q(z/x)}\). First, note that \({{\,\textrm{RD}\,}}(z/x)< z/x < {{\,\textrm{RU}\,}}(z/x)\), and hence z lies strictly between \(x{{\,\textrm{RD}\,}}(z/x)\) and \(x{{\,\textrm{RU}\,}}(z/x)\). More precisely, let \(I = x \cdot [{{\,\textrm{RD}\,}}(z/x), {{\,\textrm{RU}\,}}(z/x)]\). Then \(z \in I\), and hence \(I \cap {{\,\textrm{fl}\,}}^{-1}[Z]\) is nonempty. Since Z is a floating-point interval, \({{\,\textrm{fl}\,}}^{-1}[Z]\) is an interval. We shall now show that \({{\,\textrm{fl}\,}}^{-1}[Z]\) is strictly wider than I:

$$\begin{aligned} {{\,\textrm{diam}\,}}I&= {|x|} ({{\,\textrm{RU}\,}}(z/x) - {{\,\textrm{RD}\,}}(z/x))\\&= {|x|} \beta ^{Q(z/x)}\\&< {{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z]. \end{aligned}$$

Therefore, since I and \({{\,\textrm{fl}\,}}^{-1}[Z]\) are overlapping intervals, by Lemma 2.3, it follows that either \(\min I \in {{\,\textrm{fl}\,}}^{-1}[Z]\) or \(\max I \in {{\,\textrm{fl}\,}}^{-1}[Z]\). Hence either \(x \otimes {{\,\textrm{RU}\,}}(z/x) \in Z\) or \(x \otimes {{\,\textrm{RD}\,}}(z/x) \in Z\), and so x is feasible for Z. \(\square \)

The next lemma builds on the previous result to give a width independent of the denominator under which virtually all values are feasible.

Lemma 3.8

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval, and let \(x \in \mathbb {F}^*\) and \(z \in Z\). If \(\beta ^{e_{\textrm{min}}}\le {|z/x|} \le \max \mathbb {F}\) and \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] \ge \beta ^{Q(z) + 1}\), then x is feasible for Z.

Proof

Suppose \(\beta ^{e_{\textrm{min}}}\le {|z/x|} \le \max \mathbb {F}\) and \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] \ge \beta ^{Q(z) + 1}\). Then,

$$\begin{aligned} {|x|} \beta ^{Q(z/x)}&= {|x|} \beta ^{E(z/x) - p + 1}\\&\le {|x|} {|z/x|} \beta ^{1-p}\\&= {|z|} \beta ^{1 - p}\\&< \beta ^{E(z) + 1} \beta ^{1 - p}\\&= \beta ^{Q(z) + 1}\\&\le {{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z]. \end{aligned}$$

Therefore, by Lemma 3.7, x is feasible for Z. \(\square \)

Remark

For any ordinary rounding function (e.g. \({{\,\textrm{RD}\,}}\), \({{\,\textrm{RU}\,}}\), or \({{\,\textrm{RN}\,}}\)), the requirement that \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] \ge \beta ^{Q(z) + 1}\) roughly corresponds to Z containing at least \(\beta \) floating-point numbers with the same exponent as z (cf. Ziv et al. [30]).

The next lemma provides an even stronger bound, under the condition that the significand of the numerator is no greater in magnitude than the significand of the denominator:

Lemma 3.9

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval, and let \(x \in \mathbb {F}^*\) and \(z \in Z\). Let \(m_x = x \beta ^{-E(x)}\) and \(m_z = z \beta ^{-E(z)}\). If \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] \ge \beta ^{Q(z)}\), \({|m_z|} \le {|m_x|}\), and \(\beta ^{e_{\textrm{min}}}\le {|z/x|} \le \max \mathbb {F}\), then x is feasible for Z.

Proof

Suppose \(\beta ^{e_{\textrm{min}}}\le {|z/x|} \le \max \mathbb {F}\). If \({|m_z|} = {|m_x|}\), then \({|z/x|} = \beta ^{E(z) - E(x)}\) and hence \(z/x \in \mathbb {F}\), so x is trivially feasible for Z. Suppose \({|m_z|} < {|m_x|}\) instead, and also suppose \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] \ge \beta ^{Q(z)}\). Then \(\log _\beta {{\,\textrm{ufp}\,}}(m_z/m_x) \le -1\) and so by Lemma 2.13, it follows that \(E(z/x) \le E(z) - E(x) - 1\). Thus,

$$\begin{aligned} {|x|}\beta ^{Q(z/x)}&= {|x|}\beta ^{E(z/x) - p + 1}\\&\le {|x|}\beta ^{E(z) - E(x) - p}\\&= {|m_x|}\beta ^{Q(z) - 1}\\&< \beta ^{Q(z)}\\&\le {{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z]. \end{aligned}$$

Therefore, by Lemma 3.7, x is feasible for Z. \(\square \)

Remark

Fig. 1
figure 1

Preimages of 1 and 1.75 under floating-point rounding, with \((\beta , p, {e_{\textrm{min}}}, {e_{\textrm{max}}}) = (2, 3, -2, 1)\). \({{\,\textrm{RNE}\,}}\) is rounding to nearest, breaking ties by choosing numbers with even integral significands

Any “ordinary” rounding function almost always satisfies the condition on the diameter on the preimage. Specifically, we always have \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[\{ z \}] = \beta ^{Q(z)}\) unless \({|z|}\) is normal and a power of \(\beta \). This can be seen in Fig. 1; note the different diameters of the preimages of 1 under rounding.

Note that Lemmas 3.7 and 3.9 are both easier to apply when using standard rounding modes, or more specifically, when the preimage under rounding of any floating-point number is not too narrow. Although the usual rounding functions are both faithful and monotonic, and these are indeed very useful properties, they are not enough on their own:

Example 3.10

We shall construct an ill-behaved rounding function which is still both faithful and monotonic. Let \(f :{\overline{\mathbb {R}}} \rightarrow {\overline{\mathbb {F}}}\) be a function such that

$$\begin{aligned} f(x) = {\left\{ \begin{array}{ll} {{\,\textrm{RD}\,}}(x) &{} \text {if } x \le \max \mathbb {F},\\ +\infty &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

Since \({{\,\textrm{RD}\,}}\) is nondecreasing and faithful, it is easy to see that f is also nondecreasing and faithful. However, the only real that it rounds to \(\max \mathbb {F}\) is \(\max \mathbb {F}\) itself. That is, we have \(f^{-1}[\{ \max \mathbb {F} \}] = \{ \max \mathbb {F} \}\), and so the preimage under rounding has zero diameter. Further, given \(x, y \in {\overline{\mathbb {F}}}\), we have \(x \otimes y = \max \mathbb {F}\) if and only if \(xy = \max \mathbb {F}\) exactly. This means that finding the floating-point factors of \(\max \mathbb {F}\) requires integer factorization. Specifically, since \(\max \mathbb {F}= (\beta ^p - 1)\beta ^{q_{\textrm{max}}}\), the floating-point factors of \(\max \mathbb {F}\) are exactly the floating-point numbers with a nonnegative exponent (i.e. integers) whose integral significands divide \(\beta ^p - 1\).

As demonstrated above, faithfulness and monotonicity are not enough to ensure that a rounding function has a tractable preimage. We therefore restrict our attention to a better-behaved class of rounding functions. In particular, we are interested in rounding functions that allocate to each floating-point number at least half of the gap around it:

Definition 3.11

(Equitable rounding) A rounding function f is equitable if it is faithful, monotonic, and \({{\,\textrm{diam}\,}}f^{-1}[\{ x \}] \ge \beta ^{Q(x) - k}\) where \(k = 1\) if \({|x\beta ^{-E(x)}|} = 1\) or \(k = 0\) otherwise for all \(x \in \mathbb {F}^*\).

Remark

Note that we do not require the preimages under rounding of 0, \(+\infty \), and \(-\infty \) to have a minimum diameter. This is because they already have every nonzero finite floating-point number as a floating-point factor, and so we do not need their preimages at all.

Given this definition, the following is then straightforward (albeit tedious) to prove:

Lemma 3.12

\({{\,\textrm{RD}\,}}\) and \({{\,\textrm{RU}\,}}\) are equitable rounding functions.

However, \({{\,\textrm{RN}\,}}\) is not necessarily equitable, since it is unclear what rounding to nearest means for numbers outside \([\min \mathbb {F}, \max \mathbb {F}]\) and thus our definition does not fully specify the preimages of \(\min \mathbb {F}\) and \(\max \mathbb {F}\) under \({{\,\textrm{RN}\,}}\). For instance, we could choose \({{\,\textrm{RN}\,}}\) such that \({{\,\textrm{RN}\,}}^{-1}[\{ x \}] = ((x^- + x)/2, x]\) when \(x = \max \mathbb {F}\). This is clearly not equitable, since \({{\,\textrm{diam}\,}}{{\,\textrm{RN}\,}}^{-1}[\{ x \}] = \beta ^{q_{\textrm{max}}}/2\). However, this inconvenience is not as important as it may seem. As mentioned earlier, IEEE 754-style rounding to nearest specifies that a number rounds to an infinity if and only if its magnitude is no less than \((\max \mathbb {F}+ \beta ^{{e_{\textrm{max}}}+ 1})/2\). This in turn means that the preimages of \(\min \mathbb {F}\) and \(\max \mathbb {F}\) under rounding have a diameter of \(\beta ^{q_{\textrm{max}}}\), as required for equitability. In particular, the preimages of \(\min \mathbb {F}\) and \(\max \mathbb {F}\) under any equitable rounding to nearest are at least as large as those under IEEE 754-style rounding:

Lemma 3.13

\({{\,\textrm{RN}\,}}\) is an equitable rounding function if and only if \({{\,\textrm{RN}\,}}^{-1}[\{ +\infty \}] \subseteq I\) and \({{\,\textrm{RN}\,}}^{-1}[\{ -\infty \}] \subseteq -I\) where \(I = [(\max \mathbb {F}+ \beta ^{{e_{\textrm{max}}}+ 1})/2, +\infty ]\).

Therefore, since IEEE 754-style systems are virtually the only ones in use, we do not consider equitability to be an especially onerous requirement.

Now, under the assumption of equitable rounding, we can substantially refine the conclusion of Lemma 3.9.

Lemma 3.14

Let \(x \in \mathbb {F}^*\) and \(z \in {\overline{\mathbb {F}}}\), and let \(m_x = x \beta ^{-E(x)}\) and \(m_z = z \beta ^{-E(z)}\). If \({{\,\textrm{fl}\,}}\) is equitable and \(\beta ^{e_{\textrm{min}}}\le {|z/x|} \le \max \mathbb {F}\) and \({|m_z|} \le {|m_x|}\), then either \({|m_z|} = 1\) or x is a floating-point factor of z.

Proof

Suppose \({{\,\textrm{fl}\,}}\) is equitable and \(\beta ^{e_{\textrm{min}}}\le {|z/x|} \le \max \mathbb {F}\) and \({|m_z|} \le {|m_x|}\). Suppose also that \({|m_z|} \ne 1\). Then, by the definition of equitable rounding, we have \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[\{ z \}] \ge \beta ^{Q(z)}\). Therefore by Lemma 3.9, it follows that x is feasible for \(\{ z \}\), and hence the result. \(\square \)

Remark

Note that here we show that x is a factor of z, whereas Lemmas 3.7 and 3.9 only prove that it is feasible for some superset Z.

Combining the previous results, we obtain a much clearer picture of what it means when a number is infeasible:

Theorem 3.15

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval, and let \(x \in \mathbb {F}^*\) and \(z \in Z\). Let \(m_x = x\beta ^{-E(x)}\) and \(m_z = z\beta ^{-E(z)}\). If \({{\,\textrm{fl}\,}}\) is equitable and \({|z/x|} \le \max \mathbb {F}\) and x is not feasible for Z, then either

  1. 1.

    \({|z/x|} < \beta ^{e_{\textrm{min}}}\), or

  2. 2.

    \(1 = {|m_z|} < {|m_x|}\) and \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] < \beta ^{Q(z)}\), or

  3. 3.

    \({|m_x|} < {|m_z|}\) and \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] < \beta ^{Q(z) + 1}\).

Proof

Follows immediately from Lemmas 3.7 and 3.14. \(\square \)

Remark

When rounding the quotient produces a subnormal number, as in case (1), the loss of significant digits can drastically magnify the error and make x infeasible unless Z is exceptionally wide. This makes subnormal quotients difficult to handle in general, since the difference between z and \(x \otimes {{\,\textrm{fl}\,}}(z/x)\) may be as large as x itself.

Since we assume \({{\,\textrm{fl}\,}}\) is equitable, case (2) is the only instance where x is infeasible and \({|m_z|} \le {|m_x|}\). Although it is a very constrained special case, it shall take some work to dispense with, as we will see later. The most common cause of infeasibility is case (3). In this instance, we have \({|m_x|} \le {|m_z|}\), but Z is too narrow to provide a direct solution, likely due to containing fewer than \(\beta \) numbers. Note that this means that having more than one number in Z usually suffices for feasibility when \(\beta = 2\).

Theorem 3.15 gives a straightforward classification of cases where infeasibility can occur. Although we are not able to give a full treatment of the subnormal quotient case in this paper, we will present an efficient solution for the other two cases. In the next section, we develop criteria for feasibility that are more amenable to a computational solution.

4 Analyzing the Error

Fig. 2
figure 2

Graphs of exact products of floating-point divisors x and rounded quotients \({{\,\textrm{RD}\,}}(z/x)\) and \({{\,\textrm{RU}\,}}(z/x)\) where \((\beta , p, {e_{\textrm{min}}}, {e_{\textrm{max}}}) = (10, 2, -1, 1)\) and \(z = 2.5\). The x-axis shows all positive finite floating-point numbers, spaced equally

When computing a real number, we are often forced to approximate it due to the limitations of the number system. For systems as complicated as the floating-point numbers, the error in this approximation—the rounding error—can seem unpredictable. The notion is simple: given a rounding function f, the rounding error of some \(x \in \mathbb {R}\) with respect to f is just \(f(x) - x\). However, it is difficult to tell from this what values the rounding error can take and how it varies with x. In our case, we are more precisely interested in a “round-trip” error. Roughly speaking, according to Lemma 3.6, a floating-point number is feasible for a given set if and only if we can undo division by using multiplication. In other words, it is feasible if and only if the error in the “round-trip” is small enough.

For any given floating-point numerator z and denominator x, any error in attempting to round-trip multiplication using division stems from rounding the quotient z/x. This is easy to show: since \({{\,\textrm{fl}\,}}\) is assumed to be faithful, if z/x is also a floating-point number, then \(x \otimes {{\,\textrm{fl}\,}}(z/x) = x\) exactly. Consequently, any instance where \(x \otimes {{\,\textrm{fl}\,}}(z/x) \ne x\) implies that z/x is not exactly representable. This is the common case, since the quotient of two floating-point numbers rarely fits into the available number of digits.Footnote 12 However, the product is also rounded, so an inexact quotient may still be close enough. For instance, if we round downward, it is sufficient that the product lies in \([z, z^+)\), where \(z^+\) is the next floating-point number after z. A number of such examples can be seen in Fig. 2, as well as a few instances where the product has no error (note that the graph is centered on z). Other patterns are visible as well, such as periodic behavior and a reduction in error whenever the significand of x is greater than the significand of z.

In the remainder of this section, we shall develop ways of manipulating this round-trip error, with the goal of using integer optimization to find floating-point factors. The key insight comes from Lemma 3.6, which tells us that feasibility can be decided using only \({{\,\textrm{fl}\,}}\) and upward- or downward-rounded quotients. Further, since we assume \({{\,\textrm{fl}\,}}\) is faithful, we also know that multiplication is also either rounded upward or downward. Therefore, we only really need to be concerned with rounding errors relative to \({{\,\textrm{RU}\,}}\) and \({{\,\textrm{RD}\,}}\) to solve the problem. However, the definitions of \({{\,\textrm{RU}\,}}\) and \({{\,\textrm{RD}\,}}\) are very cumbersome to manipulate directly. In order to make progress, we must first find a more practical description of rounding. We will do this by expressing rounding error in terms of the remainder of floored division. Furthermore, for the sake of simplicity, we will consider rounding over floating-point numbers with unbounded exponents.Footnote 13 We will also show how this relates to the original case.

4.1 Further Preliminaries

4.1.1 The Mod Operator

We shall use a generalized notion of the remainder of a division. Given a numerator n, denominator d, and quotient q, a remainder r satisfies \(n = qd + r\) and \({|r|} < {|d|}\). These quantities are traditionally integers, but we extend the operation to the real numbers. Given n and d, we shall determine the remainder r by choosing the quotient \(q = \left\lfloor n/d\right\rfloor \). Thus, we define the binary operator \(\bmod \) as follows:

Definition 4.1

(Remainder of floored division) For any numerator \(x \in \mathbb {R}\) and denominator \(y \in \mathbb {R}^*\), we define \((x \bmod y) = x - y\left\lfloor x/y\right\rfloor \).

Remark

To avoid excessive parenthesization, the mod operator defined here is nonassociative and has the lowest precedence. That is, within the parentheses, everything before mod is the first argument, and everything after mod is the second argument.

Since this is an extension of the usual notion of the remainder to real numerators and denominators, not all the same properties will hold exactly. However, there is still a close connection with modular arithmetic, which the next two lemmas will illustrate. Firstly, the remainder is periodic in the numerator, with a period equal to the denominator. That is, we can add or subtract integer multiples of the denominator to the numerator without affecting the result:

Lemma 4.2

For any \(x \in \mathbb {R}\), \(y \in \mathbb {R}^*\), and \(n \in \mathbb {Z}\), \((x + ny \bmod y) = (x \bmod y)\).

Proof

Follows directly by a straightforward computation. \(\square \)

Secondly, if we restrict ourselves wholly to the integers and fix the denominator, equality of remainders is equivalent to congruence:

Lemma 4.3

For all \(a, b \in \mathbb {Z}\), \(n \in \mathbb {Z}^*\), we have \((a \bmod n) = (b \bmod n)\) if and only if \(a \equiv b \pmod {{|n|}}\).

Proof

If \((a \bmod n) = (b \bmod n)\), then \(a - n \left\lfloor a/n\right\rfloor = b - n \left\lfloor b/n\right\rfloor \), and hence \(a \equiv b \pmod {{|n|}}\). Suppose instead that \(a \equiv b \pmod {{|n|}}\). Then there exist \(p, q, r \in \mathbb {Z}\) such that \(a = pn + r\) and \(b = qn + r\). As \((pn + r \bmod n) = (qn + r \bmod n) = (r \bmod n)\), the result follows. \(\square \)

As might be expected from being so closely related to congruence, the \(\bmod \) operator has many useful properties, though we shall only need a few. The following properties hold for all \(x \in \mathbb {R}\) and \(a, y \in \mathbb {R}^*\):

$$\begin{aligned} {|(x \bmod y)|} < {|y|}{} & {} \text {(upper bound)} \end{aligned}$$
(45)
$$\begin{aligned} (ax \bmod ay) = a(x \bmod y){} & {} \text {(distributive law)}\end{aligned}$$
(46)
$$\begin{aligned} \text {if } x, y \in \mathbb {Z}, \text { then } (x \bmod y) \in \mathbb {Z}{} & {} \text {(integrality)}\end{aligned}$$
(47)
$$\begin{aligned} \text {if } (x \bmod y) \ne 0, \text { then } (-x \bmod y) = y - (x \bmod y){} & {} \text {(complement)} \end{aligned}$$
(48)

It is also worth noting that the remainder (when nonzero) has the same sign as the denominator:

$$\begin{aligned} y(x \bmod y) \ge 0 \end{aligned}$$
(49)

Finally, the following results simplify some common cases of the numerator:

$$\begin{aligned} (0 \bmod y) = 0\end{aligned}$$
(50)
$$\begin{aligned} (x \bmod y) = 0 \text { if and only if } x = ny \text { for some } n \in \mathbb {Z}\end{aligned}$$
(51)
$$\begin{aligned} (x \bmod y) = x \text { if and only if } 0 \le x/y < 1 \end{aligned}$$
(52)
$$\begin{aligned} (x \bmod y) = x \text { if and only if } xy \ge 0 \text{ and } {|x|} < {|y|} \end{aligned}$$
(53)

The proofs of these statements are simple and follow easily from the definition of \(\bmod \).

4.1.2 Floating-Point Numbers with Unbounded exponents

We define these sets almost identically to the bounded case. For compatibility with their bounded counterparts, we inherit the base \(\beta \) and precision p:

Definition 4.4

(Unbounded floating-point numbers) We define the following sets:

$$\begin{aligned} \mathbb {F}_\infty ^*&= \{ M \beta ^q \mid M, q \in \mathbb {Z}, 0< {|M|} < \beta ^p \},\\ \mathbb {F}_\infty&= \mathbb {F}_\infty ^* \cup \{ 0 \},\\ \overline{\mathbb {F}}_\infty&= \mathbb {F}_\infty \cup \{ -\infty , +\infty \}. \end{aligned}$$

Remark

Note that the unbounded sets are strict supersets of their bounded counterparts. That is, we have \(\mathbb {F}^* \subset \mathbb {F}_\infty ^*\), \(\mathbb {F}\subset \mathbb {F}_\infty \), and \({\overline{\mathbb {F}}} \subset \overline{\mathbb {F}}_\infty \).

The unbounded floating-point numbers satisfy the following properties for all \(x \in \overline{\mathbb {F}}_\infty \) and \(n \in \mathbb {Z}\):

$$\begin{aligned} \text {if } {|n|} \le \beta ^p, \text {then } n&\in \mathbb {F}_\infty \end{aligned}$$
(54)
$$\begin{aligned} \text {if } \beta ^{e_{\textrm{min}}}\le {|x|} \le \max \mathbb {F}, \text { then } x&\in \mathbb {F}^* \end{aligned}$$
(55)
$$\begin{aligned} x\beta ^n&\in \overline{\mathbb {F}}_\infty{} & {} \text {(scaling by powers of }{\beta )}\end{aligned}$$
(56)
$$\begin{aligned} \text {if } x \in \mathbb {F}_\infty ^*, \text { then } x/{{{\,\textrm{ulp}\,}}(x)}&\in \mathbb {Z}{} & {} \text {(integral significand)} \end{aligned}$$
(57)

Proving the existence of a unique upward or downward rounding takes a little more work in the unbounded case, since the sets are infinite, so their subsets might not have a minimum or maximum. For instance, the set \((0, 1] \cap \overline{\mathbb {F}}_\infty \) has no minimum, since it contains every \(\beta ^{-n}\) where \(n \ge 0\). However, we only have finitely bounded infinite sets around zero:

Lemma 4.5

Let \(a, b \in \mathbb {R}\). If \(0 \notin [a, b]\), then \([a, b] \cap \overline{\mathbb {F}}_\infty \) is a finite set.

Proof

Let \(S = [a, b] \cap \overline{\mathbb {F}}_\infty \). If S is empty, then the result follows trivially. Suppose S is nonempty instead and suppose that \(0 \notin [a, b]\). Then \(a \le b\), and a and b must be nonzero and have the same sign. We shall consider the case of each sign separately.

Suppose a and b are positive. Let \(e_a = \left\lfloor \log _\beta a\right\rfloor \) and \(e_b = \left\lfloor \log _\beta b\right\rfloor \), and let

$$\begin{aligned} T = \{ M \beta ^{e-p+1} \mid M, e \in \mathbb {Z}, 0< M < \beta ^p, e_a \le e \le e_b + p - 1 \}. \end{aligned}$$

Clearly, T is finite. Now, let \(s \in S\). Since s is positive, by definition we have \(s = M \beta ^q\) for some \(M, q \in \mathbb {Z}\) where \(0< M < \beta ^p\). Thus, to show that \(s \in T\), it suffices to show that \(e_a - p + 1 \le q \le e_b\). Therefore,

$$\begin{aligned} \beta ^q \le M \beta ^q = s \le b < \beta ^{e_b + 1} \end{aligned}$$

so \(q < e_b + 1\) and hence \(q \le e_b\). Similarly,

$$\begin{aligned} \beta ^{e_a} \le a \le s = M\beta ^{q} < \beta ^{q + p} \end{aligned}$$

so \(e_a - p < q\) and hence \(e_a - p + 1 \le q\). Therefore \(s \in T\), and hence \(S \subseteq T\), and since T is finite, so is S.

Suppose instead a and b are negative. Then \(-b\) and \(-a\) are positive, so \(0 \notin [-b, -a]\), and thus \([-b, -a] \cap \overline{\mathbb {F}}_\infty \) is finite. Since \(\overline{\mathbb {F}}_\infty \) is closed under negation, we have \([-b, -a] \cap \overline{\mathbb {F}}_\infty = -([a, b] \cap \overline{\mathbb {F}}_\infty ) = -S\), and hence S is also finite. \(\square \)

Corollary 4.5.1

For any \(x \in {\overline{\mathbb {R}}}\), the sets \(\{ y \in \overline{\mathbb {F}}_\infty \mid y \le x \}\) and \(\{ y \in \overline{\mathbb {F}}_\infty \mid y \ge x \}\) have a minimum and a maximum.

We can now define downward and upward rounding over the unbounded floating-point numbers:

Definition 4.6

(Unbounded rounding) We define the functions \({{\,\textrm{RD}\,}}_\infty , {{\,\textrm{RU}\,}}_\infty :{\overline{\mathbb {R}}} \rightarrow \overline{\mathbb {F}}_\infty \) as follows:

$$\begin{aligned} {{\,\textrm{RD}\,}}_\infty (x)&= \max {\{ y \in \overline{\mathbb {F}}_\infty \mid y \le x \}},\\ {{\,\textrm{RU}\,}}_\infty (x)&= \min {\{ y \in \overline{\mathbb {F}}_\infty \mid y \ge x \}}. \end{aligned}$$

As expected, these functions are equal to their bounded counterparts over the range of the normal numbers:

Lemma 4.7

For all \(x \in {\overline{\mathbb {R}}}\), if \(\beta ^{e_{\textrm{min}}}\le {|x|} \le \max \mathbb {F}\), then \({{\,\textrm{RD}\,}}(x) = {{\,\textrm{RD}\,}}_\infty (x)\) and \({{\,\textrm{RU}\,}}(x) = {{\,\textrm{RU}\,}}_\infty (x)\).

We also have a similar gap property to Eq. 40:

Lemma 4.8

For all \(x \in {\overline{\mathbb {R}}}\), we have either \({{\,\textrm{RD}\,}}_\infty (x) = {{\,\textrm{RU}\,}}_\infty (x) = x\) or \({{\,\textrm{RU}\,}}_\infty (x) - {{\,\textrm{RD}\,}}_\infty (x) = {{\,\textrm{ulp}\,}}(x)\).

Note that the gap is always finite, since having unbounded exponents means that every finite real rounds to another finite real. Only \(+\infty \) and \(-\infty \) round to themselves.

4.1.3 Plausibility

In order to effectively translate our problem into the realm of unbounded floating-point numbers, we will also need a counterpart to our earlier concept of feasibility. Note that we keep the same rounding function: even though \({{\,\textrm{fl}\,}}\) is not surjective with respect to the unbounded floating-point numbers, it is compatible since it already maps every extended real to a bounded floating-point number.

Definition 4.9

(Plausibility) Given \(Z \subseteq \overline{\mathbb {F}}_\infty \) and \(x \in \overline{\mathbb {F}}_\infty \), we say that x is plausible for Z if and only if \({{\,\textrm{fl}\,}}(xy) \in Z\) for some \(y \in \overline{\mathbb {F}}_\infty \). We denote the set of all unbounded floating-point numbers plausible for Z by \({{\,\textrm{Plaus}\,}}(Z)\).

Remark

If \(Z \subseteq {\overline{\mathbb {F}}}\), then we immediately have \(Z \subseteq {{\,\textrm{Feas}\,}}(Z) \subseteq {{\,\textrm{Plaus}\,}}(Z)\).

Similarly to the sets of factors, we have the following closure property:

Lemma 4.10

For all \(Z \subseteq \overline{\mathbb {F}}_\infty \), the set \({{\,\textrm{Plaus}\,}}(Z)\) is closed under negation and scaling by powers of \(\beta \).

Due to exponents being unrestricted, we have the following:

Lemma 4.11

For all \(Z \subseteq \overline{\mathbb {F}}_\infty \), we have \(1 \in {{\,\textrm{Plaus}\,}}(Z)\) if and only if Z is nonempty.

Perhaps most importantly, we have an analogue to Lemma 3.6 (the proof is identical, mutatis mutandis):

Lemma 4.12

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval, and let \(x \in \mathbb {F}^*_\infty \) and \(z \in Z\). Then \(x \in {{\,\textrm{Plaus}\,}}(Z)\) if and only if either \({{\,\textrm{fl}\,}}(x {{\,\textrm{RD}\,}}_\infty (z/x)) \in Z\) or \({{\,\textrm{fl}\,}}(x {{\,\textrm{RU}\,}}_\infty (z/x)) \in Z\).

We also have an analogue to Lemma 3.9:

Lemma 4.13

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval, and let \(x \in \mathbb {F}^*_\infty \) and \(z \in Z\). Let \(m_x = x/ulp(x)\) and \(m_z = z/{{\,\textrm{ulp}\,}}(z)\). If \({{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] \ge {{\,\textrm{ulp}\,}}(z)\) and \({|m_z|} \le {|m_x|}\), then x is plausible for Z.

As expected, feasibility and plausibility agree when the quotients are normal:

Lemma 4.14

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval, and let \(x \in \mathbb {F}^*\) and \(z \in Z\). If \(\beta ^{e_{\textrm{min}}}\le {|z/x|} \le \max \mathbb {F}\), then \(x \in {{\,\textrm{Feas}\,}}(Z)\) if and only if \(x \in {{\,\textrm{Plaus}\,}}(Z)\).

Proof

If \(x \in {{\,\textrm{Feas}\,}}(Z)\), then \(x \in {{\,\textrm{Plaus}\,}}(Z)\) trivially. Suppose \(x \in {{\,\textrm{Plaus}\,}}(Z)\) and \(\beta ^{e_{\textrm{min}}}\le {|z/x|} \le \max \mathbb {F}\). Then \({{\,\textrm{RD}\,}}(z/x) = {{\,\textrm{RD}\,}}_\infty (z/x)\) and \({{\,\textrm{RU}\,}}(z/x) = {{\,\textrm{RU}\,}}_\infty (z/x)\), so by Lemma 4.12, we have either

$$\begin{aligned} x \otimes {{\,\textrm{RD}\,}}(z/x)&= {{\,\textrm{fl}\,}}(x {{\,\textrm{RD}\,}}_\infty (z/x)) \in Z, \end{aligned}$$

or

$$\begin{aligned} x \otimes {{\,\textrm{RU}\,}}(z/x)&= {{\,\textrm{fl}\,}}(x {{\,\textrm{RU}\,}}_\infty (z/x) \in Z, \end{aligned}$$

and therefore \(x \in {{\,\textrm{Feas}\,}}(Z)\) by definition. \(\square \)

Since exponents are unbounded, there is always a next and a previous plausible number. We will find it useful to have a function giving the smallest plausible number greater than some lower bound. (Note that \({\overline{\mathbb {F}}}\) being finite means that we cannot have a similar total function for the feasible numbers.) First, we prove its existence:

Lemma 4.15

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval, and let \(x \in \mathbb {R}^*\). Let \(A = \{ y \in {{\,\textrm{Plaus}\,}}(Z) \mid y \ge x \}\). Then A is nonempty and has a minimum if and only if Z is nonempty.

Proof

If Z is empty, then \({{\,\textrm{Plaus}\,}}(Z)\) is trivially empty and hence so is A. Suppose instead that Z is nonempty and let \(z \in Z\) and let \(t = z/s\) where

$$\begin{aligned} s = {\left\{ \begin{array}{ll} \beta {{\,\textrm{ufp}\,}}(x) &{} \text {if } x > 0,\\ -{{\,\textrm{ufp}\,}}(x) &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Clearly, \(x \le s\). Now, since \({|s|}\) is a power of \(\beta \) and \(z \in \overline{\mathbb {F}}_\infty \), we have \(s, t \in \overline{\mathbb {F}}_\infty \). Therefore \({{\,\textrm{fl}\,}}(st) = {{\,\textrm{fl}\,}}(z) = z\), and hence \(s, t \in {{\,\textrm{Plaus}\,}}(Z)\) by definition and so \(s \in A\).

Now, let \(L = [x, s] \cap {{\,\textrm{Plaus}\,}}(Z)\) and \(L' = [x, s] \cap \overline{\mathbb {F}}_\infty \). Since x and s have the same sign by definition, by Lemma 4.5, it follows that \(L'\) is finite. Hence, since \(s \in L \subseteq L'\), it follows that L is also finite, and since it is nonempty, it has a minimum. Now, let \(R = (s, +\infty ] \cap {{\,\textrm{Plaus}\,}}(Z)\). Since \(\min L \le s\), it follows that \(\min L < r\) for all \(r \in R\). Thus, since \(A = L \cup R\), we have \(\min L \le a\) for all \(a \in A\), and hence \(\min A = \min L\). \(\square \)

With this result, we can now define our “round up to plausible” function:

Definition 4.16

For each nonempty \(Z \subseteq {\overline{\mathbb {F}}}\), we define the function \(\phi _{Z} :\mathbb {R}^* \rightarrow \overline{\mathbb {F}}_\infty \) by

$$\begin{aligned} \phi _{Z}(x) = \min {\{ y \in {{\,\textrm{Plaus}\,}}(Z) \mid y \ge x \}}. \end{aligned}$$

It is easy to see from the definition that these functions are nondecreasing:

Lemma 4.17

For all \(Z \subseteq {\overline{\mathbb {F}}}\) and \(x, y \in \mathbb {R}^*\), if \(x \le y\), then \(\phi _{Z}(x) \le \phi _{Z}(y)\).

Thus, since plausible numbers are a superset of feasible numbers, we have the following:

Lemma 4.18

Let \(Z \subseteq {\overline{\mathbb {F}}}\), \(x \in \mathbb {R}^*\) and \(y \in {{\,\textrm{Feas}\,}}(Z)\). Then, if \(x \le y\), then \(\phi _{Z}(x) \le y\).

That is, each of these functions gives a lower bound on the next feasible number.

4.2 Relating Feasibility, Plausibility, and Rounding Errors

We begin by giving expressions for rounding in terms of remainders. Note that the following result directly gives a rounding error term.

Lemma 4.19

For all \(x \in \mathbb {R}^*\), we have

$$\begin{aligned} {{\,\textrm{RD}\,}}_\infty (x)&= x - (x \bmod {{\,\textrm{ulp}\,}}(x)),\\ {{\,\textrm{RU}\,}}_\infty (x)&= x + (-x \bmod {{\,\textrm{ulp}\,}}(x)). \end{aligned}$$

Proof

Let and \(M = x/{{\,\textrm{ulp}\,}}(x)\). Then, rearranging using the definition of \(\bmod \), we obtain and and thus \(x' \le x \le x''\). Since \(\beta ^{p-1} \le {|M|} < \beta ^p\), it follows that and . Thus , and since \({{\,\textrm{ulp}\,}}(x)\) is a power of \(\beta \), we have \(x', x'' \in \mathbb {F}_\infty \). Thus, by the definitions of \({{\,\textrm{RD}\,}}_\infty \) and \({{\,\textrm{RU}\,}}_\infty \),

$$\begin{aligned} x' \le {{\,\textrm{RD}\,}}_\infty (x) \le x \le {{\,\textrm{RU}\,}}_\infty (x) \le x''. \end{aligned}$$

Now, if \(M \in \mathbb {Z}\), then \(x' = x''\) and hence the result follows immediately. Suppose \(M \notin \mathbb {Z}\) instead. Then \(x \notin \mathbb {F}_\infty \), and so \({{\,\textrm{RU}\,}}_\infty (x) - {{\,\textrm{RD}\,}}_\infty (x) = {{\,\textrm{ulp}\,}}(x) = x'' - x'\). Therefore,

$$\begin{aligned} x' \le {{\,\textrm{RD}\,}}_\infty (x) \le x \le {{\,\textrm{RD}\,}}_\infty (x) + {{\,\textrm{ulp}\,}}(x) \le x' + {{\,\textrm{ulp}\,}}(x), \end{aligned}$$

and hence \(x' = {{\,\textrm{RD}\,}}_\infty (x)\) and \(x'' = {{\,\textrm{RU}\,}}_\infty (x)\), as desired. \(\square \)

The above result allows us to manipulate the rounding error using the various laws obeyed by the mod operator. The following result gives us an expression for the round-trip error between any two consecutive plausible numbers.

Lemma 4.20

Let \(Z \subseteq {\overline{\mathbb {F}}}\) and \(z \in Z\). Let \(x \in \mathbb {R}^*\). If \(z \in \mathbb {F}^*\), then for all \(t \in [x, \phi _{Z}(x)]\), we have \({{\,\textrm{ufp}\,}}(x) \le {|t|} \le \beta {{\,\textrm{ufp}\,}}(x)\) and

$$\begin{aligned} t {{\,\textrm{RD}\,}}_\infty (z/t)&= z - (z \bmod t {{\,\textrm{ulp}\,}}(z/x)),\\ t {{\,\textrm{RU}\,}}_\infty (z/t)&= z + (-z \bmod t {{\,\textrm{ulp}\,}}(z/x)). \end{aligned}$$

Proof

Suppose \(z \in \mathbb {F}^*\) and let \(t \in [x, \phi _{Z}(x)]\). Let \(m_x = x/{{\,\textrm{ufp}\,}}(x)\), \(m_z = z/{{\,\textrm{ufp}\,}}(z)\), and \(m_t = t/{{\,\textrm{ufp}\,}}(t)\). By Lemma 4.19 and the distributivity of multiplication over \(\bmod \),

$$\begin{aligned} t {{\,\textrm{RD}\,}}_\infty (z/t)&= z - (z \bmod t {{\,\textrm{ulp}\,}}(z/t)),\\ t {{\,\textrm{RU}\,}}_\infty (z/t)&= z + (-z \bmod t {{\,\textrm{ulp}\,}}(z/t)), \end{aligned}$$

so for the equalities it suffices to show that \({{\,\textrm{ulp}\,}}(z/t) = {{\,\textrm{ulp}\,}}(Z/x)\), or equivalently, that \({{\,\textrm{ufp}\,}}(z/t) = {{\,\textrm{ufp}\,}}(z/x)\). Note that if \({{\,\textrm{ufp}\,}}(x) = {{\,\textrm{ufp}\,}}(t)\) and \({{\,\textrm{ufp}\,}}(m_x) = {{\,\textrm{ufp}\,}}(m_t)\), then by Lemma 2.12, we have

$$\begin{aligned} {{\,\textrm{ufp}\,}}(z/t)&= {{\,\textrm{ufp}\,}}(m_z/m_t){{\,\textrm{ufp}\,}}(z)/{{\,\textrm{ufp}\,}}(t)\\&= {{\,\textrm{ufp}\,}}(m_z/m_x){{\,\textrm{ufp}\,}}(z)/{{\,\textrm{ufp}\,}}(x)\\&= {{\,\textrm{ufp}\,}}(z/x). \end{aligned}$$

Also note that if \({{\,\textrm{ufp}\,}}(t) = {{\,\textrm{ufp}\,}}(x)\), then \({{\,\textrm{ufp}\,}}(x) \le {|t|} < \beta {{\,\textrm{ufp}\,}}(t)\) immediately.

We proceed by cases. In the following, note that by Lemma 4.10, every power of \(\beta \) and every scaling of z by a power of \(\beta \) is plausible for Z.

  • Suppose \({|m_z|} < m_x\). Then by Lemma 2.12, we have \({{\,\textrm{ufp}\,}}(m_z/m_x) = 1/\beta \). Then, since \(x < \beta {{\,\textrm{ufp}\,}}(x) \in {{\,\textrm{Plaus}\,}}(Z)\), we have \(t \le \beta {{\,\textrm{ufp}\,}}(x)\).

    • Suppose \(t < \beta {{\,\textrm{ufp}\,}}(x)\). Then, since \({{\,\textrm{ufp}\,}}(x) \le x \le t\), we have \({{\,\textrm{ufp}\,}}(t) = {{\,\textrm{ufp}\,}}(x)\) and hence \({|m_z|} < m_x \le m_t\). Therefore, by Lemma 2.12, we have \({{\,\textrm{ufp}\,}}(m_z/m_t) = 1/\beta \) as well, as desired.

    • Suppose \(t = \beta {{\,\textrm{ufp}\,}}(x)\) instead. Then,

      $$\begin{aligned} {{\,\textrm{ufp}\,}}(z/t)&= {{\,\textrm{ufp}\,}}\left( \frac{z}{\beta {{\,\textrm{ufp}\,}}(x)}\right) \\&= \frac{1}{\beta } \cdot \frac{{{\,\textrm{ufp}\,}}(z)}{{{\,\textrm{ufp}\,}}(x)}\\&= {{\,\textrm{ufp}\,}}(m_z/m_x) {{\,\textrm{ufp}\,}}(z)/{{\,\textrm{ufp}\,}}(x)\\&= {{\,\textrm{ufp}\,}}(z/x). \end{aligned}$$
  • Suppose \(0 < m_x \le {|m_z|}\) instead. Then, since \({|m_z|} {{\,\textrm{ufp}\,}}(x) \in {{\,\textrm{Plaus}\,}}(Z)\), we have

    $$\begin{aligned} x \le t \le {|m_z|} {{\,\textrm{ufp}\,}}(x) < \beta {{\,\textrm{ufp}\,}}(x). \end{aligned}$$

    Thus \({{\,\textrm{ufp}\,}}(t) = {{\,\textrm{ufp}\,}}(x)\) and hence

    $$\begin{aligned} m_x \le m_t = t/{{\,\textrm{ufp}\,}}(x) \le {|m_z|}. \end{aligned}$$

    Thus by Lemma 2.12, we also have \({{\,\textrm{ufp}\,}}(m_z/m_x) = {{\,\textrm{ufp}\,}}(m_z/m_t) = 1\), as desired.

  • Suppose \(-{|m_z|} \le m_x < 0\) instead. Then, since \(-{{\,\textrm{ufp}\,}}(x) \in {{\,\textrm{Plaus}\,}}(Z)\), we have

    $$\begin{aligned} x \le t \le -{{\,\textrm{ufp}\,}}(x), \end{aligned}$$

    so \({{\,\textrm{ufp}\,}}(t) = {{\,\textrm{ufp}\,}}(x)\) and hence

    $$\begin{aligned} -{|m_z|} \le m_x \le m_t \le -1. \end{aligned}$$

    Therefore, by Lemma 2.12, we have \({{\,\textrm{ufp}\,}}(m_z/m_x) = {{\,\textrm{ufp}\,}}(m_z/m_t) = 1\), as desired.

  • Suppose \(m_x < -{|m_z|}\) instead. Then \({{\,\textrm{ufp}\,}}(m_z/m_x) = 1/\beta \) by Lemma 2.12, and since \(-{|m_z|} {{\,\textrm{ufp}\,}}(x) \in {{\,\textrm{Plaus}\,}}(Z)\), we have

    $$\begin{aligned} x \le t \le -{|m_z|} {{\,\textrm{ufp}\,}}(x), \end{aligned}$$

    and thus \({{\,\textrm{ufp}\,}}(x) = {{\,\textrm{ufp}\,}}(t)\).

    • Suppose \(t < -{|m_z|} {{\,\textrm{ufp}\,}}(x)\). Then

      $$\begin{aligned} {{\,\textrm{ufp}\,}}(m_z/m_x) = {{\,\textrm{ufp}\,}}(m_z/m_t) = 1/\beta , \end{aligned}$$

      and hence \({{\,\textrm{ufp}\,}}(z/t) = {{\,\textrm{ufp}\,}}(z/x)\), as desired.

    • Suppose \(t = -{|m_z|} {{\,\textrm{ufp}\,}}(x)\) instead. Then,

      $$\begin{aligned} t {{\,\textrm{ulp}\,}}(z/x)&= -{|m_z|} {{\,\textrm{ufp}\,}}(x) {{\,\textrm{ulp}\,}}(z/x)\\&= -{|m_z|} {{\,\textrm{ufp}\,}}(x) {{\,\textrm{ufp}\,}}(m_z/m_x) \frac{{{\,\textrm{ufp}\,}}(z)}{{{\,\textrm{ufp}\,}}(x)} \beta ^{1-p} \\&= -{|m_z|}{{\,\textrm{ufp}\,}}(z)\beta ^{1-p}\\&= -\frac{{|z|}}{\beta ^{p-1}}, \end{aligned}$$

      hence z is an integer multiple of \(t {{\,\textrm{ulp}\,}}(z/x)\). Thus

      $$\begin{aligned} (z \bmod t {{\,\textrm{ulp}\,}}(z/x)) = (-z \bmod t {{\,\textrm{ulp}\,}}(z/x)) = 0, \end{aligned}$$

      and so the rounding error is zero. Similarly, since

      $$\begin{aligned} z/t = \frac{z}{-{|m_z|} {{\,\textrm{ufp}\,}}(x)} = -{{\,\textrm{sgn}\,}}(z)\frac{{{\,\textrm{ufp}\,}}(z)}{{{\,\textrm{ufp}\,}}(x)}, \end{aligned}$$

      it is a (possibly negated) power of \(\beta \), and thus in \(\mathbb {F}^*_\infty \). Therefore,

      $$\begin{aligned} t {{\,\textrm{RD}\,}}_\infty (z/t) = t {{\,\textrm{RU}\,}}_\infty (z/t) = t \cdot \frac{z}{t} = z \end{aligned}$$

      exactly, and hence the result follows.

We have thus proven the result in all cases, so we are finished. \(\square \)

The following lemma constrains the feasibility of tiny quotients. Essentially, if any especially small quotients are feasible, so are any larger ones, up to a limit. This result will be useful in determining whether there are any floating-point factors around zero. Although we cannot easily find subnormal factors, we can often eliminate them when they do not exist.

Lemma 4.21

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval, \(x, y \in \mathbb {F}^*\), and \(z \in Z\). If \(y \in {{\,\textrm{Feas}\,}}(Z)\) and \({|y|} \le {|x|} \le {|z|}/\max \mathbb {F}\), then \(x \in {{\,\textrm{Feas}\,}}(Z)\).

Proof

If \({|x|} = {|z|}/\max \mathbb {F}\) exactly, then \({|x|} \otimes \max \mathbb {F}= {|z|}\) and hence \(x \in {{\,\textrm{Feas}\,}}(Z)\) by Lemma 3.2. Suppose \(y \in {{\,\textrm{Feas}\,}}(Z)\) and \({|y|} \le {|x|} < {|z/\max \mathbb {F}|}\) instead. Then \({|z/x|} > \max \mathbb {F}\) and hence z is nonzero. We first prove the result for positive x and y, before then generalizing to all cases. In the following, note that by Lemma 3.6, we have either \(y \otimes {{\,\textrm{RU}\,}}(z/y) \in Z\) or \(y \otimes {{\,\textrm{RD}\,}}(z/y) \in Z\).

Suppose \(z > 0\). Since \({|z/x|} > \max \mathbb {F}\) and \(0 < y \le x\) by assumption, we have \({{\,\textrm{RD}\,}}(z/y) = {{\,\textrm{RD}\,}}(z/x) = \max \mathbb {F}\) and \({{\,\textrm{RU}\,}}(z/y) = +\infty \). Therefore, if \(y \otimes {{\,\textrm{RU}\,}}(z/y) = +\infty \in Z\), then \(x \otimes +\infty = +\infty \in Z\) also, and hence x is feasible for Z. If \(y \otimes {{\,\textrm{RD}\,}}(z/y) \in Z\) instead, then, since \(y \le x\), multiplying by \({{\,\textrm{RD}\,}}(z/y)\) gives,

$$\begin{aligned} \min Z \le y \otimes {{\,\textrm{RD}\,}}(z/y) \le x \otimes {{\,\textrm{RD}\,}}(z/y) = x \otimes {{\,\textrm{RD}\,}}(z/x) \le z \le \max Z, \end{aligned}$$

and since Z is a floating-point interval, it follows that \(x \otimes {{\,\textrm{RD}\,}}(z/x) \in Z\). Therefore x is feasible for Z.

Suppose \(z < 0\) instead. Then, we similarly have \({{\,\textrm{RD}\,}}(z/y) = -\infty \) and \({{\,\textrm{RU}\,}}(z/y) = {{\,\textrm{RU}\,}}(z/x) = \min \mathbb {F}\). Hence, if \(y \otimes {{\,\textrm{RD}\,}}(z/y) = -\infty \in Z\), then \(x \otimes -\infty = -\infty \in Z\), and so x is feasible for Z. If \(y \otimes {{\,\textrm{RU}\,}}(z/y) \in Z\) instead, then since \(y \le x\), multiplying by \({{\,\textrm{RU}\,}}(z/y)\) gives

$$\begin{aligned} \max Z \ge y \otimes {{\,\textrm{RU}\,}}(z/y) \ge x \otimes {{\,\textrm{RU}\,}}(z/y) = x \otimes {{\,\textrm{RU}\,}}(z/x) \ge z \ge \min Z, \end{aligned}$$

and since Z is a floating-point interval, it follows that \(x \otimes {{\,\textrm{RU}\,}}(z/x) \in Z\). Therefore x is feasible for Z.

We have thus shown the result for positive x and y. Now, suppose instead that y is negative (x can have any sign). Then \({|y|}\) is feasible for Z by Lemma 3.2, and thus \({|x|}\) is feasible for Z by the earlier result. Therefore \({|x|}\) is feasible for Z regardless of the sign of y, and hence by Lemma 3.2, so is x. \(\square \)

The following theorem allows us to reduce the question of finding feasible numbers to that of finding plausible numbers, when applicable:

Theorem 4.22

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval, and let \(x \in \mathbb {F}^*\) and \(z \in Z\). Let

$$\begin{aligned} A&= \{ y \in {{\,\textrm{Feas}\,}}(Z) \mid y \ge x \},\\ B&= \{ y \in {{\,\textrm{Plaus}\,}}(Z) \mid y \ge x \},\\ C&= \{ y \in {{\,\textrm{Plaus}\,}}(Z) \mid y \ge {|x|} \}. \end{aligned}$$

If x is normal and \(\beta ^{e_{\textrm{min}}}\le {|z/x|} \le \max \mathbb {F}\), then either A is empty or \(\min A \in \{ \min B, \min C \}\).

Proof

Suppose x is normal, \(\beta ^{e_{\textrm{min}}}\le {|z/x|} \le \max \mathbb {F}\), and A is nonempty. Let \(a = \min A\), \(b = \min B\), and \(c = \min C\). Since \({{\,\textrm{Feas}\,}}(Z) \subseteq {{\,\textrm{Plaus}\,}}(Z)\), we have \(a \in B\) and thus \(x \le b \le a\).

We first show that if \(c \le \max \mathbb {F}\), then \(c \in {{\,\textrm{Feas}\,}}(Z)\). Let \(t = z/\beta ^{E(z/x)}\). By Lemma 4.10, since \({|z|} \in {{\,\textrm{Plaus}\,}}(Z)\), we have \({|t|} \in {{\,\textrm{Plaus}\,}}(Z)\). Now, since \({|z/x|} \ge \beta ^{e_{\textrm{min}}}\), we have \({|z/x|} \ge \beta ^{E(z/x)}\). Taking reciprocals and multiplying by \({|z|}\), we obtain \({|x|} \le {|z|}/\beta ^{E(z/x)} = {|t|}\) and therefore \({|t|} \in C\). Thus we obtain \({|x|} \le c \le {|t|}\), and taking reciprocals and multiplying by \({|z|}\) again, it follows that

$$\begin{aligned} \max \mathbb {F}\ge {|z/x|} \ge {|z/c|}&\ge {|z/t|} = \beta ^{E(z/x)} \ge \beta ^{e_{\textrm{min}}}. \end{aligned}$$

Therefore, by Lemma 4.14, since \(\beta ^{e_{\textrm{min}}}\le {|x|} \le c\), if \(c \le \max \mathbb {F}\), then \(c \in \mathbb {F}^*\) and thus \(c \in {{\,\textrm{Feas}\,}}(Z)\).

Next, we dispose of two trivial cases:

  • Suppose \(0 \in {{\,\textrm{Feas}\,}}(Z)\). Then \(0 \otimes y \in Z\) for some \(y \in \mathbb {F}\) and hence \(0 \in Z\). Thus \(0 \otimes x = 0 \in Z\), and so \(x \in {{\,\textrm{Feas}\,}}(Z)\). Therefore \(a \le x\), and hence \(x = a = b\).

  • Suppose \(+\infty \in {{\,\textrm{Feas}\,}}(Z)\). Then \(+\infty \otimes y \in Z\) for some nonzero \(y \in {\overline{\mathbb {F}}}\). If \(y > 0\), then \(+\infty \otimes y = +\infty = +\infty \otimes x\), and so \(x \in {{\,\textrm{Feas}\,}}(Z)\). If \(y < 0\), then \(+\infty \otimes y = -\infty = -\infty \otimes x\), and so \(x \in {{\,\textrm{Feas}\,}}(Z)\) again. Thus \(a \le x\) and so \(x = a = b\).

We can now proceed to the proof of the main statement. Suppose \(0 \notin {{\,\textrm{Feas}\,}}(Z)\) and \(+\infty \notin {{\,\textrm{Feas}\,}}(Z)\) instead. Then \(a < +\infty \), and so \(a \le \max \mathbb {F}\). We shall divide the problem into two cases, depending on the sign of x.

Suppose \(x > 0\). Then \(B = C\), and hence \(b = c \le a \le \max \mathbb {F}\). Therefore \(c \in {{\,\textrm{Feas}\,}}(Z)\) by our earlier result, so \(a \le c\) also, and hence \(a = b = c\), as desired.

Suppose \(x < 0\) instead. Then, since x is normal, we have \(x \le -\beta ^{e_{\textrm{min}}}\). Since \(-\beta ^{e_{\textrm{min}}}\in {{\,\textrm{Plaus}\,}}(Z)\) by Lemma 4.10, we have \(-\beta ^{e_{\textrm{min}}}\in B\) and thus \(x \le b \le -\beta ^{e_{\textrm{min}}}\). Therefore \(\beta ^{e_{\textrm{min}}}\le {|b|} \le {|x|} \le \max \mathbb {F}\) and hence \(b \in \mathbb {F}^*\) and \(\beta ^{e_{\textrm{min}}}\le {|z/x|} \le {|z/b|}\). Thus, if \({|z/b|} \le \max \mathbb {F}\), then \(b \in {{\,\textrm{Feas}\,}}(Z)\) by Lemma 4.14 and hence \(a = b\). Suppose \({|z/b|} > \max \mathbb {F}\) instead. Then \({|b|} < {|z|}/\max \mathbb {F}\). We further divide this into two subcases, depending on the sign of a. Note that \(a \ne 0\) since \(0 \notin {{\,\textrm{Feas}\,}}(Z)\) by assumption.

  • Suppose a is negative. Then \(b \le a < 0\), and thus \(0< {|a|} \le {|b|} < {|z|}/\max \mathbb {F}\). Since \(a \in {{\,\textrm{Feas}\,}}(Z)\) by definition, Lemma 4.21 implies that \(b \in {{\,\textrm{Feas}\,}}(Z)\) also and thus \(a = b\).

  • Suppose a is positive. Then \([x, 0] \cap {{\,\textrm{Feas}\,}}(Z)\) is empty, and since \({{\,\textrm{Feas}\,}}(Z)\) is closed under negation, it follows that \(a > {|x|}\). Thus, since \(a \in {{\,\textrm{Plaus}\,}}(Z)\), we have \(a \in C\) and hence \(c \le a \le \max \mathbb {F}\). Therefore \(c \in {{\,\textrm{Feas}\,}}(Z)\) by our earlier result, so \(a \le c\) and therefore \(a = c\), as desired.

Therefore the result holds in all cases, and we are done. \(\square \)

We now give the main result of this section. The following theorem reduces the problem to an optimization problem over the integral significands of plausible numbers.

Theorem 4.23

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval and \(z \in Z \cap \mathbb {F}^*\). Then, for all \(x \in \mathbb {F}^*_\infty \), we have \(\phi _{Z}(x) = M {{\,\textrm{ulp}\,}}(x)\) where

$$\begin{aligned} M&= \min {\{ N \in A \cup B \mid N \ge M_x \}},\\ A&= \{ N \in \mathbb {Z}^* \mid -(M_z \beta ^{p-k} \bmod N) \in I \beta ^{p-k} \},\\ B&= \{ N \in \mathbb {Z}^* \mid (-M_z \beta ^{p-k} \bmod N) \in I \beta ^{p-k} \},\\ I&= \frac{{{\,\textrm{fl}\,}}^{-1}[Z] - z}{{{\,\textrm{ulp}\,}}(z)},\\ k&= {\left\{ \begin{array}{ll} 0 &{} \text {if } {|M_x|} > {|M_z|},\\ 1 &{} \text {otherwise,} \end{array}\right. }\\ M_x&= \frac{x}{{{\,\textrm{ulp}\,}}(x)},\\ M_z&= \frac{z}{{{\,\textrm{ulp}\,}}(z)}. \end{aligned}$$

Proof

Let \(y = \phi _{Z}(x)\) and \(M_y = y/{{\,\textrm{ulp}\,}}(x)\). Since y is plausible by definition, we have either \({{\,\textrm{fl}\,}}(y {{\,\textrm{RD}\,}}_\infty (z/y)) \in Z\) or \({{\,\textrm{fl}\,}}(y {{\,\textrm{RU}\,}}_\infty (z/y)) \in Z\) by Lemma 4.12. Therefore, by Lemma 4.20, we have either

$$\begin{aligned} z - (z \bmod y {{\,\textrm{ulp}\,}}(z/x))&\in {{\,\textrm{fl}\,}}^{-1}[Z], \end{aligned}$$

or

$$\begin{aligned} z + (-z \bmod y {{\,\textrm{ulp}\,}}(z/x))&\in {{\,\textrm{fl}\,}}^{-1}[Z]. \end{aligned}$$

Since \({{\,\textrm{ulp}\,}}(M_z/M_x) = \beta ^{k - p}\) by Lemma 2.12, subtracting z and then dividing by \({{\,\textrm{ulp}\,}}(z)\cdot {{\,\textrm{ulp}\,}}(M_z/M_x)\) gives us either

$$\begin{aligned} -(M_z \beta ^{p-k} \bmod M_y)&\in I \beta ^{p-k}, \end{aligned}$$

or

$$\begin{aligned} (-M_z \beta ^{p-k} \bmod M_y)&\in I \beta ^{p-k}. \end{aligned}$$

By Lemma 4.20, we have \({{\,\textrm{ufp}\,}}(x) \le {|y|} \le \beta {{\,\textrm{ufp}\,}}(x)\) and thus \(\beta ^{p-1} \le {|M_y|} \le \beta ^p\). If \({|M_y|} < \beta ^p\), then \({{\,\textrm{ufp}\,}}(y) = {{\,\textrm{ufp}\,}}(x)\), and hence \(M_y\) is an integer. If \({|M_y|} = \beta ^p\) instead, then \(M_y\) is again an integer, and thus \(M_y \in A \cup B\) in all cases. Hence, since \(x \le y\) and thus \(M_x \le M_y\) by definition, we obtain \(M_x \le M \le M_y\). Hence \(\beta ^{p-1} \le {|M|} \le \beta ^p\) and \(x \le M {{\,\textrm{ulp}\,}}(x) \le y\), and therefore \(M {{\,\textrm{ulp}\,}}(x) \in \mathbb {F}^*_\infty \). Then, by Lemma 4.20, we have

$$\begin{aligned} t {{\,\textrm{RD}\,}}_\infty (z/t)&= z - (z \bmod t {{\,\textrm{ulp}\,}}(z/x)),\\ t {{\,\textrm{RU}\,}}_\infty (z/t)&= z + (-z \bmod t {{\,\textrm{ulp}\,}}(z/x)), \end{aligned}$$

where \(t = M {{\,\textrm{ulp}\,}}(x)\).

Next we prove that \(t \in {{\,\textrm{Plaus}\,}}(Z)\). Substituting the definition of I into the definitions of A and B and rearranging, we obtain

$$\begin{aligned} A&= \{ N \in \mathbb {Z}^* \mid z - (z \bmod N {{\,\textrm{ulp}\,}}(z) \beta ^{k-p}) \in {{\,\textrm{fl}\,}}^{-1}[Z] \},\\ B&= \{ N \in \mathbb {Z}^* \mid z + (-z \bmod N {{\,\textrm{ulp}\,}}(z) \beta ^{k-p}) \in {{\,\textrm{fl}\,}}^{-1}[Z] \}. \end{aligned}$$

Now, since \(\beta ^{k-p} = {{\,\textrm{ulp}\,}}(M_z/M_x)\), we have

$$\begin{aligned} M {{\,\textrm{ulp}\,}}(z) \beta ^{k-p}&= M {{\,\textrm{ulp}\,}}(z) {{\,\textrm{ulp}\,}}(M_z/M_x)\\&= M {{\,\textrm{ulp}\,}}(z/M_x)\\&= M {{\,\textrm{ulp}\,}}(x) {{\,\textrm{ulp}\,}}(z/x)\\&= t {{\,\textrm{ulp}\,}}(z/x). \end{aligned}$$

Hence, since \(M \in A \cup B\), we have either

$$\begin{aligned} z - (z \bmod t {{\,\textrm{ulp}\,}}(z/x))&= t {{\,\textrm{RD}\,}}_\infty (z/t) \in {{\,\textrm{fl}\,}}^{-1}[Z], \end{aligned}$$

or

$$\begin{aligned} z + (-z \bmod t {{\,\textrm{ulp}\,}}(z/x))&= t {{\,\textrm{RU}\,}}_\infty (z/t) \in {{\,\textrm{fl}\,}}^{-1}[Z]. \end{aligned}$$

Since \(t \in \mathbb {F}^*_\infty \), we thus have \(t \in {{\,\textrm{Plaus}\,}}(Z)\). Therefore \(y \le t\) by definition, and hence \(y = t = M {{\,\textrm{ulp}\,}}(x)\), as desired. \(\square \)

By transforming the problem of finding plausible numbers into an integer optimization problem, we make it much more straightforward to analyze. Even so, the optimization problem presented in Theorem 4.23 is unusual, since what is being optimized is a remainder. However, as we saw earlier in Fig. 2, there are patterns in the round-trip error. In the next section, we will make use of them to bound the error and thus solve the optimization problem efficiently.

5 Minimizing Remainders of a Constant Divided by a Variable

In the previous section, we reduced the problem of finding factors to an optimization problem over remainders where the dividend is constant but the divisor is not. The underlying problem is not specific to the floating-point numbers, and we can state it generally as follows:

Problem 3

Given a rational x and an interval I containing zero, what are the integers n such that \((x \bmod n) \in I\)? Specifically, given a fixed integer m, what is the least (or greatest) such n where \(n \ge m\) (or \(n \le m\))?

Naively, this problem can be solved by exhaustive search on the divisors in at most \({|x|} + 1\) trials, but this is clearly impractical. We do not have a provably efficient solution to the general problem, and it may well be the case that one does not exist at all.Footnote 14 However, the specific case we are interested in can be solved in a constant number of arithmetic operations.

Fig. 3
figure 3

Graphs of \((-1000 \bmod n)\) and , respectively above and below the x-axis. The upper graph follows a parabola whenever the lower graph behaves linearly (e.g. around \(n = 30\)). Similarly, linear segments in the upper graph correspond to constant segments in the lower (e.g. near \(n= 80\))

To better grasp the problem, let us first observe how the remainder varies with the divisor. Plotting \((-1000 \bmod n)\) in Fig. 3, we see parabolas near \(n = 30\), as well as between \(n = 20\) and \(n = 25\). After \(n = 40\), a pattern of “stepped” parabolas emerges, with an increasing number of “steps” later on. Finally, the parabolas start disappearing entirely around \(n = 60\) and give way to a linear pattern. As one might expect from rounding being expressible in terms of remainder, these patterns are also visible in Fig. 2.

This behavior may seem an unusual coincidence at first, but it occurs for any choice of numerator. Algebraically, the reason is remarkably simple. By the definition of the \(\bmod \) operator, we have

Thus, if is constant over a certain interval, then \((-1000 \bmod n)\) is linear in n over that interval. If changes linearly, then \((-1000 \bmod n)\) is instead quadratic in n over that interval. This relationship can be clearly seen in Fig. 3, comparing the plots above and below the x-axis. The “stepped parabolas” correspond to values of n where the slope of \(-1000/n\) is close to linear, but less than 1, and so does not change at every step.

More formally, consider integers ani such that n and \(n + i\) are nonzero, and let :

  1. Constant quotient

    If , then

    $$\begin{aligned} (a \bmod n + i)&= a - (n+i)q\\&= (a \bmod n) - iq. \end{aligned}$$
  2. Linear quotient

    If there is some integer d such that , then

    $$\begin{aligned} (a \bmod n + i)&= a - (n+i)(q + di)\\&= (a \bmod n) - (di^2 + (q+nd)i). \end{aligned}$$

Note that the above still holds if we permit the variables to take arbitrary real values. In this way we can also find parameters corresponding to each parabola inside the “stepped” segments.Footnote 15

In our work, we will focus on linear and quadratic segments. These are especially useful, since we already have the tools to work with them. The following observations are key:

  1. 1.

    Given a linear or quadratic segment, we can directly check it for a satisfying divisor by solving a linear or quadratic equation, respectively.

  2. 2.

    The graph of remainders can always be partitioned into such segments.Footnote 16

  3. 3.

    Therefore, we can search for a solution one segment at a time, rather than one divisor at a time.

With a good (that is, small) partition, we can shrink the search space significantly. However, this is still an exhaustive search, so proving the absence of a solution requires checking every segment. Furthermore, it is not clear that it is necessarily possible to construct a partition that is asymptotically smaller than magnitude of the dividend.

For the specific case we are interested in, however, the number of segments does not matter: for a suitable partition, we can show that the remainder at either the start or the end of any segment always satisfies the bounds. More specifically, according to Theorem 4.23, we are interested in remainders of the form \((-ab \bmod n)\) or \((ab \bmod n)\) for nonzero integers abn. We now study the solution space under these special conditions.

5.1 Conditions for a Solution via Extrapolation

Fig. 4
figure 4

Graphs of \((-1000 \bmod x)\) and \((-x^2 - 1000 \bmod x)\). The dots indicate points where x is an integer

Given nonzero integers a and b, we construct an appropriate partition using a function \(f :\mathbb {R}^* \rightarrow \mathbb {R}\) such that \(f(x) = (-x^2 - {|ab|} \bmod x)\). Clearly, we have \(f(n) = (-{|ab|} \bmod n)\) for all nonzero integers n. Since the numerator is a quadratic itself, the interval between each pair of consecutive roots of f is a quadratic segment (Fig. 4, bottom). Now, note the following:

  1. 1.

    A segment contains solutions if and only if the value of least magnitude within it satisfies the bounds.

  2. 2.

    In each segment, the value of least magnitude is either at the start or the end of the segment.

  3. 3.

    The start and the end of each segment lie between two consecutive roots of f, and are obtained by rounding those roots.

Therefore, it suffices to prove that if \(f(x) = 0\), then either or lies within the bounds. Note that a linear mapping does not suffice: although it fits better when x is large (this can be seen in Fig. 4), it does not provide sufficiently strong bounds when x is small. The next result further explains the behavior of real extensions of remainders. In particular, we see that a quadratic extension makes quadratic segments have constant quotients.

Lemma 5.1

Let I be an interval not containing zero and let f be a real-valued function continuous over I. If \((f(x) \bmod x) \ne 0\) for all \(x \in I\), then for all \(x, y \in I\).

Proof

We proceed by contraposition. Suppose for some \(x, y \in I\). Without loss of generality, we assume that . Let \(q :I \rightarrow \mathbb {R}\) be a function such that \(q(t) = f(t)/t\). Then, since f is continuous over I, so is q. If , then by monotonicity, which is a contradiction. Therefore , and hence by the intermediate value theorem there is some \(w \in (x, y] \subseteq I\) such that . Thus f(w) is an integer multiple of w, and so \((f(w) \bmod w) = 0\), as desired. \(\square \)

Corollary 5.1.1

Let I be an interval not containing zero and let f be a real-valued function continuous over I. If \((f(x) \bmod x) \ne 0\) for all \(x \in I\), then \((f(x) \bmod x) = f(x) - xq\) for all \(x \in I\), where q is an integer such that for all \(y \in I\).

The following lemma allows us to express the remainders of interest as quadratics.

Lemma 5.2

Let \(a, b \in \mathbb {Z}\) and let \(f, g, h :\mathbb {R}^* \rightarrow \mathbb {R}\) be functions such that

$$\begin{aligned} f(t)&= (-t^2 - {|ab|} \bmod t),\\ g(t)&= (-ab \bmod t),\\ h(t)&= -(ab \bmod t). \end{aligned}$$

Let I be an interval not containing zero. Let for some \(x \in I\). If \(f(t) \ne 0\) for all \(t \in I\), then for all \(n \in I \cap \mathbb {Z}\),

$$\begin{aligned} f(n)&= -n^2 - qn - {|ab|},\\ g(n)&= {\left\{ \begin{array}{ll} f(n) &{} \text {if } ab \ge 0,\\ n - f(n) &{} \text {otherwise.} \end{array}\right. }\\ h(n)&= g(n) - n. \end{aligned}$$

Proof

Suppose \(f(t) \ne 0\) for all \(t \in I\) and let \(n \in I \cap \mathbb {Z}\). Then, by Lemma 5.1 we immediately have \(f(n) = -n^2 - qn - {|ab|}\). Since n is an integer, we also have \(f(n) = (-{|ab|} \bmod n)\). Thus, since f(n) is nonzero, it follows that n does not divide ab. Therefore g(n) and h(n) are nonzero, and hence

$$\begin{aligned} g(n) - n = -(n - (-ab \bmod n)) = -(ab \bmod n) = h(n). \end{aligned}$$

If \(ab \ge 0\), then \(f(n) = (-ab \bmod n) = g(n)\). If \(ab < 0\), then \(f(n) = (ab \bmod n) = -h(n) = n - g(n)\), and thus \(g(n) = n - f(n)\) as desired. \(\square \)

We now proceed to demonstrate the conditions under which the roots of the quadratic mapping provide a solution. We first derive a simpler expression for the quadratic passing through a given root.

Lemma 5.3

Let \(a, b, \delta \in \mathbb {R}\) and let \(x \in \mathbb {R}^*\). Then \(-(x + \delta )^2 + (x + \delta )k - ab = - \delta ^2 -(x - \frac{ab}{x})\delta \) where \(k = \frac{x^2 + ab}{x}\).

Remark

Note that \((-x^2 - ab \bmod x) = 0\) if and only if \(\frac{-x^2-ab}{x}\) is an integer.

We now bound the magnitude of that quadratic in terms of the difference between the factors in the numerator.

Lemma 5.4

Let \(a, b, x \in \mathbb {R}\). If \(1 < a + 1 \le {|x|} \le b - 1\), then \({|x - ab/x|} < b - a - 1\).

Proof

Let \(f :\mathbb {R}^* \rightarrow \mathbb {R}\) be a function such that \(f(t) = t - \frac{ab}{t}\) and suppose \(1 < a + 1 \le {|x|} \le b - 1\). Since a and b are positive, we have \(ab > 0\) and so f is strictly increasing over \((-\infty , 0)\) and over \((0, +\infty )\). Hence \(f(a+1) \le f({|x|}) \le f(b - 1)\). Now, since \(b - 1\) and a are positive, we have

$$\begin{aligned} (b - a - 1) - f(b-1)&= (b - a - 1) - \left( b - 1 - \frac{ab}{b-1}\right) \\&= \frac{ab}{b-1} - a\\&= \frac{a}{b-1}\\&> 0, \end{aligned}$$

and hence \(b - a - 1 > f(b-1)\). Similarly, since b and \(a+1\) are positive,

$$\begin{aligned} -(b - a - 1) - f(a+1)&= -(b - a -1) - \left( a + 1 - \frac{ab}{a+1}\right) \\&= \frac{ab}{a+1} - b\\&= -\frac{b}{a+1}\\&< 0, \end{aligned}$$

and so \(-(b - a - 1) < f(a+1)\). Therefore, \({|f({|x|})|} < b - a - 1\). If \(x > 0\), then \({|x|} = x\) and the result follows immediately. If \(x < 0\) instead, we have

$$\begin{aligned} {|f({|x|})|} = {|f(-x)|} = {|-f(x)|} = {|f(x)|}, \end{aligned}$$

and so the result follows. \(\square \)

Remark

We assume that \(a + 1 \le x \le b - 1\) in order to obtain the necessary bound. When a and b are integers, if \(x = a\) or \(x = b\), then \((-ab \bmod x) = 0\), which lies within the bounding interval by assumption.

The following lemma is an auxiliary result.

Lemma 5.5

Let \(x \in \mathbb {R}\) and \(\delta \in (0, 1)\). If \(\delta x\) and \((\delta - 1)(x+1)\) are integers, then they are both nonzero and have opposite signs.

Proof

Suppose \(\delta x\) and \((\delta - 1)(x+1)\) are integers. If \(x = 0\), then \((\delta - 1)(x+1) = \delta - 1 \notin \mathbb {Z}\), which is a contradiction. If \(0 < {|x|} \le 1\), then \(\delta x \notin \mathbb {Z}\), which is also a contradiction. Therefore \({|x|} > 1\), and hence \(x + 1\) is nonzero and has the same sign as x. Therefore, since \(\delta - 1\) is negative, it follows that \(\delta x\) and \((\delta - 1)(x+1)\) are nonzero and have opposite signs. \(\square \)

We now show that rounding a root of the quadratic extension gives a solution for a sufficiently wide interval. However, we assume that the factors in the numerator are within a factor of two of each other. Otherwise, the bound from Lemma 5.4 grows too quickly to be useful. Luckily, this assumption always holds when \(\beta = 2\), which is the most common case by far.

Lemma 5.6

Let \(a, b \in \mathbb {Z}\) and let \(x \in \mathbb {R}^*\). Let I be an interval containing zero. If \(0 < a \le {|x|} \le b\), \((-x^2 - ab \bmod x) = 0\), \(b \le 2a\) and \({{\,\textrm{diam}\,}}I \ge a\), then for some ,

  • either \((-ab \bmod {\hat{x}}_1) \in I\) or \(-(ab \bmod {\hat{x}}_1) \in I\), and

  • either \((ab \bmod {\hat{x}}_2) \in I\) or \(-(-ab \bmod {\hat{x}}_2) \in I\).

Proof

Suppose \(0 < a \le {|x|} \le b\), \((-x^2 - ab \bmod x) = 0\), \(b \le 2a\) and \({{\,\textrm{diam}\,}}I \ge a\). We shall first dispose of some trivial cases. If \(x \in \mathbb {Z}\), then , so

and hence the result. Since \(0 \in I\), if \(a< {|x|} < a + 1\) instead, then either or , and hence the result holds trivially. Similarly, if \(b - 1< {|x|} < b\), then either \({|\left\lfloor x\right\rfloor |} = b\) or \({|{\left\lceil \llceil x\right\rceil \rrceil }|} = b\), and the result follows trivially again.

Suppose \(x \notin \mathbb {Z}\) and \(a + 1< {|x|} < b - 1\) instead, and let \(f :\mathbb {R}\rightarrow \mathbb {R}\) be a function such that \(f(t) = -t^2 + kt - ab\) where \(k = \frac{x^2 + ab}{x}\). Then, since \((-x^2 - ab \bmod x) = 0\), it follows that k is an integer and thus f(t) is an integer whenever t is an integer. Now, let and \(y = \frac{ab}{x} - x - \delta \). Then, by Lemma 5.3,

$$\begin{aligned} \begin{aligned} f({\left\lceil \llceil x\right\rceil \rrceil })&= -{\left\lceil \llceil x\right\rceil \rrceil }^2 + k{\left\lceil \llceil x\right\rceil \rrceil } - ab\\&= -(x + \delta )^2 + k(x + \delta ) - ab\\&= -\delta ^2 - \left( x-\frac{ab}{x}\right) \delta \\&= \delta \left( \frac{ab}{x} - x - \delta \right) \\&= \delta y. \end{aligned} \end{aligned}$$

Since x is not an integer, it follows that and hence . Thus, by Lemma 5.3 again,

Now, since \(b \le 2a\), by Lemma 5.4, we have \({|ab/x - x|} < b - a - 1 \le a - 1\). Hence, applying the triangle inequality,

$$\begin{aligned} {|f({\left\lceil \llceil x\right\rceil \rrceil })|}&= {|\delta y|}\\&< {|ab/x - x - \delta |}\\&\le {|ab/x - x|} + \delta \\&< {|ab/x - x|} + 1\\&< a, \end{aligned}$$

and

and

Hence the distance between each of \(f(\left\lfloor x\right\rfloor )\), \(f({\left\lceil \llceil x\right\rceil \rrceil })\) and 0 is strictly less than \({{\,\textrm{diam}\,}}I \ge a\). Let \(c = \min R\) and \(d = \max R\) where \(R = \{ f(\left\lfloor x\right\rfloor ), f({\left\lceil \llceil x\right\rceil \rrceil }) \}\). Then \({{\,\textrm{diam}\,}}{[c, d]} = d - c < b - a\). Since \(f(\left\lfloor x\right\rfloor )\) and \(f({\left\lceil \llceil x\right\rceil \rrceil })\) are both integers and \(\delta \in (0, 1)\), it follows by Lemma 5.5 that \(\delta y\) and \((\delta - 1) (y + 1)\) are nonzero and have opposite signs and hence \(c< 0 < d\). Thus both [cd] and I contain zero, and since \({{\,\textrm{diam}\,}}{[c, d]} < a \le {{\,\textrm{diam}\,}}I\), we have either \(c \in I\) or \(d \in I\) by Lemma 2.3, and therefore either \(f(\left\lfloor x\right\rfloor ) \in I\) or \(f({\left\lceil \llceil x\right\rceil \rrceil }) \in I\). Similarly, since \([-d, -c]\) also contains zero and has the same diameter as [cd], we also have either \(-f(\left\lfloor x\right\rfloor ) \in I\) or \(-f({\left\lceil \llceil x\right\rceil \rrceil }) \in I\).

Now, in the following, note that we have both and , which drastically simplifies working with mod.

  • Suppose \(f({\left\lceil \llceil x\right\rceil \rrceil })\) and x have the same sign. Then \(-f(\left\lfloor x\right\rfloor )\) also has the same sign, and \(-f({\left\lceil \llceil x\right\rceil \rrceil })\) and \(f(\left\lfloor x\right\rfloor )\) have the opposite sign. Therefore,

    and thus

    $$\begin{aligned} -f({\left\lceil \llceil x\right\rceil \rrceil })&= -(-ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }),\\ f(\left\lfloor x\right\rfloor )&= -(ab \bmod \left\lfloor x\right\rfloor ). \end{aligned}$$

    Hence either , and also either \((ab \bmod \left\lfloor x\right\rfloor ) \in I\) or \(-(-ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }) \in I\).

  • Suppose instead \(f({\left\lceil \llceil x\right\rceil \rrceil })\) and x have opposite signs. Then \(f(\left\lfloor x\right\rfloor )\) and \(-f({\left\lceil \llceil x\right\rceil \rrceil })\) have the same sign as x, and hence \(-f(\left\lfloor x\right\rfloor )\) has the opposite sign of x. Therefore,

    $$\begin{aligned} f(\left\lfloor x\right\rfloor ) = (f(\left\lfloor x\right\rfloor ) \bmod \left\lfloor x\right\rfloor ) = (-ab \bmod \left\lfloor x\right\rfloor ),\\ -f({\left\lceil \llceil x\right\rceil \rrceil }) = (-f({\left\lceil \llceil x\right\rceil \rrceil }) \bmod {\left\lceil \llceil x\right\rceil \rrceil }) = (ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }), \end{aligned}$$

    and thus

    $$\begin{aligned} -f(\left\lfloor x\right\rfloor ) = -(-ab \bmod \left\lfloor x\right\rfloor ),\\ f({\left\lceil \llceil x\right\rceil \rrceil }) = -(ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }). \end{aligned}$$

    Hence either \((-ab \bmod \left\lfloor x\right\rfloor ) \in I\) or \(-(ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }) \in I\), and also either \((ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }) \in I\) or \(-(-ab \bmod \left\lfloor x\right\rfloor ) \in I\).

Therefore, for some \({\hat{x}}_1, {\hat{x}}_2 \in \{ \left\lfloor x\right\rfloor , {\left\lceil \llceil x\right\rceil \rrceil } \}\), we have either \((-ab \bmod {\hat{x}}_1) \in I\) or \(-(ab \bmod {\hat{x}}_1 \in I)\), and also either \((ab \bmod {\hat{x}}_2) \in I\) or \(-(-ab \bmod {\hat{x}}_2) \in I\). \(\square \)

Combining the above result with Theorem 4.23, we can finally show that, for equitable rounding functions, roots of the quadratic extension give a bound on plausible integral significands. With the help of Lemma 5.2, this will let us compute plausible numbers efficiently in base 2.

Theorem 5.7

Let \(Z \subseteq {\overline{\mathbb {F}}}\) be a floating-point interval, and let \(z \in Z \cap \mathbb {F}^*\) and \(x \in \mathbb {F}^*_\infty \). Let \(f :\mathbb {R}^* \rightarrow \mathbb {R}\) be a function such that \(f(t) = (-t^2 - {|M_z|}\beta ^{p-k} \bmod t)\) where

$$\begin{aligned} k&= {\left\{ \begin{array}{ll} 0 &{} \text {if } {|M_x|} > {|M_z|},\\ 1 &{} \text {otherwise,} \end{array}\right. }\\ M_x&= \frac{x}{{{\,\textrm{ulp}\,}}(x)},\\ M_z&= \frac{z}{{{\,\textrm{ulp}\,}}(z)}. \end{aligned}$$

Let \(r \in \mathbb {R}^*\) be a root of f. If \({{\,\textrm{fl}\,}}\) is equitable, \(r \ge M_x\), and either \(\beta ^{p-1} < {|M_z|} \le 2\beta ^{p-1}\) or \(\beta = 2\), then \(\phi _{Z}(x) \le {\left\lceil \llceil r\right\rceil \rrceil } {{\,\textrm{ulp}\,}}(x)\).

Proof

Suppose the conditions hold and let \(I = ({{\,\textrm{fl}\,}}^{-1}[Z] - z)/{{\,\textrm{ulp}\,}}(z)\) and \(M = \phi _{Z}(x) / {{\,\textrm{ulp}\,}}(x)\). We divide the proof into two major cases based on the value of \(M_z\), and then handle the different values of k within those cases.

Suppose \(\beta ^{p-1} < {|M_z|} \le 2 \beta ^{p-1}\). Then, since \({{\,\textrm{fl}\,}}\) is equitable, by definition we have

$$\begin{aligned} {{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] \ge {{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[\{ z \}] \ge \beta ^{Q(z)} \ge {{\,\textrm{ulp}\,}}(z), \end{aligned}$$

and therefore,

$$\begin{aligned} {{\,\textrm{diam}\,}}I \beta ^{p-k}&= \beta ^{p-k} {{\,\textrm{diam}\,}}\frac{{{\,\textrm{fl}\,}}^{-1}[Z]-z}{{{\,\textrm{ulp}\,}}(z)}\\&= \frac{\beta ^{p-k}}{{{\,\textrm{ulp}\,}}(z)} {{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z]\\&\ge \beta ^{p-k}. \end{aligned}$$

If \(k = 0\), then \({|M_z|} < {|M_x|}\), so Lemma 4.13 implies that x is plausible for Z, and thus

$$\begin{aligned} \phi _{Z}(x) = x = M_x {{\,\textrm{ulp}\,}}(x) \le r {{\,\textrm{ulp}\,}}(x) \le {\left\lceil \llceil r\right\rceil \rrceil } {{\,\textrm{ulp}\,}}(x). \end{aligned}$$

Suppose \(k = 1\) instead, and let \(a = \beta ^{p-k}\) and \(b = {|M_z|}\). Then \(b \le 2a = 2\beta ^{p-1}\) by assumption and \({{\,\textrm{diam}\,}}I \beta ^{p-k} \ge a\). We now consider each sign of \(M_z\):

  • Suppose \(M_z > 0\). Then \(b = M_z\), and hence by Lemma 5.6, we have either \(-(ab \bmod {\hat{r}}) = -(M_z\beta ^{p-k} \bmod {\hat{r}}) \in I \beta ^{p-k}\) or \((-ab \bmod {\hat{r}}) = (-M_z\beta ^{p-k} \bmod {\hat{r}}) \in I \beta ^{p-k}\) for some \({\hat{r}} \in \{ \left\lfloor r\right\rfloor , {\left\lceil \llceil r\right\rceil \rrceil } \}\) and hence \(M \le {\hat{r}} \le {\left\lceil \llceil r\right\rceil \rrceil }\) by Theorem 4.23.

  • Suppose \(M_z < 0\). Then \(b = -M_z\) and by Lemma 5.6 we have either \((ab \bmod {\hat{r}}) = (-M_z\beta ^{p-k} \bmod {\hat{r}}) \in I \beta ^{p-k}\) or \(-(-ab \bmod {\hat{r}}) = -(M_z\beta ^{p-k} \bmod {\hat{r}}) \in I \beta ^{p-k}\) for some \({\hat{r}} \in \{ \left\lfloor r\right\rfloor , {\left\lceil \llceil r\right\rceil \rrceil } \}\) and hence \(M \le {\hat{r}} \le {\left\lceil \llceil r\right\rceil \rrceil }\) by Theorem 4.23.

Therefore \(M \le {\left\lceil \llceil r\right\rceil \rrceil }\) in both cases, and multiplying by \({{\,\textrm{ulp}\,}}(x)\), we obtain \(\phi _{Z}(x) \le {\left\lceil \llceil r\right\rceil \rrceil } {{\,\textrm{ulp}\,}}(x)\).

Finally, suppose \(\beta = 2\) and \({|M_z|} = \beta ^{p-1}\) instead. If \(k = 1\), then \({|M_x|} = {|M_z|}\), so x is trivially plausible for Z and hence \(\phi _{Z}(x) = x \le {\left\lceil \llceil r\right\rceil \rrceil } {{\,\textrm{ulp}\,}}(x)\) immediately. Suppose \(k = 0\) instead, and let \(a = {|M_z|}\) and \(b = \beta ^{p-k}\). Then, since \(\beta = 2\), we have \(b = 2a\). Since \({{\,\textrm{fl}\,}}\) is equitable,

$$\begin{aligned} {{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z] \ge {{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[\{ z \}] \ge \beta ^{Q(z) - 1} \ge {{\,\textrm{ulp}\,}}(z)/\beta , \end{aligned}$$

and therefore,

$$\begin{aligned} {{\,\textrm{diam}\,}}I \beta ^{p-k}&= \beta ^p {{\,\textrm{diam}\,}}\frac{{{\,\textrm{fl}\,}}^{-1}[Z]-z}{{{\,\textrm{ulp}\,}}(z)}\\&= \frac{\beta ^{p}}{{{\,\textrm{ulp}\,}}(z)} {{\,\textrm{diam}\,}}{{\,\textrm{fl}\,}}^{-1}[Z]\\&\ge \beta ^{p-1}\\&= a. \end{aligned}$$

We now complete the proof using Lemma 5.6 and Theorem 4.23. By the same argument as earlier, we have \(M \le {\hat{r}} \le {\left\lceil \llceil r\right\rceil \rrceil }\) for some \({\hat{r}} \in \{ \left\lfloor r\right\rfloor , {\left\lceil \llceil r\right\rceil \rrceil } \}\), and thus \(\phi _{Z}(x) \le {\left\lceil \llceil r\right\rceil \rrceil } {{\,\textrm{ulp}\,}}(x)\), as desired. \(\square \)

Remark

If r is the least root of f greater than \(M_x\), then f is nonzero over \((M_x, r)\). Thus, according to Lemma 5.2, the restrictions of \(M \mapsto -(M_z\beta ^{p-k} \bmod M)\) and \(M \mapsto (-M_z \beta ^{p-k} \bmod M)\) to \((M_x, r) \cap \mathbb {Z}\) are quadratics. Hence, if neither x nor \(\left\lfloor r\right\rfloor {{\,\textrm{ulp}\,}}(x)\) are plausible for Z, then we necessarily have \(\phi _{Z}(x) = {\left\lceil \llceil r\right\rceil \rrceil }{{\,\textrm{ulp}\,}}(x)\). Otherwise, \(\phi _{Z}(x) \le \left\lfloor r\right\rfloor {{\,\textrm{ulp}\,}}(x)\), so we can compute \(\phi _{Z}(x)\) by solving the quadratics from Lemma 5.2 to find the earliest point where they lie in the interval \(({{\,\textrm{fl}\,}}^{-1}[Z] - z)/{{\,\textrm{ulp}\,}}(z)\).

As discussed above, combining the bound given by Theorem 5.7 with Lemma 5.2 allows us to solve Problem 1 for normal numbers in binary floating-point arithmetic. We next turn to assembling our results into an algorithm.

6 Algorithms

Theorem 3.5 states that the classical division-based algorithm can be used to produce optimal bounds for floating-point arithmetic if we can devise some way of finding the feasible floating-point numbers nearest any given infeasible number. If the bounds and their quotients are within the range of the normal numbers, Theorem 4.22 tells us that we can do this by instead finding the nearest plausible numbers. Finally, for equitable rounding functions in base 2 (and other bases under certain conditions), Theorem 5.7 allows us to efficiently find the next plausible number. Putting these together, we can derive an algorithm for efficiently computing optimal bounds on the factors of a binary floating-point multiplication. Algorithms 1 to 6 form a complete pseudocode implementation of the results of this paper. They are annotated with the properties they satisfy and justifications for them. The proofs of correctness for Algorithms 4 to 6 require some additional minor results which are included in Appendix A. We thus obtain the following:

Theorem 6.1

Algorithms 1 to 6 are correct. Algorithms 3 to 6 can be computed in O(1) arithmetic operations. Algorithms 1 and 2 can be computed in O(1) arithmetic operations and computations of the preimage of a floating-point interval under \({{\,\textrm{fl}\,}}\).

Remark

As in Lemma 2.25, for typical choices of \({{\,\textrm{fl}\,}}\), these algorithms have time complexity \(O(M(p) + \log ({e_{\textrm{max}}}- {e_{\textrm{min}}}+ 1))\), where M(p) is the time complexity of multiplying two p-digit integers. We again do not take the base, precision or exponent range as constants in our analysis.

According to Theorem 3.5, given the preconditions hold, applying Algorithm 1 to the bounds produced by the traditional algorithm gives optimal bounds on the factors. This takes O(1) arithmetic operations and computations of \({{\,\textrm{fl}\,}}^{-1}\), which is much less than the exponential worst case we proposed in Conjecture 2.27 for just iterating the classical algorithm.

Although the preconditions on Algorithm 1 almost always hold when \(\beta = 2\), they must still be checked. Otherwise, even though the algorithm always returns either a feasible number or \(+\infty \), the value returned may not in fact be the next feasible number (or, in the case of \(+\infty \) being infeasible, the problem instance may still be satisfiable).

In the event that some or all of the inputs are subnormal or we are working in a higher base, we still have some options for using these results. By Lemma 4.18, the next plausible number is always a lower bound on the next feasible number, and Algorithm 2 works correctly for subnormal input, since there is no such distinction over the unbounded floating-point numbers. Furthermore, it can be shown that Algorithm 2 returns a lower bound on the next plausible number even in higher bases. However, we make no claims to the efficiency under such conditions of iterating Algorithm 2 to find the next feasible number.

Now, as we demonstrated in Example 2.29, iterating the classical algorithm is insufficient to find optimal bounds on solution sets in the general case. That is, given a problem instance (XYZ) where \(X, Y, Z \subseteq {\overline{\mathbb {F}}}\) are floating-point intervals, iterating F until a fixed point is reached may not produce optimal bounds on Z. We proposed solving this by using binary search to find the greatest and least values in Z for which the instance is satisfiable. (As a reminder, an instance is satisfiable if and only if its optimal bounds are nonempty.) Naively, this search takes \(O(p + \log ({e_{\textrm{max}}}- {e_{\textrm{min}}}+ 1))\) satisfiability queries. However, we can do better. Lemma 3.8 says that if we have at least \(\beta \) values in Z, then almost any value is feasible for Z. Thus, in most cases, we only need to consider first and last \(\beta \) values of Z to obtain optimal intervals. When \(\beta = 2\), this means that we typically only need to trim at most one value from each endpoint of Z. Note that this situation only occurs when the problem instance is satisfiable—otherwise, our optimal bounds on X and Y will be empty, and thus so will the optimal bounds on Z.

figure a
figure b
figure c
figure d
figure e
figure f

7 Conclusion

In this paper, we have rigorously presented the first step towards an efficient solution to the problem of computing optimal interval bounds on the variables of the floating-point constraint \(x \otimes y = z\). The algorithms presented here are sound, complete, and efficient for binary floating-point arithmetic involving normal numbers. This has immediate implications for floating-point decision procedures based on interval constraint solving. We hope that the techniques developed here will be useful in future study of floating-point theory.

As suggested earlier, a number of questions remain open in this topic:

  • What is the optimal time complexity for computing feasible numbers in general (i.e. in all bases and with subnormals)? Can we asymptotically outperform brute force search? Does an efficient algorithm exist for this task?

  • Although the algorithms in this paper are relatively brief, the work to develop them is much longer. Is there are shorter derivation of this solution?

  • The results of this paper rely on a mixture of integer and floating-point reasoning. Can these results be stated naturally using only floating-point numbers? If so, is there a version of this algorithm which uses only floating-point arithmetic?

  • Separately, Sect. 5 contains results which may be useful for solving mod constraints beyond the scope of this paper. Can they be applied more broadly?