Numerical representations and error

Introduction

There are many ways to represent a number within a digital computer. In this section we will examine some of the more useful and common types of these representations, as well as their strengths and weaknesses in terms of error analysis and efficiency.

First some basic terminology: if u is an approximation to a quantity v, then the absolute error in the approximation is

|u - v|

and the relative error is

\frac{|u-v|}{|v|}

Note that the relative error is not defined if v=0.

In discussing the behavior of small or large quantities it can be useful to use the 'big-O' and 'big-\Theta' notations. For a small quantity h, we say that a function f(h) is O(g(h)) if there exists an H>0 and a constant C>0 such that |f(h)| < C g(h) for all 0 < h < H. For example, the function sin(h) is O(h). The notation \Theta(g(h)) is more restrictive: there must be constants C_1>0 and C_2 and H>0 such that

C_1 g(h) < |f(h)| < C_2 g(h)

for all 0 < h < H. The function f = 2h + h^2 is O(h) and O(1), but is \Theta(h) and not \Theta(1).

Similarly, if we are interested in the behavior of f(n) for large values of n (often an integer), then it is O(g(n)) if there is a N > 0 and a constant C such that |f(n)| < C g(n) for all n>N.

Floating-point representations

The most common way of representing numbers in computations is to use a floating-point representation. We will focus on the IEEE (Institute of Electrical and Electronics Engineers) double precision standard, which is currently the most often used (but some other types are important in specialized areas such as audio processing and machine learning).

A floating-point number is represented with a sign (either positive or negative), a mantissa of some number of digits in a particular base (almost always base-2 on a computer), and an exponent. In base-2 most numbers will be represented in the form

\pm 1.b_1 b_2 b_3 \ldots b_N \cdot 2^p

For 64-bit double precision numbers, N=52. The number is represented in a block of memory (a "word") of 64 bits. The first bit is the sign, with 0 indicating a positive number and 1 indicating a negative number.

The next part of the word stores the exponent, p, but shifted in order to represent but small and large numbers. For 64-bit floats, 11 bits are used for the exponent, shifted by 1023. This shift value is called the exponent bias. For example, an exponent block of 10000001000 represents the exponent (2^{10} + 2^3) - 1023 = 9. The last 52 bits are the mantissa (the numbers b_1, b_2, up to b_{52}).

So for example the decimal number 1,073,741,824.5, which in exponential binary form is

+1.0000000000000000000000000000001000000000000000000000 \cdot 2^{30}

would have an exponent block of 30 + 1023 in base-2,

00000011110 + 01111111111 = 10000011101

and as it is positive its sign bit would be 0, so the 64-bit word would be

0100000111010000000000000000000000000000001000000000000000000000

For more information on floating point numbers, the Wikipedia entry is quite good. There is also a page on "minifloats" , which use a small number of bits and are mainly used as examples, since it is awkward to do examples with 64 or 32 bits.

Special and subnormal numbers

There are a few exceptions to the usual representation. If the exponent is as large as possible (all ones), the mantissa bits are then examined. If the mantissa is all zero, then the number is interpreted as \pm \infty, depending on the sign bit. If the mantissa is not zero, then the number is interpreted as "NaN", or "Not a Number", indicating an indeterminate quantity such as 0/0 was encountered (or sometimes that something has gone wrong in a computation).

The second exception is when the exponent is as small as possible (all zeros). In this case, the left-most bit is assumed to be 0 rather than 1. These are called subnormal numbers, and this case includes 0. The smallest nonzero number will be the subnormal number with a mantissa of all zeros followed by one 1. For 64-bit floats, this number is 2^{-1074} \approx 5 \cdot 10^{-324}.

Rounding

Most numbers cannot be exactly represented in floating-point, and need to be rounded. The IEEE standard rounds them up or down to the nearest representable number, unless it is exactly halfway between the nearest two representable numbers. In that special case, the rounding is chosen so that the last digit of the mantissa will be zero. This helps avoid consistently rounding up or down for the special halfway case.

Floating point number types
Some common float formats; yellow is the sign bit, blue boxes are the exponent bits, and red boxes are the mantissa bits.

Machine epsilon

There is an important difference between the smallest nonzero number representable in floating-point, and the precision of an arithmetic operation. The precision in basic operations such as addition and multiplication is limited by the precision of the mantissa. Often this is described by the "machine epsilon" of a given system; unfortunately the machine epsilon has several different definitions, but a simple one is: the machine epsilon is the difference between 1 and the nearest number above 1. For base-64 floating point, the machine epsilon is 2^{-52}, which is about 2.22 \cdot 10^{-16}.

Example

Lets look at an example of loss of precision in an arithmetic operation using floating-point numbers. For simplicity, we will use standard 16-bit floating point (`half-precision'): base b=2, one sign bit, five bits for exponents (shifted by -16), and 10 bits of precision (with an implied leading 1, so 11 bits of actual precision).

Lets compute

(fl(6/5) - fl(1/5)) - fl(1)

where fl(x) denotes the floating-point representation of a number x.

To make the bit representations more clear we will put in some commas between the different segments for sign, exponent, and the significand.

To see what how the number 6/5 is represented in this system, we first compute the binary representation:

(6/5)_2 = 1.0011001100110011... = 1.\overline{0011}

This needs to be rounded to have 10 bits after the point; since the remainder is over halfway to the next largest number, we round up to

(6/5)_2 \approx 1.0011001101

We do not need to adjust the location of the decimal point here, so the exponential part is just 2^0, and we need to represent a zero in the exponent bits. Since the exponent is shifted by -16, we actually want to store a 16 in the five exponent bits. So we get

fl(6/5) = 0,10000,0011001101

where the leading 0 is the sign bit, indicating a positive number, following by 16_2 = 10000, and then the 10 binary digits of the significand.

Similarly,

(1/5)_2 = 0.0011001100110011... = 0.\overline{0011}

but now we need to multiply by 2^3 to get a leading 1 before the decimal point. So the exponent is -3, and we store 18 = 10100_2 in the exponent bits. In this case, we need to round down for the 10 significand bits, obtaining:

fl(1/5) = 0,10100,1001100110

When we subtract fl(6/5) - fl(1/5), we first use all of the bits available, and then round the answer. In this case, we have

 1.001100110100
-0.001100110011

which is

 1.000000000001

which rounds down to 1.

Since 1 is exactly representable by 0,10000,0000000000,

(fl(6/5) - fl(1/5)) - fl(1) = 0,00000,0000000000

and we get zero.

Exercises

  1. Convert the number \frac{28}{3} to its binary representation.

  2. Find the two solutions of the equation x^2 + 9^{14}x = 3 to four significant digits.

  3. Compute the quantity (\frac{7}{3} - \frac{4}{3}) - 1 in these two versions of floating point arithmetic:

    1. Standard 64-bit floating point: base b=2, eleven bits for exponents (shifted by -1023), and 52 bits of precision.

    2. Base b = 3, three ternary digits for exponents shifted by -13, and 5 ternary digits of precision.