Floating Point

Generalise fixed point by encoding the position of the binary point inside the number.

perform all operations in special hardware (or emulate in software)
represent number in "scientific format"
```
   x = m * 10^e
```
(except we use base 2 rather than base 10)
numerous standards; IEEE 754 now most common

IEEE 754 Format

32 ("single") and 64-bit ("double")
32-bit word is: 1-bit sign; 8-bits exponent; 23-bits mantissa
64-bit word is: 1-bit sign; 11-bit exponent; 52-bit mantissa
larger word sizes permitted (e.g. Apple uses 80-bit "extended" format internally)

The sign is simple sign-magnitude (didn't someone say that was bad?)

Exponent is excess-127 (single) or excess-1023 (double)

Magnitude is unsigned binary fraction (why unsigned?)

If exponent is not zero, mantissa is interpreted as having a leading 1

E.g.

1 10000111 01000000000000000000000

This is a negative number (sign is 1)

Exponent is 135-127=8

Mantissa is 1*2-2 = 0.25

Value is: -1.25 * 2^8 = -320

Largest number:

0 11111111 11111111111111111111111

Positive sign

Maximum exponent (255-127=128)

Maximum mantissa (1-2^-23)

Value is: (1+1-2^-23) * 2^128 = 6.81 * 10^38.

Or at least, this would be the largest IEEE single if the exponent 11111111 was not treated in a special way.

"Not-a-numbers", and "infinities"

Infinities can result from e.g. trying to compute 1/0, or from an overflow

Bit patterns (+INF and -INF):

0 11111111 00000000000000000000000
1 11111111 00000000000000000000000

Operations are well-defined on infinities:

add any number to INF gives INF
multiply INF by -1 gives -INF
divide 1 by -INF gives -0

NaNs have 11111111 exponent, but non-zero mantissa field.

Used to indicate "meaningless" operations

E.g.

square root of negative number
adding +INF to -INF
invalid division (0/0)
invalid multiplication (0*INF)
etc.

Mantissa field is used to record NaN code.

Operating on a NaN usually results in the same NaN as the result.

avoids need to check intermediate results for errors
optionally, errors can be "signalled"

Normalised and De-normalised mantissa

Leading 1 bit is not stored if exponent is non-zero

why store it, when you know what it is? (lose a bit off the other end if you store this one)
how can you store it, if it does not exist?

E.g.

1.100110011001100110011001 * 2^-126

Divide this number by 2, and exponent becomes -127 (smallest possible). Divide by 2 again, and the number becomes

0.1100110011001100110011001 * 2^-127

I.e. the leading bit is now 0, but we cannot reduce the exponent any further.

Rounding errors: density of floats

When does x+1=x ?

Happens when the exponent is larger than 23

the mantissa is multiplied by more than 2^23
but there are only 23 bits available;
simply no place to store a low-end '1'

Distance "between" successive floats doubles with each increase in the exponent.

With floats in the upper-end of the range, the distance reaches 2*10^31. E.g. in this range, x+1,000,000 still equals x.

Incommersurable bases

It is not possible to convert between base 10 and base 2 completely accurately.

E.g. represent 0.1 as a float

Closest single-precision IEEE float is

0.000110011001100110011001101

but this is more like

0.10000000149

These "minor" differences compound with each operation. Eventually, all precision can be lost.

Degradation is delayed by computing with high-precision (e.g. Extended floats), but the problem never goes away.