Floating Point

Generalise fixed point by encoding the position of the binary point inside the number.

IEEE 754 Format

The sign is simple sign-magnitude (didn't someone say that was bad?)

Exponent is excess-127 (single) or excess-1023 (double)

Magnitude is unsigned binary fraction (why unsigned?)

If exponent is not zero, mantissa is interpreted as having a leading 1

E.g.

1 10000111 01000000000000000000000
This is a negative number (sign is 1)

Exponent is 135-127=8

Mantissa is 1*2-2 = 0.25

Value is: -1.25 * 2^8 = -320

Largest number:

0 11111111 11111111111111111111111
Positive sign

Maximum exponent (255-127=128)

Maximum mantissa (1-2^-23)

Value is: (1+1-2^-23) * 2^128 = 6.81 * 10^38.

Or at least, this would be the largest IEEE single if the exponent 11111111 was not treated in a special way.

"Not-a-numbers", and "infinities"

Infinities can result from e.g. trying to compute 1/0, or from an overflow

Bit patterns (+INF and -INF):

0 11111111 00000000000000000000000
1 11111111 00000000000000000000000
Operations are well-defined on infinities:

NaNs have 11111111 exponent, but non-zero mantissa field.

Used to indicate "meaningless" operations

E.g.

Mantissa field is used to record NaN code.

Operating on a NaN usually results in the same NaN as the result.

Normalised and De-normalised mantissa

Leading 1 bit is not stored if exponent is non-zero

E.g.

1.100110011001100110011001 * 2^-126
Divide this number by 2, and exponent becomes -127 (smallest possible). Divide by 2 again, and the number becomes
0.1100110011001100110011001 * 2^-127
I.e. the leading bit is now 0, but we cannot reduce the exponent any further.

Rounding errors: density of floats

When does x+1=x ?

Happens when the exponent is larger than 23

Distance "between" successive floats doubles with each increase in the exponent.

With floats in the upper-end of the range, the distance reaches 2*10^31. E.g. in this range, x+1,000,000 still equals x.

Incommersurable bases

It is not possible to convert between base 10 and base 2 completely accurately.

E.g. represent 0.1 as a float

Closest single-precision IEEE float is

0.000110011001100110011001101
but this is more like
0.10000000149
These "minor" differences compound with each operation. Eventually, all precision can be lost.

Degradation is delayed by computing with high-precision (e.g. Extended floats), but the problem never goes away.