Floating Point Numbers. How they work, the pitfalls of imprecision and how to fix them.

What you need to know before diving in.

Basic knowledge in bit wise operations(shift operations and bit "fallout")

Some notations used in this article

Acronym	Meaning
FP	Floating point
^	"To the power of"
# rtz	Round to zero
RTP	Round to Positive infinity
RTN	Round to Negative infinity
RTE	Round to Zero

a

So what are floating point numbers?

Floating point numbers simply are a

positional notation representing numbers as a fixed count of significant bits but starting from the first non-zero high-order bit of the number when represented as a finite binary fraction.

Ok.. maybe it wasn't so simple and I don't know why people would think of such a difficult definition of such a simple concept. All floating point numbers are, is a way to represent decimal numbers in a computer.

How are they represented?

To understand how FP numbers are represented we need to get familiar with the so called Scientific notation of representing numbers. Basically a number is broken down to a value times a base(AKA radix) to some power:

The number 25.67 in Scientific notation would be written out as: 2567 * (10 ^ -2).
The number 21 in Scientific notation would be written out as: 21 * (10 ^ 0).
The number -0.4 in Scientific notation would be written out as: -4 * (10 ^ -1) and so on.

Initially FP numbers were consider to be written in this notation however there is 1 problem that prevented that. Namely that a single number could be represented in multiple ways. Lets take PI as an example, if we wanted to represent 3.14159 in Scientific notation we could represent it like so:

0.314159 * (10 ^ 1) however
3.14159 * (10 ^ 0) is also a valid representation and so is:
31.4159 * (10 ^ -1)

In order to get FP numbers to work in the same manner across all architectures everyone had to agree that there is only one "Right" way to represent a number. So they decided that floating point numbers will be represented using Normalized Scientific Notation.

Normalized form states that the integer part of your number (the part to the left of the decimal point) will have a single non-zero number. Going back to the PI example above we see that the "Proper way" to represent it is the second way namely 3.14159 * (10 ^ 0).

So to summarize a floating point number is broken into 3 parts:

Sign - for telling apart negatives from positives. Takes up 1 bit.
Significant - the significant digits of the number AKA matissa - a fixed number of bits usually 23 for a 32 bit float.
Exponent - the position of the significant with respect to the binary point.

If we translate this into binary it will go like:

value = (-1) ^ sign * (significant << exponent). And 3.14158 in binary normalized form will look like:

Note* that the shift operator used here differs from the bit wise shift operators we are used to. The difference comes from the fact that this shift operator shifts in a different direction depending on the sign of the exponent. Meaning if the exponent is a positive number the operator shifts to the left and if the exponent is a negative number it shifts to the right.

Immediately though a problem appears namely that a single number can be represented in multiple ways using this notation. Lets take the number PI as an example, if we wanted to represent 3.14159 in Scientific notation we could represent it like so:

0.314159 * 10 ^1 however
3.14159 * 10^ 0 is also a valid representation and so is:
31.4159 *10^ -1

In order for floating points to work in the same manner on every architecture people had to agree on a single representation. And that representation is called Normalized Scientific Notation.

Normalized form states that the integer part of your number (the part before the decimal point) will have a single non-zero number. Going back to the PI example above we see that the "Proper way" to represent it is the secon way namely 3.14159 * (10 ^ 0).

Note* that since Normalized form mandates that we have a single non zero digit integer part there is no way we could represent 0 in this form. So 0 is represented as a number ^ the lowest exponent we could represent, for 32 bit numbers that would be -126.

Also since the notation Mandates we start our representation with a non-zero bit that first bit is not kept in memory.The Computer simply remembers where it is.

How calculations are done?

For simplicity lets invent our own floating point 8 bit number format which takes a single bit for its sign, 3 bits for its exponent nad 4 bits for its mantissa. Using this format we can see that we can represent numbers between [0, 7] as that is the range that fits in 3 bits. However we

Let's have three positive FP8 numbers:

a = 1.0000b << -2 = 0.25 ; smallest positive normal in our FP8
b = 1.000b << 3 = 8
c = 1.0100b << 0 = 1.25

Then: a + b + c = 0.25 + 8 + 1.25 = 9.5 OR 1.0011b << 3 in binary form.

How do we add two FP's with a different exponent though? We simply denormalize the smaller-exp number to get it represented into the larger exp. Carry out the computation, then re-normalize, applying rounding policy as needed. Example: TODO: GIF HERE

(1.0000b << -2) + (1.0000b << 3) ; denormalize the smaller - exponent num
(0.00001b << 3) + (1.000b << 3) ; add the intermediates
1.0001b << 3 ; rounding policy applies, e.g RTZ , RTP, RTN, RTE

tesrt

Resources:

Scientific notation Introduction - A Khan Academy video in which Salman Khan explains
On the buoyancy of the Floating Point - A brilliant talk by Martin Krustev from Chaos Group. This blog is actually based on that talk. CONTENT IN BULGARIAN.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly