chapter 3

Chapter 3

floating point data types

representation

When using floating point numbers one has to consider the specialties of these data types. The representation of floating point numbers inside your computer leads to 2 commonly seen effects in calculations:

	round off errors due to the limited number of digits stored it is complicated to a) add two numbers of largely different magnitude b) subtract two numbers only slightly different
	truncation errors happen when you try to calculate numerical values iteratively (e.g. integrals) - even with an indefinite number of digits one has to stop his algorithm after a limited number of steps.

A typical floating point number is represented by a sign bit (s), a mantissa (m), a base (b) and an exponent (e) which frequently has a an Offset (E)

(s*) m*b^e-E

with mantissa and exponent being integers. The base is usually b=2. In C/C++ you have the types float (32 bit) and double (64 bit) with double being a quasi standard when performing floating point operations. If one needs higher precision there is also a long double (80 bit). For even higher precision than this there exist special libraries.

In a float one finds 8 bit for the exponent, 23 for the mantissa and one sign bit. With 8 bit the exponent ranges from 0 to 255, lets assume an offset of E=128. Then we have a largest number of 2^(128)=3.4E38. A very important point is the understanding, that this largest number has nothing got to do with round off errors. These are only given by the number of bits used for the mantissa: if you want to subtract or add two numbers differing in magnitude you have to equal both exponents, then add the mantissae. Now, for a smaller number it means to increase the exponent but to decrease the mantissa. Decreasing the mantissa means at the same time loosing information. To clarify this, let's take an example with base= 10 and 5 digits used for the mantissa:

a = 1453.4 = 1.4534E3 b = 1.0942 = 1.0942E0

to add a and b, b has to be transformed: b = 0.0010E3

=> a+b = 1.4544E3 = 1454.4

In this example we clearly loose information on number b when its digits are "right-shifted" to decrease the mantissa the same way in which the exponent is increased. Let's see an example of this:

#include <iostream>

void main()
{
    float f1=1.0,f2;
    int i;

    for (i=0;i<30;i++)
    {
        f2=1.0+f1;
        f2-=1.0;
        cout << i << "\t" << f2 << endl;
        f1/=2;
    }
}

In this example we first add f1=2^-i to 1, just to subtract 1 immediately afterwards. What remains should be f1 - as long as the representation of f1 normalized to the same exponent as 1 is not zero. When this happens the resulting value of f2 becomes 0. Running this program you will realize that this occurs between i=23 and i=24. Compare this with the number of bits of a float's mantissa of 23. As an exercise, change this program to use double and long double types and find out the number of bits used for their mantissae.

The precision of a float and a double can also be nicely seen with the following simple example:

#include <iostream>
#include <math.h>

int main()
{
    double d;
    float f;
    d=1.2345678901234567890;
    f=1.2345678901234567890;
    cout.precision(20);
    cout << d << "\n" << f << "\n";
}

In the output

1.2345678901234566904
1.2345678806304931641

we can see the digit, where differences between the number we set in the program and the saved value (which is then printed out) occur. Compare this with the number of bits of the mantissa of 23 and 53 for float and double, respectively. What is the number of decimal digits resulting from these numbers?

Another example is the first exercise of this chapter.

functions using floating point types

Mathematical functions are declared in math.h and take a double as an argument and also return a double. Some of the more important ones are

	pow (x,b) calculate x to the power of b: x^b
	exp(b) evaluates e^b
	log(x), log10(x) to evaluate the logarithm of x to base e and 10, respectively
	sin(x), cos(x), tan(x) the trigonometric functions
	asin(x), acos(x), atan(x), atan2(y,x) the arc trigonometric functions (atan2 is interesting in geometrical calculations, as it preserves the correct phase angle)
	sqrt(x) guess what this one evaluates...

In C these functions exist in a second version with an appended "f", denoting that these functions return a float instead of a double. In C++ one can alternatively use the library cmath.h. In this library the above functions are overloaded - you can either pass a float, a double or a long double. The return type is then the same as the type of the passed argument. Trigonometric functions deal always with angles in radian.

a pitfall when using floating point numbers in C/C++

A common source of error are wrongly initialized floating point numbers. In a command like

double a=1;

the number 1 is usually automatically converted to a floating point value: 1.0. Nevertheless if we try the following:

double b=1/3;

we may be surprised that the resulting value of b is 0! The reason for this is that we divide 1 by 3 - two integers. As stated in chapter 1 such a division returns an integer - and this is also here the case. Furthermore in a pure integer division the result is always rounded down: b becomes 0. To correct this behavior it is sufficient to make just one value a floating point type. Note that in case of zeros after the decimal point the two following terms are equivalent:

b=1. b=1.0

Maybe you already mentioned this problem in the 2nd exercise of the last chapter. If you have not yet done so, try again that example, replacing the "e+=1.0/nf" by "e+=1/nf".

email me: Daniel Schürmann