We get the smallest representable number when all of the values of \(a_i\) and \(b_i\) are \(0\), except for \(b_{32}\).
\[\begin{eqnarray} &a_i =& 0 \ \forall i,\ b_1,\ b_2, ...,\ b_{31} = 0,\ and \ b_{32} = 1 \\ &\implies& 2^{-32} \approx 10^{-9} \end{eqnarray}\] Largest Number
We get the largest representable number when all of the values of \(a_i\) and \(b_i\) are \(1\).
\[\begin{eqnarray} &a_i =& 1 \ \forall i \ and \ b_i = 1 \ \forall i \\ &\implies& 2^{31}\ + ... + 2^1 + 2^0 + 2^{-1} + 2^{-2}+ ... +\ 2^{-32} \approx 10^9 \end{eqnarray}\] Binary Point Placement
We consider the range and precision when evaluating a representation.
Range: Difference between the largest and smallest numbers possible.
Precision: Smallest possible difference between any two numbers.
Consider an 8 bit system where 6 bits store the integer part and 2 bits store the fractional part. What are its range and precision?
Answer
Smallest Numbers
\(000000.01 = 0.25\)
\(000000.10 = 0.5\)
Largest Numbers
\(111111.10 = 63.5\)
\(111111.11 = 63.75\)
Range
\(63.75 - 0.25 = 63.5\)
Precision
\(0.5 - 0.25 = 63.75 - 63.5 = 0.25\)
How about for an 8 bit system where 2 bits store the integer part and 6 bits store the fractional part?
Answer
Smallest Numbers
\(00.000001 = 0.015625\)
\(00.000010 = 0.03125\)
Largest Numbers
\(11.111110 = 3.96875\)
\(11.111111 = 3.984375\)
Range
\(3.984375 - 0.015625 = 3.96875\)
Precision
\(0.03125 - 0.015625 = 3.984375 - 3.96875 = 0.015625\)
We see that having more integer bits increases range and more bits for the fractional part increases precision. It can be hard to decide how much precision and range you need.
Fix: Let the binary point "float"
Floating Point Numbers
The floating point representation of a binary number is similar to scientific notation for decimals. Similar to how you can represent 23.375 as
\[2.3375 \cdot 10^1,\]
you can represent \((10111.011)_2\) as
\[1.0111011 \cdot 2^4.\]
A floating-point number can represent numbers of different orders of magnitude(very large and very small) with the same number of fixed digits.
More formally, we can define a floating point number \(x\) as
\[x = \pm q \cdot 2^m,\]
where:
- \(\pm\) is the sign
- \(q\) is the significand
- \(m\) is the exponent
Aside from the special case of zero and subnormal numbers (discussed below), the significand is always in normalized form
\[q = 1.f,\]
where:
- \(f\) is the fractional part of the significand
Whenever we store a normalized floating point number, the 1 is assumed. We don't store the entire significand, just the fractional part. This is called the "hidden bit representation", which gives one additional bit of precision.
Properties of Normalized Floating-Point Systems
A number \(x\) in a normalized binary floating-point system has the form
\[ \begin{equation} x = \pm 1.b_1b_2b_3...b_n \times 2^m = \pm 1.f \times 2^m \end{equation} \]
- Digits: \(b_i \in {0, 1}\)
- Exponent range: Integer \(m \in [L,U]\)
- Precision: \(p = n + 1\)
- Smallest positive normalized floating-point number: \( 2^L\)
- Largest positive normalized floating-point number: \( 2^{U+1}(1-2^{-p})\)
Example
\[\begin{equation} x = \pm 1.b_1b_2 \times 2^m \text{ for } m \in [-4,4] \text{ and } b_i \in \{0,1\} \end{equation} \]
Answer
- Smallest normalized positive number:
\[ \begin{equation} (1.00)_2 \times 2^{-4} = 0.0625 \end{equation} \]
- Largest normalized positive number:
\[ \begin{equation} (1.11)_2 \times 2^4 = 28.0 \end{equation} \]
- Any number \(x\) closer to zero than 0.0625 would underflow to zero.
- Any number \(x\) outside the range -28.0 and +28.0 would overflow to infinity.
Machine Epsilon
Machine epsilon (\(\epsilon_m\)) is defined as the distance (gap) between \(1\) and the next largest floating point number.
\[\pm 1.b_1b_2 \times 2^m\ for \ m \in [-4,4]\ and\ b_i \in \{0, 1\}\] \[(1.00)_2 \times 2^0 = 1 \hspace{1.8cm} (1.01)_2 \times 2^0 = 1.25\] \[\epsilon_m = (0.01)_2 \times 2^0 = 0.25\] Or for a general normalized floating point system \(1.f \times 2^m\), where \(f\) is represented with \(n\) bits, machine epsilon is defined as:
\[\epsilon_m = 2^{-n}\] In programming languages these values are typically available as predefined constants. For example, in C, these constants are FLT_EPSILON
and DBL_EPSILON
and are defined in the float.h
library. In Python you can access these values with the code snippet below.
import numpy as np
# Single Precision
eps_single = np.finfo(np.float32).eps
print("Single precision machine eps = {}".format(eps_single))
# Double Precision
eps_double = np.finfo(np.float64).eps
print("Double precision machine eps = {}".format(eps_double))
Note: There are many definitions of machine epsilon that are used in various resources, such as the smallest number such that \(\text{fl}(1 + \epsilon_m) \ne 1\). These other definitions may give slightly different values from the definition above depending on the rounding mode (next topic). In this course, we will always use the values from the "gap" definition above.
Range of Integer Numbers
What is the range of integer numbers that can be represented exactly in this representation?
\[\pm 1.b_1b_2 \times 2^m\ for \ m \in [-4,4]\ and\ b_i \in \{0, 1\}\] \[\begin{eqnarray} (1)_2 &=& 1.00 \times 2^0 &=& 1_{10} \\ (10)_2 &=& 1.00 \times 2^1 &=& 2_{10} \\ (11)_2 &=& 1.10 \times 2^1 &=& 3_{10} \\ (100)_2 &=& 1.00 \times 2^2 &=& 4_{10} \\ (101)_2 &=& 1.01 \times 2^2 &=& 5_{10} \\ (110)_2 &=& 1.10 \times 2^2 &=& 6_{10} \\ (111)_2 &=& 1.11 \times 2^2 &=& 7_{10} \\ (1000)_2 &=& 1.00 \times 2^3 &=& 8_{10} \\ (1001)_2 &=& \_\_\_\_?\_\_\_\_ &=& 9_{10} \\ (1010)_2 &=& 1.01 \times 2^3 &=& 10_{10} \\ \end{eqnarray}\] We see that we can represent every integer from \((1)_{10}\) to \((8)_{10}\) in this floating point system. However, in order to represent \((9)_{10}\) or \((1001)_{2}\), we would need more than two bits in the fractional part. This limit in precision causes skips in integers above an integer range.
Notice the upper bound of this range assumes the form \(1.00 \times 2^3\), where \(3\) represents the precision of the floating point system. Therefore, this upper bound is given by \(2^p\).
IEEE-754 Single Precision
- \(x = (-1)^s 1.f \times 2^m\)
- 1-bit sign, s = 0: positive sign, s = 1: negative sign
- 8-bit exponent \(c\), where \(c = m + 127\), we need to reserve exponent number for special cases \( c = (11111111)_2 = 255, c = (00000000)_2 = 0\), therefore \(0 < c < 255\)
- 23-bit fractional part \(f\)
- Machine epsilon:
- Smallest positive normalized FP number: \(UFL = 2^L = 2^{-126} \approx 1.2 \times 10^{-38}\)
- Largest positive normalized FP number: \(OFL = 2^{U+1}(1 - 2^{-p}) = 2^{128}(1 - 2^{-24}) \approx 3.4 \times 10^{38}\)
The exponent is shifted by 127 to avoid storing a negative sign. Instead of storing \(m\), we store \(c = m + 127\). Thus, the largest possible exponent is 127, and the smallest possible exponent is -126.
Example:
Convert the binary number \((100101.101)_2\) to the normalized FP representation
\[1.f \times 2^m\] Answer
\((100101.101)_2 = (1.00101101)_2 \times 2^5\) \(s = 0,\quad f = 00101101…00,\quad m = 5\) \(c=m + 127 = 132 = (10000100)_2\)
Normalized FP representations: \(0 \; 10000100 \; 00101101000000000000000\)
For additional reading about IEEE Floating Point Numbers
IEEE-754 Double Precision
- \(x = (-1)^s 1.f \times 2^m\)
- 1-bit sign, s = 0: positive sign, s = 1: negative sign
- 11-bit exponent \(c\), where \( c = m + 1023\), we need to reserve exponent number for special cases \( c = (11111111111)_2 = 2047, c = (00000000000)_2 = 0\), therefore \(0 < c < 2047\)
- 52-bit fractional part \(f\)
- Machine epsilon:
- For IEEE-754 double precision, \(\epsilon_m = 2^{-52}\), as shown by:
\[\epsilon_m = 1.\underbrace{000000...000}_{\text{51 bits}}{\bf 1} - 1.\underbrace{000000...000}_{\text{51 bits}}{\bf 0} = 2^{-52}\] - Smallest positive normalized FP number: \(UFL = 2^L = 2^{-1022} \approx 2.2 \times 10^{-308}\)
- Largest positive normalized FP number: \(OFL = 2^{U+1}(1 - 2^{-p}) = 2^{1024}(1 - 2^{-53}) \approx 1.8 \times 10^{308}\)
The exponent is shifted by 1023 to avoid storing a negative sign. Instead of storing \(m\), we store \(c = m + 1023\). Thus, the largest possible exponent is 1023, and the smallest possible exponent is -1022.
Special Cases in IEEE-754
There are several corner cases that arise in floating point representations.
Zero
In our definition of floating point numbers above, we said that there is always a leading 1 assumed. This is true for most floating point numbers. A notable exception is zero. In order to store zero as a floating point number, we store all zeros for the exponent and all zeros for the fractional part. Note that there can be both +0 and -0 depending on the sign bit.
Infinity
If a floating point calculation results in a number that is beyond the range of possible numbers in floating point, it is considered to be infinity. We store infinity with all ones in the exponent and all zeros in the fractional. \(+\infty\) and \(-\infty\) are distinguished by the sign bit.
NaN
Arithmetic operations that result in something that is not a number are represented in floating point with all ones in the exponent and a non-zero fractional part.
Floating Point Number Line
The above image shows the number line for the IEEE-754 floating point system.
Subnormal Numbers
As mentioned above, a normal number is defined as a floating point number with a 1 at the start of the significand, and the smallest normal number in double precision is \(1.000… \times 2^{-1022}\). The smallest representable normal number is called the underflow level, or UFL.
However, we can go even smaller than this by removing the restriction that the first number of the significand must be a 1. These numbers are known as subnormal, and are stored with all zeros in the exponent. Technically, zero is also a subnormal number.
It is important to note that subnormal numbers do not have as many significant digits as normal numbers.
IEEE-754 Single precision (32 bits):
- \( c = (00000000)_2 = 0 \)
- Exponent set to \(m\) = -126
- Smallest positive subnormal FP number: \(2^{-23} \times 2^{-126} \approx 1.4 \times 10^{-45}\)
IEEE-754 Double precision (64 bits):
- \( c = (00000000000)_2 = 0 \)
- Exponent set to \(m\) = -1022
- Smallest positive subnormal FP number: \(2^{-52} \times 2^{-1022} \approx 4.9 \times 10^{-324}\)
The use of subnormal numbers allows for more gradual underflow to zero (however subnormal numbers don't have as many accurate bits as normalized numbers).
Example:
Suppose you are given a (binary) floating point system of the form
\[(-1)^s(1.b_1b_2)_2 \times 2^E\] that has an exponent range from -2 to 5.
We use this floating point system to represent the following subnormal number:
\[s = 1\] \[b_1b_2 = (11)_2\] Convert this subnormal number into a decimal number.
Answer
In this floating point system, subnormal numbers will be represented as
\[x = (-1)^s(0.f)\times2^L.\] We convert the fractional part \((11)_2\) into a decimal number and use the lower bound of the exponent range \(L = -2\).
\[\begin{eqnarray} x &=& (-1)^1 \times (0.11)_2 \times 2^{-2} \\ &=& (-1) \times (0.11)_2 \times 2^{-2} \\ &=& (-1) \times (0.75)_{10} \times 2^{-2} \\ &=& -0.1875 \end{eqnarray}\] Review Questions
- Convert a decimal number to binary.
- Convert a binary number to decimal.
- What are the differences between floating point and fixed point representation?
- Given a real number, how would you store it as a machine number?
- Given a real number, what is the rounding error involved in storing it as a machine number? What is the relative error?
- Explain the different parts of a floating-point number: sign, significand, and exponent.
- How is the exponent of a machine number actually stored?
- What is machine epsilon?
- What is underflow (UFL)?
- What is overflow?
- Why is underflow sometimes not a problem?
- Given a toy floating-point system, determine machine epsilon and UFL for that system.
- How do you store zero as a machine number?
- What are subnormal numbers?
- How are subnormal numbers represented in a machine?
- Why are subnormal numbers sometimes helpful?
- What are some drawbacks to using subnormal numbers?
Changelog
-
View Remaining Entries
- April 2nd, 2024: Apramey Hosahalli (apramey2) — adjust examples, reorder sections to match slides
- March 31st, 2024: Apramey Hosahalli (apramey2) — add binary point location section
- March 27th, 2024: Apramey Hosahalli (apramey2) — update learning objectives and add section for fixed-point representation
- April 28th, 2020: Mariana Silva (mfsilva) — moved rounding content to a separate page
- January 26th, 2020: Wanjun Jiang (wjiang24) — add normalized fp numbers, and some examples
- January 14th, 2018: Erin Carrier (ecarrie2) — removes demo links
- December 13th, 2017: Adam Stewart (adamjs5) — fix incorrect formula under number systems and bases
- December 8th, 2017: Erin Carrier (ecarrie2) — specifies UFL is positive
- November 19th, 2017: Matthew West (mwest) — addition of machine epsilon diagrams
- November 18th, 2017: Erin Carrier (ecarrie2) — updates machine eps def
- November 15th, 2017: Erin Carrier (ecarrie2) — fixes typo in converting integers
- November 14th, 2017: Erin Carrier (ecarrie2) — clarifies when stored normalized
- November 13th, 2017: Erin Carrier (ecarrie2) — updates machine epsilon
- November 13th, 2017: Erin Carrier (ecarrie2) — updates machine epsilon definition, fixes inconsistent capitalization
- November 12th, 2017: Erin Carrier (ecarrie2) — minor fixes throughout, adds changelog, adds section on number systems in different bases
- November 1st, 2017: Adam Stewart (adamjs5) — first complete draft
- October 16th, 2017: Luke Olson (lukeo) — outline