IEEE Standard for Floating-Point Arithmetic (IEEE 754) |
||
Description
The IEEE-754 format is a binary format used in computers to store and manipulate floating point numbers. The IEEE-754 standard covers 8 different variations, 5 binary, ranging from 16 to 256 bits and 3 decimal, ranging from 32 to 128 bits. Only the binary format and the 32, 64 and 128 bit sizes will be discussed. It should be noted that the binary formats are all very similar to each other. The decimal formats are different and are not used by the C/C++ compilers that will be used here. It is also the case that the LLVM and the GNU C/C++ compilers do not use the IEEE-754 128 bit standard for the long double types in all cases. There are options for using 128 bit formats, however they will not be covered here. In the C++23 standard (ISO/IEC 14882:2023) there are also fixed width floating point types. In the current C/C++ standards there are also three fundamental floating point types:
- float  32 bits
- double  64 bits
- long double  64 or 128 bits and is some cases 80 bit decimal
It is the three IEEE-754 formats that will be covered here.
For general information and history visit IEEE 754 - Wikipedia
The IEEE-754 Formats
32 Bit
64 Bit
128 Bit
IEEE-754 Floating Point Converters
Number: |
Binary
Sign | Exponent | Mantissa |
---|---|---|
Number: |
Binary
Sign | Exponent | Mantissa |
---|---|---|
Number: |
Binary
Sign | Exponent | Mantissa |
---|---|---|
How to convert to the 32 bit format
The floating point number 34.25- Convert the number to binary
- Move the floating point and determine the exponent
- Remove the leading 1 and determine the mantissa
- Added the sign bit followed by the 8 bit exponent and then the 23 bit mantissa
34.25 = 0b100010.01
Move the floating point to the position directly behind the leading one. 0b100010.01 = 0b1.0001001 54321< The number of position shifted plus the 32 bit bias of 127 exponent = 127 + 5 = 133 binary exponent = 0b10000100 Note: In cases where the leading 1 is to the right of the floating point, the number added to the bias is negative. One technic for determining the exponent is based on converting the decimal number to binary by raising two to the decimal number [2N]. In the code below, for the 64 bit format. Lines 118-122 show this technic for determining the positive exponent. And lines 106-110 show the technic for determining a negative exponent.
The number after shift to determine the exponent is: 0b1.0001001 Remove the leading 1 and the floating point and fill in the remaining bits with zeros to determine the mantissa 0b00010010000000000000000
If the number positive the sign is 0, if negative it's 1 0 10000100 00010010000000000000000
This process is followed for all three of the formats. The only difference is size of the exponent and mantissa and the bias.
The bias is determined by the size of the exponent:
(2^(number bits of the exponent))/2 - 1
- 32 bit: (2^8)/2 - 1 = 127
- 64 bit: (2^11)/2 - 1 = 1023
- 32 bit: (2^15)/2 - 1 = 16383
Program to format IEEE-754 binary 64 bit format
This program takes an decimal number from the command line and formats it in the IEEE-754 binary 64 bit format. It outputs various data a long the way showing information about the process. The formatted data is placed in a "long integer" which is part a union. Which contains a "double" of the same size. At the end of the program the "double" is out to verify the format.
To Compile:
clang -o formatF2b -lm formatF2b.cor
gcc -o formatF2b -lm formatF2b.c
Convertion Demo Code
Download C Source
1 ///////////////////////////////////////////////////////////////////////
2 //
3 // formatF2b.c
4 //
5 // Written by: Don Dugger
6 // Date: 2024-01-06
7 //
8 // Copyright (C) 2024 Acme Software Works, Inc.
9 //
10 // Redistribution and use in source and binary forms, with or without
11 // modification, are permitted provided that the following conditions are met:
12 //
13 // 1. Redistributions of source code must retain the above copyright notice,
14 // this list of conditions and the following disclaimer.
15 //
16 // 2. Redistributions in binary form must reproduce the above copyright notice,
17 // this list of conditions and the following disclaimer in the documentation
18 // and/or other materials provided with the distribution.
19 //
20 // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 // AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 // IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23 // ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
24 // LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
25 // CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
26 // GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27 // HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28 // LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
29 // OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 //
31 ///////////////////////////////////////////////////////////////////////
32
33 #include <stdio.h>
34 #include <stdlib.h>
35 #include <assert.h>
36 #include <stdint.h>
37 #include <unistd.h>
38 #include <libgen.h>
39 #include <stdbool.h>
40 #include <math.h>
41
42 ///////////////////////////////////////////////////////////////////////
43
44 long fraction2bin(double decimal, int size);
45
46 ///////////////////////////////////////////////////////////////////////
47
48 #define FORMAT_SIZE 64
49 #define MANTISSA_SIZE 52
50 // The BIAS is (2^(number bits of the exponent))/2 - 1
51 #define BIAS 1023
52
53 ///////////////////////////////////////////////////////////////////////
54 ///
55 /// main
56 ///
57 /// @brief This program takes an decimal number from the command line
58 /// and formats it in the IEEE-754 binary 64 bit format.
59 /// It outputs various data a long the way showing information
60 /// about the process.
61 /// The formatted data is placed in a "long integer" which is
62 /// part a union. Which contains a "double" of the same size.
63 /// At the end of the program the "double" is out to verify
64 /// the format.
65 ///
66 /// @return Zero
67 ///
68 ///////////////////////////////////////////////////////////////////////
69 ///
70 int main(int argc, char* argv[])
71 {
72 /// This union is used to verify the result
73 union {
74 double f;
75 long i;
76 } u;
77 ///////////////////////////////////////////////////////////////////
78 /// Get the floating point number that will be formatted
79 char* end;
80 double original = strtod(argv[1],&end);
81
82 ///////////////////////////////////////////////////////////////////
83 /// Check the sign of the input and remove it form the input
84 bool negative = false;
85 if (original < 0) {
86 negative = true;
87 original *= -1;
88 }
89
90 ///////////////////////////////////////////////////////////////////
91 /// Extract the integer portion and fraction of the input
92 double integer;
93 double fraction = modf(original,&integer);
94 long int_portion = (long)integer;
95
96 ///////////////////////////////////////////////////////////////////
97 /// Determine the exponent
98 long exponent = 0;
99 if ( int_portion == 0 ) {
100 /// If the input is less than one, the exponent will be a
101 /// negative number. In that case, negative exponents will
102 /// be tested until it is greater than the fraction. Which
103 /// is one greater than the proper exponent. The negative
104 /// exponent can not be greater than the BIAS, so that
105 /// will be the last exponent tested.
106 for (int idx=0;idx>-BIAS;--idx) {
107 if ( pow(2,idx) > fraction) {
108 exponent = idx-1;
109 }
110 }
111 } else {
112 /// If the input is one or greater, the exponent will be
113 /// positive. And in that case the exponent is determined
114 /// by the position of the most significant one in the
115 /// binary integer portion of the input. So when 2 to the
116 /// power of "idx" is greater the integer portion of the
117 /// input. The most significant one is one less.
118 for (int idx=0; idx<=BIAS && exponent == 0; ++idx) {
119 if ( int_portion < pow(2,idx)) {
120 exponent = idx-1;
121 }
122 }
123 }
124
125 ///////////////////////////////////////////////////////////////////
126 /// Determine the mantissa
127 /// The mantissa times two raised to the exponent is the way the
128 /// formatted floating point will be evaluated. So reversing the
129 /// process will yield the mantissa.
130 double mantissa = original / pow(2,exponent);
131
132 ///////////////////////////////////////////////////////////////////
133 /// Now output the result.
134 printf("Exponent = %ld\n",exponent);
135 printf("Integer = %ld\n",int_portion);
136 printf("Fraction = %lf\n",fraction);
137 printf("Mantissa = %lf\n",mantissa);
138
139 ///////////////////////////////////////////////////////////////////
140 /// Now place the results in the integer element of the union to
141 /// verify the results.
142 long bin_exp = ((exponent + BIAS) & ((BIAS*2)+1));
143 printf("==============================\n");
144 printf("Sign = %d\n", negative ? 1 : 0);
145 // For the 64 bit format the exponent is 11 bits
146 printf("Binary Exponent = %11.11b\n",(int)bin_exp);
147 u.i = fraction2bin(mantissa,MANTISSA_SIZE);
148 printf("Binary Mantissa = ");
149 for (long idx = MANTISSA_SIZE-1;idx>=0;--idx) {
150 printf("%d", (((long)1 << idx) & u.i) ? 1 : 0);
151 }
152 printf("\n");
153 bin_exp <<= MANTISSA_SIZE;
154 u.i |= bin_exp;
155 if ( negative ) {
156 u.i |= (long)pow(2,FORMAT_SIZE-1);
157 }
158 printf("==============================\n");
159 printf("Results = %lf\n",u.f);
160
161 return 0;
162
163 } // End of main()
164
165 ///////////////////////////////////////////////////////////////////////
166 ///
167 /// @Brief This is a basic decimal fraction to binary fraction
168 /// converter.
169 /// The algorithm works by first removing the integer
170 /// portion of the decimal number and then multiplying
171 /// the fraction by two and if the result is greater than
172 /// one setting the bit in the binary result to one then
173 /// move to the next bit. Then repeating the process until
174 /// it reaches size requested. It starts at the most
175 /// significant bit of the output and working it's way
176 /// down to the least significant bit.
177 ///
178 /// @Param The decimal with the one in place.
179 /// @Param The size of the binary fraction.
180 ///
181 /// @return The binary fraction.
182 ///
183 ///////////////////////////////////////////////////////////////////////
184 ///
185 long fraction2bin(double fraction, int size)
186 {
187 //double f = fraction - 1;
188 double integer;
189 long bin = 0;
190 long pos = 1;
191 fraction = modf(fraction,&integer);
192 pos <<= (size-1);
193 for (int idx=0;idx<=size;++idx) {
194 fraction *= 2;
195 if ( fraction >= 1.0 ) {
196 bin |= pos;
197 fraction -= 1.0;
198 }
199 pos >>= 1;
200 }
201
202 return bin;
203 } // End of fraction2bin()
204
205 ///////////////////////////////////////////////////////////////////////
206 //////////////////////// End of File //////////////////////////////////
207 ///////////////////////////////////////////////////////////////////////
208
For more information: info@acmesoftwareworks.com