ASW
Floating Point
 

IEEE Standard for Floating-Point Arithmetic (IEEE 754)

Description

The IEEE-754 format is a binary format used in computers to store and manipulate floating point numbers. The IEEE-754 standard covers 8 different variations, 5 binary, ranging from 16 to 256 bits and 3 decimal, ranging from 32 to 128 bits. Only the binary format and the 32, 64 and 128 bit sizes will be discussed. It should be noted that the binary formats are all very similar to each other. The decimal formats are different and are not used by the C/C++ compilers that will be used here. It is also the case that the LLVM and the GNU C/C++ compilers do not use the IEEE-754 128 bit standard for the long double types in all cases. There are options for using 128 bit formats, however they will not be covered here. In the C++23 standard (ISO/IEC 14882:2023) there are also fixed width floating point types. In the current C/C++ standards there are also three fundamental floating point types:

The 80 bit decimal format is used to take advantage of the FPU (Floating Point Unit) built in x86 processor. The GNU 128 formats are discussed here.
It is the three IEEE-754 formats that will be covered here.

For general information and history visit IEEE 754 - Wikipedia


The IEEE-754 Formats

32 Bit

64 Bit

128 Bit



IEEE-754 Floating Point Converters

Number:  

Binary

Sign Exponent Mantissa

Number:  

Binary

Sign Exponent Mantissa

Number:  

Binary

Sign Exponent Mantissa


How to convert to the 32 bit format

The floating point number 34.25
  1. Convert the number to binary
  2. 
        34.25 = 0b100010.01
                    
  3. Move the floating point and determine the exponent
  4. 
        Move the floating point to the position directly behind the leading one.
        0b100010.01 = 0b1.0001001
           54321<
        The number of position shifted plus the 32 bit bias of 127 
        exponent = 127 + 5 = 133
        binary exponent = 0b10000100
        Note: In cases where the leading 1 is to the right of the floating point,
              the number added to the bias is negative.
    
        One technic for determining the exponent is based on converting the decimal number to binary by
        raising two to the decimal number [2N].
        In the code below, for the 64 bit format. Lines 118-122 show this technic for determining the
        positive exponent. And lines 106-110 show the technic for determining a negative exponent. 
        
  5. Remove the leading 1 and determine the mantissa
  6. 
        The number after shift to determine the exponent is:
            0b1.0001001
        Remove the leading 1 and the floating point and fill in the remaining bits with zeros 
        to determine the mantissa
            0b00010010000000000000000
        
  7. Added the sign bit followed by the 8 bit exponent and then the 23 bit mantissa
  8. 
            If the number positive the sign is 0, if negative it's 1
                0 10000100 00010010000000000000000
        

This process is followed for all three of the formats. The only difference is size of the exponent and mantissa and the bias.

The bias is determined by the size of the exponent:

(2^(number bits of the exponent))/2 - 1

  • 32 bit: (2^8)/2 - 1 = 127
  • 64 bit: (2^11)/2 - 1 = 1023
  • 32 bit: (2^15)/2 - 1 = 16383

Program to format IEEE-754 binary 64 bit format

This program takes an decimal number from the command line and formats it in the IEEE-754 binary 64 bit format. It outputs various data a long the way showing information about the process. The formatted data is placed in a "long integer" which is part a union. Which contains a "double" of the same size. At the end of the program the "double" is out to verify the format.

To Compile:


        clang -o formatF2b -lm formatF2b.c
        
       or

        gcc -o formatF2b -lm formatF2b.c
        



Convertion Demo Code

Download C Source

     1	///////////////////////////////////////////////////////////////////////
     2	// 
     3	// formatF2b.c
     4	// 
     5	// Written by: Don Dugger
     6	//       Date: 2024-01-06
     7	//
     8	// Copyright (C) 2024 Acme Software Works, Inc.
     9	//
    10	// Redistribution and use in source and binary forms, with or without
    11	// modification, are permitted provided that the following conditions are met:
    12	// 
    13	// 1. Redistributions of source code must retain the above copyright notice,
    14	//    this list of conditions and the following disclaimer.
    15	// 
    16	// 2. Redistributions in binary form must reproduce the above copyright notice,
    17	//    this list of conditions and the following disclaimer in the documentation
    18	//    and/or other materials provided with the distribution.
    19	// 
    20	// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
    21	// AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    22	// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
    23	// ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
    24	// LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
    25	// CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
    26	// GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
    27	// HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
    28	// LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
    29	// OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
    30	//
    31	///////////////////////////////////////////////////////////////////////
    32	 
    33	#include <stdio.h>
    34	#include <stdlib.h>
    35	#include <assert.h>
    36	#include <stdint.h>
    37	#include <unistd.h>
    38	#include <libgen.h>
    39	#include <stdbool.h>
    40	#include <math.h>
    41	
    42	///////////////////////////////////////////////////////////////////////
    43	
    44	long fraction2bin(double decimal, int size);
    45	
    46	///////////////////////////////////////////////////////////////////////
    47	
    48	#define FORMAT_SIZE   64
    49	#define MANTISSA_SIZE 52
    50	// The BIAS is (2^(number bits of the exponent))/2 - 1
    51	#define BIAS          1023
    52	
    53	///////////////////////////////////////////////////////////////////////
    54	///
    55	///  main 
    56	/// 
    57	///  @brief  This program takes an decimal number from the command line
    58	///          and formats it in the IEEE-754 binary 64 bit format.
    59	///          It outputs various data a long the way showing information 
    60	///          about the process. 
    61	///          The formatted data is placed in a "long integer" which is
    62	///          part a union. Which contains a "double" of the same size. 
    63	///          At the end of the program the "double" is out to verify 
    64	///          the format.
    65	///
    66	///  @return Zero
    67	///
    68	///////////////////////////////////////////////////////////////////////
    69	///
    70	int main(int argc, char* argv[])
    71	{
    72	    /// This union is used to verify the result
    73	    union {
    74	        double f;
    75	        long i;
    76	    } u;
    77	    ///////////////////////////////////////////////////////////////////
    78	    /// Get the floating point number that will be formatted
    79	    char* end;
    80	    double original = strtod(argv[1],&end);
    81	
    82	    ///////////////////////////////////////////////////////////////////
    83	    /// Check the sign of the input and remove it form the input
    84	    bool negative = false;
    85	    if (original < 0) {
    86	        negative = true;
    87	        original *= -1;
    88	    }
    89	
    90	    ///////////////////////////////////////////////////////////////////
    91	    /// Extract the integer portion and fraction of the input
    92	    double integer;
    93	    double fraction = modf(original,&integer);
    94	    long int_portion = (long)integer;
    95	
    96	    ///////////////////////////////////////////////////////////////////
    97	    /// Determine the exponent
    98	    long exponent = 0;
    99	    if ( int_portion == 0 ) {
   100	        /// If the input is less than one, the exponent will be a
   101	        /// negative number. In that case, negative exponents will
   102	        /// be tested until it is greater than the fraction. Which
   103	        /// is one greater than the proper exponent. The negative
   104	        /// exponent can not be greater than the BIAS, so that 
   105	        /// will be the last exponent tested.
   106	        for (int idx=0;idx>-BIAS;--idx) {
   107	            if ( pow(2,idx) > fraction) {
   108	                exponent = idx-1;
   109	            }
   110	        }
   111	    } else {
   112	        /// If the input is one or greater, the exponent will be
   113	        /// positive. And in that case the exponent is determined
   114	        /// by the position of the most significant one in the
   115	        /// binary integer portion of the input. So when 2 to the
   116	        /// power of "idx" is greater the integer portion of the
   117	        /// input. The most significant one is one less.
   118	        for (int idx=0; idx<=BIAS && exponent == 0; ++idx) {
   119	            if ( int_portion < pow(2,idx)) {
   120	                exponent = idx-1;
   121	            }
   122	        }
   123	    }
   124	
   125	    ///////////////////////////////////////////////////////////////////
   126	    /// Determine the mantissa
   127	    ///  The mantissa times two raised to the exponent is the way the
   128	    ///  formatted floating point will be evaluated. So reversing the
   129	    ///  process will yield the mantissa.
   130	    double mantissa = original / pow(2,exponent);
   131	
   132	    ///////////////////////////////////////////////////////////////////
   133	    /// Now output the result.
   134	    printf("Exponent = %ld\n",exponent);
   135	    printf("Integer  = %ld\n",int_portion);
   136	    printf("Fraction = %lf\n",fraction);
   137	    printf("Mantissa = %lf\n",mantissa);
   138	
   139	    ///////////////////////////////////////////////////////////////////
   140	    /// Now place the results in the integer element of the union to
   141	    /// verify the results.
   142	    long bin_exp = ((exponent + BIAS) & ((BIAS*2)+1));
   143	    printf("==============================\n");
   144	    printf("Sign            = %d\n", negative ? 1 : 0);
   145	    // For the 64 bit format the exponent is 11 bits
   146	    printf("Binary Exponent = %11.11b\n",(int)bin_exp);
   147	    u.i = fraction2bin(mantissa,MANTISSA_SIZE);
   148	    printf("Binary Mantissa = ");
   149	    for (long idx = MANTISSA_SIZE-1;idx>=0;--idx) {
   150	        printf("%d", (((long)1 << idx) & u.i) ? 1 : 0);
   151	    }
   152	    printf("\n");
   153	    bin_exp <<= MANTISSA_SIZE;
   154	    u.i |= bin_exp;
   155	    if ( negative ) {
   156	        u.i |= (long)pow(2,FORMAT_SIZE-1);
   157	    }
   158	    printf("==============================\n");
   159	    printf("Results = %lf\n",u.f);
   160	
   161	    return 0;
   162	
   163	} // End of main()
   164	
   165	///////////////////////////////////////////////////////////////////////
   166	/// 
   167	/// @Brief This is a basic decimal fraction to binary fraction 
   168	///        converter.
   169	///        The algorithm works by first removing the integer
   170	///        portion of the decimal number and then multiplying 
   171	///        the fraction by two and if the result is greater than
   172	///        one setting the bit in the binary result to one then 
   173	///        move to the next bit. Then repeating the process until 
   174	///        it reaches size requested. It starts at the most 
   175	///        significant bit of the output and working it's way 
   176	///        down to the least significant bit.
   177	///
   178	/// @Param The decimal with the one in place.
   179	/// @Param The size of the binary fraction.
   180	///
   181	/// @return The binary fraction.
   182	///
   183	///////////////////////////////////////////////////////////////////////
   184	///
   185	long fraction2bin(double fraction, int size)
   186	{
   187	    //double f = fraction - 1;
   188	    double integer;
   189	    long bin = 0;
   190	    long pos = 1;
   191	    fraction = modf(fraction,&integer);
   192	    pos <<= (size-1);
   193	    for (int idx=0;idx<=size;++idx) {
   194	        fraction *= 2;
   195	        if ( fraction >= 1.0 ) {
   196	            bin |= pos;
   197	            fraction -= 1.0;
   198	        }
   199	        pos >>= 1;
   200	    }
   201	
   202	    return bin;
   203	} // End of fraction2bin()
   204	
   205	///////////////////////////////////////////////////////////////////////
   206	//////////////////////// End of File //////////////////////////////////
   207	///////////////////////////////////////////////////////////////////////
   208	
         

For more information: info@acmesoftwareworks.com