Number representations - IEEE floating point, 2's compliment,biased ..

2's compliment

Fig 2's compliment representation

Positive number in 2s compliment
sign bit an-1 is zero
n-1 bits are allowed for the magnitude of number so total distinct positive numbers will be 2^(n-1)
numbers range from all zero to all ones(2^0+2^1+2^2 ...2^(n-2)) i.e from 0 to 2^(n-1) - 1

Negative Number in 2s compliment
sign bit an-1 is 1
for magnitude of number all bits are used in calculation including the most significant an-1 bit
Calculation of magnitude considers most significant bit as negative and rest as positive so
magnitude is calculated as -2^(n-1)+ sum(0 to n-2)ai*2^i
smallest number in negative that can be represented is when all the n-1 bits are zero and only the msb contributes that is -2^(n-1)
Maximum number in negative 2s compliment is when all n-1 bits are 1 and they reduce the effect of msb, it is -2^(n-1) + (2^n-2 + 2^(n-3) ....2^1+2^0) = -1

Observations
single representation of zero, when all n bits are zero
we can represent -2^(n-1) but not +2^(n-1) in 2s compliment
when bits of positive number n are complimented(1s compliment) it becomes -(n+1) in decimal example
+3 is 0011, its compliment 1100 is -4
+7 is 0111 its compliment 1000 is -(8)
To obtain the negation of the number that is +n to -n we need to first take 1s compliment which will give us -(n+1) then we can add single digit to get the required number -n
negation of zero will give zero only if we ignore the carry, exception case is for negation of -2^n-1 also where all bits will become zero , we get same number -2^n-1

Arithmetic Operations in 2s compliment

Addition
discard the extra bit when adding , two 4bit numbers discard the 5 msb if any of result. It is different from overflow which gives incorrect result.
when result is large than can be handled in word size we call it overflow. ALU must signal not to use the result.
example of overflow
add 1001 //-7
1010 //-6
10011 // +3
Overflow can occur whether or not there is carry
overflow occurs when sign of result for two positive numbers or two negative numbers is opposite

Multiplication
//booth algorithm

Division

Overflow
when adding unsigned binary numbers, overflow occurs when the final carry-out is a 1
In 2's complement, the final carry-out is always ignored, and overflow occurs when there is a sign change

One's Complement
used to represent negative numbers
positive numbers same as in sign magnitude system (with first bit zero )
negative number is obtained by inverting the bits of positive number
zero has two representation +0 and -0
range of number sis -(2^n-1 - 1) to (2^n-1 -1)

Floating Point Representation

for numbers too large or too small
number is represented approx to fixed number of significant digits and scaled using exponent
way to represent scientific numbers
A floating point number consist of
1) significand mantissa or coefficient signed digit string.
significand length determines to which precision number can be represented, prportional to accuracy
2)a signed integer exponent scale factor proportional to range, more bits to the exponent means more range

|---|--------------|---------------------|
|---|--------------|---------------------|
sign(1bit) E(8bit) Significand(23bit)

Fig. 32 bit floating point Format

S*B^e
B is bias, implicit, same for all numbers. bias value is typically 2^e-1 - 1
radix point is assumed to be after MSB bit of signigicand

advantage of biased representation is that non negative floating point numbers can be treated as integers for comparison purpose

sign: 0 positive, 1 negative

exponent: stored in biased form, that is, bias is subtracted from the field to get true exponent
exponent is 8 bits means number from 0 to 255 with bias it becomes -127 to +128

Significand: stored in normalized form, msb should be one and radix point should be after msb
msb one is taken as implicit and need not be stored, so 23 bit of significand can represnt 24 bits
In normalized value of significand lies in [1,2)

IBM/360 floating point format
total length 32 bits
exponent 7 bits
fraction 24 bits
bias 2^7-1 = 64 unlike IEEE, 2^n-1 -1 bias

base is 16 unlike IEEE base-2
there are no hidden 1 significand
so normalization is to 0.bbbb only
If number is represented in IBM/360 floating point like
1/0 eeeeeee fffffff..

fraction has no hidden 1.
we get the decimal for the binary representation exponent and raise it to the base 16
(sign) 0.ffff * 16 ^exponent

IEEE 754 floating point standard

single precision: 1 bit sign, 8 bit exponent, 23 bit fraction (24 bit significand = 1+ fraction) (“1” is implied)
double precision: 1 bit sign, 11 bit exponent, 52 bit fraction (53 bit significand)

Fractional portion of the significand represents a fraction between 0 and 1 (each bit from left to right has weight 2-1, 2-2, …)
Since 0 has no leading 1, it is given the reserved exponent value 0 so that hardware won’t attach a leading 1 to it
Exponent is “biased” (biased notation) to make sorting easier
all 0s is smallest exponent (most negative) all 1s is largest (most positive)
bias of 127 for single precision and 1023 for double precision

(–1)^sign × (1 + fraction) × 2^{exponent – bias}

Exponent and mantissa of all zeros and ones are special values

Range of exponents
exponent of all zero(0) and all ones(255) defines special values
For single precision
it can range from 1 to 254 and in bias we would say -126 to +127
for double precision
-1022 to +1023

Mantissa number of bits
A normalized number would add a hidden 1 making the mantissa of 2
24 bit for single precision
53 bit for double precision

Denomalized or subnormal number
demormal numbers have hidden bit of significand as 0
have exponent of all zeros

0.0         0 or 1   00000000   00000000000000000000000
                                (hidden bit is a 0)

subnormal   0 or 1   00000000   not all zeros
                                (hidden bit is a 0)

normalized  0 or 1    > 0        any bit pattern
                                (hidden bit is a 1)

Representation of Zero
Two representation of zero +0 and -0

exponent and mantissa all 0

in hex 0x0000 0000 and 0x8000 0000

Representation of infinity and NaN
• +∞ an -∞ are denoted with an exponent of all 1's

s  e        f
  +∞     0 11111111 00000... (0x7f80 0000)
  -∞     1 11111111 00000... (0xff80 0000)
  NaN    ? 11111111 (not all zero)

(S is either 0 or 1, E=0xff, and F is anything but all zeros)

significand fraction lies betwee 1/B <= s <=1
where B is base
for base 16 fraction lies betwee 1/16 and 1 inclusive
for base 2 its .5 to 1 inclusive

Error in representing recurring fractions

xT – true value
xA – approximate value

True Error in xA (exact value of the error) = E(x) = xT - x A
True Relative Error in xA = xT-xA / xT
True Percentage Relative Error in xA = xT-xA / xT * 100

Problem Consider 16 bit floating point format with 1 bit for sign 5 for exponent and 10 for fraction exponent is in excess 15 representation
a) decimal for number in this format
0 01001 0101000000
Solution
sign bit is positive
exponent is in excess 15 so e+15 = 9
e=-6
so the value is 1.0101₂ * 2^-6 which is (1 + 1/4 + 1/16)/64 = 2.05078 * 10^-2

b)convert decimal number -50.75 in above representation
50₁₀ = 110010₂ and .75₁₀ = .11₂
so -50.75 = -1.1001011 * 2⁵
biased exponent value will be 15 + 5 = 20₁₀= 10100₂
leaving the first bit in significand
Number is 1 10100 1001011000

Problem 6 bit wide number represented in 2's compliment what is maximum and minimum number
Solution Minimum when first bit is 1 i.e -2^n-1 = -32
Max number when first bit is zero and sum of all n-1 bits i.e 2^n-1 -1 = 31

Problem Consider bit pattern 1010 1100 1011 0101 0011 0000 0011 1000 what value does it represent in
1)2's compliment
since number is negative(msb is 1) we find the value by
1)subtracting 1
2)fliping bits

2)single precision floating point
sign = negative
exponent 8 bits after sign 010 11001, its biased so we have to subtract 127 from it
i.e 89-127 = -38
significand is remaining bits with an implicit 1 so
1.011 0101 0011 0000 0011 1000
number is - 1.011 0101 0011 0000 0011 1000 * 2^-38

Problem Convert - 1101001.10011 to IEEE format

Solution
a) Normalize : -1.10100110011 * 2 ^6
b)exponent = 6+127= 133 = 10000101
significand 10100110011 and remaining bits goes zero

Problem Consider IBM base-16 single precision format
represent decimal number 2.25 in this format

Solution

1)get sign bit, 0
2)express number in binary
(2.25)d = (10.01)b

Base 16 format means number is represented

Offset Binary Numbers (Excess-K)

all zeros corresponds to minimal negative value 00000000..
all ones correspond to maximal positive value 111111...

excess 2^n-1
take an example where 4 bit number is stored in excess 8
4 bit number 1111
in unsigned notation its 15 in decimal
in offset biased notation the number is 15 - 8 =7
so decimal range for number is from -8, all zero's 0 - 8
and max would be 7, when all 1's 15-8
zero would be represented by 1000 because 8- 8(excess-8)

comparison to 2's compliment
If we look at the minimum and maximum number above we can see that if we invert the first bit we get the same number in 2's compliment
like minimum -8 is 0000 in excess 8
invert first bit 1000 and interpret as 2's compliment its same
for 8 we have 0111 in excess-8
invert first bit 1111 and in 2's compliment it is 8

Example
excess 2^n-1 is used in IBM mainframe 360/370 to store the exponent
excess (2^n-1 - 1) is used in IEEE 754 to store the exponent

converting IEEE to decimal number
IEEE single precision 1 bit sign 8 bit exponent rest signigicand
number in significand lies between 1/2 <=s <= 1

For single precision
total length = 32 bits
exponent = 8 bits
bias = 2^8-1 -1 =127
fraction is 32 - (8+1) = 23 bits

for double precision
total lenght = 64 bits
exponent= 11bits
bias is 2^11-1 -1 =1023
fraction is 64 - (11+1) = 52
steps
1)get the sign bit from
2)get the next 8 bits for the exponent from the representation
this is in excess 2^8-1 -1 i.e excess-127 for single precision
3)subtract 127 from the exponent value
e = extracted exponent - 127
4) get the 24 bit fraction f the representation
and one to the fraction 1.f
now number is
(sign) (1.f) * 2^e

Convert from binary to IEEE format

convert from decimal to IEEE
1)convert decimal number into binary
2)normalize it form 1.bbb * 2^e
3) find the exponent by adding bias
4)add bbbb... to significand and store exponent
when bbb... lenght is less than number of bits to store store remaining zeros
(sign) (exponent) bbbbbbbbbb

Example

1)10.6
10 = 1010 , .6=10011001...
For single precision
1) so number in binary is 1010.10011001 with repeating pattern 1001
2) normalize it, 1.01010011001... * 2^3
3) exponent is 3+127 = 130, in binary 10000010
4) sign is 0
fraction becomes repeating pattern for total length of 23
0 10000010 01010011001..

Converting decimal to IBM/360 floating point format

Steps
1)get the sign bit of the number
2)express decimal number in binary notation
3)normalize it, its base-16 so we move 4 bits at a time instead of 1
since we are moving radix point in binary *2^4(16) will move radix right 4 places
and if we moved n times exponent is n
like for binary 10.01 we represent it in normal base 16 as
.001001 * 16^1
for binary .00001
.1*16^-1
4)fraction is to the right of radix point in normalized form
5)exponent is 64+power of base-16, express it in binary

Example
2.25
1)sign = +
2)binary for 2.25 decimal = 10.01
3)normalized value = .001001 * 16^1
4)fraction value is 001001 and rest zeros for 24 bits total
5)exponent is 64 + 1 = 65, in binary 1000001
6) number is 0 1000001 001001000000000000000000

Floating Point Arithmetic
1)Align radix point, exponent values should be same for both opernads
2)perform operation
3)

Overflow and Underflow

Overflow
overflow is check by checking exponent value before and during normalization
positive exponent exceeds the maximum possible exponent value
represented by setting to infinity
Underflow
occurs when number is too small(near zero) and getting a 1 before radix point in normalized form causes exponent field to be zero
underflow may result in demormalized values where hidden bit will be zero and exponent wil be all zeros
precision would be reduced

largest negative number(magnitude)
exponent all 1's = 255-127 = 128
fraction all 1's = (.111...) = (1/2)^1 +(1/2)^2 .....(1/2)^23 = ((1/2)^24 - (1/2))/(-1/2) = -2^-23+1
significand would be 2-2^-23
sign -1
number is -(2-2^-23) *2^128
Negative overflow
numbers less than -(2-2^-23) *2^128 are negative underflow
smallest magnitude negative number
exponent all 0's = -127
fraction all 0's =0
significand =1+0 =1
number - 2^-127
Negative underflow
numbers greater than - 2^-127 are negative underflow
smallest positive number expressible
exponent all zeros
fraction all zeros
significand 1
number 2^-127
Positive underflow
numbers less than 2^-127 are positive underflow
largest positive number
all exponent bits 1
all fraction bits 1 = 1-2^-23
significand = 2-2^-23
number (2-2^-23)*2^128

Rouding
when number cannot represented in limited precision bits, we round the results
there are number of ways of rounding
IEEE standard requires
ALU register also has three extra bits other than significand 1.bbbbb
extra bits in order from lsb
1)guard bit
2)round bit
3)sticky bit

when mantissa has to be shifter to align the radix point, bits that fall off the lsb goes into extra bits.These bits can also be set in division and multiplication
guard and round bits are for precision
if sticky bit is set to 1 it remains at 1 to indicate what was beyond lsb

round toward 0. truncates any bits that the representation does not hold. result is approximate value represented err on the side of being closer to 0.
round toward positive infinity. chooses representations that err on the side of being to "the right" as viewed on a number line.
round toward negative infinity. chooses representations that err (for representing approximate values) on the side of being to "the left" as viewed on a number line.
round to nearest. This method attempts to distinguish which approximate value that may be represented is closer to the desired value.

Number Conversions

All Operations on 7bit number
Decimal Number to 1's compliment form
1)find the representation for positive number that is 127
2)negate the bits
Represent -127 in 1's compliment
solution
representation of 127 is 0111111 so 1's compliment is negation of this that is1000000
Represent 0 in 1's compliment
1's compliment has two representation for zero +0(all zero's) and -0(will all 1's)

Decimal Number to signed representation
1)include sign bit (msb) =1 when number is negative else 0
2)represent the number with remaining n-1 bits in normal form

Represent -23 in signed representation
1)msb =1
2)binary of 23 is 16+4+2+1 = 010111
so signed number is 1010111

Decimal Number to 2's compliment
for negative number
1)find binary for +N
2) -N = (1's compliment[n] +1)

Note remember to convert number to required number of bits

example if we convert 23 to 10111 in 5 bits instead of 6 and perform operation like
23 = 10111
2's compliment = (01000 +1) = 01001 it gives positive number which is incorrect

represent -23
1)23=010111

2)1's compliment of 23= 1101000
2's compliment is 1101001

2's compliment to decimal number
For negative
if number has more 1's then zero then
1)B = (1's compliment B + 1)
example represent 2's compliment 10111 in decimal
B = -(01000 +1) = -(01001) = -9

http://pages.cs.wisc.edu/~cs354-1/cs354/karen.notes/flpt.apprec.html
www.cs.sunysb.edu/~cse220/Homework/HW1-sol.pdf //IEEE and floating point