Floating point number

Rational number
open 2 entries with the same name
Collection
zero Useful+1
zero
Floating point number, belonging to Rational number A specific in subset A numerical representation of a number, used in a computer to approximate any one real number Specifically, this real number consists of a integer or Number of fixed points (i.e Mantissa )It is obtained by multiplying a certain radix (usually 2 in computers) to the power of an integer. This representation method is similar to that of radix 10 Scientific enumeration
Chinese name
Floating point number
Foreign name
Floating Point
Definition
approximate

brief introduction

Announce
edit

Floating point calculation

Floating point calculation refers to the operation involving floating point numbers, which is usually accompanied by approximation or Rounding
A floating point number a is represented by two numbers m and e: a=m × b ^ e. In any such system, we choose one base B (the base of the numeration system) and precision p (that is, how many bits are used to store). M (mantissa) is shaped as ± d.ddd The p digits of ddd (each bit is between 0 and b-1 integer , including 0 and b-1). If the first bit of m is a non-zero integer, m is called Normalization Of. Some descriptions use a single sign bit (s stands for+or -) to indicate positive and negative, so m must be positive. E is the index.

structure

It can be seen that a floating point number is represented in the computer, and its structure is as follows:
Mantissa Part( Fixed-point decimal Order code Part (fixed point integer)
Order sign ±
Order code e
Number sign ±
Mantissa m
This design can represent a larger range of numbers that cannot be represented by fixed point numbers in a fixed length storage space.
Floating point addition and subtraction
There are two floating point numbers x and y, which are
x = Mx*2^Ex
y = My*2^Ey
Where Ex and Ey are the order codes of numbers x and y, respectively, and Mx and My are the order codes of numbers x and y Mantissa
The operation rules for adding and subtracting two floating point numbers are
If Ex is less than or equal to Ey, then x ± y=(Mx * 2 ^ (Ex - Ey) ± My) * 2 ^ Ey,
The operation process of completing floating-point addition and subtraction is generally divided into four steps:
1. Check 0 operands;
2. Compare the order size and complete Antithesis
three Mantissa Add or subtract;
4. Results Normalization And rounding off.
(1) 0 operation number check
Floating point addition and subtraction process ratio Fixed-point operation The process is complex. If it is judged that one of the two operands x or y is 0, the operation result can be known and it is unnecessary to perform a series of subsequent operations to save operation time. The 0 operand check step is used to complete this function.
(2) Compare the order code size and complete Antithesis
To add or subtract two floating point numbers, first check whether the order codes of the two numbers are the same, that is decimal point Whether the position is aligned. If the two order codes are the same, it means decimal point If it is aligned, you can add or subtract mantissa. On the contrary, if the two order codes are different, it means that the decimal point position is not aligned. In this case, the two order codes must be the same. This process is called order matching.
want Antithesis First, the difference between two order codes Ex and Ey should be calculated, that is
△E = Ex-Ey
If △ E=0, it means that two order codes are equal, i.e. Ex=Ey; If △ E>0, it means Ex>Ey; If △ E<0, it means Ex<Ey.
When Ex ≠ Ey, pass Mantissa To change Ex or Ey to make them equal. In principle, you can change Ex by Mx shift to achieve Ex=Ey, or you can change Ey by My shift to achieve Ex=Ey. However, because most of the floating point numbers are Normalization , the mantissa will be shifted to the left Most significant bit The loss of caused great error. Mantissa Although the right shift causes the loss of the least significant bit, it causes less error. Therefore, Antithesis Operation regulations enable Mantissa Move to the right, and the mantissa moves to the right, and the order code increases accordingly, and its value remains unchanged. Obviously, one increased order code is equal to the other, and the increased order code must be of small order. So in Antithesis The mantissa of small orders is always shifted to the right(
⑶ Sum operation of mantissa
Antithesis After that, you can sum the mantissa. Regardless addition The operation or subtraction operation is based on addition, and the method is exactly the same as that of fixed-point addition and subtraction.
⑷ Results Normalization
During floating point addition and subtraction, Mantissa The result of summation can also get 01 ф…ф Or 10 ф…ф, That is, the two symbol bits are unequal, which is called overflow in fixed-point addition and subtraction operations, and is not allowed. But in floating point operations, it indicates that Mantissa The absolute value of the sum result is greater than 1, which is broken to the left Normalization At this time, move the operation result to the right to achieve Normalization Is called right normalization. The rules are: Mantissa Move 1 bit to the right and add 1 to the order code. When Mantissa Left if not 1. M Normalization
(5) Rounding
stay Antithesis Or normalize to the right, Mantissa To shift to the right, the lower part of the mantissa shifted to the right will be lost, resulting in error , so round off processing is required.
There are two simple rounding methods: one is "0 rounding 1", that is, if it is lost when moving right digit If the highest bit of is 0, it will be rounded; if it is 1, it will add "1" to the last bit of mantissa. The other is the "constant one" method, that is, as long as the digit is removed Mantissa The end of the is set to "1".
stay IEEE754 standard Rounding provides four alternative methods:
The essence of nearest rounding is commonly referred to as "rounding". For example, Mantissa The number of redundant bits exceeding the specified 23 bits is 10010, and the value of redundant bits exceeds half of the specified value of the least significant bit, so the least significant bit shall be increased by 1. If the extra 5 bits are 01111, simply truncate. For the special case of the extra 5 bits 10000, if the least significant bit is 0, the tail is truncated; If the LSB is now 1, move it up one bit to 0.
Round to 0 Number axis Rounding off the origin direction is a simple truncation. whether Mantissa Whether it is a positive number or a negative number, truncation will make the value absolute value It is smaller than the absolute value of the original value. This method is easy to lead to error accumulation.
Korean+ Rounding pair Positive number As long as the redundant bits are not all 0, 1 will be added to the least significant bit; yes negative It is a simple truncation.
Rounding towards - ∞ is just the opposite of rounding towards+∞. yes Positive number For example, as long as the redundant bits are not all 0, the tail is simply truncated; yes negative For example, advance 1 to the least significant bit.
(6) Spill treatment
Floating point overflow is manifested by its rank overflow. During the addition/subtraction operation, check whether there is overflow: if the order code is normal, the addition (subtraction) operation ends normally; If the rank code overflows, it should be handled accordingly. In addition Mantissa The overflow of also needs to be handled.
The order code overflow exceeds the positive exponential value of the maximum value that the order code may represent, which is generally considered as+∞ and - ∞.
The negative exponential value of order code underflow that exceeds the minimum value that the order code may represent is generally considered as 0.
Mantissa Overflow The addition of two mantissas with the same sign produces a carry up of the highest bit. Move the mantissa to the right and increase the order code by 1 to realign.
Mantissa Underflow When moving the mantissa to the right, the least significant bit of the mantissa flows out from the right end of the mantissa field, and rounding is required.

example

Announce
edit

subject

For example, a 4-bit decimal floating point number with an exponent range of ± 4 can be used to represent 43210, 4.321 or 0.0004321, but there is not enough precision to represent 432.123 and 43212.3 (it must be approximately 432.1 and 43210). Of course, the number of digits actually used is usually far greater than 4.

Special value

In addition, floating point notation usually includes some special values:+∞ and − ∞ (positive and negative Infinity )And NaN ('Not a Number '). Infinity is used when the number is too large to represent, and NaN indicates illegal operations or undefined results.

Binary representation

As we all know, all the data in the computer are based on Binary The floating point number is no exception. However, the binary representation of floating point numbers is not like Number of fixed points That's easy.

Floating point concept

To clarify a concept, floating point numbers are not necessarily equal to decimal The fixed-point number is not necessarily an integer. The so-called floating point number means that the decimal point is not fixed logically, and Number of fixed points It can only represent values with a fixed decimal point. It depends on the meaning of the number given by the user.
There are 6 kinds of floating point numbers in C++, namely:
float: Single precision , 32-bit
unsigned float: Single precision Unsigned, 32-bit
Double: double, 64 bit
long double: Gao Shuang Precision, 80 bits
Only float (with symbol, Single precision , 32-bit) indicates how floating-point numbers in C++are represented in memory. Let's talk about the basic knowledge first, pure decimal Binary representation of.
Pure decimal must be represented in binary first Normalization , that is, it is converted into the form of 1.xxxxx * (2 ^ n) ("^" represents the power, and 2 ^ n represents the nth power of 2). For a pure decimal D, the formula for finding n is as follows:
n = 1 + log2(D); // N obtained by pure decimal must be negative
Then use D/(2 ^ n) to get Normalization The decimal number after. Next is decimal system reach Binary For a better understanding of the conversion problem, let's first look at how to express the pure decimal in decimal system 10. Suppose there is a pure decimal D, which decimal point Each subsequent digit forms a series
{k1,k2,k3,...,kn}
Then D can be expressed as follows:
D = k1 / (10 ^1 ) + k2 / (10 ^ 2 ) + k3 / (10 ^ 3 ) + ... + kn / (10 ^ n )
When extended to binary system, the representation of pure decimal is:
D = b1 / (2 ^ 1 ) + b2 / (2 ^ 2 ) + b3 / (2 ^ 3 ) + ... + bn / (2 ^ n )
Now the problem is how to find b1, b2, b3,......, bn. The algorithm is more complex to describe. Let's use numbers. To make a statement, 1/(2 ^ n) is a special number. I call it rank value.

Example 2

For example, 0.456, bit 1, 0.456 is less than the rank value of 0.5, so it is 0; The second bit, 0.456 is greater than the bit order value 0.25, this bit is 1, and 0.206 is obtained by subtracting 0.25 from 0.456 into the next bit; The third bit, 0.206, is greater than the bit order value 0.125, this bit is 1, and 0.081 of 0.206 minus 0.125 is moved to the next bit; The fourth digit, 0.081 is greater than 0.0625, which is 1, and 0.0185 of 0.081 minus 0.0625 is taken to the next digit; The 5th bit 0.0185 is less than 0.03125
At last, combine enough 1's and 0's in bit order to get a more accurate pure decimal expressed in binary. At the same time, the problem of precision arises. Many numbers cannot be completely expressed accurately in a finite number of n. We can only use a larger n value to express this number more accurately. That is why in many fields, programmer Both prefer to use double instead of float.
I use a structure with bit field to describe the memory structure of float as follows:
struct MYFLOAT
{
bool bSign : 1; // Symbol, positive and negative, 1 bit
char cExponent : 8; // Index, 8 bits
unsigned long ulMantissa : 32; // Mantissa , 32-bit
};
Needless to say, symbols: 1 means negative, 0 means positive
The index is based on 2, ranging from - 128 to 127. The index in the actual data is the original index plus 127. If it exceeds 127, it starts from - 128, and its behavior is the same as the overflow of the CPU processing addition and subtraction in the X86 architecture.
For example: 127+2=- 127- 127 - 2 = 127
The mantissa omits 1 in the first bit, so you need to add 1 to the first bit before restoring. It may contain integer and pure decimal parts, or only one part, depending on the size of the number. For floating point numbers with integer parts, there are two representations of integers. When the integer is greater than the decimal 16777215 Scientific enumeration If it is less than or equal to, the general binary representation is directly used. Scientific enumeration It is the same as the decimal representation.
The decimal part is used directly Scientific enumeration , but the form is not X * (10 ^ n), but X*(
0 000000000000000000000000000000
Sign bit, exponent bit, tail bit
--------------------------------------------------------------------------------

Example 3

Determine whether two floating point numbers are equal.
In this example, we use C++code to determine whether two floating point numbers are equal. Since floating point numbers cannot be accurately represented in storage, fp1==fp2 cannot accurately determine whether the float type variables fp1 and fp2 are equal. (fp1-fl2)<0.0000001 should be used for judgment.
Example:
bool equal(float fp1,float fp2)
{
if( abs( fp1 - fp2 ) < 0.00000001 ) return true;
else
return false;
}
--------------------------------------------------------------------------------

Derivative distribution

Announce
edit

brief introduction

Author: concreteHAM
What is floating point number? I don't need to say more. What we want to discuss here is Normalization Any of Base Of the leading number of a floating point number probability distribution
On《 The art of computer programming 》The second volume has a very in-depth discussion, from which I can refine the main points.

example

For example:
Floating point number
2.345 E 67 This is a decimal Normalized floating point number The leading number is 2.
As far as there is only one "random" floating-point number, it is meaningless to discuss its distribution. What we want to discuss is the distribution of leading digits of floating-point results generated after a series of operations conducted by sufficient multiple "random" numbers.
Suppose there is a huge set of floating-point numbers, where each floating-point number in the logarithmic set is multiplied by 2, and there is a decimal floating-point number F, whose leading number is 1, then the possible value range of its base number is 1.000...~1.999..., multiplied by a number 2, then its base number becomes 2.000...~3.999, It is obvious that the number of floating-point numbers with leading numbers of 1 before multiplying by 2 is the same as the number of floating-point numbers with leading numbers of 2 and 3 now. With this, we will analyze next.
For a b Base The range of its leading digit x is 0<x<b, let f (x) be the leading digit of the above number set probability density function (Note: it is a density function), then its probability between the leading number u and v (0<u<v<b) is:
∫[u,v]f(x)dx ⑴
As mentioned above, for a sufficient small increment Δ x. F (x) must satisfy such a formula:
f⑴ Δ x = x*f(x) Δ x ⑵
Because:
f⑴ Δ X is the probability in the differential segment of f ⑴. According to the above, f ⑴ Δ X probability equals f (1 * x) * (x* Δ x)
Obviously:
f(x) = f⑴/x ⑶
When two sides are integrated between [1, b], the left side of the equal sign must be 1, and the right side is equal to f ⑴ ln (b):
1 = f⑴ln(b) ⑷
It is obtained that: f ⑴=1/ln (b) is brought into ⑶
f(x) = 1/(x*ln(b))
Then use formula (1) to get:
∫[u,v]1/(x*ln(b))dx
= ln(v/u) / ln(b)
This is the probability distribution function of the leading number.
For example, b=10 Base When the leading number is 1, the probability is:
= ln((1+1)/1) / ln⑽
≈ 0.301
The probability that the leading number is 9 is:
= ln((9+1)/9) / ln⑽
≈0.0458
The following is a test program (Mathematica software):
T[n_,b_]:=Block[{res={},ran,i,a},
For[i=1,i<b,i++;
res=Append[res,0]
];
For[i=0,i<n,i++;
ran=Random[]*Random[]*Random[]; Fully scramble floating point numbers
ran=Log[b,ran];
a=Floor[b^(ran-Floor[ran])]; Take the leading number
Res [[a]]++Statistics of the number of leading digits
];
Return[res]
]
Execute T [100000,10] to 10 Base Test 100000 floating point numbers and get a distribution:
{30149,18821,13317,9674,7688,6256,5306,4655,4134}
It is quite close to the theoretical value.