Floating point number, belonging toRational numberA specific insubsetA numerical representation of a number, used in a computer to approximate any onereal number。Specifically, this real number consists of aintegerorNumber of fixed points(i.eMantissa)It is obtained by multiplying a certain radix (usually 2 in computers) to the power of an integer. This representation method is similar to that of radix 10Scientific enumeration。
Floating point calculation refers to the operation involving floating point numbers, which is usually accompanied by approximation orRounding。
A floating point number a is represented by two numbers m and e: a=m × b ^ e.In any such system, we choose onebaseB (the base of the numeration system) and precision p (that is, how many bits are used to store).M (mantissa) is shaped as ± d.dddThe p digits of ddd (each bit is between 0 and b-1integer, including 0 and b-1).If the first bit of m is a non-zero integer, m is calledNormalizationOf.Some descriptions use a single sign bit (s stands for+or -) to indicate positive and negative, so m must be positive.E is the index.
structure
It can be seen that a floating point number is represented in the computer, and its structure is as follows:
Floating point addition and subtraction process ratioFixed-point operationThe process is complex.If it is judged that one of the two operands x or y is 0, the operation result can be known and it is unnecessary to perform a series of subsequent operations to save operation time.The 0 operand check step is used to complete this function.
(2) Compare the order code size and completeAntithesis
To add or subtract two floating point numbers, first check whether the order codes of the two numbers are the same, that isdecimal pointWhether the position is aligned.If the two order codes are the same, it meansdecimal pointIf it is aligned, you can add or subtract mantissa.On the contrary, if the two order codes are different, it means that the decimal point position is not aligned. In this case, the two order codes must be the same. This process is called order matching.
wantAntithesisFirst, the difference between two order codes Ex and Ey should be calculated, that is
△E = Ex-Ey
If △ E=0, it means that two order codes are equal, i.e. Ex=Ey;If △ E>0, it means Ex>Ey;If △ E<0, it means Ex<Ey.
When Ex ≠ Ey, passMantissaTo change Ex or Ey to make them equal.In principle, you can change Ex by Mx shift to achieve Ex=Ey, or you can change Ey by My shift to achieve Ex=Ey.However, because most of the floating point numbers areNormalization, the mantissa will be shifted to the leftMost significant bitThe loss of caused great error.MantissaAlthough the right shift causes the loss of the least significant bit, it causes less error.Therefore,AntithesisOperation regulations enableMantissaMove to the right, and the mantissa moves to the right, and the order code increases accordingly, and its value remains unchanged.Obviously, one increased order code is equal to the other, and the increased order code must be of small order.So inAntithesisThe mantissa of small orders is always shifted to the right(
⑶ Sum operation of mantissa
AntithesisAfter that, you can sum the mantissa.RegardlessadditionThe operation or subtraction operation is based on addition, and the method is exactly the same as that of fixed-point addition and subtraction.
During floating point addition and subtraction,MantissaThe result of summation can also get 01ф…фOr 10ф…ф,That is, the two symbol bits are unequal, which is called overflow in fixed-point addition and subtraction operations, and is not allowed.But in floating point operations, it indicates thatMantissaThe absolute value of the sum result is greater than 1, which is broken to the leftNormalization。At this time, move the operation result to the right to achieveNormalizationIs called right normalization.The rules are:MantissaMove 1 bit to the right and add 1 to the order code.WhenMantissaLeft if not 1. MNormalization。
(5) Rounding
stayAntithesisOr normalize to the right,MantissaTo shift to the right, the lower part of the mantissa shifted to the right will be lost, resulting inerror, so round off processing is required.
There are two simple rounding methods: one is "0 rounding 1", that is, if it is lost when moving rightdigitIf the highest bit of is 0, it will be rounded; if it is 1, it will add "1" to the last bit of mantissa.The other is the "constant one" method, that is, as long as the digit is removedMantissaThe end of the is set to "1".
The essence of nearest rounding is commonly referred to as "rounding".For example,MantissaThe number of redundant bits exceeding the specified 23 bits is 10010, and the value of redundant bits exceeds half of the specified value of the least significant bit, so the least significant bit shall be increased by 1.If the extra 5 bits are 01111, simply truncate.For the special case of the extra 5 bits 10000, if the least significant bit is 0, the tail is truncated;If the LSB is now 1, move it up one bit to 0.
Round to 0Number axisRounding off the origin direction is a simple truncation.whetherMantissaWhether it is a positive number or a negative number, truncation will make the valueabsolute valueIt is smaller than the absolute value of the original value.This method is easy to lead to error accumulation.
Korean+∞Rounding pairPositive numberAs long as the redundant bits are not all 0, 1 will be added to the least significant bit;yesnegativeIt is a simple truncation.
Rounding towards - ∞ is just the opposite of rounding towards+∞.yesPositive numberFor example, as long as the redundant bits are not all 0, the tail is simply truncated;yesnegativeFor example, advance 1 to the least significant bit.
(6) Spill treatment
Floating point overflow is manifested by its rank overflow.During the addition/subtraction operation, check whether there is overflow: if the order code is normal, the addition (subtraction) operation ends normally;If the rank code overflows, it should be handled accordingly.In additionMantissaThe overflow of also needs to be handled.
The order code overflow exceeds the positive exponential value of the maximum value that the order code may represent, which is generally considered as+∞ and - ∞.
The negative exponential value of order code underflow that exceeds the minimum value that the order code may represent is generally considered as 0.
MantissaOverflowThe addition of two mantissas with the same sign produces a carry up of the highest bit. Move the mantissa to the right and increase the order code by 1 to realign.
MantissaUnderflow When moving the mantissa to the right, the least significant bit of the mantissa flows out from the right end of the mantissa field, and rounding is required.
example
Announce
edit
subject
For example, a 4-bit decimal floating point number with an exponent range of ± 4 can be used to represent 43210, 4.321 or 0.0004321, but there is not enough precision to represent 432.123 and 43212.3 (it must be approximately 432.1 and 43210).Of course, the number of digits actually used is usually far greater than 4.
Special value
In addition, floating point notation usually includes some special values:+∞ and − ∞ (positive and negativeInfinity)And NaN ('Not a Number ').Infinity is used when the number is too large to represent, and NaN indicates illegal operations or undefined results.
Binary representation
As we all know, all the data in the computer are based onBinaryThe floating point number is no exception.However, the binary representation of floating point numbers is not likeNumber of fixed pointsThat's easy.
Floating point concept
To clarify a concept, floating point numbers are not necessarily equal todecimalThe fixed-point number is not necessarily an integer.The so-called floating point number means that the decimal point is not fixed logically, andNumber of fixed pointsIt can only represent values with a fixed decimal point. It depends on the meaning of the number given by the user.
There are 6 kinds of floating point numbers in C++, namely:
Only float (with symbol,Single precision, 32-bit) indicates how floating-point numbers in C++are represented in memory.Let's talk about the basic knowledge first,pure decimalBinary representation of.
Pure decimal must be represented in binary firstNormalization, that is, it is converted into the form of 1.xxxxx * (2 ^ n) ("^" represents the power, and 2 ^ n represents the nth power of 2).For a pure decimal D, the formula for finding n is as follows:
n = 1 + log2(D); // N obtained by pure decimal must be negative
Then use D/(2 ^ n) to getNormalizationThe decimal number after.Next isdecimal systemreachBinaryFor a better understanding of the conversion problem, let's first look at how to express the pure decimal in decimal system 10. Suppose there is a pure decimal D, whichdecimal pointEach subsequent digit forms aseries:
Now the problem is how to find b1, b2, b3,......, bn.The algorithm is more complex to describe. Let's use numbers.To make a statement, 1/(2 ^ n) is a special number. I call it rank value.
Example 2
For example, 0.456, bit 1, 0.456 is less than the rank value of 0.5, so it is 0;The second bit, 0.456 is greater than the bit order value 0.25, this bit is 1, and 0.206 is obtained by subtracting 0.25 from 0.456 into the next bit;The third bit, 0.206, is greater than the bit order value 0.125, this bit is 1, and 0.081 of 0.206 minus 0.125 is moved to the next bit;The fourth digit, 0.081 is greater than 0.0625, which is 1, and 0.0185 of 0.081 minus 0.0625 is taken to the next digit;The 5th bit 0.0185 is less than 0.03125
At last, combine enough 1's and 0's in bit order to get a more accurate pure decimal expressed in binary. At the same time, the problem of precision arises. Many numbers cannot be completely expressed accurately in a finite number of n. We can only use a larger n value to express this number more accurately. That is why in many fields,programmerBoth prefer to use double instead of float.
I use a structure with bit field to describe the memory structure of float as follows:
struct MYFLOAT
{
bool bSign : 1; // Symbol, positive and negative, 1 bit
Needless to say, symbols: 1 means negative, 0 means positive
The index is based on 2, ranging from - 128 to 127. The index in the actual data is the original index plus 127. If it exceeds 127, it starts from - 128, and its behavior is the same as the overflow of the CPU processing addition and subtraction in the X86 architecture.
For example: 127+2=- 127-127 - 2 = 127
The mantissa omits 1 in the first bit, so you need to add 1 to the first bit before restoring.It may contain integer and pure decimal parts, or only one part, depending on the size of the number.For floating point numbers with integer parts, there are two representations of integers. When the integer is greater than the decimal 16777215Scientific enumerationIf it is less than or equal to, the general binary representation is directly used.Scientific enumerationIt is the same as the decimal representation.
The decimal part is used directlyScientific enumeration, but the form is not X * (10 ^ n), but X*(
Determine whether two floating point numbers are equal.
In this example, we use C++code to determine whether two floating point numbers are equal.Since floating point numbers cannot be accurately represented in storage, fp1==fp2 cannot accurately determine whether the float type variables fp1 and fp2 are equal.(fp1-fl2)<0.0000001 should be used for judgment.
What is floating point number? I don't need to say more. What we want to discuss here isNormalizationAny ofBaseOf the leading number of a floating point numberprobability distribution。
On《The art of computer programming》The second volume has a very in-depth discussion, from which I can refine the main points.
As far as there is only one "random" floating-point number, it is meaningless to discuss its distribution. What we want to discuss is the distribution of leading digits of floating-point results generated after a series of operations conducted by sufficient multiple "random" numbers.
Suppose there is a huge set of floating-point numbers, where each floating-point number in the logarithmic set is multiplied by 2, and there is a decimal floating-point number F, whose leading number is 1, then the possible value range of its base number is 1.000...~1.999..., multiplied by a number 2, then its base number becomes 2.000...~3.999,It is obvious that the number of floating-point numbers with leading numbers of 1 before multiplying by 2 is the same as the number of floating-point numbers with leading numbers of 2 and 3 now.With this, we will analyze next.
For a bBaseThe range of its leading digit x is 0<x<b, let f (x) be the leading digit of the above number setprobability density function(Note: it is a density function), then its probability between the leading number u and v (0<u<v<b) is:
∫[u,v]f(x)dx ⑴
As mentioned above, for a sufficient small incrementΔx. F (x) must satisfy such a formula:
f⑴Δx = x*f(x)Δx ⑵
Because:
f⑴ΔX is the probability in the differential segment of f ⑴. According to the above, f ⑴ΔX probability equals f (1 * x) * (x*Δx)
Obviously:
f(x) = f⑴/x ⑶
When two sides are integrated between [1, b], the left side of the equal sign must be 1, and the right side is equal to f ⑴ ln (b):
1 = f⑴ln(b) ⑷
It is obtained that: f ⑴=1/ln (b) is brought into ⑶
f(x) = 1/(x*ln(b))
Then use formula (1) to get:
∫[u,v]1/(x*ln(b))dx
=ln(v/u) / ln(b)⑸
This is the probability distribution function of the leading number.
For example, b=10BaseWhen the leading number is 1, the probability is:
= ln((1+1)/1) / ln⑽
≈ 0.301
The probability that the leading number is 9 is:
= ln((9+1)/9) / ln⑽
≈0.0458
The following is a test program (Mathematica software):
T[n_,b_]:=Block[{res={},ran,i,a},
For[i=1,i<b,i++;
res=Append[res,0]
];
For[i=0,i<n,i++;
ran=Random[]*Random[]*Random[]; Fully scramble floating point numbers
ran=Log[b,ran];
a=Floor[b^(ran-Floor[ran])]; Take the leading number
Res [[a]]++Statistics of the number of leading digits
];
Return[res]
]
Execute T [100000,10] to 10BaseTest 100000 floating point numbers and get a distribution: