decodingyescodeAnd simultaneously removeBitstreamNoise mixed in the process of propagation.Translate the text into a group of numbers using the decoding table or use the decoding table to represent a series of informationsignalThe process of translating into words is called decoding.
In digital circuit, decoder (such as n-line - 2n lineBCDDecoder) can act asMultiple input multiple outputLogic gateThe role of can transform the encoded input into the encoded output. Here, the encoding of input and output is different.The input enable signal must be connected to the decoder to make it work normally, or the output will be an invalid codeword.Decoding inMultiplexing、Seven segment digital tubeandMemoryAddress decoding and other applications are necessary.
Assuming that the coding sequence is (∧) 1 2, m m m C=c c, the receiver receives thesignalIs R(analog signalOr digital signal, depending on the definition of the channel), then the receiver will naturally look for conditions in all possible code sequencesprobabilityThe largest P (C R) m is considered as the most likely transmission sequence.That is:
C~Arg {MAX P (C R)} m C mm=This decision criterion is called maximum a posterioriprobabilityGuidelines (MAP).
algorithm
Announce
edit
Viterbi decoding algorithm is aConvolutional codeOfdecodeAlgorithm.The disadvantage is that the complexity of the algorithm increases rapidly with the increase of the constraint length.To be compared when the constraint length N is 7routeThere are 64. When the number is 8, the number of paths becomes 128.(2<<(N-1))。Therefore, viterbi decoding is generally applied in the case where the constraint length is less than 10.
algorithmIt is specified that the data received at time t should be compared for 64 times, that is, there are two in each path of 64 statesbranch(Because the input is 0 or 1), at the same time, it jumps to two different states, compares the two corresponding outputs with the actually received output, discards the one with a large measurement value (that is, the comparison results differ greatly), and the one left is called survivalrouteAdd the surviving path to the measurement of the surviving path at the previous time and save it. This adds one step to the 64 surviving paths.At the end of decoding, 64 lines survivedrouteSelect the one with the smallest measurement and reverse deduce the surviving path (calledto flash back)And get the corresponding decoding output.
Code definition
Announce
edit
Put words, numbers or otherobjectDigitize, or convert information and data into specified electric pulsesignal。Coding is widely used in computer, television, remote control and communication.Coding is based on certainagreementOr formatAnalog informationconvert toBitstreamProcess.
staycomputer hardwareIn, coding is the process of converting information into coded values (typically numbers) on a subject or unit for data storage, management and analysis purposes.In software, coding means logically executing a program using a specific language such as C or C++.staycryptographyIn, coding refers to the act of writing in a code or password.
Convert data into codes or coded characters, and be able to translate into the original data form.It is the process of computer writing instructions and part of programming.In automatic cartography, the process of using numbers and letters to represent map content according to certain rules. Through coding, computers can identify geographical elements of maps.
The n-bit binary number can be combined into different information of the n-th power of 2, and each information is specified with a specific code group. This process is also called coding.
The files we encounter daily are classified into ASCII and Binary.ASCII is the acronym of "American Standard Code for Information Exchange", which can be called "American Standard".American Standard specifies 128 numbers from 0 to 127 to represent the standard code of information, including 33 control codes, a space code, and 94 image codes.The image code includes English upper and lower case letters, Arabic numerals, punctuation marks, etc.The English computer text we usually read is transmitted and stored in the form of image code.American Standard is the universal code for most computers in the world.
However, a character in a computer is mostly represented by an eight digit binary number.So everycharacterThere may be 256 different values.Since the American Standard only stipulates 128 codes, the remaining 128 codes are not standardized, and the usage varies from family to family.In addition, the use of 33 control codes in American Standard varies from manufacturer to manufacturer.So when we exchange files between different computers, it is necessary to distinguish between two different types of files.Each word in the first type of document is an American standard image code or a space code.Such documents are called "American Standard"text file”(ASCII Text Files), or omitted“text file”, which can be directly exchanged between different computer systems.The second type of documents, that is, documents containing control codes or non US standard codes, cannot be directly exchanged between different computer systems.This kind of document has a general name, called“Binary file”(Binary Files)。
2. National standard, location, "quasi national standard"
“national standard”It is used for "National Standard Information Exchange of the People's Republic of China"Chinese character coding”Short name of.national standardThe table (basic table) arranges more than 7000 Chinese characters, punctuation marks, foreign letters, etc. into a square array of 94 rows and 94 columns.Each horizontal line in the square array is called an "area", and each area has 94 "bits".The coordinates of a Chinese character in the matrix are called“LocationCode ".For example, the word "Zhong" is in the 48th position of the 54th area in the square arrayLocationThe size is 5448.
In fact, the number is 94.It is the total number of image codes in American Standard.national standardThe table uses this number to represent a Chinese character with two American logo symbols.Since the code of the American Standard image symbol is from 33 to 126, if the Chinese character area and bit code are added with 32 respectively, the range of the American Standard image code will overlap.For example, if 32 is added to the "middle" word area and bit code, 86 and 80 are obtained.The hexadecimal system of these two numbers is put together to get 5650, which is called“national standardCode ", and the corresponding two American symbols, VP, are the" national symbol "of the word" Chinese ".
This leads to a distinction betweennational standardThe question of symbol and American symbol.In a mixed Chinese and English file, does "VP" stand for "Chinese" or an English acronym?When developing CCDOS, the Sixth Research Institute of the Ministry of Electronic Industry used a simple solution:national standardAdd 128 to each of the two numbers of the code to rise to the position of the non US standard code.(Changednational standardCode, which is still traditionally called "national standard".)
Although this solution solved the original problem, new problems arose.Chinese documents have become“Binary file”, which can neither be reliably exchanged between different computer systems, nor can it interact with most of the American standard symbols in the marketSoftware compatibility。
To distinguish between the two“national standard”We call the original national standard code overlapping with the American standard image code "pure national standard", and the national standard code of CCDOS plus 128 "quasi national standard".
3. GBK code
GBK code is an extension of GB codeCharacter encoding, has coded more than 20000 simple and complex Chinese characters, and the simplified versions of Win95 and Win98 use GBK as the systemInternal code。
GB isnational standardK is the first letter of the "extended" Chinese phonetic alphabet.In fact, GBK is anotherChinese character codingStandard, full name: Chinese Internal Code Specification, issued in 1995.
From the practical application,MicrosoftSince the simplified Chinese version of win95, the system has adopted GBK code, which includes TrueType Song typeface and bold typefaceGBK font library(provided by Beijing Zhongyi Electronic Company), which can be used for display and printing, and provides four GBK Chinese character input methods.In addition, the browser IE4.0 simplified and traditional Chinese versions internally provide a GBK-BIG5 code bidirectional conversion function.In addition, in the language pack provided by Microsoft for IE, the simplified Chinese language support kit has two fonts, Song typeface and bold typeface, which are also GBK Chinese characters (Zhuhai Stone ComputerTypesettingProvided by the system development company).Other Chinese font manufacturers have also begun to provide TrueType or PostScriptGBK font library。
Many plug-in Chinese platforms, such asAntarctica, Richwin, etc., providing GBK code support, including font, input method, and converters between GBK and other Chinese codes.
On the Internet, many websites use GBK codes.
But most search engines can not support GBK Chinese character search very well, and some search engines in mainland China can not support GBK Chinese character search perfectly.
GBK is compatible with GB-2312 code downwards and supports ISO 10646.1 international standard upwards, which is a starting standard for the former to transition to the latter.
GBK specification includes all CJK Chinese characters and symbols in ISO 10646.1, and has some supplements.Including: all Chinese characters and non Chinese characters in GB 2312;Other CJK Chinese characters in GB 13000.1.The above total 20902 GB Chinese characters;52 Chinese characters not included in GB 13000.1 in the Simplified Summary Table;28 radicals and important components not included in GB 13000.1 in Kangxi Dictionary and Cihai;13 Chinese character structure characters;139 graphic symbols in BIG-5 that are not included in GB 2312 but exist in GB 13000.1;6 phonetic symbols added in GB 12345;19 vertical graphic symbols added to GB 12345 (29 vertical punctuation symbols added to GB 12345 compared to GB 2312, 10 of which are not included in GB 13000.1, so GBK will not accept them);21 Chinese characters selected from CJK compatible area of GB 13000.1;31 IBM OS/2 special symbols in GB 13000.1 revenue.GBK also adopts doublebyteIt indicates that the overall coding range is 0x8140~0xFEFE, the first byte is between 0x81~0xFE, and the last byte is between 0x40~0xFE. Excluding a line of 0x × × 7F, there are 23940 code bits in total, and 21886 Chinese characters and graphic symbols are included, including 21003 Chinese characters (including radicals and components) and 883 graphic symbols.
4. BIG5 code
BIG5 code is for traditional Chinese charactersChinese character coding, currently in Taiwan and Hong Kongcomputer system It is widely used in.The coding range of BIG5 code is shown below.
5. HZ code
HZ code is widely used on the InternetChinese character coding。The "HZ" scheme is characterized by "purenational standard”Chinese and American standard codes are mixed.How to distinguish "HZ"national standardWhat about those that match the American logo?In fact, the answer is very simple: when a section of national standard code is inserted in the middle of a string of American standard codes, we add~in front of the national standard code and~after it.These additional codes are called "escape codes" and "escape codes" respectively.Since these additional codes are also American standard image codes, the whole document is just like an American standardtext file, which can be safely transferred on the computer network, and also can be used to process most English textsSoftware compatibility。
6. ISO - 2022CJK code
ISO-2022 is the International Standards Organization (ISO)characterDeveloped coding standards.Use twobyteThe Chinese code is called ISO-2022 CN, and the Japanese and Korean codes are called JP and KR respectively. Generally, they are collectively called CJK codes.At present, CJK code is mainly used in Internet network.
7. UCS and ISO 10646
In 1993, the international standard ISO10646 definedUniversal Character Set(Universal Character Set, UCS). UCS is all othercharacter setA superset of the standard.It guarantees thatcharacter setIt is bidirectional compatible.That is, if you translate any text string to UCS format and then back to the original code, you will not lose any information.
UCS containscharacter。It includes not only descriptions in Latin, Greek, Slavic, Hebrew, Arabic, Armenian and Georgian, but also hieroglyphs in Chinese, Japanese and Korean, as well as Hiragana, Katakana, Bengali and Punjabi Golumuccicharacter(Gurmukhi), Tamil, Kannada, Malayalam, Thai, Lao, Bopomofo, Hangul, Devangari, Gujarati, Oriya, Telugu and other languages.For languages that have not yet been added, they will eventually be added because they are studying how to best code them in the computer.These languages include Tibetian, Khmer, Runic, Ethiopian, other hieroglyphs, and various Indo European languages, as well as selected artistic languages such as Tengwar, Cirth, and Klingon.UCS also includes a large number of graphic, printing, mathematical, and scientific symbols, including all those provided by TeX, Postscript, MS-DOS, MS-Windows, Macintosh, OCR fonts, and many other word processing and publishing systemscharacter。
ISO 10646 defines a 31 bitcharacter set。However, in this huge coding space, only the first 65534 code bits (0x0000 to 0xFFFD) have been allocated so far.The 16 bit subset of this UCS is called the Basic Multilingual Plane (BMP)characterThey are very special characters (such as hieroglyphs), and only experts in history and science will use them.According to the current plan, there may never becharacterIt is allocated beyond the 21 bit encoding space from 0x000000 to 0x10FFFF, which covers more than 1 million potential future characters.ISO 10646-1, first published in 1993, definescharacter setAnd the architecture of the content in BMP.DefineCharacter encodingThe second part of ISO 10646-2 is under preparation, but it may take several years to complete.newcharacterIt is still being added to BMP continuously, but the existing characters are stable and will not change any more.
UCS not only providescharacterAssign a code and give it a formal name.That represents a UCS or Unicode valueHexadecimal number, usually add "U+" in front, just like U+0041 representscharacter"Latin capital A".UCScharacterU+0000 to U+007F is consistent with US-ASCII (ISO 646), and U+0000 to U+00FF is consistent with ISO 8859-1 (Latin-1).From U+E000 to U+F8FF, a large range of codes beyond BMP are reserved for private use.
In 1993, four USC-4 (Universal Character Set) defined in ISO10646 were usedbyteIs wide enough to accommodate a considerable amount of space, but this is too fatcharacterAt that time and even now, the standard had its unrealistic side, that is, it would occupy too muchstorage space And affect the efficiency of information transmission.At the same time, the Unicode organization also started to develop a 16 bitcharacterStandard. In order to avoid the competition between the two 16 bit codes, the two organizations began to negotiate in 1992 with a view to finding common ground through compromise. This is UCS-2 (BMP, Basic Multilingual Plane, 16 bit) and Unicode today, but they are still different schemes.
8. Unicode code
We need to trace the origin of Unicode.
When computers were popularized in East Asia, they encountered the use of ideographiccharacterChina, Japan, South Korea and other countries that are not alphabetic languages.Commonly used in the languages used in these countriescharacterAs many as thousands of characters, but the original characters are singlebyteCode, one piececode pageThe maximum number of characters that can be accommodated in is only 28=256. There is nothing that can be done for languages that use ideographic characters.Since abyteNot enough. Naturally, people use two bytescharacter set(DBCS)。But doublebytecharacter setAlthough ideographic characters in Chinese use two byte codes, the ASCII code and Japanese katakana are still represented by single byte, which brings a lot of trouble to programmers, because whenever DBCS string processing is designed, it is always necessary to judge whether a byte represents a character or a half character. If it is a half character,Is that the first half or the second half?It can be seen that DBCS is not a very good solution.
People are constantly looking for betterCharacter encodingThe final result of the scheme is the birth of Unicode.Unicode is actually widebyteCharacter set, which uses two bytes for each character, namely 16 bits, so when processing characters, you don't have to worry about only processing half a character.
At present, Unicode is used in networks, Windows systems and many large-scale software.
Among GB coding standards, GB2312 and GBK are commonly used. GB2312 is a subset of GBK, and the GB2312 coding range is 0xA1A1 - 0xFEFE. If the pure GB2312 coding is simple, GBK is easy to processcharacter setThere are some hints. Let's talk about the GBK coding standard first:
GBK adopts doublebyteIt indicates that the overall coding range is 8140-FEFE, the first byte is between 81-FE and the last byte is between 40-FE, and the line xx7F is eliminated.There are 23940 code points in total, including 21886 Chinese characters and graphic symbols, including 21003 Chinese characters (including radicals and components) and 883 graphic symbols.
All code classifications
Announce
edit
1. Chinese character area.include:
a. GB 2312 Chinese character area.Namely GBK/2: B0A1-F7FE.6763 Chinese characters in GB 2312 are included, in the original order.
b. GB 13000.1 Extended Chinese character area.include:
(1) GBK/3: 8140-A0FE。6080 CJK Chinese characters in GB 13000.1 are included.
(2) GBK/4: AA40-FEA0。8160 CJK Chinese characters and supplementary Chinese characters are included.
CJK Chinese characters come first, arranged according to UCS code size;The added Chinese characters (including radicals and components) are listed below according to the page number of Kangxi Dictionary/Word bitArrange.
2. Graphic symbol area.include:
a. GB 2312 Non Chinese character symbol area.Namely GBK/1: A1A1-A9FE.In addition to the symbols in GB 2312,
There are also 10 lowercase Roman numerals and symbols supplemented by GB 12345.There are 717 counting symbols.
b. GB 13000.1 expanded non Chinese character area.Namely GBK/5: A840-A9A0.BIG-5 non Chinese character symbols, structure symbols and "○" are arranged in this area.There are 166 counting symbols.
3. User defined area: It is divided into (1), (2) and (3) three cells.
(1) AAA1-AFFE, 564 code points.
(2) F8A1-FEFE, 658 code points.
(3) A140-A7A0, 672 code points.
Although Zone (3) is open to users, its use is restricted, because it does not rule out adding new areas in the futurecharacterPossibility.
Here are some tips:
1. In php,Character encodingIt is based on the code sent, because the code used is the code entered by the user and will not be automatically changed, but in ASP, the default code is unicode, so we can easily get the gbk ->unicode code comparison table, so that it is easy to realize the conversion from gbk to utf-8 even without a basic library;
2. Because GBK is the highest and lowest value is 0x40, that is, 64, sometimes when organizing some strings involving Chinese, the segmentationcharacterIt is better to use the ones before 64asciiCode, so that replacement or segmentation in any case will not lead to garbled codes. The commonly used characters are ",", ";", ":", "", "", "", "", and "". These characters will never confuse the gb code.
Coding type
Announce
edit
Encoding is a basic perceptual process that interprets afferent stimuli in cognition.Technically, this is a complex, multi-stage conversion process, from more objective sensory input (such as light and sound) to subjective meaningful experience.
Character encodingCharacter encoding is a set of rules that can be used to pair a set of natural language characters (such as the alphabet or syllable list) with a set of other things (such as numbers or electric pulses).
Text encoding uses aMarkup LanguageTo mark the structure and other features of a text for computer processing.
Semantics encoding refers to the semantic encoding of formal language A with formal language B, which is a method of using language B to express all words (such as programs or instructions) of language A.
Electronic encoding is to convert asignalConvert to a code that has been optimized for transmission or storage.The conversion is usually performed by aCodecDone.
Neural encoding refers to the method of how information is described in neurons.
Memory coding(Memory encoding) is the process of converting feelings into memories.
Encryption is the process of transforming information for confidentiality.
Transcoding is the process of converting encoding from one format to another.
Physics
Announce
edit
In physics, coding and decoding are referred to as gate circuits.
takeanalog signalThe period of conversion to binary digit is calledAnalog-to-digital converter(ADC)。A digital to analog converter (DAC) converts binary numbers into analog quantities.Encoder anddecoderIt is generally used for chip address selection.3-8decoderIt is to convert the input three bit code into 8-bit output, so that one bit is different from the others.For example, 010 is 00000010 after decoding.