data compression

Announce Upload video
Computer function
Collection
zero Useful+1
zero
Data Compression [2] ), is the process of encoding the original data with less space [2] , which means reducing the amount of data without losing useful information storage space It is a technical method to improve the efficiency of its transmission, storage and processing, or reorganize the data according to a certain algorithm to reduce data redundancy and storage space. Data compression includes lossy compression and lossless compression
stay computer science and information theory In, data compression or source coding is the process of representing information with fewer data bits (or other information related units) than that without coding according to a specific coding mechanism. For example, if we encode "compression" as "comp", this article can use fewer Data bits express. A popular compression instance is the ZIP used by many computers file format It not only provides the compression function, but also serves as an archiving tool (Archiver), which can store many files in the same file.
Chinese name
data compression
Foreign name
Data Compression
Discipline
Computer Science and Technology _ Database _ New Database Technology

outline

Announce
edit
For any form of communication, compression can only be performed when both the sender and receiver of the information can understand the encoding mechanism data communication Can work. For example, only when the recipient knows that this article needs to be written in English character This article is meaningful only when it is explained. Similarly, only when the receiver knows the coding method can he be reasonable decompression Data. Some compression algorithms take advantage of this feature to encrypt data in the compression process, such as using password encryption, to ensure that only the authorized party can get the data correctly.
Data compression can be achieved because most real world data has statistical redundancy. For example, the letter "e" is more commonly used than the letter "z" in English, and it is very unlikely that the letter "q" is followed by "z". Lossless compression algorithms usually take advantage of statistical redundancy, so that the sender's data can be more succinct but still complete.
If a certain degree of loss of fidelity is allowed, further compression can be achieved. For example, when people look at pictures or TV pictures, they may not notice that some details are not perfect. Similarly, the two audio recording sampling sequences may sound the same, but in fact they are not exactly the same. Lossy compression algorithm uses fewer bits to represent image, video or audio in case of slight difference.
Because it can help reduce the consumption of expensive resources such as hard disk space and connection bandwidth, compression is very important. However, compression needs to consume information processing resources, which may also be expensive. Therefore, the design of data compression mechanism needs to make a compromise between the compression capacity, distortion, computing resources required and other different factors that need to be considered.
Some mechanisms are reversible, so that the original data can be recovered. This mechanism is called lossless data compression Other mechanisms are designed to achieve higher compression ratio A certain degree of data loss is allowed. This mechanism is called Lossy data compression
However, there are often files that cannot be lossless data compression Algorithm compression. In fact, no compression algorithm can compress data that does not contain recognizable patterns. Attempts to compress compressed data usually result in extended data, and attempts to compress encrypted data usually also result in extended data.
actually, Lossy data compression It will eventually reach the point where it cannot work. Let's take an extreme example. Each time the compression algorithm removes the last byte of the file, the compression algorithm will not continue to work after the algorithm continues to compress until the file becomes empty.

classification

Announce
edit
There are many data compression methods, and data with different characteristics have different data compression methods (that is, coding methods). Here we will classify them from several aspects. [1]
(1) Instant compression and non instant compression
For example, making an IP phone call is to convert voice signals into digital signals, compress them, and then transmit them over the Internet. The process of data compression is real-time. Instant compression is generally used in the transmission of video and audio data. Instant compression is often used for special hardware devices, such as compression cards.
Non instant compression is often used by computer users. This compression is carried out only when necessary, without instantaneity. For example, compress a picture, an article, a piece of music, etc. Non instant compression generally does not require special equipment, and it is OK to install and use the corresponding compression software directly in the computer.
(2) Data compression and file compression
In fact, data compression includes file compression. Data originally refers to any digital information, including various files used in computers. But sometimes, data refers to some time specific data, which is often collected, processed or transmitted in real time. File compression refers to the compression of data to be stored in physical media such as disk, such as a piece of article data, a piece of music data, a piece of program coding data, etc.
(3) Lossless compression and lossy compression
Lossless compression uses statistical redundancy of data for compression. The theoretical limit of data statistical redundancy is 2:1 to 5:1, so the compression ratio of lossless compression is generally low. This kind of method is widely used in text data, programs and image data in special applications, which need to store data accurately. The lossy compression method makes use of the insensitivity of human vision and hearing to some frequency components in images and sounds, which allows certain information to be lost in the process of compression. Although the original data cannot be completely recovered, the lost part has little impact on the understanding of the original image, but it has a relatively large compression ratio. Lossy compression is widely used in the compression of voice, image and video data.

principle

Announce
edit
in fact, Multimedia information There is a lot of data redundancy. For example, for a still building background, blue sky and green space in an image, many of the pixels are the same. If they are stored point by point, a lot of space will be wasted, which is called spatial redundancy. For another example, in the adjacent sequence of TV and animation, only the moving objects have a little change, and only the difference part can be stored, which is called time redundancy. In addition, there are structural redundancy and visual redundancy, which provide conditions for data compression.
In short, the theoretical basis of compression is information theory From the perspective of information, compression is to remove the redundancy in information, that is, to remove the deterministic or inferable information, and to retain the uncertain information, that is, to replace the original redundant description with a description closer to the essence of information, which is the amount of information.

application

Announce
edit
A very simple compression method is Run length code This method uses simple codes such as data and data length to replace the same continuous data. This is lossless data compression An instance of. This method is often used in office computers to make better use of disk space, or better use of bandwidth in computer networks. about Spreadsheet text Executable For such symbolic data, lossless is a very critical requirement, because in most cases, even a Data bits The changes are unacceptable.
For video and audio data, as long as there is no loss of important parts of the data, a certain degree of quality degradation is acceptable. By taking advantage of the limitations of the human perception system, it can greatly save money storage space And the quality of the results is not significantly different from the quality of the original data. these ones here Lossy data compression The method usually requires a compromise between compression speed, compressed data size, and quality loss.
Damaging image compression When used in digital cameras, the storage capacity is greatly improved, and the image quality is almost unchanged. Damaged for DVD MPEG-2 The codec video compression also realizes similar functions.
In lossy audio compression, psychoacoustic methods are used to remove the inaudible or hard to hear components in the signal. The compression of human speech often uses more professional technology, so people sometimes distinguish "speech compression" or "speech coding" from "audio compression" as an independent research field. Different audio and speech compression standards belong to the category of audio codec. For example, voice compression is used for Internet Telephone, while audio compression is used for CD ripping and MP3 player decoding.

theory

Announce
edit
The theoretical basis of compression is information theory (It is closely related to algorithm information theory) and rate distortion theory. The research work in this field was mainly established by Claude Shannon, who published a basic paper in this field in the late 1940s and early 1950s. Doyle and Carlson wrote in 2000 that data compression "has one of the simplest and most beautiful design theories in all engineering fields". Cryptography and coding theory It is also a closely related discipline, and the idea of data compression is also deeply rooted in statistical inference.
many lossless data compression The system can be regarded as a four step model, Lossy data compression The system usually includes more steps, such as prediction, frequency transformation and quantization.

Popular algorithm

Announce
edit
Lempel Ziv (LZ) compression method is one of the most popular lossless storage algorithms. DEFLATE is a variant of LZ, which aims at Decompression speed And compression ratio It has been optimized. Although its compression speed may be very slow, PKZIP, gzip, and PNG are all using DEFLATE. LZW (Lempel Ziv Welch) was a patent of Unisys until the patent expired in June 2003. This method was used for GIF images. In addition, it is worth mentioning the LZR (LZ Renau) method, which is the basis of the Zip method. LZ method is based on form The compression model of, in which the entries in the table are replaced with duplicate data strings. For most LZ methods, this form It is generated dynamically from the initial input data. this form Frequently used Hoffman code Maintenance (e.g. SHRI, LZX). An LZ based coding mechanism with good performance is LZX, which is used to Microsoft CAB format of.

Algorithmic coding

Announce
edit
The best compression tool uses probabilistic model predictions for arithmetic coding arithmetic coding It was invented by Jorma Rissanen and transformed into a practical method by Witten, Neal and Cleary. This method can achieve better compression than the well-known Huffman algorithm, and it is very suitable for adaptive data compression. The prediction of adaptive data compression is closely related to the context. arithmetic coding Already used for binary image compression Standard JBIG, document compression standard DejaVu. The text input system Dasher is an inverse arithmetic encoder.

type

Announce
edit
Data compression can be divided into two types, one is called lossless compression The other is called Lossy compression
lossless compression It refers to the reconstruction (or restoration) of compressed data, decompression )The reconstructed data is identical to the original data; Lossless compression is used when the reconstructed signal is required to be completely consistent with the original signal. A common example is disk file compression. Lossless compression algorithm can generally compress the data of ordinary files to 1/2~1/4 of the original data. Some commonly used lossless compression algorithms are Hoffman (Huffman) algorithm and LZW (Lenpel Ziv&Welch) compression algorithm.
Lossy compression refers to the use of compressed data for reconstruction. The reconstructed data is different from the original data, but does not affect people's misunderstanding of the information expressed in the original data. Lossy compression It is suitable for situations where the reconstructed signal is not necessarily the same as the original signal. For example, image and sound compression can be used Lossy compression Because the data contained in it is often more than the information that our visual and auditory systems can receive, some data will be lost without misunderstanding the meaning expressed by sound or image, but the compression ratio can be greatly improved

Extended Reading

Announce
edit
On the Internet, we can easily send image and audio data and share video, not only thanks to the increased bandwidth and speed of the Internet, but also thanks to the progress of data compression technology. It is no exaggeration to say that all kinds of data we commonly use have used data compression.
Data compression can be roughly divided into two types: lossless data compression that can completely restore data to the original state, and lossy data compression that cannot completely restore data to the original state.
In lossless data compression, the simplest method is run length compression. If there are consecutive parts of the same character in a string, you can change the repeated characters into numbers to shorten the data. For example, the string aaaabbbcccccc is composed of four a's, three b's and six c's, so it can be represented by "4a3b6c", which compresses the original 13 character data into six characters. This method can also be applied to images. For example, if there are 12 consecutive red pixels and 10 consecutive yellow pixels in the image data, they can be represented by "12 red and 10 yellow". However, in actual data, there are few cases where a large number of characters are the same or the color is continuous.
In lossless data compression, another famous algorithm is Huffman coding, which is a compression technology with a wider range of applications. In a computer, every character (such as English letters) is represented by 8 bits. In Huffman coding, the number of 8 bits originally allocated to each character will be temporarily canceled, and the characters with many occurrences in the data will be represented by short bits, while the characters with few occurrences will be represented by long bits. Through such transformation, data can be displayed more efficiently. Huffman coding has high compression rate and no patent problem, so it is used in zip and other compression algorithms.
Lossy data compression is often used in image, audio and video data. In fact, this kind of data before compression contains information that many people's perception cannot perceive. As long as you delete this information, you can make the data smaller. However, the eliminated data cannot be recovered. For example, image data includes brightness and hue (components of color). Human vision is very sensitive to the change of brightness, but not very sensitive to the discrimination of hue. Therefore, the data representing the hue can be ingeniously reduced. At this time, a mathematical method called Fourier transform is used.
Some digital cameras can store uncompressed raw data and compressed JPEG data at the same time. For images taken from the same object, the original data can reach tens of megabytes (MB), while JPEG data can be compressed to about one tenth of it [2]