Error Correction and Data Compression

Prepared by:

Joseph Malkevitch
Mathematics and Computing Department
York College (CUNY)
Jamaica, New York 11451-0001

A large number of new communications technologies (e.g. cell phone, fax, HDTV) involve the use of both error-correction and data compression. If one uses a binary stream of digits to send a picture, sound, or text file, the channel over which it is sent may have noise, so one wants to protect against this by using some kind of error correction system. Error correction invariably involves using more bits to represent the data than would be the case if there was no noise and one could send the original binary stream. It is usually wise to compress the original data stream before it is sent since it is nearly always true that the cost (in money or time) of sending extra bits across the channel is more expensive than the cost of using compression and error correction before the information is sent, and then correcting errors and decompressing after the information arrives. One has to compress first and then protect for errors, since otherwise the compressed information would mess up the error correction steps.

Suppose that one wants to use an error correction system on a long string of sequenced DNA. We will assume that that the only symbols to be considered are A, C, G, and T. Thus, we have four information types and each type is to be represented by a binary code word. We can encode four pieces of information using a code which has five binary bits and is capable of correcting one error. (With code words of fixed length less than 5 this is not possible.)

A = 11111

C = 11000

G = 00100

T = 00011

You can check the Hamming distance between every pair of these code words is at least 3. Thus, this code can correct up to one error per code word.

Suppose that A, C, G, and T appear with frequencies:

A = 20, C = 10, G = 4, and T = 2. If we used a fixed length binary code without error correction we will be able to send 36 information symbols with 2(36) = 72 bits.

For example, we could use the code:

A = 00

C = 10

G = 01

T = 11

However, by constructing a Huffman code with these frequencies we would get the code:

A = 1

C = 01

G = 001

T = 000

Note that no code word is a prefix of any of the other code words, hence, this code can be decoded unambiguously. Huffman discovered his famous approach to data compression as an "exercise" in a class assigned by Robert Fano (son of Gino Fano, for whom the finite projective plane with 7 points is named) of MIT.

How many bits does this use?

Since A is coded 20 times with one bit, this uses 20 bits; since C is codes 10 times with two bits, this uses 20 bits; since G is coded 4 times with three bits this uses 12 bits; since T is codes 2 times with three bits, this uses 6 bits.

The total is 20 + 20 + 12 + 6 = 58 bits instead of 72 bits. We have saved 14 bits or 14/72 is 7/36.

The way that the message is processed for error correction is to take the four possible pairs of the compressed message: 00, 01, 10, and 11 and code them using 5 bits as shown below:

00 = 11111

01 = 11000

10 = 00100

11 = 00011.

Since the compressed "text" would have 58 bits, there would be 29 pairs (of four potential kinds) each of which could be encoded using 5 bits, which is 145 bits, which is a saving over the uncompressed but error corrected steam of 180 bits. Again, this is a saving of 35/180 bits or 7/36.

Exercises:

1. What is the shortest length possible code words in a fixed length binary code to represent:

i. 56 information symbols
ii. 116 information symbols
iii. upper and lower case English letters, a blank, and all the decimal digits
iv. upper and lower case English letters, a blank, all the decimal digits, and symbols for the colon, the apostrophe, the semi-colon, quotation marks, the comma, the period and the question mark.

2. Design a Huffman code for the following string of decimal digits:

111210000110000133245678992200001100000100000001240055000222000001

How much compression is achieved over the best binary code of fixed length code words?

3. Design a Huffman code for the following text. Be sure to allow for the "blank" and "period" but treat all letters as if they are lower case:

The sly fox jumped over the cheese and eat all of the corn and the wheat.

How much compression is achieved over the best binary code of fixed length code words?