MATHmaniaCS - Lesson 2: Text Compression

Lesson 2: Text Compression




					Copyright (c) 1998, C. Heeren, T. Magliery, and L. Pitt. This lesson and supporting materials may be copied for nonprofit educational use only. Motivation: In the previous lesson we discussed the way in which data is stored on a computer. Specifically, we talked about representing numbers as sequences of 0s and 1s. In this lesson we'll discuss some techniques for representing text as a sequence of 0s and 1s. Materials: First Code worksheet - one per student. Simple Code worksheet - one per student. Overhead histogram of letter usage in Alice in Wonderland, and Morse Code. A Wonderful Code overhead. No Problem overhead. Get Huffy worksheet - one per student. Lesson Plan: Explain the motivation for the lesson. Discussion: "How can letters be stored using only 0s and 1s?" Someone will probably suggest the simple code where the letters 'a' through 'z' are represented using the numbers 1 through 26, but with the numbers written in binary. This is a good place to start. You may want to prompt younger children with the preliminary question of how to use decimal numbers 1, 2, ... to represent the letters of the alphabet. Distribute the First Code worksheet and ask the students to decode the message. (If they complain that it doesn't work, say "do the best you can.") Discussion: "Is there a problem?" YES! There are at least two different solutions to this code! "Why is that a problem?" (The handout should make this obvious.) "How can we solve this problem?" The students will probably recommend separating the characters with a space or some other character. Explore this a bit. What would this separator have to look like? Have the kids suggest ideas. Remember, it can only be made of 0s and 1s. Here's an example of something they might suggest, and the way you can resolve the problem: "how about using the 0 as a space?" to which you say, "how do you decode 10101? (which is either aaa, or ea, or just u)" Then they might suggest something like: "how about separating the letters by spaces, and representing the space by the number 27 written in binary (11011)?" Then you say, "this particular number isn't bad, but you were just lucky. Besides, we can do better...if we do it this way, it takes more bits to say "the letter's finished" than it does to tell the letter itself!" Here the students will need a hint: ask them if the binary number 110 (6) is the same as 0110 and 00110 and 000110, etc. Make sure they all agree that it is. Tell them that we sometimes call zeros like this "padding." Now you have to change gears a little bit. Say, "Suppose I give you this list of numerals and tell you they're a sequence of phone numbers. (Write any 21 digits on the board.) Can you tell me exactly what the phone numbers are?" They'll say "YES!" You say, "WHY?" They'll say, "because we know how long phone numbers are!" You say, "Computers can be told how long letters are! Now the question is, how long should they be?" How long should they be? Students may be able to figure this out. They know they need 26 characters, so 5 is enough since 32 distinct strings of length 5 can be made. You may need to ask them if 3 is enough? No, that only gets you through the letter 111=G (if A is zero). Is 4 enough? No, that only gets you through 1111=P (if A is zero). So, now you have a code: A=00000, B=00001, C=00010, D=00011, etc. Hand out the Simple Code worksheet for them to solve. Point out that this is a good code, but that we are also interested in transmitting things quickly, and using storage space wisely so all those padding zeros are slowing things down, or taking up valuable space. Next we'll look at a coding technique which addresses this issue. Place histogram of letter usage in Alice in Wonderland on the overhead. Briefly discuss the findings. You may want to refer to "RSTLN E" from Wheel-Of-Fortune. "What is most frequently occurring letter?" "What letter is used least often?" Also mention "ETAOIN SHRDLU". A plentitude of research agrees that ETAOIN are the most frequently occurring letters in the English language, but some people disagree whether SHRDLU are the next 6. Quite a controversy. Have students examine Morse code, and compare it to the info in the histogram. They should notice that more frequently occurring letters have shorter Morse codewords. Explain that if we use short codewords for frequent letters and long codewords for rare letters we can make the overall length of the text shorter. This is a VERY important point. The problem with this idea is that if the codewords are different lengths, we may run into problems like the ones on the First Code worksheet - we can't tell when the codeword for one letter ends and the next begins. This is because one letter's code is the beginning part of another letter's code. For example, A=10, and B=1011. A code where this doesn't occur is called a "prefix" code. Place the A Wonderful Code handout on the overhead and ask the students to examine it for a minute to convince themselves that it is indeed a prefix code. Spend a few minutes in discussion of the code. They should observe that rare letters are long, and common letters are short. This is an example of a Huffman code based on the letter frequencies in Alice in Wonderland. Different texts might result in different codes. More advanced students should research this type of code and learn to make them. Place the No Problem! handout on the overhead and carefully explain how to use the decision tree (the explanation is on the handout). Have the students practice decoding and encoding letters in addition to the handout's examples. Distribute the Get Huffy worksheet so they can practice on their own. Now we spend some time talking about the amount of data compression we achieve by using the Huffman Code instead of the fixed length code. Write these questions on the board, and have the students answer them on the back of their Get Huffy worksheet. How many bits are required to encode the message using the wonderful Huffman Code? How many bits would have been required using the simple fixed length coding? Use numbers to describe the difference between the wonderful code and the simple code. Optional for more advanced students: Let's see what kind of improvement Huffman Coding provides on a larger amount of text. It turns out that Alice in Wonderland contains exactly 107717 letters. How many bits are necessary to store the text using the simple coding scheme of 5 bits per letter? (Answer: 5 x 107717 =538585.) How many bits are necessary to store the text using the Huffman Code? [You may or may not want to help the students with the answer... To answer this question you'll need both the codeword for each letter and the frequency with which each is used. Multiply each frequency by the length of the corresponding codeword, then add up these values. This should give you the total number of characters. Note that rather than give you the actual frequency of each letter we have given you the percentage of the total represented by each letter. To get the frequency, multiply the percentage by the total number of characters. For example, we know that there are 13572 'e's (subject to rounding error), since 12.6% of the 107717 characters are 'e's.] Finally, after doing all this, you should find that Alice in Wonderland can be stored in 456225 bits using the Huffman code. How much improvement is this? (Answer: (538585-456225)/538585 * 100 = 15.3%) Using other techniques (Lempel-Ziv coding for example), compression schemes implemented by modems typically achieve in the neighborhood of 50% compression for text. See Computer Science Unplugged for a fun activity based on this alternate scheme. Conclusion: Despite the ever growing size of computer storage devices, there is still a need for systems to store data as efficiently as possible. By coding text before it is stored, and decoding it when it is retrieved, the storage capacity of a computer (or transmitting capacity of a modem) can be increased at the expense of slightly slower access time. Evaluation: Collect the set of worksheets completed by the students, including the written answers on the back of the Get Huffy worksheet. They are progressively more difficult, and should adequately reflect the students' understanding of the material. Also, be sure to watch for engagement and participation during the dialogue at the beginning of the lesson. Extensions: Advanced students can be told that the frequency histogram represents a probability distribution for the letters in the alphabet. They should be able to find the expected length of a codeword, given a particular code. In addition, given a discrete random variable, they may be able to discover the Huffman coding algorithm on their own. There are entire courses in coding theory, in which Huffman Codes are discussed in great detail. Most discrete mathematics textbooks discuss the Huffman algorithm. Students can spend time compiling letter frequencies from a favorite text and then graphing and analyzing the results. This can either be done individually with short texts, or as a class on a longer text. Advanced students can create the Huffman code using the Huffman algorithm (see previous suggestion), and analyze the compression on the original text. Student teams can be challenged to devise particular codes for 3 or 4-letter alphabets with given frequencies, and see which team does best. For example, if 'a' occurs 50% of the time, and 'b', 'c' each occur 25% of the time, then the code a=0, b=10, c=11 works as well as any code. How about if 'a' occurs 50% of the time, 'b' 25%, and 'c', 'd' each 12.5%? Create your own problems. The culmination of experimenting with particular examples might be to introduce them to Huffman's method for computing the best code (this can be found in virtually every text on discrete mathematics, or can be found on the web easily). An interesting bit of history: Huffman devised his method while he was "only" an undergraduate in college - the problem had been unresolved prior to his discovery. Related Links: Data Compression, Jack Ganssle, the Ganssle Group Data Compression, Jeremy Johnson, Drexel University Huffman Code, John Beidler, University of Scranton, PA Huffman Code Introduction, Java applet, Eric Acosta, Texas Tech University Huffman Codes, Karen Van Houton, University of Idaho Standards: NCTM K-4: 1, 2, 3, 4, 6, 8, 11, 13. NCTM 5-8: 1, 2, 3, 4, 7, 8, 10. NCTM 9-12: 1, 2, 3, 4, 10, 12. Skills: Problem solving / reasoning / communication / connections Encoding and decoding practice Binary number practice Statistics and data collection Multiplication Copyright (c) 1998, C. Heeren, T. Magliery, and L. Pitt. This lesson and supporting materials may be copied for nonprofit educational use only.