|
Copyright (c) 1998, C. Heeren, T. Magliery, and L. Pitt.
This lesson and supporting materials may be copied for nonprofit
educational use only.
Motivation:
In the previous lesson we discussed the way in which data is stored on
a computer. Specifically, we talked about representing numbers as sequences
of 0s and 1s. In this lesson we'll discuss some techniques for representing
text as a sequence of 0s and 1s.
Materials:
-
First Code worksheet - one per student.
-
Simple Code worksheet - one per student.
-
Overhead histogram
of letter usage in Alice in Wonderland, and Morse Code.
-
A Wonderful Code overhead.
-
No Problem overhead.
-
Get Huffy worksheet - one per student.
Lesson Plan:
-
Explain the motivation for the lesson.
-
Discussion: "How can letters be stored using only 0s and 1s?" Someone will
probably suggest the simple code where the letters 'a' through 'z' are
represented using the numbers 1 through 26, but with the numbers written
in binary. This is a good place to start. You may want to prompt
younger children with the preliminary question of how to use decimal numbers
1, 2, ... to represent the letters of the alphabet.
-
Distribute the First Code worksheet
and ask the students to decode the
message. (If they complain that it doesn't work, say "do the best
you can.")
-
Discussion: "Is there a problem?" YES! There are at least
two different solutions to this code! "Why is that a problem?"
(The handout should make this obvious.)
-
"How can we solve this problem?" The students will probably recommend
separating the characters with a space or some other character. Explore
this a bit. What would this separator have to look like? Have
the kids suggest ideas. Remember, it can only be made of 0s and 1s.
Here's an example of something they might suggest, and the way you can
resolve the problem: "how about using the 0 as a space?" to
which you say, "how do you decode 10101? (which is either aaa, or ea, or
just u)" Then they might suggest something like: "how about
separating the letters by spaces, and representing the space by the number
27 written in binary (11011)?" Then you say, "this particular number
isn't bad, but you were just lucky. Besides, we can do better...if
we do it this way, it takes more bits to say "the letter's finished" than
it does to tell the letter itself!"
-
Here the students will need a hint: ask them if the binary number
110 (6) is the same as 0110 and 00110 and 000110, etc. Make sure
they all agree that it is. Tell them that we sometimes call zeros
like this "padding."
-
Now you have to change gears a little bit. Say, "Suppose I give you
this list of numerals and tell you they're a sequence of phone numbers.
(Write any 21 digits on the board.) Can you tell me exactly what
the phone numbers are?" They'll say "YES!" You say, "WHY?"
They'll say, "because we know how long phone numbers are!" You say,
"Computers can be told how long letters are! Now the question is,
how long should they be?"
-
How long should they be? Students may be able to figure this out.
They know they need 26 characters, so 5 is enough since 32 distinct strings
of length 5 can be made. You may need to ask them if 3 is enough?
No, that only gets you through the letter 111=G (if A is zero). Is
4 enough? No, that only gets you through 1111=P (if A is zero).
So, now you have a code: A=00000, B=00001, C=00010, D=00011, etc.
-
Hand out the Simple Code worksheet
for them to solve.
-
Point out that this is a good code, but that we are also interested in
transmitting things quickly, and using storage space wisely so all those
padding zeros are slowing things down, or taking up valuable space.
Next we'll look at a coding technique which addresses this issue.
-
Place histogram of letter usage
in Alice in Wonderland on the overhead. Briefly discuss the findings.
You may want to refer to "RSTLN E" from Wheel-Of-Fortune. "What is
most frequently occurring letter?" "What letter is used least often?"
Also mention "ETAOIN SHRDLU". A plentitude of research agrees that
ETAOIN are the most frequently occurring letters in the English
language, but some people disagree whether SHRDLU are the next 6.
Quite a controversy.
-
Have students examine Morse code, and compare it to the info in the
histogram. They should notice that more frequently occurring letters
have shorter Morse codewords.
-
Explain that if we use short codewords for frequent letters and long codewords
for rare letters we can make the overall length of the text shorter.
This is a VERY important point.
-
The problem with this idea is that if the codewords are different lengths,
we may run into problems like the ones on the First Code worksheet -
we can't tell when the codeword for one letter ends and the next
begins. This is because one letter's code is the beginning part of
another letter's code. For example, A=10, and B=1011.
A code where this doesn't occur is called a "prefix" code.
-
Place the A Wonderful Code handout on
the overhead and ask the students to examine it for a minute to
convince themselves that it is indeed a prefix code. Spend a few
minutes in discussion of the code. They should observe that rare
letters are long, and common letters are short. This is an
example of a
Huffman code based on the letter
frequencies in Alice in Wonderland. Different texts might result in
different codes. More advanced students should research this type of
code and learn to make them.
-
Place the No Problem! handout on the
overhead and carefully explain how to use the decision tree (the
explanation is on the handout). Have the students practice decoding
and encoding letters in addition to the handout's examples.
-
Distribute the Get Huffy worksheet so they can practice on their own.
-
Now we spend some time talking about the amount of data compression we
achieve by using the Huffman Code instead of the fixed length code.
Write these questions on the board, and have the students answer them on
the back of their Get Huffy worksheet.
-
How many bits are required to encode the message using the wonderful Huffman
Code?
-
How many bits would have been required using the simple fixed length coding?
-
Use numbers to describe the difference between the wonderful code and the
simple code.
-
Optional for more advanced students: Let's see what kind of improvement
Huffman Coding provides on a larger amount of text. It turns out that Alice
in Wonderland contains exactly 107717 letters. How many bits are necessary
to store the text using the simple coding scheme of 5 bits per letter?
(Answer: 5 x 107717 =538585.) How many bits are necessary to store the text
using the Huffman Code?
[You may or may not want to help the students with the answer...
To answer this question you'll need both the codeword for each letter and
the frequency with which each is used. Multiply each frequency by the
length of the corresponding codeword, then add up these values. This should
give you the total number of characters. Note that rather than give you
the actual frequency of each letter we have given you the percentage of
the total represented by each letter. To get the frequency,
multiply the percentage by the total number of characters. For
example, we know that there are 13572 'e's (subject to rounding error),
since 12.6% of the 107717 characters are 'e's.]
Finally, after doing all
this, you should find that Alice in Wonderland can be stored in 456225
bits using the Huffman code. How much improvement is this?
(Answer: (538585-456225)/538585 * 100 = 15.3%)
Using other techniques (Lempel-Ziv coding for example),
compression schemes implemented by modems typically achieve in the
neighborhood of 50% compression for text.
See Computer
Science Unplugged for a fun activity based on this alternate scheme.
Conclusion:
Despite the ever growing size of computer storage devices, there is still
a need for systems to store data as efficiently as possible. By coding
text before it is stored, and decoding it when it is retrieved, the storage
capacity of a computer (or transmitting capacity of a modem) can be increased
at the expense of slightly slower access time.
Evaluation:
Collect the set of worksheets completed by the students, including the
written answers on the back of the Get Huffy worksheet. They are
progressively more difficult, and should adequately reflect the students'
understanding of the material. Also, be sure to watch for engagement
and participation during the dialogue at the beginning of the lesson.
Extensions:
Advanced students can be told that the frequency histogram represents a
probability distribution for the letters in the alphabet. They should be
able to find the expected length of a codeword, given a particular code.
In addition, given a discrete random variable, they may be able to discover
the Huffman coding algorithm on their own.
There are entire courses in coding theory, in which Huffman Codes are
discussed in great detail. Most discrete mathematics textbooks discuss
the Huffman algorithm.
Students can spend time compiling letter frequencies from a favorite
text and then graphing and analyzing the results. This can either
be done individually with short texts, or as a class on a longer text.
Advanced students can create the Huffman code using the Huffman algorithm
(see previous suggestion), and analyze the compression on the original
text.
Student teams can be challenged to devise particular codes for 3 or
4-letter alphabets with given frequencies, and see which team does best.
For example, if 'a' occurs 50% of the time, and 'b', 'c' each occur 25%
of the time, then the code a=0, b=10, c=11 works as well as any code. How
about if 'a' occurs 50% of the time, 'b' 25%, and 'c', 'd' each 12.5%?
Create your own problems. The culmination of experimenting with particular
examples might be to introduce them to Huffman's method for computing the
best code (this can be found in virtually every text on discrete mathematics,
or can be found on the web easily). An interesting bit of history: Huffman
devised his method while he was "only" an undergraduate in college - the
problem had been unresolved prior to his discovery.
Related Links:
-
Data Compression,
Jack Ganssle, the Ganssle Group
-
Data
Compression, Jeremy Johnson, Drexel University
-
Huffman Code, John Beidler, University of Scranton, PA
-
Huffman Code Introduction, Java applet, Eric Acosta, Texas Tech
University
-
Huffman Codes, Karen Van Houton, University of Idaho
Standards:
-
NCTM K-4: 1, 2, 3, 4, 6, 8, 11, 13.
-
NCTM 5-8: 1, 2, 3, 4, 7, 8, 10.
-
NCTM 9-12: 1, 2, 3, 4, 10, 12.
Skills:
- Problem solving / reasoning / communication / connections
- Encoding and decoding practice
- Binary number practice
- Statistics and data collection
- Multiplication
Copyright (c) 1998, C. Heeren, T. Magliery, and L. Pitt.
This lesson and supporting materials may be copied for nonprofit
educational use only.
|