Compression is a method or protocol for using
fewer bits to represent the original information. Compression can
be achieved in a variety of methods including looking for patterns
and substituting symbols for the larger patterns of data. Compression
can be a "hard problem" for computers because it is difficult to
know whether or not the compression you've found is optimal - if
you keep trying would it get better? It's hard to know when to stop,
and hard to verify that you've compressed it "enough". When it's
impossible, or would take an unreasonable amount of time, to know an
exact solution you can create a strategy called a "heuristic" to
define some rules about when the solution is good enough. See the vocabulary section below.
Heuristic: A problem solving approach (algorithm)
to find a satisfactory solution where finding an optimal or exact
solution is impractical or impossible.
Lossless Compression: A data compression
algorithm that allows the original data to be perfectly
reconstructed from the compressed data.
Students will be able to:
Compress a piece of text using the Text Compression program.
Explain why the optimal amount of compression is impossible or "hard" to identify.
Explain some factors that make compression challenging.
Develop a heuristic for text compression.
Describe the purpose and rationale for lossless compression.
This is a big lesson that covers a lot of bases. First and foremost it covers the following topics directly from the CSP framework:
1. Lossless Compression: The basic principle behind compression is to develop a method or protocol for using fewer bits to represent the original information.
2. Heuristics: There is no single correct way to compress text using the method we use in this lesson because a) there is no known algorithm for finding an optimal solution, and b) we don't even know a way to verify whether a given solution is optimal. There is no way to prove it or derive it beyond trying all possibilities by brute force. This is an example of an algorithm that cannot run in a "reasonable amount of time" - one of the CSP learning objectives.
3. Foreshadowing Programming Behaviors: Lastly, the Text Compression Activity is an important lesson to refer back to when students start programming. The activity engages students in thinking and problem solving behaviors that foreshadow skills that are particularly useful for programming later down the line. In particular, when students recognize patterns that repeat, and then represent those patterns as abstract symbols, and then further recognize patterns within those patterns, it is very similar to the kinds of abstractions we develop when writing functions and procedures when programming. Decoding the message in the activity is very similar to tracing a sequence of function calls in a program.
Before beginning the lesson activities, briefly
talk with students about compression -- how they might use it normally
in their own discourse, what it does, and why it's important in computing.
When you send text messages to a friend, do you spell every word correctly?
Do you use abbreviations for common words? List as many as you can.
Write some examples of things you might see in a text message that are not proper English.
Why do you use these abbreviations? What is the benefit?
Today's class is about compression.
When you abbreviate or use coded language to shorten the original text, you are "compressing text." Computers do this too, in order to save time and space.
The art and science of compression is about figuring out how to represent the same data with fewer bits.
Why is compression important?
One reason is that storage space is limited and you'd always prefer to use fewer bits if you could.
A much more compelling reason is that there is an upper limit to how fast bits can be transmitted over the Internet.
What if we need to send a large amount of text faster over the Internet, but we've reached the physical limit of how fast we can send bits? Our only choice is to somehow capture the same information with fewer bits; we call this compression.
Let's look at an example of a text message that's been compressed in a clever way.
Provide students with the Decode This Message - Activity Guide.
Either individually or with partners, let students decode the message using the key in the activity guide.
How much was it compressed?
To answer, we need to compare the number of characters in the original poem to the number of characters needed to represent the compressed version.
Display or demonstrate the ideas from the Decode This Message - Activity Recap (Teacher's Guide)
Important Note: The compressed poem is not just this part. If you were to send this to someone over the Internet they would not be able to decode it. The full compressed text includes BOTH the compressed text and the key to solve it. Thus, you must account for the total number of characters in the message plus the total number of characters in the key to see how much you've compressed it over the original.
Now you are going to try your hand at compressing things on your own.
First, watch the following video about Text Compression.
Rather than using Code.org's Text Compression Widget as described in the video, we will use a Quorum program instead. The program is fully finished - it doesn't need any modifications, and can be run as-is. The widget can be accessed by the "Lossless Text Compression Widget" resource.
How much were you able to compress your sentences? What was difficult about it?
What makes compression hard?
Do we think that these compression amounts that we've found
are the the best? Is there a way to know what the best compression is?
Is there a process a person can follow to find the best
(or a pretty good) compression for a piece of text?
In computer science there is a word for strategies to use when you're
not sure what the exact or best solution to a problem is. A HEURISTIC
is a problem
solving approach (typically an algorithm) to find a satisfactory
solution where finding an optimal or exact solution is impractical
Continue working on compressing your text using a key.
As you do so, develop a set of rules, or a
'heuristic' that generally seems to provide good results.
Record your heuristic as a list of steps that someone else
unfamiliar with the problem could follow and still end up with
If working with a group, trade your heuristics with someone
else. Are they clear and specific enough that you always know
what to do? If not, provide feedback to one another and improve
your heuristics to provide clearer instructions.
Using someone else's heuristic, attempt to compress one or
more text passage. Record the amount of compression you achieve.
Compression in the Real World (.zip)
Zip Compression - There is a compression algorithm called
LZW compression upon which the common "zip" utility is based.
Zip compression does something very similar to what you did today
with text compression.
Do you want to use zip compression for real? Most computers
have it built in: Windows - select a file or group of files,
right-click, and choose "Send To...Compressed (zipped) Folder"
Mac - select a file or group of files, ctrl+click, and choose "Compress Items"
Warning: if you try this results may vary. Zip works really
well for text, but only on large files. If you try to compress
the simple hello.txt file we used in a previous lesson, you'll
see the resulting file is actually bigger. Zip is meant for text.
It might not work on non-text files very well because they
are already compressed or don't have the same kinds of embedded
patterns that text documents do.
Final Thoughts: If you send the compressed poem, would your
friend be able to read it? Why is the key important? Your
friend would only be able to read it if she knew how it was encoded.
The dictionary or key is necessary because it tells her how to
decompress the information that she has. Why do you want to compress anything? What's the point?
It is useful for sending things faster or for smaller storage.
It allows for optimization of limited resources. For a piece of text,
what is a "good" amount of compression? Is there a way to know
when you've compressed it the most? Explain how you would know, or why you can't know.
Do you think it's possible to describe (or write) a specific
set of instructions that a person could follow that would always
result in better text compression than your heuristic? Why or why not?
Some compression programs (like zip) do a great job if the file
is sufficiently large and has reasonable amounts of repetition.
However, it is also possible to create a 'compressed file' that
is larger than the original because the heuristic doesn't work in every single case.
Is there a way to know that a compressed piece of text is as
compressed as possible? If yes, describe how you could
determine it. If no, why not? There is no perfect solution. The
size and shape of the data will determine what the 'best' answer
is and we often cannot even be sure it is the best answer
(only that it is better than other answers we have tried.)
What did everyone's processes for compression have in common?
Real World Zip Compression: Experiment with zip using text
files with different contents. Are the results for small files
as good as for large files? (On Macs, in the Finder choose “get
info" for a file to see the actual number of bytes in the file,
since the Finder display will show 4KB for any file that's less
than that. Warning: results may vary. Zip works really well for
text, but it might not compress other files very well because
they are already compressed or don't have the same kinds of
embedded patterns that text documents do.
Challenge: Research the LZW algorithm - .zip compression is
based on the LZW Compression Scheme. While the idea behind the
text compression tool is similar to LZW (zip) algorithm,
tracing the path of compression and decompression is somewhat
challenging. Learning more about LZW and what happens in the
course of this algorithm would be an excellent extension project.