CS 8761 Natural Language Processing - Fall 2004
Assignment 2 - Due Friday Oct 15, 2004, noon
This may be revised in response to your questions.
Last update Sunday Oct 10, 4pm
Objectives
To estimate the entropy of English and interpret those results based on
experiments using literary and newswire text. You may also gain an
appreciation for the difficulties of sentence boundary detection.
Specification
Write a Perl program called shannon.pl that will play the "Shannon game".
Your program should accept as input an arbitrary number of sentences that
are read from STDIN. The user must then guess each sentence, letter by
letter. You should estimate the entropy of English based on these guesses.
Your Shannon Game program should display one sentence at time, where all
the positions in the sentence have been covered by "*" characters. When
the user guesses the correct letter it should be displayed in place of the
"*". When the user guesses the complete sentence, the total entropy for
that sentence should be displayed, as should a running total that
reflects the entropy across all the sentences processed thus far.
You should not assume that sentence boundaries have been determined in
your input text. Thus, you should provide a sentence boundary detector
program called boundary.pl to create the input to your Shannon game
program. You should require that each sentence selected by boundary.pl
has a minimum of 7 words. Make sure that once a sentence is
selected that it is not selected again during the same run of the program.
Your sentence boundary detection program should also convert the
text to upper case and remove all punctuation and non-alpha characters
except for spaces. Please note that sentence boundary detection is a
difficult problem in its own right, so you should not be surprised if
you have an imperfect solution. That is ok, as long as you are able to
get the simple cases correct. At a minimum, your boundary detector should
identify sentences that end with . ? or ! and have no internal uses of
these characters (like might occur in an abbreviation or quote). You
should be able to identify a sentence that occurs on multiple lines, or
multiple sentences per line. For example, the following are cases you
should be able to handle.
This sentences is
going to go on and on for a
long while, and then end.
What a great sentence. This one is even better!
My friends, I wish that I would be with you today, but alas, I can't!
Examples of cases that I wouldn't necessarily expect you to handle (but
would be happy if you did) are as follows:
Dr. Johnson, the noted Ph.D. in anthropology, is a great person.
My friend said, "Hey! Ted! You're lost!" and I had to agree.
These two programs should be run as follows from the command line:
boundary.pl 10 chapter1.txt chapter2.txt | shannon.pl outfile1 outfile2 outfile3 outfile4
boundary.pl will select 10 random sentences from chapter1.txt and
chapter2.txt and output them to STDOUT. shannon.pl will read from STDIN
and display the sentences one by one to the user (where the actual
letters and spaces have been covered with "*" characters). outfile1
outfile2, outfile3, and outfile4 (whose names are specified by the user)
are output files that are described
below.
If the user happens to guess the same letter twice for the same position,
don't count this more than once. Please keep track of the letters guessed
for each position to avoid double counting.
You may assume that the command line arguments are valid. There is no need
to include command line error checking. Please note that I will test your
program automatically on a csdev machine so please test it on the same
and follow exactly same command line format.
Output
Your program should reveal the sentence as letters are guessed. This does
not need to be a fancy display - simple ascii output is fine.
After each sentence is guessed, your program should output the entropy
associated with that sentence, and a cumulative total of entropy for all
sentences processed thus far. You only need display two digits of
precision to the right of the decimal.
Your program should also create three output files that can be named from
the command line. Above they are called outfile1, outfile2, outfile3, and
outfile4, and will be referred to as such here, however you should assume
that the user can name them anything they like.
outfile1: A table of the average number of guesses before the given
letter was guessed correctly.
A 4.44
B 3.21
etc.
This shows that when A was the correct letter, it took 4.44 tries to guess
it correctly. Your table should have 27 entries, one for each of the 26
letters and also for space.
outfile2: A table showing the average number of guesses to get a letter
correct given that the previous letter is as shown:
A 3.99
B 2.99
etc.
This shows that when A was the letter preceding the letter to be guessed,
it took 3.99 turns to guess that next letter correctly.
outfile3: A table showing the average number of guesses to get a letter
correct given that the following letter is as shown:
A 2.99
B 4.55
etc.
This shows that when A follows the letter to be guessed, it takes 2.99 to
get that preceding letter correct (on average).
The values in these files should be computed to 2 digits of precision to
the right of the decimal point.
outfile4: A table that displays each sentence that was guessed, the
number of guesses per letter, and the entropy for that sentence and
overall. Each entry in this table should be formatted as follows:
M Y S E N T E N C E I S H E R E T O D A Y
10 2 1 9 3 2 1 1 1 1 1 1 8 3 1 10 2 1 1 1 9 1 1 1 1
Sentence Entropy: X.XXXX
Cumulative Entropy: X.XXXX
Note that the sentence above only has 5 words, this is just to show you
an example of the output format for outfile4. Remember that each of your
sentences should contain at least 7 words.
Experiments and Report
You should produce a written report describing the outcome of the
following experiments. This should be a plain text file called
experiments.txt. This report should begin with a description of your
method for computing entropy of English.
Download a novel from
Project Gutenberg
(www.gutenberg.net). In your report, please name the novel and also
provide a specific URL where it can be directly downloaded from.
Run the Shannon Game for
20 randomly selected sentences from this novel. Examine the results in
outfile1, outfile2, outfile3, and outfile4. What conclusions can you draw
from these results? Please
make an effort to incorporate all the different pieces of information you
have available into your analysis. Make sure you include your input
and output files as well as the entropy values in your report.
Now use the apw.100000 file to select 20 random sentences and repeat the
experiment and analysis that you conducted for the novel.
After running experiments on both kinds of texts, what general conclusions
can you draw about the entropy of English? Make specific references to the
results in your output files, as well as the overall entropy values.
Make sure to include your view on whether or not the entropy of
English is different depending on the type of text. Make sure to support
your position with evidence from your experiments.
I am not looking for a particular answer. There are many conclusions one
can draw from this kind of data. I'm curious as to what you find
interesting, and want to see how you analyze this type of data.
If your analysis suggests that other experiments are appropriate, feel
free to carry them out and comment on them. Make sure you do my original
set of experiments above however, even if you introduce your own.
Submission Guidelines
Please name your programs 'boundary.pl' and 'shannon.pl'. Please name your
report 'experiments.txt'. Each of these should be plain text files. Make
sure that your name, date, and class information are contained in each
file, and that your programs are commented using perldoc.
Place these three files into a directory that is named with your umd user
id. In my case the directory would be called tpederse, for example. Then
create a tar file that includes this directory and your three files.
Compress that tar file and submit it via the web drop from the class home
page. Please note that any submissions after the deadline will be not be
graded.
This is an individual assignment. You must write *all* of your code on
your own. Do not get code from your colleagues, the internet, etc. You are
welcome to discuss the Shannon game in general with your classmates,
friends, family, etc., but do not discuss implementation details. You
must also write your report on your own. Please do not discuss your
interpretations of these results amongst yourselves. This is meant to
make you think for yourself and arrive at your own conclusions.
by:
Ted Pedersen
- tpederse@umn.edu