CS 8761 Natural Language Processing - Fall 2002
Assignment 1 - Due Mon, September 30, noon
This may be revised in response to your questions.
Last update Sun September 29 6:00 pm
Objectives
To compute the entropy of English and a second language and interpret
those results based on several experiments you will conduct with your
code. You may also gain an appreciation for the difficulties of sentence
boundary detection.
Specification
Write a Perl program that will play the "Shannon game". Your program
should accept as input some number of randomly selected sentences from a
text file. The user must then guess each sentence, letter by letter, so
that the entropy can be computed.
Your Shannon Game program should display one sentence at time, where all
the positions in the sentence have been covered by "*" characters. When
the user guesses the correct letter it should be displayed in place of the
"*". When the user guesses the complete sentence, the total entropy for
that sentence should be displayed, as should a running total that
reflects the entropy across all the sentences processed thus far.
You should not assume that sentence boundaries have been determined in
your input text. Thus, you should provide a sentence boundary detector
program to create the input to your Shannon game program. Your
sentence boundary detection program should also convert the text
to upper case and remove all punctuation and non-alpha characters
except for spaces. Please note that sentence boundary detection is a
difficult problem in its own right, so you should not be surprised if
you have an imperfect solution. That is ok, as long as you are able to
get the simple cases correct.
These two programs should be run as follows from the command line:
sentence.pl 10 mobydick.txt | shannon.pl outfile1 outfile2 outfile3
sentence.pl will select 10 random sentences from mobydick.txt and pass
them to shannon.pl which should display them one by one to the user (where
the actual letters and spaces have been covered with "*" characters).
outfile1 outfile2 and outfile3 (whose names are specified by the user) are
output files that are described below.
If the user happens to guess the same letter twice for the same position,
don't count this more than once. Please keep track of the letters guessed
for each position to avoid double counting.
You may assume that the command line arguments are valid. There is no need
to include command line error checking. Please note that I will test your
program automatically on a csdev machine so please test it on the same
and follow exactly same command line format.
Output
Your program should reveal the sentence as letters are guessed. This does
not need to be a fancy display - simple ascii output is fine.
After each sentence is guessed, your program should output the entropy
associated with that sentence, and a cumulative total of entropy for all
sentences processed thus far. You only need display two digits of
precision to the right of the decimal.
Your program should also create three output files that can be named from
the command line. Above they are called outfile1, outfile2, and outfile3
and will be referred to as such here.
outfile1: A table of the average number of guesses before the given
letter was guessed correctly.
A 4.44
B 3.21
etc.
This shows that when A was the correct letter, it took 4.44 tries to guess
it correctly. Your table should have 27 entries, one for each of the 26
letters and also for space.
outfile2: A table showing the average number of guesses to get a letter
correct given that the previous letter is as shown:
A 3.99
B 2.99
etc.
This shows that when A was the letter preceding the letter to be guessed,
it took 3.99 turns to guess that next letter correctly.
outfile3: A table showing the average number of guesses to get a letter
correct given that the following letter is as shown:
A 2.99
B 4.55
etc.
This shows that when A follows the letter to be guessed, it takes 2.99 to
get that preceding letter correct (on average).
The values in these files should be computed to 2 digits of precision to
the right of the decimal point.
Experiments and Report
You will produce a short written report describing the outcome of the
following experiments.
Download some text from Project Gutenberg. Run the Shannon Game for at
least 20 randomly selected sentences (more if you think it is necessary).
Compare the results in outfile1, outfile2, and outfile3. What conclusions
can you draw from those results? Please compare these tables with each
other, and also discuss each of them on their own. Make sure you
reproduce the tables you are discussing in your report file
Download some text in another language that you know. If you know Hindi,
then it must be Hindi and it should be transliterated (ie Hindi written in
the Roman/English alphabet.) If you don't know any other language, you
can use some Hindi text, or lacking that use the Spanish version of Don
Quixote available from Project Gutenberg. Repeat the same experiment that
you carried out for English. Discuss your Hindi/second language results
on their own merits, and then compare them to English. What conclusions
can you draw?
I am not looking for a particular answer. There are many conclusions one
can draw from this kind of data. I'm curious as to what you find
interesting, and want to see how you analyze this type of data.
If your analysis suggests that other experiments are appropriate, feel
free to carry them out and comment on them. Make sure you do my original
set of experiments above however, even if you introduce your own.
Submission Guidelines
Please name your programs 'sentence.pl' and 'shannon.pl'. Please name your
report 'experiments.txt'. Each of these should be plain text files. Make
sure that your name, date, and class information are contained in each
file, and that your programs are commented.
Place these three files into a directory that is named with your umd user
id. In my case the directory would be called tpederse, for example. Then
create a tar file that includes this directory and your three files.
Compress that tar file and submit it via the web drop from the class home
page. Please note that the deadline will
be enforced by automatic means. Any submissions after the deadline will
not be graded. The web drop has a limit of 10mb, so your files should be
plain text. If you have large data files you wish to share, please contact
me via email and we'll figure out a way to do that.
For both your first and second language text, I would like your
report to tell me where it came from. (presumably this will come
from URLs, so you can list those. Also, create a directory in
your /home/cs/ partition and put the data you use in your experiments
there. Make it world readable and name them as follows:
/home/cs/(your id)/CS8761/shannon/english
/home/cs/(your id)/CS8761/shannon/hindi (or whatever your 2nd language was)
chmod -R gou+r /home/cs/(your id)/CS8761
will accomplish this. Please note that everything in CS8761
and the subdirectories will be world readable as a result.
This is an individual assignment. You must write *all* of your code on
your own. Do not get code from your colleagues, the internet, etc. You are
welcome to discuss the Shannon game in general with your classmates,
friends, family, etc., but do not discuss implementation details. You
must also write your report on your own. Please do not discuss your
interpretations of these results amongst yourselves. This is meant to
make you think for yourself and arrive at your own conclusions.
by:
Ted Pedersen
- tpederse@umn.edu