CS 5761 - Introduction to Natural Language Processing
Programming Assignment 5 - Demo in Lab on Monday, Mar 11 at 4pm
(submit code via email to patw0006@d.umn.edu before lab)
Objectives
To see how different smoothing algorithms affect probability estimates.
Specification
Modify your assignment 1 program to output n-gram probabilities using
maximum likelihood estimates, add-1 smoothing, and Witten-Bell smoothing.
Recall that assignment 1 finds the top N most frequent sequences
consisting of M words. Thus, your modified program should still output the
top N most frequent M word sequences, only now it should also output
probability estimates based on each of these smoothing schemes. Also
output the frequency counts associated with each M word sequence. Make
sure your output is nicely formatted (columns line up, etc.)
For example, your output might look something like this if you
ran 'assign5.pl 2 2 textinput' :
TOP 2 MOST FREQUENT 2 WORD SEQUENCES
FREQ MLE ADD-1 WITTEN-BELL
OF THE 100 0.000001 0.000002 0.000003
FOR THE 99 0.000001 0.000002 0.000003
Remember that if you display all of the M word sequences of text, the
total each column should be 1.0000 (more or less, there may be some
round-off error). You should use as much precision as you need to
differentiate among the estimates. This may vary depending on the
amount of text you are using, and the length of your sequences, so
your program should deal with that automatically.
Assume that the number of possible types in the unigram model is equal to
the number of types observed (call this value V). Then, in the bigram
model assume that the number of possible types is V*V, and in the trigram
model V*V*V, and so on.
If you request the top N ranked sequences, and there are fewer than
N observed events, once you have displayed all the observed events,
have a generic display for unobserved events where their smoothed
estimates are shown. You do not need to generate specific N-grams
that have not been observed to display, something like "UNOBSERVED"
will be fine (as long as you have estimates for this as well.) Our
assumption here is that any unobserved event will be as likely as
any other unobserved event.
Policies (from syllabus)
All programming assignments and your project will be demonstrated during
designated lab sessions. You should also submit an electronic copy of
your source code to the TA prior to the designated demo session. (His
email address is patw0006@d.umn.edu.) There is no other way to submit
your programming assignments or project. Failure to submit AND demo on
time will result in a zero.
Any code you submit should be commented. I must be able to understand
what your code does simply by reading the comments. This understanding
should extend down to the details of your code. So do not simply
describe the input and output, also include comments that describe
your particular algorithm and coding techniques. Failure to comment
to this degree will result in a zero.
All assignments and the project are to be done individually. You are
required to write your own code. Unless otherwise specified, you must
only turn in code that you personally wrote. The only possible exception
to this is if I tell you to use a module that is available in a book
or online archive. However, I will clearly indicate when this is
permissible. Violations of this policy will result in severe grading
penalties and/or failure in the class.