CS 5761 - Introduction to Natural Language Processing
Programming Assignment 4 - Demo in Lab on Monday, Mar 04 at 4pm
(submit code via email to patw0006@d.umn.edu before lab)
Objectives
To gain experience with n-gram models.
Specification
Implement a program that will learn an n-gram model from a given body of
text. Your program should then generate a given number of sentences based
on that n-gram model. See the discussion on pages 202-206 of your text for
further details.
Your program should work for any value of n, and should output m
sentences. Convert all text to lower case, and make sure to include
punctuation in the n-gram models. Your program should learn a single
n-gram model from any number of input files.
Your program should run as follows:
assign4.pl n m input-file/s
so running your program like this:
assign4.pl 3 10 book1.text book2.text
should result in 10 randomly generated sentences based on a tri-gram
model learned from book1.text and book2.text.
Make sure that you separate punctuation marks from text and treat them as
tokens. Also treat numeric data as tokens. So, in a sentence like
my, oh my, i wish i had 100 dollars
you should have 11 tokens
my , oh my , i wish i had 100 dollars
You may assume that any period, question mark, or exclamation point
represents the end of a sentence. (In general this assumption is wrong,
but is perfectly adequate for our purposes here.) When generating a
sentence, keep going until you find a terminating punctuation mark.
Once you observe that then the sentence is complete.
If the length of a sentence in the input text file is less than n, then
you may simply discard that sentence and not use it when computing
n-gram probabilities.
Policies (from syllabus)
All programming assignments and your project will be demonstrated during
designated lab sessions. You should also submit an electronic copy of
your source code to the TA prior to the designated demo session. (His
email address is patw0006@d.umn.edu.) There is no other way to submit
your programming assignments or project. Failure to submit AND demo on
time will result in a zero.
Any code you submit should be commented. I must be able to understand
what your code does simply by reading the comments. This understanding
should extend down to the details of your code. So do not simply
describe the input and output, also include comments that describe
your particular algorithm and coding techniques. Failure to comment
to this degree will result in a zero.
All assignments and the project are to be done individually. You are
required to write your own code. Unless otherwise specified, you must
only turn in code that you personally wrote. The only possible exception
to this is if I tell you to use a module that is available in a book
or online archive. However, I will clearly indicate when this is
permissible. Violations of this policy will result in severe grading
penalties and/or failure in the class.