CS 5761 - Introduction to Natural Language Processing
Programming Assignment 4 - Submit via
web drop by 5pm Friday March 26.
Objectives
To gain experience learning n-gram language models from text.
Specification
Design and implement a Perl program called ngram.pl that will learn an
n-gram language model from a given body of text. Your program should then
generate a given number of sentences based on that n-gram model. See the
discussion on pages 202-206 of your text for further details.
Your program should work for any value of n, and should output m
sentences. Convert all text to lower case, and make sure to include
punctuation in the n-gram models. Your program should learn a single
n-gram model from any number of input files.
Your program should run as follows:
ngram.pl n m input-file/s
so running your program like this:
ngram.pl 3 10 book1.text book2.text
should result in 10 randomly generated sentences based on a tri-gram
model learned from book1.text and book2.text.
Make sure that you separate punctuation marks from text and treat them as
tokens. Also treat numeric data as tokens. So, in a sentence like
my, oh my, i wish i had 100 dollars .
you should have 12 tokens
my , oh my , i wish i had 100 dollars .
Your program will need to identify sentence boundaries, and your ngrams
should *not* cross these boundaries. For example, you could have input
like this:
He went down the stairs
and then out the side door.
My mother and brother
followed him.
You should treat this as two sentences, as in:
He went down the stairs and then out the side door .
My mother and brother followed him .
To identify sentence boundaries, you may assume that any period,
question mark, or exclamation point represents the end of a sentence.
(In general this assumption is wrong, but is perfectly adequate for our
purposes here.) When generating a sentence, keep going until you find a
terminating punctuation mark. Once you observe that then the sentence is
complete.
If the length of a sentence in the input text file is less than n, then
you may simply discard that sentence and not use it when computing
n-gram probabilities.
Policies (see syllabus for more details)
Please comment your code. You must provided a detailed description of your
spelling correction algorithm in your source code comments. This should
focus on how you score the candidate corrections for a word. Also make
sure you name, class, etc. is clearly included in the comments.
It is fine to use a Perl reference book to provide examples of loops,
variables, etc., but your ngram.pl specific code must be your own, and
not taken from any other source (human, published, on the web, etc.)