Ted Pedersen - CS 5761 - Introduction to Natural Language Processing

CS 5761 - Introduction to Natural Language Processing

Programming Assignment 4 - Submit via web drop by 5pm Friday March 26.

Objectives

To gain experience learning n-gram language models from text.

Specification

Design and implement a Perl program called ngram.pl that will learn an n-gram language model from a given body of text. Your program should then generate a given number of sentences based on that n-gram model. See the discussion on pages 202-206 of your text for further details.

Your program should work for any value of n, and should output m sentences. Convert all text to lower case, and make sure to include punctuation in the n-gram models. Your program should learn a single n-gram model from any number of input files.

Your program should run as follows:

ngram.pl n m input-file/s

so running your program like this:

ngram.pl 3 10 book1.text book2.text

should result in 10 randomly generated sentences based on a tri-gram model learned from book1.text and book2.text.

Make sure that you separate punctuation marks from text and treat them as tokens. Also treat numeric data as tokens. So, in a sentence like

my, oh my, i wish i had 100 dollars .

you should have 12 tokens

my , oh my , i wish i had 100 dollars .

Your program will need to identify sentence boundaries, and your ngrams should *not* cross these boundaries. For example, you could have input like this:

He went down the stairs
and then out the side door. 
My mother and brother 
followed him.

You should treat this as two sentences, as in:

He went down the stairs and then out the side door . 
My mother and brother followed him .

To identify sentence boundaries, you may assume that any period, question mark, or exclamation point represents the end of a sentence. (In general this assumption is wrong, but is perfectly adequate for our purposes here.) When generating a sentence, keep going until you find a terminating punctuation mark. Once you observe that then the sentence is complete.

If the length of a sentence in the input text file is less than n, then you may simply discard that sentence and not use it when computing n-gram probabilities.

Policies (see syllabus for more details)

Please comment your code. You must provided a detailed description of your spelling correction algorithm in your source code comments. This should focus on how you score the candidate corrections for a word. Also make sure you name, class, etc. is clearly included in the comments.

It is fine to use a Perl reference book to provide examples of loops, variables, etc., but your ngram.pl specific code must be your own, and not taken from any other source (human, published, on the web, etc.)