CS 5761 - Introduction to Natural Language Processing
Programming Assignment 1 - Demo in Lab on Monday, Feb 4 at 4pm
(also submit via email to patw0006@d.umn.edu before lab)
Objectives
To gain experience with the Perl programming language, in particular
focusing on its text processing capabilities.
Specification
Write a Perl program that prints to standard output the n most
frequent m word sequences in an arbitrary number of input
files. For this assignment a word is defined as a string of alphabetic
characters. Word sequences are determined as follows. Suppose your input
file consists of the following:
this is the test
input
file
If m = 2, then the following word sequences will be found:
this is
is the
the test
test input
input file
n , m and the input file names should
be specified by the user on the command line. Your program should display
the word sequences as well as the count of the number of times each occurs
in the input files.
The format of the command line arguments should be as follows:
yourProgram n m [list of input files]
For example,
yourProgram 5 3 in.txt file2
should display the 5 most frequent 3 word sequences that are found in
files 'in.txt' and 'file2'. Your program should display the 3 word
sequences and their associated frequency counts. Your output might
look like this:
new york city 456
for the people 345
president george bush 342
in the time 211
he said this 199
This tells us that 'new york city' occurred 456 times in these 2 files,
that 'for the people' occurred 345 times, etc.
In the event of ties among the top m most frequent word
sequences, display all of the sequences involved in the tie. In such
cases your output should have more than n of these m
word sequences. For example, suppose that there was a tie for the 4th
ranked 3 word sequences in the files above between 'in the time' and
'under the bridge'. Suppose that both occur 211 times. Thus, both should
be displayed and there will be 6 three word sequences to be displayed
given the command line arguments above:
new york city 456
for the people 345
president george bush 342
in the time 211
under the bridge 211
he said this 199
Prior to finding any word sequences, your program should eliminate all
non-alphabetic characters from the input files, and convert all text to
lower case.
Make sure to check the boundary cases. For example, if you have a very
small number of words in your input files, it's possible that the total
number of m word sequences will be less than n . In
this case your program should display all the m word
sequences. If n is larger than the number of words in the
input files, then your program should list all of the m word
sequences as well.
Your program should treat the input files as one long line of text; in
other words, m word sequences should not be interrupted by
end-of-line or end-of-file markers.
You may assume that the command line arguments are correct, and that at
least one non-empty input file will always be provided.
Test your program using a variety of input files. Make sure that
your program can handle 1,000,000 words of input in a reasonable amount of
time (no more than a few minutes). You can find large amounts of text
by downloading a few books from the Project Gutenberg web site shown
on the class web page. During your lab demo the TA will provide test
input files, and your program will be graded based on the output from
this data. If your program does not run or produces no correct output
you will receive no credit.
Policies (from syllabus)
All programming assignments and your project will be demonstrated during
designated lab sessions. You should also submit an electronic copy of
your source code to the TA prior to the designated demo session. (His
email address is patw0006@d.umn.edu.) There is no other way to submit
your programming assignments or project. Failure to submit AND demo on
time will result in a zero.
Any code you submit should be commented. I must be able to understand
what your code does simply by reading the comments. This understanding
should extend down to the details of your code. So do not simply
describe the input and output, also include comments that describe
your particular algorithm and coding techniques. Failure to comment
to this degree will result in a zero.
All assignments and the project are to be done individually. You are
required to write your own code. Unless otherwise specified, you must
only turn in code that you personally wrote. The only possible exception
to this is if I tell you to use a module that is available in a book
or online archive. However, I will clearly indicate when this is
permissible. Violations of this policy will result in severe grading
penalties and/or failure in the class.