CS 5761 - Introduction to Natural Language Processing
Programming Assignment 2 - Submit via
web drop by 5pm Monday Feb 16.
Objectives
To develop techniques that will be useful in creating profiles of written
documents. To gain experience with Perl file handling and hashes.
Specification
Write a Perl program called profiler.pl that will take as input a rank
cutoff (described below), a split value (described below) and an arbitrary
number of text files (1 or more). If multiple files are given as input,
treat all of them as one large document. You should convert all text to
upper or lower case, and remove or ignore punctuation. Your program
should output the following information in a nicely designed report.
For example, the following command will run the profiler and display the
top 30 ranked words, and display the word types that are unique to the
last 5 percent of the text. The text to be processed is found in two
files (holmes1.txt and bible.txt) but will be treated as one big file.
profiler.pl 30 5 olivertwist.txt davidcopperfield.txt
There will be some experiments for you to run and report on as a part of
this assignment. That will be discussed in more detail in the lecture, and
will be due a few days after the code is to be turned in.
Experiments
You should conduct the following experiments, and prepare a short
presentation to be given in the lab on Thu Feb 19. This
will count for 30% of the grade on this assignment.
You should select two corpora that are from different genres. Each should
consist of at least 100,000 tokens. For example, you might choose a novel
as your first corpus, and news wire text as your second corpus. Note that
your two corpora should be fairly distinct, so do not select two different
novels or text from two different newspapers. You should be able
to find suitable text by searching around the Internet. Don't
forget the links that are found on the class web page.
Run your profiler.pl program on each corpus with two different settings
for S, 50 and 5. Interpret what these results tell you about the nature of
the corpora, and about the nature of Zipf's Law.
You should prepare a presentation to give to the class during the lab.
This should consist of powerpoint slides, or handouts. In either case,
submit your slides or handouts to the webdrop by 4pm on Thursday, and I
will put the slides on the computer in HH 302 for projection, or I will
print out enough copies of your handouts so that each member of the class
receives one.
Your presentation should summarize the characteristics of
your corpora, the results of each of the 4 possible runs of profiler, and
then your conclusions about what this tells us about the nature of the
text and Zipf's Law. Note that you must draw conclusions - you can not
simply summarize your results.
Policies (see syllabus for more details)
Please comment your code. In particular, note where each piece of data
requested above is being collected in your code. Also make sure you name,
class, etc. is clearly included in the comments.
It is fine to use a Perl reference book for examples of loops,
variables, etc., but your profiler specific code must be your own, and not
taken from any other source (human, published, on the web, etc.)