CS 8761 Natural Language Processing - Fall 2004
Assignment 3 - Due Mon, Nov 1, noon
This may be revised in response to your questions.
Last Update, Monday Oct 25, 9pm
Objectives
To develop a method of identifying two word collocations using mean
and variance, and then to compare that method with the log-likelihood
ratio. This method is based on
Retrieving Collocations from Text : Xtract , by Frank Smadja.
Specification
Implement a Perl program called mean_variance.pl that will identify
collocations in text using the method described in Section 5.2 of your
text. You should also develop a score that ranks the bigrams according to
how "good" of a collocation it is. This score should be based on the mean,
standard deviation, frequency count, and window size, and should be a
single value that you can use to sort your results.
Your program should accept the following input parameters from the
command line:
- An integer value that tells how large a window in which you will find
collocations. This window should be defined such that it includes
the two words in the bigram, and specifies the number of intervening
words that are allowed. Thus, a value of 2 means that the words must be
adjacent, a value of 3 means that they may be up to
1 word apart, a value of 4 means that they may be up to 2 words apart etc.
Thus, the size of the window must be 2 or greater.
- An arbitrary number of input files, where each line has been
previously determined to be a sentence (by a slightly revised version
of your program boundary.pl from Assignment 2). Note that in your
assignment 2 version your boundary program selected some number
sentences at random. To use this program in this assignment, you will
want to override that and have it
find all sentence boundaries for all the input files. Please make sure
that your boundary detection program results in one sentence per line, and
one line per sentence. Your mean_variance.pl program should assume that
the input is formatted in this way.
Thus, you should be able to run your program as follows:
boundary.pl input1.txt input2.txt > data.txt
mean_variance.pl 12 data.txt
OR
boundary.pl input1.txt input2.txt | mean_variance.pl 10
boundary.pl myfile.txt myotherfile.txt mythirdfile.txt > data.txt
mean_variance.pl 7 data.txt
OR
boundary.pl myfile.txt myotherfile.txt mythirdfile.txt | mean_variance.pl 5
The first group of commands will find bigrams that have up to 10 words
between them, while the second will find bigrams with up to 5 words
between them.
You should also familiarize yourself with the Ngram Statistics Package,
which is available on the Sun systems on campus. You can also find it at
http://www.d.umn.edu/~tpederse/nsp.html. You will run experiments using
the log-likelihood measure as provided in this package.
Output
Your program should output a table similar to Table 5.5 (page 161) as
found in your text. This table shows the mean position, the standard
deviation, the count of the bigram, and then the two words that make up
the bigram or trigram. In addition to this, you should show your "ranking
value" and sort your output with respect to this value. This is a
score that you will develop that results in the most reliable and
interesting ranking of collocations. This score can be based on some
combination of the mean, standard deviation, frequency, and window size.
Experiments and Report
All of your experiments should be performed on a corpus of New York Times
text from 2002 that is now available on the class web page. It consists of
approximately 8,000,000 tokens.
You will produce a written report describing the outcome of a number of
experiments. However, you should begin your report by describing how you
are interpreting the scores from the mean_variance.pl method. In other
words, describe whatever you believe to be the best way to rank these
scores to use them to find or retrieve significant collocations. Please
title this portion of your report "MY INTERPRETATION OF MEAN VARIANCE".
Please provide specific examples showing why you believe your
interpretation is correct.
TOP 50 COMPARISON:
Run NSP/ll.pm and mean_variance.pl on CORPUS. Look over the top 50 or so
ranks produced by each program. (Note that you should rank your
mean_variance.pl results according to the interpretation you
described above.) Which seems better at identifying significant or
interesting collocations? How would you characterize the top 50 bigrams
found by each module? Is one of these measures significantly "better" or
"worse" then the other? Why do you think that is?
CUTOFF POINT:
Look up and down the list of bigrams as ranked by NSP/ll.pm and
mean_variance.pl. Do you notice any "natural" cutoff point for scores,
where bigrams above this value appear to be interesting or significant,
while those below do not? If you do, indicate what the cutoff point is and
discuss what this value "means" relative to the test that creates it. If
you do not see any such cutoff, discuss why you can't find one. What does
that tell you about these tests?
OVERALL RECOMMENDATION:
Based on your experiences above and any other variations you care to
pursue, is NSP/ll.pm or mean_variance.pl "the best" for identifying
significant collocations in large corpora of text. If one is better,
please explain why it is better. If none is better please explain. Your
explanation should be specific to your investigations and not simply
repeat conventional wisdom.
In your report, please divide it up into sections according to my
subheadings above (TOP 50 COMPARISON, CUTOFF POINT, OVERALL
RECOMMENDATION). You should include written analysis of your results for
each subheading, as well as actual program output to support your
analysis.
Submission Guidelines
Submit your programs boundary.pl and mean_variance.pl, as well as
your report file (experiments.txt) as a single compressed tar file that
is named with your user id. This should be submitted to the web drop on
the class web page prior to the deadline.
This is an individual assignment. You must write *all* of your code on
your own. Do not get code from your colleagues, the Internet, etc. In
addition, you are not to discuss which measure of association you are
using with your classmates. It is essentially impossible that you will
all independently arrive at the same measure or a small set of measures,
so please work independently. You must also write your report on your
own. Please do not discuss your interpretations of these results amongst
yourselves. This is meant to make you think for yourself and arrive at
your own conclusions.
by:
Ted Pedersen
- tpederse@umn.edu