CS 8761 Natural Language Processing - Fall 2002
Assignment 2 - Due Fri, October 11, noon
This may be revised in response to your questions.
Last update Mon Oct 7 11am
Objectives
To investigate various measures of association that can be used to
identify collocations in large corpora of text. In particular you will
identify and implement a measure that can be used with 2 and 3 word
sequences, and compare this method with some other standard measures.
Specification
Download and install version 0.51 of the
N-gram Statistics
Package.
NSP comes with a number of modules that can be used to perform various
tests and measures on 2x2 tables in order to identify 2 word collocations
in text. These modules are a good starting point, but they by no means
represent the full range of possibilities.
For example, the module mi.pm provided with NSP implements pointwise
mutual information. Yet, there is no module for "true" mutual information.
In addition, all of the modules provided are only suitable for 2 word
sequences (bigrams). Yet, there is no reason that tests of association
can not be implemented for 3 word sequences (trigrams).
This assignment will require that you identify a measure that is
suitable for both 2 and 3 word sequences that is not already a part
of NSP. You will implement this measure and carry out some experiments to
see how well it performs relative to some of the existing measures
supported by NSP.
You will find the documentation of NSP to be relatively complete (we hope),
and should be able to determine how to implement modules from the
information provided. Please review the various READMEs that come with
NSP closely before attempting to carry out any of this assignment.
NSP was implemented entirely by
Satanjeev "Bano" Banerjee,
a nearly former UMD MS student much like yourself. However, if you
have questions about NSP please consult the documentation first, and then
me second. Please do not contact Bano under any circumstances.
Output
Your implementation will consist of four .pm files that will be used by
the statistic.pl program that comes with NSP. These .pm files will not
run on their own and will produce output only when used with NSP. You
should make no modifications to NSP. Your modules must work with version
0.51 of NSP.
Experiments and Report
All of your experiments should be performed on the same corpus of text,
hereafter known as CORPUS. You should create at least a 1,000,000 token
corpus from Internet resources such as Project Gutenberg, etc. Please make
CORPUS available to me at /home/cs/(your id)/CS8761/nsp/. You may use
English or transliterated Hindi.
You will produce a short written report describing the outcome of the
following experiments.
Experiment 1:
Implement "true" mutual information for 2 word sequences. Call this module
tmi.pm. True (or "real") mutual information is designated as I(X;Y) in our
text (see page 67).
Identify a measure that is currently not supported in NSP that is
suitable for discovering 2 and 3 word collocations. For example, the
log-likelihood ratio can be formulated in a 2 word and 3 word version.
However, you may not use that since it is already included in NSP. Please
note that not all measures will extend from 2 to 3 words. You should
consult some statistics books (look for tests or measures of association)
to find candidates. Please discover your own measure! Do not work with
your classmates on this. I expect that there will be a reasonable
variety of measures used, and if the class somehow miraculously settles
on 1 or 2 measures I will assume that you have not worked individually.
The textbook mentions the z-score and relative frequency ratios. I do not
believe that either is appropriate for this problem. There are some minor
variations to Pearson's chi-squared test (that essentially add a constant
value to some of the cell counts). These variations are not sufficiently
distinct from what already exists in NSP and are not suitable.
Finally, please do not invent your own measures. There are plenty of
existing measures of association available.
Once you have discovered a measure suitable for 2 and 3 word sequences,
implement the 2 word version. Call the module 'user2.pm'. Please comment
your module carefully, including a reference to wherever you found out
about this measure. You should describe the measure completely enough so
that I could compute values for it by hand based on your description and
given some 2x2 table of values.
Once you have your implementation working (and do test it by computing
some values by hand and making sure that your program reaches the same
result) please carry out the following and summarize your findings in a
written report. In each section, make sure that you show some of the
bigrams or trigrams identified, along with their statistic values/scores
and frequency counts. Also make sure to clearly indicate any command line
options that you have used so I can recreate your results if I wish.
TOP 50 COMPARISON:
Run NSP on CORPUS using user2.pm and tmi.pm. Look over the top 50 or
so ranks produced by each module. Which seems better at identifying
significant or interesting collocations? How would you characterize the
top 50 bigrams found by each module? Is one of these measures
significantly "better" or "worse" then the other? Why do you think
that is?
CUTOFF POINT:
Look up and down the list of bigrams as ranked by NSP for tmi.pm and
user2.pm. Do you notice any "natural" cutoff point for scores, where
bigrams above this value appear to be interesting or significant, while
those below do not? If you do, indicate what the cutoff point is and
discuss what this value "means" relative to the test that creates it.
If you do not see any such cutoff, discuss why you can't find one.
What does that tell you about these tests?
RANK COMPARISON:
Use rank.pl to compare each of these measures with ll.pm and mi.pm. Which
is more like ll.pm and which is more like mi.pm? Explain and interpret
what you observe. To be clear, you should compare ll.pm with tmi.pm and
user2.pm, and then compare mi.pm with tmi.pm and user2.pm. Comment and
interpret what you observe. (When running rank.pl, make sure that ll.pm
and mi.pm are the first modules indicated each time you run it.)
OVERALL RECOMMENDATION:
Based on your experiences above and any other variations you care to
pursue, which of the following measures is "the best" for identifying
significant collocations in large corpora of text: mi.pm, ll.pm, user2.pm,
or tmi.pm? If there is one, please explain why it is better. If
none is better please explain. Your explanation should be specific to your
investigations and not simply repeat conventional wisdom.
In your report, please divide it up into sections according to my
subheadings above (TOP 50 COMPARISON, RANK COMPARISON, CUTOFF
POINT, OVERALL RECOMMENDATION) You should dedicate 1-2 paragraphs to each
subheading. Please feel free to include small portions of your output to
illustrate your points.
Experiment 2:
Implement a module named ll3.pm that performs the log likelihood ratio
test for 3 word sequences. Assume that the null hypothesis of the test is
as follows:
p(W1)p(W2)p(w3) = p(W1,W2,W3)
Create your 3 word version of user2.pm and call it user3.pm. Please
comment user3.pm just as carefully as you did user2.pm. Some of the
information may be duplicated (ie references, general description) but
that is ok. Make sure you make it clear how this measure extends to 3 word
sequences and provide enough detail so I can compute values for this by
hand based on your description.
TOP 50 RANK:
As in Experiment 1, except now for user3.pm and ll3.pm.
CUTOFF POINT:
As in Experiment 1, except now for user3.pm and ll3.pm.
OVERALL RECOMMENDATION:
As in Experiment 1, except now for user3.pm and ll3.pm.
Submission Guidelines
Submit your modules 'tmi.pm', 'user2.pm', 'll3.pm', and 'user3.pm'.
Each should be commented and explain what test you are implementing and
what it does. Your comments for 'user2.pm' and 'user3.pm' must be
sufficient so that I can manually compute values for these tests.
You must also provide a complete worked out example (showing, in detail,
how this measure is computed for a specific case.)
You must also indicate where you found out about this measure. Please
provide a complete reference for printed materials, and complete working
urls for web pages. Please name your report 'experiments.txt'. Clearly
identify the discussion associated with Experiment 1 and 2, and use the
SUBHEADINGS provided above to organize your discussion.
All of these should be plain text files. Make sure that your name, date,
and class information are contained in each file, and that your .pm
files are carefully commented.
Place all of these files into a directory that is named with your umd user
id. In my case the directory would be called tpederse, for example. Then
create a tar file that includes this directory and the files you will
submit. Compress that tar file and submit it via the web drop from the
class home page. Please note that the deadline will be enforced by
automatic means. Any submissions after the deadline will not be graded.
The web drop has a limit of 10mb, so your files should be plain text. If
you have large data files you wish to share, please include them in your
/home/cs/(your id)/CS8761/nsp directory.
This is an individual assignment. You must write *all* of your code on
your own. Do not get code from your colleagues, the Internet, etc. In
addition, you are not to discuss which measure of association you are
using with your classmates. It is essentially impossible that you will
all independently arrive at the same measure or a small set of measures,
so please work independently. You must also write your report on your
own. Please do not discuss your interpretations of these results amongst
yourselves. This is meant to make you think for yourself and arrive at
your own conclusions.
by:
Ted Pedersen
- tpederse@umn.edu