CS 8995 Corpus Based Natural Language Processing
Assignment 2 - Due Mon, Feb 12, 4 pm
Please note that I may revise this based on your feedback
and questions. Last update Fri Feb 2 6:00pm
Objectives
To identify collocations using pointwise mutual information, where the
collocations are defined as regular expressions.
Specification
Write a Perl program that will calculate the pointwise Mutual Information
for the pair of strings that match two regular expressions (represented by
random variables X and Y in the description below).
Your program should display the pair of strings that match the regular
expressions and their pointwise mutual information values for the top N
matching patterns. Please note that the pair of random variables (X;Y) should
represent the two matching strings that occur in sequence in the text. In
other words, X should refer to the string on the right side
and Y to the string on the left side.
Your program should be able to handle any value of N and any number of
text files. As a practical matter, you may assume that the total number
of word tokens in my test cases will be less than 1 million and that
the value of N will be between 1 and 100. Your program should be able to
process 1 million word tokens in under 10 minutes, and certainly much
faster when given very restrictive regular expressions that limit the
number of strings for which mutual information values must be computed.
Your program will need to be able to process raw text files and part
of speech tagged files from the Penn Treebank. You should have a command
line option to tell your program when it is using Treebank data. Remember
that each word in the treebank data has a two or three character part
of speech tag attached to it (e.g., the/DT).
Your program should accept command line arguments ordered as follows:
userid.pl N [-treebank] /regex1/ /regex2/ file1 file2 ... filez
For example:
tpederse.pl 2 /\binterest(ing|s|ed)?\b / /\w+/ file2 file3 file5
tpederse.pl 10 /\binterest in\b / /\w+/ file1 file2 file3 file5
tpederse.pl 10 /interest in / /\w+ / file1 file2 file3
tpederse.pl 5 /../ /./ file1 file2 file3
tpederse.pl 5 -treebank /\w+\/NN / /\w+\/PP/ file1 file2 file3
tpederse.pl 5 -treebank /\w+\/N. / /cat/ file1 file2 file3 file4
Please note that the regular expressions do NOT need to be limited
by word boundaries. The fourth example above describes a situation
where the left component consists of any two characters and the
right consists of any single character. If you want the regex to
respect word boundaries then you should simply use a combination of
\b, spaces, and/or \s+ in the definition of the regex.
You may assume that the command line arguments are valid. There is no need
to include error checking for things like invalid values of N (eg. N < 0,
etc), non-existent files, or invalid regular expressions.
Some command shells may not be able to process regular expressions on the
command line as shown above. It is fine if you need to modify the format
of the command line to get the regexs passed properly to Perl.
However, remember that I will test your programs under Unix on csdevXX.
Make sure your program runs on this platform before you submit it.
If special command line formatting is required to run ** on Unix ** please
include a comment near the top of your code showing me exactly what
is needed. If such a comment is not included I will assume your
program follows the syntax shown above.
In the event of ties among the top N values, you should treat all tied
values as one rank. For example,
tpederse.pl 2 /interest(ing|s|ed)?\b / /\w+/ file2 file3 file5
might generate output like this:
interest rate 0.99999
interesting data 0.99988
interest in 0.99988
disinterest in 0.99988
Please note that we could have more than two words forming the
collocation. We could also have character-based computations. This will
all depend on how the regular expressions are formed.
You should be able to repeat the functionality of your code from
assignment 1 (with the exception of reporting the overal mutual
information values) with the following:
tpederse.pl 10 /\w+ / /\w+/ file1 file2
this should report the pointwise mutual information values of the top 10
two word sequences.
tpederse.pl 10 /./ /./ file1 file2
this should report the pointwise mutual information values of the top 10
two character sequences.
It would be wise to compare the results of assignment 1 with test cases
like this to make sure your computations are correct.
Output Format
The immediately preceding example (with the interest data) shows how your
output MUST be formated. Pointwise Mutual Information values should be
displayed to 5 digits of precision. There should be a single space
between words.
You should treat your text according to the following guidelines:
- Ignore case, treat all text as upper or lower case.
- Only allow a single space between words in the text.
- Only consider alpha characters and spaces. When using the
-treebank option you must also allow the "/". Other characters are
to be discarded.
Other information
Please use turnin to submit this program. Remember that can only use
turnin from hh33812. No email submission is necessary. The proper turnin
command is:
turnin -c cs8995 -p a2 userid.pl
This is an individual assignment. Please work on your own. You are
free to discuss the problem and coding issues with your colleagues,
but in the end the program you submit should reflect your own
understanding of Mutual Information and Perl programming.
Please note that the deadline will be enforced by automatic means. Any
submissions after the deadline will not be received.
by:
Ted Pedersen
- tpederse@d.umn.edu