Ted Pedersen - CS 8995 Corpus Based Natural Language Processing

CS 8995 Corpus Based Natural Language Processing

Assignment 2 - Due Mon, Feb 12, 4 pm

Please note that I may revise this based on your feedback and questions. Last update Fri Feb 2 6:00pm

Objectives

To identify collocations using pointwise mutual information, where the collocations are defined as regular expressions.

Specification

Write a Perl program that will calculate the pointwise Mutual Information for the pair of strings that match two regular expressions (represented by random variables X and Y in the description below).

Your program should display the pair of strings that match the regular expressions and their pointwise mutual information values for the top N matching patterns. Please note that the pair of random variables (X;Y) should represent the two matching strings that occur in sequence in the text. In other words, X should refer to the string on the right side and Y to the string on the left side.

Your program should be able to handle any value of N and any number of text files. As a practical matter, you may assume that the total number of word tokens in my test cases will be less than 1 million and that the value of N will be between 1 and 100. Your program should be able to process 1 million word tokens in under 10 minutes, and certainly much faster when given very restrictive regular expressions that limit the number of strings for which mutual information values must be computed.

Your program will need to be able to process raw text files and part of speech tagged files from the Penn Treebank. You should have a command line option to tell your program when it is using Treebank data. Remember that each word in the treebank data has a two or three character part of speech tag attached to it (e.g., the/DT).

Your program should accept command line arguments ordered as follows:

userid.pl N [-treebank] /regex1/ /regex2/ file1 file2 ... filez

For example:

tpederse.pl 2 /\binterest(ing|s|ed)?\b / /\w+/ file2 file3 file5
tpederse.pl 10 /\binterest in\b / /\w+/  file1 file2 file3 file5
tpederse.pl 10 /interest in / /\w+ /  file1 file2 file3
tpederse.pl 5 /../ /./  file1 file2 file3
tpederse.pl 5 -treebank /\w+\/NN / /\w+\/PP/  file1 file2 file3
tpederse.pl 5 -treebank /\w+\/N. / /cat/  file1 file2 file3 file4

Please note that the regular expressions do NOT need to be limited by word boundaries. The fourth example above describes a situation where the left component consists of any two characters and the right consists of any single character. If you want the regex to respect word boundaries then you should simply use a combination of \b, spaces, and/or \s+ in the definition of the regex.

You may assume that the command line arguments are valid. There is no need to include error checking for things like invalid values of N (eg. N < 0, etc), non-existent files, or invalid regular expressions.

Some command shells may not be able to process regular expressions on the command line as shown above. It is fine if you need to modify the format of the command line to get the regexs passed properly to Perl.

However, remember that I will test your programs under Unix on csdevXX. Make sure your program runs on this platform before you submit it. If special command line formatting is required to run ** on Unix ** please include a comment near the top of your code showing me exactly what is needed. If such a comment is not included I will assume your program follows the syntax shown above.

In the event of ties among the top N values, you should treat all tied values as one rank. For example,

tpederse.pl 2 /interest(ing|s|ed)?\b / /\w+/ file2 file3 file5

might generate output like this:

interest rate 0.99999
interesting data 0.99988
interest in 0.99988
disinterest in 0.99988

Please note that we could have more than two words forming the collocation. We could also have character-based computations. This will all depend on how the regular expressions are formed.

You should be able to repeat the functionality of your code from assignment 1 (with the exception of reporting the overal mutual information values) with the following:

tpederse.pl 10 /\w+ / /\w+/ file1 file2

this should report the pointwise mutual information values of the top 10 two word sequences.

tpederse.pl 10 /./ /./ file1 file2

this should report the pointwise mutual information values of the top 10 two character sequences.

It would be wise to compare the results of assignment 1 with test cases like this to make sure your computations are correct.

Output Format

The immediately preceding example (with the interest data) shows how your output MUST be formated. Pointwise Mutual Information values should be displayed to 5 digits of precision. There should be a single space between words.

You should treat your text according to the following guidelines:

Ignore case, treat all text as upper or lower case.
Only allow a single space between words in the text.
Only consider alpha characters and spaces. When using the -treebank option you must also allow the "/". Other characters are to be discarded.

Other information

Please use turnin to submit this program. Remember that can only use turnin from hh33812. No email submission is necessary. The proper turnin command is:

turnin -c cs8995 -p a2 userid.pl

This is an individual assignment. Please work on your own. You are free to discuss the problem and coding issues with your colleagues, but in the end the program you submit should reflect your own understanding of Mutual Information and Perl programming.

Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be received.

by: Ted Pedersen - tpederse@d.umn.edu