Ted Pedersen - CS 8995 Corpus Based Natural Language Processing

; Ted Pedersen - CS 8995 Corpus Based Natural Language Processing

CS 8995 Corpus Based Natural Language Processing

Assignment 1 - Due Wed, January 31, 4 pm

Please note that I may revise this based on your feedback and questions. Last update Mon Jan 29 3:00 pm

Objectives

To measure the Mutual information of character pairs and word pairs that occur in text.

Specification

Write a Perl program that will calculate the mutual information for:

Two sequential characters (represented by random variables A and B in the description below), and
Two sequential words (represented by random variables X and Y in the description below).

Your program should display the values of I(A;B) and I(X;Y) as well as the top N character pairs and word pairs and their associated Mutual Information values. Please note that the pairs of random variables (A;B) and (X;Y) should represent a two character/word sequence. In other words, A should refer to the character on the right side and B to the left side of the character pair. X should refer to the word on the right side and Y to the word on the left side. For example, suppose we are computing I(A;B) for the following:

he says

P(A=h, B=e) neq P(A=e, B=h) since the first refers to the pair 'he' and the second refers to the pair 'eh' (which is not observed above).

Your program should be able to handle any value of N and any number of text files. As a practical matter, you may assume that the total number of words tokens in my test cases will be less than 10 million and that the value of N will be between 1 and 100.

Your program should accept command line arguments ordered as follows:

userid.pl N file1 file2 ... filez

For example:

tpederse.pl 3 README robinson.txt

You may assume that the command line arguments are valid. There is no need to include error checking for things like invalid values of N (eg. N < 0, etc) or non-existent files. Please note that I will test your program automatically so if you do not follow the command line format your program will fail and receive no credit.

In the event of ties among the top N values, you should treat all tied values as one rank. If N=2 and the top 5 values are :

fine wine 0.99999
great trip 0.99988
big dog 0.99988
new york 0.99988
big time 0.97732

You would display :

fine wine 0.99999
great trip 0.99988
big dog 0.99988
new york 0.99988

Output Format

Please follow the following output format guidelines carefully since your program will be graded automatically. Mutual information values should be displayed to 5 digits of precision. Display I(A;B) and the top N character pairs (with their associated mutual information values) first, and then I(X;Y) and the top N word pairs. Your output should look something like this (the values are not meant to be illustrative of anything other than proper formatting). Suppose N=3.

14.33333
a e 1.98134
i e 1.97777
o u 1.97011
15.12345
big dog 1.34567
mad hatter 1.29999
for the 1.12030

You should treat your text according to the following guidelines:

Ignore case. Treat all letters as upper or lower case. (I have shown lower case in the output above, but if for some reason you prefer upper case that is fine.)
Ignore all characters except alphabetic (a-z) and space. This leaves you with 27 valid characters. All other characters (numeric and punctuation) should be discarded.
Only allow a single space between words.
Treat words that are separated by the end of a line as if they were separated by a space. So given
```
my
name
```
you should have character pairs y(space) and (space)n but NOT yn.

Sample character and word pair counts

Suppose the following sentence is being counted:

 
the bill is big

The character pair counts should be as follows:

t h 1 
h e 1
e   1
  b 2
b i 2
i l 1
l l 1
l   1
  i 1
i s 1
s   1
i g 1

The word pair counts should be as follows:

the bill 1
bill is 1
is big 1

Please note I did these manually so let me know if any errors are present in the counts!

Other information

This is an individual assignment. Please work on your own. You are free to discuss the problem and coding issues with your colleagues, but in the end the program you submit should reflect your own understanding of Mutual Information and Perl programming.

Please use the turnin program to submit your program. Write just a single program with your umd login id as the name. (In my case I would call my program tpederse.pl)

Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be received.

by: Ted Pedersen - tpederse@d.umn.edu