;
Ted Pedersen -
CS 8995 Corpus Based Natural Language Processing
CS 8995 Corpus Based Natural Language Processing
Assignment 1 - Due Wed, January 31, 4 pm
Please note that I may revise this based on your feedback
and questions. Last update Mon Jan 29 3:00 pm
Objectives
To measure the Mutual information of character pairs and word pairs that
occur in text.
Specification
Write a Perl program that will calculate the mutual information for:
- Two sequential characters (represented by random variables A and B
in the description below),
and
- Two sequential words (represented by random variables X and Y in
the description below).
Your program should display the values of I(A;B) and I(X;Y) as well as
the top N character pairs and word pairs and their associated Mutual
Information values. Please note that the pairs of random variables
(A;B) and (X;Y) should represent a two character/word sequence. In other
words, A should refer to the character on the right side and B to the left
side of the character pair. X should refer to the word on the right side
and Y to the word on the left side. For example, suppose we are computing
I(A;B) for the following:
he says
P(A=h, B=e) neq P(A=e, B=h) since the first refers to the pair 'he' and
the second refers to the pair 'eh' (which is not observed above).
Your program should be able to handle any value of N and any number of
text files. As a practical matter, you may assume that the total number
of words tokens in my test cases will be less than 10 million and that
the value of N will be between 1 and 100.
Your program should
accept command line arguments ordered as follows:
userid.pl N file1 file2 ... filez
For example:
tpederse.pl 3 README robinson.txt
You may assume that the command line arguments are valid. There is no need
to include error checking for things like invalid values of N (eg. N < 0,
etc) or non-existent files. Please note that I will test your program
automatically so if you do not follow the command line format your program
will fail and receive no credit.
In the event of ties among the top N values, you should treat all tied
values as one rank. If N=2 and the top 5 values are :
fine wine 0.99999
great trip 0.99988
big dog 0.99988
new york 0.99988
big time 0.97732
You would display :
fine wine 0.99999
great trip 0.99988
big dog 0.99988
new york 0.99988
Output Format
Please follow the following output format guidelines carefully
since your program will be graded automatically. Mutual information values
should be displayed to 5 digits of precision. Display I(A;B) and the top N
character pairs (with their associated mutual information values) first,
and then I(X;Y) and the top N word pairs. Your output should look
something like this (the values are not meant to be illustrative of
anything other than proper formatting). Suppose N=3.
14.33333
a e 1.98134
i e 1.97777
o u 1.97011
15.12345
big dog 1.34567
mad hatter 1.29999
for the 1.12030
You should treat your text according to the following guidelines:
- Ignore case. Treat all letters as upper or lower case. (I have shown
lower case in the output above, but if for some reason you prefer upper
case that is fine.)
- Ignore all characters except alphabetic (a-z) and space. This leaves
you with 27 valid characters. All other characters (numeric and
punctuation) should be discarded.
- Only allow a single space between words.
- Treat words that are separated by the end of a line as if they
were separated by a space. So given
my
name
you should have character pairs y(space) and (space)n but NOT yn.
Sample character and word pair counts
Suppose the following sentence is being counted:
the bill is big
The character pair counts should be as follows:
t h 1
h e 1
e 1
b 2
b i 2
i l 1
l l 1
l 1
i 1
i s 1
s 1
i g 1
The word pair counts should be as follows:
the bill 1
bill is 1
is big 1
Please note I did these manually so let me know if any errors are present
in the counts!
Other information
This is an individual assignment. Please work on your own. You are
free to discuss the problem and coding issues with your colleagues,
but in the end the program you submit should reflect your own
understanding of Mutual Information and Perl programming.
Please use the turnin program to submit your program. Write just a single
program with your umd login id as the name. (In my case I would call my
program tpederse.pl)
Please note that the deadline will be enforced by automatic means. Any
submissions after the deadline will not be received.
by:
Ted Pedersen
- tpederse@d.umn.edu