CS 8995 Corpus Based Natural Language Processing
Assignment 3 - Due Mon, Feb 26, 4 pm
This write may be revised in response to questions.
Date of last update: Thu Feb 15, 1pm
Objectives
To compare N-gram models of texts using cross entropy. This will allow us
to make determinations as to how closely related two texts are, possibly
even making it possible to resolve questions of authorship attribution.
Specification
Write a Perl program that will estimate N-gram models for two input texts,
and then compare those models using cross entropy. Probability estimates
should not be made via relative frequency (maximum likelihood
estimates) but rather via the Witten-Bell smoothing algorithm. If you are
interested in extra credit, you can implement the Good-Turing smoothing
algorithm in Perl. This is worth 5 extra credit points, so you could
earn up to 15 out of 10 on this assignment with the extra credit. There
are links to Good-Turing materials on the Sample Code page.
Regardless of if you implement Witten-Bell or Good-Turing, you may want to
develop your N-gram model first simply using relative frequency counts and
then change the estimation method after that is working. (If you are
unable to implement a smoothing algorithm, you could earn 5 of 10 possible
points by submitting a program that uses relative frequency/maximum
likelihood estimates.)
Your program should accept different values of N. Typical values will be
1 (unigram), 2 (bigram), and 3 (trigram), but your program should not
impose any arbitrary limit on N. If you implement Good-Turing you may
restrict the possible values of N to 2 and 3. However, if you
implement Witten-Bell you should be able to handle any value of N.
Your program should output the cross entropy of the two texts to as many
digits of precision as you feel necessary.
Your program should also output the frequency and total probability mass
associated with events that occur a given number of times. In other
words, show how many times events with a particular frequency occurred, and
how much probability mass all events with that frequency count receive
based on your smoothing algorithm.
For example :
0.0075
0 1000 .50
1 70 .25
2 30 .10
3 30 .10
4 30 .10
5 30 .10
0 1000 .50
1 75 .30
2 50 .11
3 25 .11
4 20 .09
5 20 .09
This output "says" that the cross entropy between the two texts was
0.0075. In text 1, there were 1000 unobserved events (events that occurred
zero times in the data) and that they had a combined probability mass of
.50. There were 70 events that occurred 1 time and they had a combined
probability mass of .25, and so on. The same information is provided for
text 2. Again, you may display your results to as many degrees of
precision as you feel is necessary. The example data above is
intended only to show you the formatting requirements and should
not be interpreted any other way.
You should assume that the input texts are relatively large, containing
at a minimum 100,000 word tokens. Make certain that your program will
complete in a reasonable amount of time. (To some extent this depends on
the value of N so no strict guideline is given here.)
You should treat your text according to the following guidelines:
- Ignore case, treat all text as upper or lower case.
- DO NOT IGNORE PUNCTUATION. Treat each punctuation mark as a word
token.
- Treat all numeric values as tokens. Do not worry about dealing with
punctuation embedded in numeric values. In a case such as 9,000.00 it will
be fine if you treat that as five tokens ('9' ',' '0000' '.' '00').
- Only allow a single space between words in the text.
In addition to submitting a program, you must also submit a write up as
described below. Programs submitted without this write up will not be
graded.
Write up
Describe the results of the following experiment in a comment block at the
beginning of your program.
Select 3 texts from the Project Gutenberg archive. Two of the texts
should be by the same author, and the other should be by an author that
is relatively distinct in style, genre, and/or era. You may
choose any texts you wish, as long as the language is English and the
number of word tokens in each text is more than 100,000.
Suppose that that the three texts are:
- text1, by Author#1
- text2, by Author#1
- text3, by Author#2
(In your write up please clearly identify the author, the work, and the
Project Gutenberg file name.)
Run your program as follows and report the results for each:
- userid.pl 3 text1 text1
- userid.pl 3 text2 text2
- userid.pl 3 text3 text3
- userid.pl 3 text1 text2
- userid.pl 3 text2 text3
- userid.pl 3 text1 text3
What conclusions do you draw from your results regarding the effectiveness
of cross entropy and n-grams as a tool for performing authorship
identification? Use your results to support these conclusions. If you wish
to perform additional experiments to support your view that is strongly
encouraged. (The above represents the minimum required).
Other information
Please use turnin to submit this program. Remember that can only use
turnin from hh33812. No email submission is necessary. The proper turnin
command is:
turnin -c cs8995 -p a3 userid.pl
This is an individual assignment. Please work on your own. You are
free to discuss the problem and coding issues with your colleagues,
but in the end the program you submit should reflect your own
understanding of N-gram models, cross-entropy and Perl programming.
Please note that the deadline will be enforced by automatic means. Any
submissions after the deadline will not be received.
by:
Ted Pedersen
- tpederse@d.umn.edu