CS 8761 Natural Language Processing - Fall 2002
Assignment 3 - Due Friday Nov 1 noon
This may be revised in response to your questions. Last update Thu
Oct 24 11:00 am.
Objectives
To explore supervised approaches to word sense disambiguation. You will
create sense--tagged text and see what can be learned from the same.
Specification
Sense--tag some text and implement a Naive Bayesian classifier to perform
word sense disambiguation.
I. Data Creation (will be credited to Assignment 4, but still due by
Nov 1)
Go to
Open Mind Word
Expert
and create a login id. Register for project CS8761-UMD. Tag 500
instances/sentences. You may tag whichever words you wish. Please make
sure that your login id is being credited with the tagging you do. It is
possible to end up tagging on behalf of an "anonymous" login id, in which
case there is no way to assign credit to your efforts.
II. Naive Bayesian Classifier Implementation
Implement a Naive Bayesian Classifier to perform word sense
disambiguation. Perform your experiments with the
"line" data .
You will find six files in the line tar file: cord2, division2,
formation2, phone2, product2, and text2. Each file corresponds
to instances of line used in the sense implied by the file name.
Each instance contains one occurrence of line in a sentence, and then a
few surrounding sentences. There is one instance per line in each file,
and there is a unique identifier that precedes the actual text. There
are a few sentences where line occurs multiple times. In those cases I
think it is safe to assume that both occurrences of line are used in
the designated sense. For simplicity I would assume that the first
occurrence of line is the targert word. Also,note that line may occur
in a variety of forms: Line, Lines, line, line, and maybe even others.
In the "line" data there is one instance per line of data. Each instance
consists of a sentence in which "line" has been used in the sense
designated by the file name, along with two or three surrounding
sentences. Each instance begins with a unique identifying code known as
the instance id. You should retain the instance id throughout
processing. You will need to know which instances are which after they
are randomized and divided into training and test data. These instance
ids should not be used as features.
There are a number of different programs that you will need to write.
Please follow the specifications described below. If there are any
variations between what I described in lecture and on the page, please
follow the specifications here.
select.pl A target.config list_of_files
This program should output two files, TRAIN and TEST, such that
TRAIN consists of A per-cent of the instances, and TEST consists of
(100-A) per-cent. Thus, A should be an integer between 0 and 100. Make
sure the order of the instances in these files is randomized before you
divide them into TEST and TRAIN files.
target.config is a file that specifies the format of the target word to
be disambiguated and the format of the "instance ids" that uniquely
idenify each instance of the target word. In this assignment the target
word is "line". However, there are a variety of forms of "line" that
appear in the data so you will need to have a single Perl regex that
identifies all of them. You must also have a second regular expression (on
the second line of target.config) that identifies the valid forms of the
instance id.
For example, if your program was only to accept "line" or "lines" as
valid target words, and if the instance id format was an integer followed
by XXX, the regexes found in target.config might be:
/lines?/
/\d+XXX/
Please note that these regexes are not the best solutions since (among
other deficiencies) they will identify "splines" as a target word and do
not reflect the fact that instance ids are at the start of a line.
The two regular expressions in target.config should be copied to the top
two lines of the TRAIN and TEST files you create. This is so they can
be used by the other programs that follow to identify the target word
and the instance ids.
As mentioned above, when you unpack the "line" data you will find there
are 6 files. These should be what you specify as your list of files,
although your code should work for any number of files. For example,
you might want to test your program only using two of the input files.
Each of these files contains instances tagged with a particular sense
of "line". For example, cord2 is a file that contains instances all of
which are tagged with the cord sense of "line". However, note that
there is no explicit tag in the data, this is provided implicitly by the
file name. So before you randomize the order of these files, make sure to
provide each instance with a sense tag.
The following is how you should run select.pl to create a 70-30 split of
the training and test data using all of the line data:
select.pl 70 target.config cord2 division2 formation2 phone2 product2 text2
feat.pl TRAIN W F > FEAT
This program should identify all of the word types that occur within W
positions to the left or right of the target word, and that occur *MORE
THAN* F times in the TRAIN data set. These are the features that you will
use to represent the training and test instances. Do not include the
target word (line) as a feature. It should output a list of types/features
to standard output. Frequency information is not required, just a
list of the types. Also remember that you should only create features from
your TRAIN data.
If W "goes off the end" of an instance (e.g., if W is 20 and there are
only 10 words to the right of the target word) do not go on to the next
instance to get features. Each instance is independent. Simply stop at
the end of the instance. Please note than an instance consists of the
sentence containing the target word plus two or three other sentences.
Also recall that each instance is on a single line in the data file.
A window size of 0 is valid. This means that there are no features
associated with the instance, only its sense-id. In this case feat.pl will
produce essentially nothing, however this is a valid situation. It could
also occur if you set your frequency cutoff very high.
cat FEAT | convert.pl W file > file.FV
Convert the input file (either TRAIN or TEST) to a feature vector
representation, where the features (FEAT) are read from standard
input. Each instance in file is converted into a series of binary
values that indicate whether or not each type listed in FEAT has
occurred within the specified window around the target word in the
given instance. The instance id and the sense tag (when file is TRAIN)
should precede the feature vector for each instance.
Remember to convert both your TRAIN and TEST data with this program
using the features found in the TRAIN data. Under no circumstances
should you derive features from the TEST data!
If FEAT is empty then convert.pl should output instance ids followed by
the correct sense tag (when file is TRAIN).
nb.pl TRAIN.FV TEST.FV > TAGGING
This program will learn a Naive Bayesian classifier from TRAIN.FV and
use that classifier to assign sense tags to TEST.FV. For each instance in
TEST.FV, your program should output the instance-id, the sense-tag as
assigned by the classifier and its associated probability according to
your classifier, and the actual sense-tag. On
the last line the program should output the accuracy of the sense tagging
(this is simply the number of instances in the test data that had their
sense correctly assigned divided by the total number of instances. This
should be output to standard output.
Your program will inevitably come across parameters that it can not
obtain estimates for in the training data. Please use Witten-Bell
Smoothing to provide values for unobserved events. Make sure that you
retain a valid probability distribution while you are smoothing!
SANITY CHECK: if there are no features and TRAIN.FV consists of a list of
instance ids and sense tags, and TEST.FV only consists of a list of
instance ids, then your classifier should revert to a "most common sense"
baseline, where it assigns every instance in TEST.FV the most frequent
sense as used in TRAIN.FV.
Output
There is no output to turn in from your Open Mind sense tagging. However,
remember that your tagging is being collected by the Open Mind Project
and will be downloaded and used in our class.
Your programs should produce output as described above. Please make sure
to follow the naming conventions and formats exactly. There is no output
to turn in beyond what is requested below.
Experiments and Report
Carry out the following experiments and summarize your findings in a
report named "experiments.txt".
Experiment with windows sizes of 0, 2, 10 and 25. Use frequency cutoffs of
1, 2, and 5. Run your classifiers with all 12 possible combinations of
window size and frequency cutoff using a 70-30 training-test data
ratio.
Report the accuracy values that you obtrain for each combination in a
table that looks something like this:
window size | frequency cutoff | accuracy
0 1 .XXXX
0 2 .XXXX
etc.
What effect do you observe in overall accuracy as the window size and
frequency cutoffs change? Are there any combinations of window size and
frequency that appears to be optimal with respect to the others? Why?
Submission Guidelines
Submit all of your program files and your report. Make sure to submit
your target.config file as well. All should be plain text. Make sure you
your name, date, and class information are contained in each file
(except for target.config!), and that your source code files are
carefully commented.
Place all of these files into a directory that is named with your umd user
id. In my case the directory would be called tpederse, for example. Then
create a tar file that includes this directory and the files you will
submit. Compress that tar file and submit it via the web drop from the
class home page. Please note that the deadline will be enforced by
automatic means. Any submissions after the deadline will not be graded.
The web drop has a limit of 10mb, so your files should be plain text.
This is an individual assignment. You must write *all* of your code on
your own. Do not get code from your colleagues, the Internet, etc.
Please do not discuss your interpretations of these results amongst
yourselves. This is meant to make you think for yourself and arrive at
your own conclusions.
by:
Ted Pedersen
- tpederse@umn.edu