Ted Pedersen - CS 8761 Natural Language Processing

CS 8761 Natural Language Processing - Fall 2002

Assignment 3 - Due Friday Nov 1 noon

This may be revised in response to your questions. Last update Thu Oct 24 11:00 am.

Objectives

To explore supervised approaches to word sense disambiguation. You will create sense--tagged text and see what can be learned from the same.

Specification

Sense--tag some text and implement a Naive Bayesian classifier to perform word sense disambiguation.

I. Data Creation (will be credited to Assignment 4, but still due by Nov 1)

Go to Open Mind Word Expert and create a login id. Register for project CS8761-UMD. Tag 500 instances/sentences. You may tag whichever words you wish. Please make sure that your login id is being credited with the tagging you do. It is possible to end up tagging on behalf of an "anonymous" login id, in which case there is no way to assign credit to your efforts.

II. Naive Bayesian Classifier Implementation

Implement a Naive Bayesian Classifier to perform word sense disambiguation. Perform your experiments with the "line" data .

You will find six files in the line tar file: cord2, division2, formation2, phone2, product2, and text2. Each file corresponds to instances of line used in the sense implied by the file name. Each instance contains one occurrence of line in a sentence, and then a few surrounding sentences. There is one instance per line in each file, and there is a unique identifier that precedes the actual text. There are a few sentences where line occurs multiple times. In those cases I think it is safe to assume that both occurrences of line are used in the designated sense. For simplicity I would assume that the first occurrence of line is the targert word. Also,note that line may occur in a variety of forms: Line, Lines, line, line, and maybe even others.

In the "line" data there is one instance per line of data. Each instance consists of a sentence in which "line" has been used in the sense designated by the file name, along with two or three surrounding sentences. Each instance begins with a unique identifying code known as the instance id. You should retain the instance id throughout processing. You will need to know which instances are which after they are randomized and divided into training and test data. These instance ids should not be used as features.

There are a number of different programs that you will need to write. Please follow the specifications described below. If there are any variations between what I described in lecture and on the page, please follow the specifications here.

select.pl A target.config list_of_files

This program should output two files, TRAIN and TEST, such that TRAIN consists of A per-cent of the instances, and TEST consists of (100-A) per-cent. Thus, A should be an integer between 0 and 100. Make sure the order of the instances in these files is randomized before you divide them into TEST and TRAIN files.

target.config is a file that specifies the format of the target word to be disambiguated and the format of the "instance ids" that uniquely idenify each instance of the target word. In this assignment the target word is "line". However, there are a variety of forms of "line" that appear in the data so you will need to have a single Perl regex that identifies all of them. You must also have a second regular expression (on the second line of target.config) that identifies the valid forms of the instance id.

For example, if your program was only to accept "line" or "lines" as valid target words, and if the instance id format was an integer followed by XXX, the regexes found in target.config might be:

/lines?/
/\d+XXX/

Please note that these regexes are not the best solutions since (among other deficiencies) they will identify "splines" as a target word and do not reflect the fact that instance ids are at the start of a line.

The two regular expressions in target.config should be copied to the top two lines of the TRAIN and TEST files you create. This is so they can be used by the other programs that follow to identify the target word and the instance ids.

As mentioned above, when you unpack the "line" data you will find there are 6 files. These should be what you specify as your list of files, although your code should work for any number of files. For example, you might want to test your program only using two of the input files. Each of these files contains instances tagged with a particular sense of "line". For example, cord2 is a file that contains instances all of which are tagged with the cord sense of "line". However, note that there is no explicit tag in the data, this is provided implicitly by the file name. So before you randomize the order of these files, make sure to provide each instance with a sense tag.

The following is how you should run select.pl to create a 70-30 split of the training and test data using all of the line data:

select.pl 70 target.config cord2 division2 formation2 phone2 product2 text2

feat.pl TRAIN W F > FEAT

This program should identify all of the word types that occur within W positions to the left or right of the target word, and that occur *MORE THAN* F times in the TRAIN data set. These are the features that you will use to represent the training and test instances. Do not include the target word (line) as a feature. It should output a list of types/features to standard output. Frequency information is not required, just a list of the types. Also remember that you should only create features from your TRAIN data.

If W "goes off the end" of an instance (e.g., if W is 20 and there are only 10 words to the right of the target word) do not go on to the next instance to get features. Each instance is independent. Simply stop at the end of the instance. Please note than an instance consists of the sentence containing the target word plus two or three other sentences. Also recall that each instance is on a single line in the data file.

A window size of 0 is valid. This means that there are no features associated with the instance, only its sense-id. In this case feat.pl will produce essentially nothing, however this is a valid situation. It could also occur if you set your frequency cutoff very high.

cat FEAT | convert.pl W file > file.FV

Convert the input file (either TRAIN or TEST) to a feature vector representation, where the features (FEAT) are read from standard input. Each instance in file is converted into a series of binary values that indicate whether or not each type listed in FEAT has occurred within the specified window around the target word in the given instance. The instance id and the sense tag (when file is TRAIN) should precede the feature vector for each instance.

Remember to convert both your TRAIN and TEST data with this program using the features found in the TRAIN data. Under no circumstances should you derive features from the TEST data!

If FEAT is empty then convert.pl should output instance ids followed by the correct sense tag (when file is TRAIN).

nb.pl TRAIN.FV TEST.FV > TAGGING

This program will learn a Naive Bayesian classifier from TRAIN.FV and use that classifier to assign sense tags to TEST.FV. For each instance in TEST.FV, your program should output the instance-id, the sense-tag as assigned by the classifier and its associated probability according to your classifier, and the actual sense-tag. On the last line the program should output the accuracy of the sense tagging (this is simply the number of instances in the test data that had their sense correctly assigned divided by the total number of instances. This should be output to standard output.

Your program will inevitably come across parameters that it can not obtain estimates for in the training data. Please use Witten-Bell Smoothing to provide values for unobserved events. Make sure that you retain a valid probability distribution while you are smoothing!

SANITY CHECK: if there are no features and TRAIN.FV consists of a list of instance ids and sense tags, and TEST.FV only consists of a list of instance ids, then your classifier should revert to a "most common sense" baseline, where it assigns every instance in TEST.FV the most frequent sense as used in TRAIN.FV.

Output

There is no output to turn in from your Open Mind sense tagging. However, remember that your tagging is being collected by the Open Mind Project and will be downloaded and used in our class.

Your programs should produce output as described above. Please make sure to follow the naming conventions and formats exactly. There is no output to turn in beyond what is requested below.

Experiments and Report

Carry out the following experiments and summarize your findings in a report named "experiments.txt".

Experiment with windows sizes of 0, 2, 10 and 25. Use frequency cutoffs of 1, 2, and 5. Run your classifiers with all 12 possible combinations of window size and frequency cutoff using a 70-30 training-test data ratio.

Report the accuracy values that you obtrain for each combination in a table that looks something like this:

window size | frequency cutoff | accuracy 
     0              1              .XXXX
     0              2              .XXXX
     etc.

What effect do you observe in overall accuracy as the window size and frequency cutoffs change? Are there any combinations of window size and frequency that appears to be optimal with respect to the others? Why?

Submission Guidelines

Submit all of your program files and your report. Make sure to submit your target.config file as well. All should be plain text. Make sure you your name, date, and class information are contained in each file (except for target.config!), and that your source code files are carefully commented.

Place all of these files into a directory that is named with your umd user id. In my case the directory would be called tpederse, for example. Then create a tar file that includes this directory and the files you will submit. Compress that tar file and submit it via the web drop from the class home page. Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be graded. The web drop has a limit of 10mb, so your files should be plain text.

This is an individual assignment. You must write *all* of your code on your own. Do not get code from your colleagues, the Internet, etc. Please do not discuss your interpretations of these results amongst yourselves. This is meant to make you think for yourself and arrive at your own conclusions.

by: Ted Pedersen - tpederse@umn.edu