CS 8761 Natural Language Processing - Fall 2004
Assignment 1 (Poor Man's LSA) - Due Wed, Sept 29, noon
This may be revised in response to your questions, so please check
this page from time to time. (Last Revision Thu Sept 23)
Objectives
This assignment will help you learn how to process larger quantities of
text and store important information about that text in an economical way.
It will also introduce you to the method of Latent Semantic Analysis and
similarity measurements in text in a straightforward way.
Specification
In this assignment, you will implement something akin to LSA, except
it won't include SVD. This is a "Poor Man's" version of LSA, and perhaps
when we are richer in knowledge we'll try our hand at the real thing.
But until then...
You will write three Perl programs for this assignment. These must run on
the Solaris/Sun platform we have available here at UMD, since this is the
platform I will use to test your code.
Part A:
Write a program called matrix.pl that will accept as input any number of
text files, and produce a co-occurrence matrix (sort of). Conceptually
your program is producing a co-occurrence matrix that shows how often
each word occurs in each context, but in reality you should not implement
it using a "dense" 2-d matrix (where all cells are explicitly
represented) for reasons that will be described shortly.
You may assume that each line in each input file represents a separate
context. The cells in your co-occurrence matrix should show how many
times each word occurs in each context. Treat all the files as one
single body of text. However, rather than storing the count of the
words, store the log of that count plus 1, in other words, each cell that
has a non-zero value should contain the following:
1 + log(count)
As mentioned above, you should *not* use a simple 2-d matrix to store
these values. A matrix stored in that form will get too large for our
system to handle. Since it will be a very sparse matrix, we can
use a more economical means of storing these counts, sometimes known as
a "sparse" format. A sparse format typically does not explicitly store 0
values in cells. Thus, you should implement your co-occurrence matrix
such that you only need to store the counts of non-zero values. The 0
values can be thought of as implicit and unmentioned.
matrix.pl should output the co-occurrence information to STDOUT in your
sparse format, and this will become the input to your other two programs.
Thus, your matrix.pl program should run as follows:
matrix.pl file1.txt > occur.txt
matrix.pl file1.txt file2.txt > occur.txt
matrix.pl file2.txt news.txt > occur.txt
etc...
Make no assumptions about the names of input files, and allow an arbitrary
number to be given on the command line. You may assume that the data you
will process is plain text, and that each line contains a separate
context. I will provide you with some data you can test with, and you
should also create your own as well for your own testing.
Part B:
You will write a program called similar.pl that will take an arbitrary
number of words as input, and produce complete pairwise similarity scores
among this set of words. These scores should be from the cosine measure.
Make sure you use the real valued cosine (see page 300 of text).
The input to your similar.pl program will be the output of matrix.pl,
which should come to similar.pl from STDIN. The output from similar.pl
should be a table which shows a matrix of scores for the set of words
you have submitted. Please only show these scores to 4 digits of
precision. This output should go to STDOUT.
Thus, you should be able to run your program as follows:
similar.pl cat dog house mouse < occur.txt > similar.txt
matrix.pl file3.txt news.txt | similar.pl cat dog house mouse > similar.txt
etc.
The output should be formatted like this:
cat dog house mouse
cat 1.0000 .3243 .1234 .9900
dog .3243 1.0000 .8776 .2211
house .1234 .8776 1.0000 .5321
mouse .9900 .2211 .5321 1.0000
Note that these values are made up, and are not meant to be indicative of
actual values you might find for these words.
Make no assumption about the number of words that will be input to your
program. If a word is input that is not known to your system, make sure
you handle that gracefully by indicating a ? in the table entries
associated with that word.
Part C:
You will write a program called knn.pl that will take a given word as
input, and will output the N nearest neighbors to that word according to
the information in your co-occurrence matrix. This program will take two
command line arguments, the word you are interested in, and the number of
neighbors you would like to find. You should find the N words that are
closest to your given word with respect to the cosine measurement. Make
sure that you use the real valued cosine (see page 300 of text).
The input to your knn.pl program will be the output of matrix.pl, which
should come to knn.pl from STDIN. The output from knn.pl should be a table
which includes the word you are interested in, as well as the N neighbors
and their scores (please only show these scores to 4 digits of precision).
This output should go to STDOUT. Thus, you should be able to run your
program as follows:
knn.pl dog 5 < occur.txt > knn.txt
matrix.pl file1.txt file2.txt | knn.pl dog 5 > knn.txt
etc.
The output should be formatted like this:
dog
bone .9998
meat .9977
horse .8890
cow .8740
fish .5432
Note that these values are made up, and are not meant to be indicative of
actual values you might find for "dog".
Experiments and Report
You will produce a short written report describing the outcome of the
following experiments.
Compare the results of your similar.pl program to the
Matrix Comparison Demo of LSA
using term-term comparison. You should carry out at least 4 different
comparisons, each using a different set of 5 or more words. Try and choose
sets of words where you have some idea about how related they should be.
Also, try and make the sets of words as different from each other as
possible. For example, one set of words might be verbs, another might be
nouns, another might be place names, etc. In your report, make sure
you describe why you selected the words you did, and what you hope they
show.
In your report, include both the results generated by your program, and
those from the LSA demo. Make sure you describe the settings that
you used with the LSA demo (which topic space used, how many factors,
etc.) Comment on how well each method seemed to perform, which was better
and closer to your expectations, and why they might lead to
different/same results. Your objective here should be to show that you
can interpret the results of your system and the LSA system in a
reasonably insightful manner that reflects some understanding of
the underlying mechanisms.
Then, compare the results of your knn.pl program to the Near Neighbor Demo
using term-term comparison.
You should carry out at least 5 different comparisons where you generate
20 neighbors, each using a different word to generate neighbors. Have 2
of your words be among those used in similar.pl experiments, and
comment on the difference between what knn.pl and similar.pl report in
those cases.
In your report, you should list the words you used, and explain why you
chose them, and what you would expect their neighbors to be. Show the
results from your program and the LSA demo, and comment on why they are
similar and different, and what this might tell you about how both
methods function.
Finally, in your report, please conclude by discussing how well your Poor
Man's version of LSA approximates the real thing. Discuss improvements you
could make to your Poor Man's version (short of adding SVD!), and how that
might improve performance.
Submission Guidelines
Please name your programs 'matrix.pl' and 'knn.pl' and 'similar.pl'.
Please name your report 'experiments.txt'. Each of these should be plain
text files. Make sure that your name, date, and class information are
contained in each file, and that your programs are commented.
Place these four files into a directory that is named with your umd user
id. In my case the directory would be called tpederse, for example. Then
create a tar file that includes this directory and your three files.
Compress that tar file and submit it via the web drop from the class home
page.
Please note that the deadline will be enforced by automatic means. Any
submissions after the deadline will not be graded. The web drop has a
limit of 10mb, so your files should be plain text. If you have large data
files you wish to share for some reason, please contact me via email and
we'll figure out a way to do that.
This is an individual assignment. You must write *all* of your code on
your own. Do not get code from your colleagues, the internet, etc. You are
welcome to discuss LSA and similarity measurements in general with your
classmates, friends, family, etc., but do not discuss implementation
details. You must also write your report on your own. Please do not
discuss your interpretations of these results amongst yourselves. This is
meant to make you think for yourself and arrive at your own conclusions.
by:
Ted Pedersen
- tpederse@umn.edu