CS 5761 - Introduction to Natural Language Processing
Programming Assignment 3 - Submit via
web drop by 5pm Monday March 1, and 5pm Monday March 8. See below for
details.
Objectives
To gain an understanding of how Google and web search data in general can
be used to solve traditional NLP problems in a non-traditional way.
Specification (Part A) due 5pm Monday March 1
Write a Perl program candidate.pl that will check to see if a word is
found in dictionary file. If the word is not found, you should output a
list of candidate corrections.
The dictionary file should simply be a file that contains one word per
line, like /usr/dict/words. Note that a simple dictionary like
/usr/dict/words does not include morphological variants of stems. For
example, it includes run, but not running. Thus, if you do not find the
given form of the word in the dictionary, you should run the Porter
Stemmer (via Text::English::stem) in order to try and find the base form.
If this doesn't work, then you should go on to generate a list of
candidate corrections for the word as given on the command line (*not* the
form generated by the Porter stemmer).
Your program should output all of the candidate corrections for a word,
based on the assumption that the spelling error is a single insertion,
deletion, transposition, or substitution. If the word is found in the
dictionary, then a message or code should be output indicating that the
word is not misspelled.
Your program should be run like this:
candidate.pl word dict
where word is an alphanumeric string, and dict is the location of a
dictionary file. You should be able to specify this as a complete path.
Your program should output the candidate corrections for word to STDOUT,
one per line. If word or it's stemmed form is in dict, your program should
output a message or code indicating that the word is ok. For example...
candidate.pl acress /usr/dict/words
access
acres
across
actress
cress
Note that the above is just given to show the format of the output, and
is not necessarily what you should expect given this dictionary and
acress.
Specification (Part B) due 5pm Monday March 8
Write a Perl program googlespell.pl, that will score a list of candidate
corrections for a word by using the Google API. The misspelled word should
be given on the command line, and you should also provide a file that
includes the list of candidate corrections as generated by candidate.pl.
Your program should be run like this:
googlespell.pl word candidates
You should develop your own approach to using the results of Google
searches to score the different candidates. You should certainly use
information about the number of times the candidates occur by themselves,
as well as the number of times each candidate occurs in a page with the
misspelling. Your approach should go beyond this, and incorporate
additional information, or use this information in some clever way.
Your program should also find out what correction Google would suggest. Do
not use this information in your approach, this is just to use as a point
of comparison.
Your program should output the typo, your correct, Google's correction,
plus a list of all your candidates sorted in order by the score your
program assigns them. For example...
TYPO: acress
MY CORRECTION: access
GOOGLE CORRECTION: actress
LIST OF CANDIDATES WITH SCORES:
access .92
actress .81
across .79
acres .63
cress .42
Presentation, Thursday March 11
You will present a demo of your scoring algorithm, along with a discussion
of results in the lab on Thursday March 11. You should study at least 10
"interesting" spelling errors, and compare the results of your program
with the correction provided by Google. In how many cases are they they
same? In the cases where they are different, how highly ranked is the
Google correction in your list of candidates? By "interesting"
errors, I mean those that will tend to reveal differences between
your approach and Google's - for example, errors that differ by more than
one character, errors at the start of the word, etc. etc.?
From all that you observe, draw some conclusions about how Google must be
doing spelling correction. Submit your slides to the webdrop by 4pm Thu
March 11.
Policies (see syllabus for more details)
Please comment your code. You must provided a detailed description of your
spelling correction algorithm in your source code comments. This should
focus on how you score the candidate corrections for a word. Also make
sure you name, class, etc. is clearly included in the comments.
It is fine to use a Perl reference book to provide examples of loops,
variables, etc., but your candidate.pl and googlespell.pl specific code
must be your own, and not taken from any other source (human, published,
on the web, etc.)