Ted Pedersen - CS 5761 - Introduction to Natural Language Processing

CS 5761 - Introduction to Natural Language Processing

Programming Assignment 3 - Submit via web drop by 5pm Monday March 1, and 5pm Monday March 8. See below for details.

Objectives

To gain an understanding of how Google and web search data in general can be used to solve traditional NLP problems in a non-traditional way.

Specification (Part A) due 5pm Monday March 1

Write a Perl program candidate.pl that will check to see if a word is found in dictionary file. If the word is not found, you should output a list of candidate corrections.

The dictionary file should simply be a file that contains one word per line, like /usr/dict/words. Note that a simple dictionary like /usr/dict/words does not include morphological variants of stems. For example, it includes run, but not running. Thus, if you do not find the given form of the word in the dictionary, you should run the Porter Stemmer (via Text::English::stem) in order to try and find the base form. If this doesn't work, then you should go on to generate a list of candidate corrections for the word as given on the command line (*not* the form generated by the Porter stemmer).

Your program should output all of the candidate corrections for a word, based on the assumption that the spelling error is a single insertion, deletion, transposition, or substitution. If the word is found in the dictionary, then a message or code should be output indicating that the word is not misspelled.

Your program should be run like this:

candidate.pl word dict

where word is an alphanumeric string, and dict is the location of a dictionary file. You should be able to specify this as a complete path. Your program should output the candidate corrections for word to STDOUT, one per line. If word or it's stemmed form is in dict, your program should output a message or code indicating that the word is ok. For example...

candidate.pl acress /usr/dict/words

access
acres
across
actress
cress

Note that the above is just given to show the format of the output, and is not necessarily what you should expect given this dictionary and acress.

Specification (Part B) due 5pm Monday March 8

Write a Perl program googlespell.pl, that will score a list of candidate corrections for a word by using the Google API. The misspelled word should be given on the command line, and you should also provide a file that includes the list of candidate corrections as generated by candidate.pl.

Your program should be run like this:

googlespell.pl word candidates

You should develop your own approach to using the results of Google searches to score the different candidates. You should certainly use information about the number of times the candidates occur by themselves, as well as the number of times each candidate occurs in a page with the misspelling. Your approach should go beyond this, and incorporate additional information, or use this information in some clever way.

Your program should also find out what correction Google would suggest. Do not use this information in your approach, this is just to use as a point of comparison.

Your program should output the typo, your correct, Google's correction, plus a list of all your candidates sorted in order by the score your program assigns them. For example...

TYPO: acress
MY CORRECTION: access
GOOGLE CORRECTION: actress

LIST OF CANDIDATES WITH SCORES:
access .92
actress .81
across .79
acres .63
cress .42

Presentation, Thursday March 11

You will present a demo of your scoring algorithm, along with a discussion of results in the lab on Thursday March 11. You should study at least 10 "interesting" spelling errors, and compare the results of your program with the correction provided by Google. In how many cases are they they same? In the cases where they are different, how highly ranked is the Google correction in your list of candidates? By "interesting" errors, I mean those that will tend to reveal differences between your approach and Google's - for example, errors that differ by more than one character, errors at the start of the word, etc. etc.?

From all that you observe, draw some conclusions about how Google must be doing spelling correction. Submit your slides to the webdrop by 4pm Thu March 11.

Policies (see syllabus for more details)

Please comment your code. You must provided a detailed description of your spelling correction algorithm in your source code comments. This should focus on how you score the candidate corrections for a word. Also make sure you name, class, etc. is clearly included in the comments.

It is fine to use a Perl reference book to provide examples of loops, variables, etc., but your candidate.pl and googlespell.pl specific code must be your own, and not taken from any other source (human, published, on the web, etc.)