A WordNet Stop List
What's a Stop List?
A stop list is a list of words that are excluded from some
language processing task, usually because they are viewed as
non--informative or potentially misleading. Usually they are
non--content words like conjunctions, determiners,
prepositions, etc. These are often called function words.
What's a WordNet Stop List?
Since WordNet only contains nouns, verbs, adjectives, and
adverbs, you might think that a stop list wouldn't really
be relevant. However, there are words that are normally
used as function words that have senses (usually obscure)
in WordNet.
For example, consider the humble word "at". According to WordNet,
"at" is a noun that has two senses, one for the chemical element
astatine and the other for a Laotian monetary unit.
It is very likely that most systems using WordNet are NOT
using "at" in these senses. Thus, a WordNet stop list will list
those words that are typically used as function words and
yet have unrelated WordNet senses that are obscure and
potentially misleading.
Finding the WordNet Stop List
This project was undertaken by Satanjeev Banerjee, and arose
in the context of an implementation of Lesk's word sense
disambiguation algorithm that will likely yield many interesting
results. The first step was to build a list of likely stop list
words. He found the following:
The stop list formed based on these lists is shown
here.
The next step was to determine which of these words have misleading
WordNet senses, which have related WordNet senses, and which have
no WordNet senses at all.
The following words are normally used as function words, but also turn
out to have rather odd (but correct) senses listed in WordNet:
I, a, an, as, at, by, he, his, me, or, thou, us, who.
This is our current WordNet stop list!
You can view the senses that cause us to arrive at this conclusion
here . The words in our
initial stop list that have no WordNet sense are shown here . And those function words
that also have WordNet senses that seem to be related are shown here
.
Please let us know if you have any other candidates for membership
in the WordNet stop list! These lists have been constructed using
our intuitive judgements and are not meant to be taken as anything
more than that!
By:
Ted Pedersen
- tpederse@d.umn.edu