Free Software
[This page is out of date. Please contact me for more current info.]
This is a directory of software developed by the
Natural Language Processing Group
at the University of Minnesota, Duluth. It is mostly in Perl, and
always
freely available
under the terms of the GNU General Public License
(GPL).
Many of these projects are available via
CPAN
and
SourceForge.
Unsupervised Corpus Based Clustering of Similar Contexts
-
SenseClusters is a package of Perl programs that allows a user to cluster
similar contexts together using unsupervised knowledge-lean methods. These
techniques have been applied to word sense discrimination, email
categorization, and name discrimination.
Collocation Identification
-
NSP allows you to identify word n-grams in large corpora using
standard tests of association such as Fisher's exact test, the log
likelihood ratio, Pearson's chi-squared text, and the Dice Coefficient.
WordNet Resources
-
WordNet::Similarity allows you to measure the similarity and relatedness
of two concepts in the WordNet lexical database using a variety of
measures of semantic similarity and relatedness.
-
WordNet::SenseRelate allows you to assign meanings to each content word in
a text. It does this by determining which sense of a word is most
related to its neighbors.
-
A few misc. programs that help us deal with WordNet.
UMLS Resources
-
UMLS::Similarity allows you to measure the similarity and relatedness of
two concepts in the Unified Medical Language Subsystem (UMLS) using a
variety of measures of semantic similarity and relatedness.
-
UMLS::Interface provides a Perl interface to the Unified Medical
Language System (UMLS) and provides much of the functionality that
enables UMLS::Similarity.
Supervised Methods of Word Sense Disambiguation
-
This is a suite a tools that allow for easy creation of supervised word
sense disambiguation experiments.
-
This is a greatly improved version of the Duluth-Shell as used in the
DuluthX Senseval-2 systems. It makes it easier to run large numbers of
experiments, and provides many detailed reporting options.
-
This extends the Duluth Senseval-2 systems with part of speech and
syntactic features. This system participated in Senseval-3 (2004).
-
Complete source code and documentation for the Duluth systems that
participated in the Senseval-3 (2004) comparative exercise among word
sense disambiguation systems. This includes supervised lexical sample
systems based on the Duluth Senseval-2 systems, and a new unsupervised
lexical sample system.
-
Complete source code and documentation for the Duluth systems
that participated in the lexical sample tasks of Senseval-2 (2001)
comparative exercise among word sense disambiguation systems. These
systems rely on lexical features like unigrams, bigrams, and
co-occurrences.
-
This is a complete word sense disambiguation system that
integrates NSP and Weka into the Gate environment.
-
This is a complete word sense disambiguation system that assigns senses
to biomedical text based on the UMLS.
Data and Data Creation Tools
-
We support conversions of data in a number of formats into the
Senseval-2 format for lexical sample word sense disambiguation. You
can find those tools here!
-
We have converted a variety of sense-tagged text into the Senseval-2
format. We provide both copies of the converted data
as well as the source code used to create it.
-
Process Senseval-2 formatted data using the Brill POS Tagger and
the Collins Parser.
-
Tools for automatic and manual alignment of parallel text.
Web Mining
-
GoogleHack finds sets of related words using the Google search engine.
By:
Ted Pedersen
- tpederse AT d umn edu