Free Software

[This page is out of date. Please contact me for more current info.]

This is a directory of software developed by the Natural Language Processing Group at the University of Minnesota, Duluth. It is mostly in Perl, and always freely available under the terms of the GNU General Public License (GPL). Many of these projects are available via CPAN and SourceForge.

Unsupervised Corpus Based Clustering of Similar Contexts

SenseClusters
SenseClusters is a package of Perl programs that allows a user to cluster similar contexts together using unsupervised knowledge-lean methods. These techniques have been applied to word sense discrimination, email categorization, and name discrimination.

Collocation Identification

Ngram Statistics Package (NSP)
NSP allows you to identify word n-grams in large corpora using standard tests of association such as Fisher's exact test, the log likelihood ratio, Pearson's chi-squared text, and the Dice Coefficient.

WordNet Resources

WordNet::Similarity
WordNet::Similarity allows you to measure the similarity and relatedness of two concepts in the WordNet lexical database using a variety of measures of semantic similarity and relatedness.
WordNet::SenseRelate
WordNet::SenseRelate allows you to assign meanings to each content word in a text. It does this by determining which sense of a word is most related to its neighbors.
WordNet Utilities
A few misc. programs that help us deal with WordNet.

UMLS Resources

UMLS::Similarity
UMLS::Similarity allows you to measure the similarity and relatedness of two concepts in the Unified Medical Language Subsystem (UMLS) using a variety of measures of semantic similarity and relatedness.
UMLS::Interface
UMLS::Interface provides a Perl interface to the Unified Medical Language System (UMLS) and provides much of the functionality that enables UMLS::Similarity.

Supervised Methods of Word Sense Disambiguation

SenseTools
This is a suite a tools that allow for easy creation of supervised word sense disambiguation experiments.
WSD Shell
This is a greatly improved version of the Duluth-Shell as used in the DuluthX Senseval-2 systems. It makes it easier to run large numbers of experiments, and provides many detailed reporting options.
SyntaLex
This extends the Duluth Senseval-2 systems with part of speech and syntactic features. This system participated in Senseval-3 (2004).
Duluth Senseval-3 Systems
Complete source code and documentation for the Duluth systems that participated in the Senseval-3 (2004) comparative exercise among word sense disambiguation systems. This includes supervised lexical sample systems based on the Duluth Senseval-2 systems, and a new unsupervised lexical sample system.
DuluthX Senseval-2 Systems
Complete source code and documentation for the Duluth systems that participated in the lexical sample tasks of Senseval-2 (2001) comparative exercise among word sense disambiguation systems. These systems rely on lexical features like unigrams, bigrams, and co-occurrences.
WSD Gate
This is a complete word sense disambiguation system that integrates NSP and Weka into the Gate environment.
CuiTools
This is a complete word sense disambiguation system that assigns senses to biomedical text based on the UMLS.

Data and Data Creation Tools

Senseval-2 Format Conversions
We support conversions of data in a number of formats into the Senseval-2 format for lexical sample word sense disambiguation. You can find those tools here!
Senseval-2 Formatted Data
We have converted a variety of sense-tagged text into the Senseval-2 format. We provide both copies of the converted data as well as the source code used to create it.
POS Tagging and Parsing Tools
Process Senseval-2 formatted data using the Brill POS Tagger and the Collins Parser.
Tools for Parallel Text
Tools for automatic and manual alignment of parallel text.

Web Mining

GoogleHack
GoogleHack finds sets of related words using the Google search engine.

By: Ted Pedersen - tpederse AT d umn edu