Ngram Statistics Package (NSP)
NSP allows you to identify word and character Ngrams that appear in
large corpora using standard tests of association such as Fisher's
exact test, the log likelihood ratio, Pearson's chi-squared test, the
Dice Coefficient, etc. NSP has been designed to allow a user to add
their own tests with minimal effort.
We have a mailing
list designed to support NSP users.
If you would like to report a bug or request a feature, please do that
here!
Download the Current Version (v1.31, released October 4, 2015)
from
CPAN or
SourceForge
Publications
-
The Design, Implementation, and Use of the Ngram Statistics Package
(Banerjee and Pedersen) - Appears in the Proceedings of the Fourth
International Conference on Intelligent Text Processing and Computational
Linguistics, February 17-21, 2003, Mexico City
[Please cite this
paper if you use NSP, it is our "official" description of the package.]
-
Fishing for
Exactness (Pedersen) - Appears in the Proceedings of the South -
Central SAS Users Group Conference (SCSUG-96), Oct 27-29, 1996
Austin, TX [introduces fisher's exact test for collocation identification]
-
Significant Lexical Relationships
(Pedersen, Kayaalp, & Bruce) - Appears in the Proceedings of the
Thirteenth National Conference on Artificial Intelligence (AAAI-96),
August 4-8, 1996, Portland, OR [good introduction to the theoretical
limits of 2x2 contingency table testing, and also the exact conditional
test]
Bibliography (papers by users of the Ngram Statistics Package)
Misc
- A note
on working with non-English alphabets.
- The Ngram Statistics Package (NSP) was formerly known as the Bigram
Statistics Package (BSP).
NSP Behind the Scenes
NSP has been used extensively in
SenseClusters
and the Duluth and word sense disambiguation systems for Senseval-2
and
Senseval-3.
NSP Development Team
Acknowledgments
The development of the Ngram Statistics Package has been supported by a
National Science Foundation
Faculty Early Career Development (CAREER) Program award (#0092784,
2001-2007), and by a Grant in Aid of Research,
Artistry and Scholarship from the Graduate School of the University of
Minnesota (2000-2001).
By:
Ted Pedersen
- tpederse AT d umn edu