Sense Tagged Text
This page contains versions of the Senseval-1, Senseval-2, line, hard
serve, and interest data that have been converted to a common format
(Senseval-2), POS tagged, and parsed. We have also created a page
where disambiguated name discrimination data
is available, and where a topic annotated version
of the Enron Corpus is available.
Our general strategy has been to convert sense-tagged text to the
Senseval-1 format using tools provided below, and then rely on the
program Sval1to2 to convert from
Senseval-1 to
Senseval-2
format.
The pos tagging was done with the Brill tagger using our package posSenseval. The parsing was done with the Collins
parser using our package parseSenseval.
-
Senseval-1
12,000+ instances of 35 words as used in the
Senseval-1 evaluation exercise.
There are some anomolies in the Senseval-1 data that are
described in this README. We
recommend that you get the data that has been corrected (with fixes).
Senseval-1 format (with fixes)
Senseval-2 format (with fixes)
Senseval-2 format (with fixes and pos tags)
Senseval-2 format (with fixes and parsed)
Senseval-1 format (no fixes)
Senseval-2 format (no fixes)
-
Senseval-1 Dry Run
20,000+ instances of 38 words. Distributed prior to Senseval-1 as a
practice run. Uses different format than Senseval-1 exercise, so a
conversion program is included with the data. See the README for general
information.
Senseval-2 format
Conversion tool .
-
Senseval-2 Data
12,000+ instances of 73 words.
Senseval-2 data cleaned by posSenseval
Senseval-2 data with pos tags from posSenseval
Senseval-2 data parsed via parseSensval
Senseval-2 keys in standard and
SenseClusters format
Senseval-2 training data in plain
text format (no xml markup)
-
Senseval-3 Data
57 different words, with nouns and adjectives tagged with WordNet senses,
and the verbs with WordSmythe senses. The the official distribution of the
data here.
Note that the Senseval-3 data uses the Senseval-2 format, so no
conversion is necessary.
Senseval-3 data parsed
-
line
4000+ instances of the noun line, tagged with 6 wordnet senses. See the
README for general information.
original format.
original format
with two
duplicate instances removed.
Conversion to Senseval-1 tool .
Senseval-1 format
Senseval-2 format
Senseval-2 format with pos tags
Senseval-2 format parsed
-
hard
4000+ instances of the adjective hard, tagged with 3 wordnet senses. See
the README for general information.
original format
original format wo/^M characters and duplicate instances removed.
Conversion to Senseval-1 tool .
Program to create unique instance
ids in original hard data.
Senseval-1 format
Senseval-2 format
Senseval-2 format with pos tags
Senseval-2 format parsed
-
serve
4000+ instances of the verb serve, tagged with 4 wordnet senses. See the README for general information.
original format
original format wo/^M characters.
Conversion to Senseval-1 tool .
Senseval-1 format
Senseval-2 format
Senseval-2 format with pos tags
Senseval-2 format parsed
-
interest
2369 instances of the noun interest from the ACL/DCI Treebank that is
tagged with 6 LDOCE senses. See the README for general information.
original format
(ftp from nsmu) or a
local copy.
Conversion to Senseval-1 tool .
original format without POS tags
Senseval-1 format with original pos tags
Senseval-2 format with original pos tags
Senseval-2 format with pos tags
Senseval-2 format parsed
Senseval-1 format without POS tags
Senseval-2 format without POS tags
Misc. Notes
The official dtd file as provided by the Senseval-2 organizers is here . Please note that our
converted data will not "parse" as true xml text. This is due to the fact
that in the original sense-tagged text, characters that require special
handling in xml are not escaped, and so forth. We are considering
ways to make this data "true" xml, and would be most grateful for any
feedback on how to best do this. [TDP Feb 9, 2003]
We have a program OMtoSval2 that converts
sense tagged text in the
Open Mind
format to Senseval-2 format, although we do not provide any versions of
the Open Mind data since that is continually evolving.
We also provide several programs to help us verify Senseval-2 formatted
data. These are found in the Sval2Check
package, and will check the validity of a Senseval-2 formatted file
(sval2parser.pl) and also identify duplicate instance ids and contexts
(sval2dups.pl) which may signal problems in the data.
Please note that we have just converted, tagged, and parsed the data, we
did not do any of the sense-tagging! Please see the associated READMEs
for proper crediting of the sense-taggers.
By:
Ted Pedersen
- tpederse AT d umn edu