CS 8995 Corpus Based Natural Language Processing
UNDER CONSTRUCTION. May be revised in response to questions. LAST
UPDATE Monday April 30 5pm
Final Project: Empirical Methods for Multilingual Text
Stage 3 - Due Thu, May 10, 4 pm
Objectives
To create a bilingual translation dictionary from sentence aligned
corpora using the EM algorithm.
Specification
This is an OPTIONAL extra credit project stage. If you complete
it successfully, you could earn up to 150% credit for the final project (30
out of 20 possible points). You may work in teams of one, two or three
members. Each team that intends to work on this stage must advise me of
this via email by 4pm Friday April 20. In this email include the name
and email address of all team members. I will confirm via email and
include a team name that you should use during this stage.
For all code submitted, please include a header comment that lists the
name of your team and the team members, as well as brief instructions on
how to use the program. EACH TEAM SHOULD WORK INDEPENDENTLY OF ALL OTHER
TEAMS. If you would like to discuss this stage with a human being, please
consult with your teammates only! Nobody else should be involved in any
way with your team! The code that you submit must be completely written
by the members of your team and your team alone. You are not to share code
with other teams for any reason. Even small violations of this policy
will result in the involved teams not receiving any credit.
If you are aware of alignment tools that are available as free software or
are in the public domain, you are free to use those instead. However, you
should mention this in your report and clearly indicate in the comments on
the code where you have obtained this material from. Also remember that
you will need to make sure that you follow the stage 2 convention
regarding the output format of sentence aligned text.
This project stage includes the following:
-
Create Sentence Aligned Corpora
Select three parallel corpora as collected for assignment 4. Each of these
corpora should be for a different language pair (e.g., french/english,
german/english, spanish/english). None of your data should have been
created by a member of your team. Those corpora are available
here . Remember that if you find problems in
a corpus you should report those to me for a small bit of extra credit.
The creator of the corpus will have three days to correct the problem
without penalty.
Use your stage 2 alignment tool to perform the sentence level alignment.
In your write up, make sure to clearly indicate which team's stage 2 tools
you used.
-
Build a Translation Dictionary using the EM algorithm
Implement the EM algorithm to estimate word for word
translation/generation parameters from sentence aligned corpora. Remember
that we are using a simplified model that only considers the word by word
translation parameters (t), not fertility, spurious words, or distortion
parameters. If you would like to introduce some assumptions regarding
fertility, spurious words, etc. you are welcome to do so. Just make sure
that those are documented in your report.
Your program should accept command line arguments as follows:
teamname-em.pl number source target file cutoff
Your program should display the top [number] most likely translation
pairs and their associated probabilities, according to the parameter
estimates computed. Your program should process the aligned sentences
contained in [file] where the language being translated from is specified
by [source] and the language being translated to is specified by [target].
Words that occur less than or equal to [cutoff] times are ignored by the
EM algorithm and will not be included in the translation pairs.
For example:
guadalajara-em.pl 5 english spanish corpus1.aligned 2
could result in output such as:
yes si 0.7653
there alli 0.7600
for por 0.6332
I yo 0.5433
are esta 0.4321
This command 'says' that you will display the top five english-spanish
translation pairs from the sentence aligned file named corpus1.aligned.
Any words that occur 2 or less times in that file are not
considered by the EM algorithm.
Note that it is possible that there be more than one translation pair for
a given target or source word. For example, the English word "bill" can
be translated to Spanish as "pico" if we are referring to a "duck bill"
or "cuenta" if referring to a "bill" that one must pay.
-
Evaluation
Do the following for [cutoff] = 0, 1, and 5:
- Find the top 20 ranked translation pairs for each of the
three sentence aligned corpora that you created. Manually check these
using a bilingual dictionary. How many of those 20 are correct?
Reverse the source and target languages and run again for all cutoff
values and corpora. Do you get the same results?
-
Write up
Prepare a short written summary of your experimental methodology and
results. This should describe the following:
- The stage 2 alignment tools you used.
- The assignment 4 corpora you aligned (and how you may have
renamed them).
- The translation pairs produced for cutoffs = 0, 1, and 5 for
each of your three language pairs. (You should have 9 sets of translation
pairs.) In the listing of each translation pair, indicate which pairs are
correct.
- The translation pairs produced for cutoffs = 0, 1, and 5 for
each of your three language pairs when the source and the target are
reversed. (You should have 9 sets of translation pairs in "reverse"
order.) In the listing of each translation pair, indicate which pairs are
the same as you found in the original order.
- Your analysis of the results. Focus on the following issues:
Do the translation pairs change substantially as cutoff varies? Do they
appear to get better/worse? Do the translation pairs change substantially
when the source and target language are reversed? What accounts for this?
-
Attendance at lecture
I will provide additional information and help during the lecture sessions
of Mon Apr 30 and Weds May 02. All members of your team are
required to attend these lectures if you are participating in stage 3.
If a team member is not present I will drop them from your team.
Other information
You should turn in 5 items. Please use turnin to submit all items.
Remember that you can only use turnin from hh33812. No email submission
is necessary. The proper turnin commands are:
turnin -c cs8995 -p p3a teamname-em.pl (em algorithm implementation)
turnin -c cs8995 -p p3b1 teamname-1.utx (aligned corpus 1)
turnin -c cs8995 -p p3b2 teamname-2.utx (aligned corpus 2)
turnin -c cs8995 -p p3b3 teamname-3.utx (aligned corpus 3)
turnin -c cs8995 -p p3c teamname.(pdf|ps) (written report)
This is a team project. Please consult with and work with your team
members closely. You may divide the work as you see fit, and you have
considerable discretion in your approach to this problem. Do not discuss
your stage 3 work with anyone outside of your team.
Please note that the deadline will be enforced by automatic means. Any
submissions after the deadline will not be received.
There will be no partial credit given unless you have a nearly working EM
algorithm program. In other words, if you turn in aligned sentences and a
write-up with an EM algorithm that is significantly flawed you will get no
credit.
I will not be available to assist you during finals week, so please
resolve any questions you might have about the project before Fri May
4.
by:
Ted Pedersen
- tpederse@d.umn.edu