Ted Pedersen - CS 8995 Corpus Based Natural Language Processing

CS 8995 Corpus Based Natural Language Processing

UNDER CONSTRUCTION. May be revised in response to questions. LAST UPDATE Monday April 30 5pm

Final Project: Empirical Methods for Multilingual Text

Stage 3 - Due Thu, May 10, 4 pm

Objectives

To create a bilingual translation dictionary from sentence aligned corpora using the EM algorithm.

Specification

This is an OPTIONAL extra credit project stage. If you complete it successfully, you could earn up to 150% credit for the final project (30 out of 20 possible points). You may work in teams of one, two or three members. Each team that intends to work on this stage must advise me of this via email by 4pm Friday April 20. In this email include the name and email address of all team members. I will confirm via email and include a team name that you should use during this stage.

For all code submitted, please include a header comment that lists the name of your team and the team members, as well as brief instructions on how to use the program. EACH TEAM SHOULD WORK INDEPENDENTLY OF ALL OTHER TEAMS. If you would like to discuss this stage with a human being, please consult with your teammates only! Nobody else should be involved in any way with your team! The code that you submit must be completely written by the members of your team and your team alone. You are not to share code with other teams for any reason. Even small violations of this policy will result in the involved teams not receiving any credit.

If you are aware of alignment tools that are available as free software or are in the public domain, you are free to use those instead. However, you should mention this in your report and clearly indicate in the comments on the code where you have obtained this material from. Also remember that you will need to make sure that you follow the stage 2 convention regarding the output format of sentence aligned text.

This project stage includes the following:

Create Sentence Aligned Corpora
Select three parallel corpora as collected for assignment 4. Each of these corpora should be for a different language pair (e.g., french/english, german/english, spanish/english). None of your data should have been created by a member of your team. Those corpora are available here . Remember that if you find problems in a corpus you should report those to me for a small bit of extra credit. The creator of the corpus will have three days to correct the problem without penalty.

Use your stage 2 alignment tool to perform the sentence level alignment. In your write up, make sure to clearly indicate which team's stage 2 tools you used.
Build a Translation Dictionary using the EM algorithm
Implement the EM algorithm to estimate word for word translation/generation parameters from sentence aligned corpora. Remember that we are using a simplified model that only considers the word by word translation parameters (t), not fertility, spurious words, or distortion parameters. If you would like to introduce some assumptions regarding fertility, spurious words, etc. you are welcome to do so. Just make sure that those are documented in your report.
Your program should accept command line arguments as follows:
```
teamname-em.pl number source target file cutoff
```
Your program should display the top [number] most likely translation pairs and their associated probabilities, according to the parameter estimates computed. Your program should process the aligned sentences contained in [file] where the language being translated from is specified by [source] and the language being translated to is specified by [target]. Words that occur less than or equal to [cutoff] times are ignored by the EM algorithm and will not be included in the translation pairs.

For example:
```
guadalajara-em.pl 5 english spanish corpus1.aligned 2
```
could result in output such as:
```
yes si 0.7653
there alli 0.7600
for por 0.6332
I yo 0.5433
are esta 0.4321
```
This command 'says' that you will display the top five english-spanish translation pairs from the sentence aligned file named corpus1.aligned. Any words that occur 2 or less times in that file are not considered by the EM algorithm.

Note that it is possible that there be more than one translation pair for a given target or source word. For example, the English word "bill" can be translated to Spanish as "pico" if we are referring to a "duck bill" or "cuenta" if referring to a "bill" that one must pay.
Evaluation
Do the following for [cutoff] = 0, 1, and 5:
- Find the top 20 ranked translation pairs for each of the three sentence aligned corpora that you created. Manually check these using a bilingual dictionary. How many of those 20 are correct?
Reverse the source and target languages and run again for all cutoff values and corpora. Do you get the same results?
Write up
Prepare a short written summary of your experimental methodology and results. This should describe the following:
- The stage 2 alignment tools you used.
- The assignment 4 corpora you aligned (and how you may have renamed them).
- The translation pairs produced for cutoffs = 0, 1, and 5 for each of your three language pairs. (You should have 9 sets of translation pairs.) In the listing of each translation pair, indicate which pairs are correct.
- The translation pairs produced for cutoffs = 0, 1, and 5 for each of your three language pairs when the source and the target are reversed. (You should have 9 sets of translation pairs in "reverse" order.) In the listing of each translation pair, indicate which pairs are the same as you found in the original order.
- Your analysis of the results. Focus on the following issues: Do the translation pairs change substantially as cutoff varies? Do they appear to get better/worse? Do the translation pairs change substantially when the source and target language are reversed? What accounts for this?
Attendance at lecture
I will provide additional information and help during the lecture sessions of Mon Apr 30 and Weds May 02. All members of your team are required to attend these lectures if you are participating in stage 3. If a team member is not present I will drop them from your team.

Other information

You should turn in 5 items. Please use turnin to submit all items. Remember that you can only use turnin from hh33812. No email submission is necessary. The proper turnin commands are:

turnin -c cs8995 -p p3a teamname-em.pl (em algorithm implementation)
turnin -c cs8995 -p p3b1 teamname-1.utx (aligned corpus 1)
turnin -c cs8995 -p p3b2 teamname-2.utx (aligned corpus 2)
turnin -c cs8995 -p p3b3 teamname-3.utx (aligned corpus 3)
turnin -c cs8995 -p p3c teamname.(pdf|ps) (written report)

This is a team project. Please consult with and work with your team members closely. You may divide the work as you see fit, and you have considerable discretion in your approach to this problem. Do not discuss your stage 3 work with anyone outside of your team.

Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be received.

There will be no partial credit given unless you have a nearly working EM algorithm program. In other words, if you turn in aligned sentences and a write-up with an EM algorithm that is significantly flawed you will get no credit.

I will not be available to assist you during finals week, so please resolve any questions you might have about the project before Fri May 4.

by: Ted Pedersen - tpederse@d.umn.edu

CS 8995 Corpus Based Natural Language Processing

UNDER CONSTRUCTION. May be revised in response to questions. LAST UPDATE Monday April 30 5pm

Final Project: Empirical Methods for Multilingual Text

Stage 3 - Due Thu, May 10, 4 pm

Objectives

Specification

Create Sentence Aligned Corpora

Build a Translation Dictionary using the EM algorithm

Evaluation

Write up

Attendance at lecture

Other information