CS 8995 Corpus Based Natural Language Processing

UNDER CONSTRUCTION. May be revised in response to questions. LAST UPDATE Monday April 30 5pm

Final Project: Empirical Methods for Multilingual Text

Stage 3 - Due Thu, May 10, 4 pm

Objectives

To create a bilingual translation dictionary from sentence aligned corpora using the EM algorithm.

Specification

This is an OPTIONAL extra credit project stage. If you complete it successfully, you could earn up to 150% credit for the final project (30 out of 20 possible points). You may work in teams of one, two or three members. Each team that intends to work on this stage must advise me of this via email by 4pm Friday April 20. In this email include the name and email address of all team members. I will confirm via email and include a team name that you should use during this stage.

For all code submitted, please include a header comment that lists the name of your team and the team members, as well as brief instructions on how to use the program. EACH TEAM SHOULD WORK INDEPENDENTLY OF ALL OTHER TEAMS. If you would like to discuss this stage with a human being, please consult with your teammates only! Nobody else should be involved in any way with your team! The code that you submit must be completely written by the members of your team and your team alone. You are not to share code with other teams for any reason. Even small violations of this policy will result in the involved teams not receiving any credit.

If you are aware of alignment tools that are available as free software or are in the public domain, you are free to use those instead. However, you should mention this in your report and clearly indicate in the comments on the code where you have obtained this material from. Also remember that you will need to make sure that you follow the stage 2 convention regarding the output format of sentence aligned text.

This project stage includes the following:

Other information

You should turn in 5 items. Please use turnin to submit all items. Remember that you can only use turnin from hh33812. No email submission is necessary. The proper turnin commands are:
turnin -c cs8995 -p p3a teamname-em.pl (em algorithm implementation)
turnin -c cs8995 -p p3b1 teamname-1.utx (aligned corpus 1)
turnin -c cs8995 -p p3b2 teamname-2.utx (aligned corpus 2)
turnin -c cs8995 -p p3b3 teamname-3.utx (aligned corpus 3)
turnin -c cs8995 -p p3c teamname.(pdf|ps) (written report)
This is a team project. Please consult with and work with your team members closely. You may divide the work as you see fit, and you have considerable discretion in your approach to this problem. Do not discuss your stage 3 work with anyone outside of your team.

Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be received.

There will be no partial credit given unless you have a nearly working EM algorithm program. In other words, if you turn in aligned sentences and a write-up with an EM algorithm that is significantly flawed you will get no credit.

I will not be available to assist you during finals week, so please resolve any questions you might have about the project before Fri May 4.

by: Ted Pedersen - tpederse@d.umn.edu