CS 8995 Corpus Based Natural Language Processing
Assignment 4 - Due Wed, Mar 7, 4 pm
Under Construction - Date of last update: Thu Mar 01, 11am
Objectives
To collect a parallel corpus of translated text for future projects.
Specification
Create a UNICODE (UTF-8) file that contains parallel text from two
languages. Your corpus should consist of 20 pairs of translated articles,
documents, etc. where one of the languages is English and the other is a
single language of your choice. Thus, your corpus should consist of 40
articles, 20 in English and 20 in the other language. Your corpus
should consist of plain text (like the project gutenburg files). There
should be no html, xml, or other forms of markup embedded in the
text.
Each article in your corpus should be at least 500 words (in
English). However, if you can find longer articles then your results on
subsequent projects will be much better. Do not subdivide longer articles,
government documents, technical documentation, etc into smaller
pieces. Your objective should be to collect 20 distinct articles rather
than simply collecting 10,000 words of text (20 articles * 500 word
minimum) in each of the two languages.
Good choices for text will include newspaper or magazine articles,
technical documentation, or government documents. Do not use religious or
literary works. Remember to make sure that you are collecting articles
that have been translated, as opposed to articles that may simply be about
the same topic but be substantially different. Do not, under any
circumstances, use online translation aids to create a translation!
Submission format
It is crucial that you format your corpus in this format
. Here is an example (using literary text
however!)
You should have twenty such entries in your corpus. Remember that
your student-id-number is the 7 digit code that UMD knows you as (not
your email address). The entire corpus that you submit should be in
Unicode, so you must either use Perl or a Unicode text editor to
create the markup scheme shown above.
Other information
Please use turnin to submit this assignment. Remember that you can only
use turnin from hh33812. No email submission is necessary. The proper
turnin command is:
turnin -c cs8995 -p a4 userid.utx
This is an individual assignment. Please work on your own. Given the
huge volume of text available online it is highly unlikely that you will
coincidentally find the same articles as another student. However, if
you are concerned that there is a common source of translated text that
multiple students might access (a popular newspaper, etc) please arrange
with your colleagues to avoid collecting the same text!
Please note that the deadline will be enforced by automatic means. Any
submissions after the deadline will not be received.
by:
Ted Pedersen
- tpederse@d.umn.edu