CS 5761 - Introduction to Natural Language Processing
Project Proposal due by 5pm Weds April 7 via webdrop. Please submit
a pdf or ps file.
Objectives
To outline the design and scope of your project in a formal written
proposal.
Specification
Your project will involve producing both a Perl implementation and a
written report. There are two possible topics:
- Conduct an analysis that shows whether or not the Voynich Manuscript
consists of human language (or not).
- Develop a method that uses information from Google to identify sets
of related words (much like Google Sets).
Within these topics you have considerable discretion as to how you
proceed. If you would like to work in teams of two, that is possible.
However, you must clearly define what each team member will be
contributing. You should structure things so that each team member has a
distinct role in the project.
If you have an alternative idea for a project, please let me know via
email by Weds March 31. I am willing to consider such possibilities, but
would like to discuss those with you before you proceed too far. If you
would like to use the basic idea of the projects above and modify them in
some significant way, then you should also send me an email note by March
31 letting me know the general idea so we can discuss.
By Weds April 7 you should have produced a project proposal. It should
include the following:
- Problem Description (1 page) : Which of these problems are
you trying to solve? Describe the problem in general terms and what
practical applications the techniques you are developing could have. You
should provide at least two references to published papers (not just web
sites) that discuss the same or a related problem. You should read and
briefly summarize these papers. If you are working in a team of 2, you
should each find and read 2 papers (for a total of 4).
- Overview of Solution/Approach (1 page) : What is the general
approach you plan on taking?
- Voynich Project: describe what tests or techniques your analysis will
consist of, and why you think they will help to answer the question of if
it is human language.
- Related Words: describe your algorithm as a series of steps.
Provide an example that shows how it works.
If you are working in a team of two, clearly indicate which of you will
handle which step. All steps in either project should be clearly
assigned to one team member or the other.
- Evaluation Plan (1-2 paragraphs) : How will you show that your
solution is valid? Will you need to find or create "gold standard"
data to use as a point of comparison? If so, where will you get that, or
how will you create it.
- Voynich Project: You may want to consider carrying out your analysis
on text that is known to be human language (or not) and showing that your
analysis produces the correct result. Text that is truly human language is
easy to find (Project Gutenberg, etc.) and you could possibly use your
program from lab4 to generate text based on a unigram model, which might
not look like true human language. (These are just ideas, you can proceed
as you wish).
- Google Sets: You may want to consider comparing the sets of words
produced with sets of words that are known to exist in a thesaurus or
other resource. You may want to consider using WordNet, which
provides sets of related words. This is installed on the csdev machines
and freely available for download. Just run "wn" or "man wn" to find out
more.
Your proposal should probably be 2-3 pages, and it should be well written
and carefully thought out. It will provide a road map for your project so
the more you put into this the more smoothly your project will go.
You will also present your project proposal in the lab on Thursday April
8. There is no need to prepare a formal presentation, we can use your
written proposal as a point of reference while you describe things.
If I have significant concerns about your topic or some aspect of your
proposal I will let you know within a few days after you submit the
proposal. In that case I might request that you make some changes or
provide additional details.