For a long time, essays have been graded by trained humans, who assign a score to the essay depending on various criteria, while at the same time allowing for individuality and creativity. In fact, any grading which involves the use of natural language, is difficult as there can be numerous ways to say the same thing. The grader has to read the student's answer and then grade it depending on how well written and how accurate the answer is. When it comes to essays, things get even more difficult. This is because essays reflect the thoughts and opinions of another human and thus there are no perfect answers to an essay. This is indeed a very tedious and fatigue inducing job, and hence if this job is automated, we end up saving a lot of time and money and of course, human effort.
Ellis Page, who came up with the idea of an automated essay grader as early as 1966 [7] was the pioneer in the field of Automated Essay Grading. He created the PEG - Project Essay Grader, an automated essay grading system, which used indirect measures to score an essay, depending on various features such as essay length, average word length, number of prepositions, etc. Page's research focused on correlations between simple features of student texts and the grades assigned by teachers. The results showed that the computer program predicted human grades quite reliably. Though the PEG worked fairly well - considering the computing limitations at the time - it was not received well, as it was relatively easy for students to fool the system. People were also opposed to the very idea of automated essay grading on the grounds that it would not be able to assess creativity. Several further studies around that time developed the basis for further research in this field, although there was little interest in using computers in this way.
After the lukewarm response to the PEG, research and development in this field was shelved for quite some time. More recently there has been some further work in this area, but the next productive phase did not begin until the early 1980s Writer's Work Bench [3] came up with another automated system, not so much for Essay scoring, but as a tool to help amateur writers and students of journalism to write better. By the early 1990s, advances in computing technology and the field of NLP enabled many people to come up with some very good ideas for grading essays with the help of machines. Of these, the most prominent are the E-rater - developed at the ETS by Burstein et. al [2], the Intelligent Essay Assessor or IEA - which is based on LSA and developed by Landauer et. al [1] and a short answer scoring system developed by Chodorow and Leacock [8]. However, these systems are well known and understood by almost every person who has an interest in this field. We shall therefore explain about some of the lesser known, but noteworthy developments in Automated Essay Scoring.
In the early 1990s, a researcher named Hellwig of Idaho State University looked at this topic from the aspect of business writing [6]. His system used many of the ideas used by Ellis Page, but he also gave importance to how often students used the 1000 ``most commonly used words'' (according to him) in their essay. However, though this worked in business documents and writings, this system did not work on traditional essays. However, his work did show that we could find a metric that combined rater judgments with intuitive human judgments.
The next significant development is the Alaska Assessment Project, which graded student essays, by searching for and counting 24 features. The results were even better than the results achieved by Page. The results were really encouraging as humans and computers agreed as high as 96% of the times. However, it did not become very popular, as human essays were still required as a benchmark.[5]
The next important example is the Computerised Instrument for writing assessment (CIWE pronounced kiwi) [4]. This was designed in conjunction with the Carmel California Evaluation Centre following work done on the Alaska Writing Programme (it is not clear whether this is the same event as the Alaska Assessment Project mentioned above). This method collected around 500 student essays in text form. The students ranged from grade school, to high school, to university applicants and graduate students. CIWE used 13 factors to grade any essay, of which the 4 most important factors were fluency, sentence development, and word use or vocabulary and paragraph development. However, as mentioned previously, the E-rater and the Intelligent Essay Assessor remain the leading Automated Essay Graders, to date.
In our system, we combine some of the methods used in both, the E-rater and the IEA. We shall provide the details of the system in later sections, but basically, we have used an LSA-based approach, without implementing Singular Value Decomposition. Instead, we have used a system of weighing every word and sentence and assign cosign values to sentences based on these weights.
The final objective of our project is to be able to create an Automated Essay Grading system, which will score student essays based on various features such as context, grammar and proper statement of facts. Basically, our system will inculcate the following four primary features and then grade the submitted essays by collectively considering the student's response regarding these features.
Gibberish Detection is a method which identifies and penalizes attempts by students to deliberately fool the system by writing only significant words or phrases in an essay instead of a proper essay. Gibberish is also known as Word Salad, which helps us understand its meaning much more easily. It is basically just a collection of significant words written by students knowing that the essay is to be automatically graded and hence significant words carry the most weight, while stop words carry little or no weight. The essay by itself does not make much sense. It should also be noted here that sentences which are overly ungrammatical could also be considered as Gibberish. This is because these sentences make the essay extremely difficult to understand, rather than adding to its meaning.
Using a simple Latent Semantic Analysis [1] based approach for essay grading is not sufficient, as LSA ignores the stop words in documents while assigning similarity scores. An intelligent student, who does not know the language very well can hence fool the system by including many significant words in the essay which carry a lot of weight, but the overall essay does not make sense at all.
For example, say a student has to write about the recent U.S. elections and knows that the grading will be done by an automated system. So, in an attempt to fool the system, the student may just write an ``essay'' like the paragraph shown below, which includes many significant words, but does not make much sense.
George W. Bush Iowa Florida John Kerry Democrats close elections 2nd November War on Terror Cheney Republicans President Iraq Edwards middle class Halliburton Saddam Hussein administration defense spending Pentagon Bin Laden Afghanistan troops tax cuts education Immigration Ralph Nader improving schools Capital punishment Homeland Security International trade welfare Social Security Federal deficit
Now, the above example contains only significant words. This is of course, an exaggerated example, but this has been done on purpose, so as to facilitate understanding. If this essay were to be compared to a well-written essay, using only the LSA-based system of ignoring stop words, it would score very well, as it contains most of the required and significant words. However, this essay does not make sense at all.
Sometimes, it so happens that a student writes an excellent essay. However, this essay is just not what was asked for in the prompt. Or, a student may start off well, but may diverge later on as the essay progresses. While scoring an essay, it is important not only to grade essays based on how well they have been written, but also on how relevant they are to the question asked. Many times, a student's essay may be factually correct and writing style may also be perfect but it just does not address the question asked.
For example, if a student has been asked to write about ``How education can help reduce crime and increase safety'', but the student writes an essay about how women's education is important and how it helps to build a society as a whole and how it brings about the development of women, and so on. Though the points raised by the student may be valid, they do not address the original prompt which looked for an essay stating how education could help mitigate crimes. Though similar, the topics of the prompt and the essay are not one and the same.
Also, it may so happen that some of the sentences in an essay may not be relevant to the other parts of the essay. They do not contribute to the whole idea of the essay, but strike us as strange and jarring. So sometimes, it is not the entire essay which is different, but just a few sentences which do not make sense in the context of the essay.
Consider that a student is given a prompt like ``Do you think competition at the work place improves the performance of the employees? Support your stand with examples''. If the student writes an essay which strongly supports the above statement, but gives and example like this:
``Competition is very important. I remember in school, our teachers would encourage us to always do our best. In fact they used to keep all sorts of races amongst us and every time we won a race, we would get cookies. So, all of us used to try to the best of our abilities to always win the races and games not only for the cookies, but also for the applause that the teacher called for in honour of the winner. Hence, competition is important.''
This example is a good one to point out the benefits of Competition in general, but it just does not address the issue of how competition in the work place is beneficial.
Many times, essays are more a reflection of the student's ideas and opinions rather than actual statements of fact. Some essays don't even call for facts, as they ask for a student's opinion on a certain matter. However, even in such cases, a student has to provide examples to support his/her stand. These statements of fact are thus mixed with or entangled within the student's opinion about any particular topic. Our system shall try to identify statements of fact from a student's ideas and opinions.
Consider that a student has been asked to write an essay on, ``What do you think is more important for your country - advances in business or agriculture? Give examples to support your choice.''
Now if a student's essay states that advances in agriculture are more important and supports his claim by the following sentence, ``India is a country where the primary occupation is agriculture. 70% of the country's population is employed in farming and agriculture. So I think advances in agriculture are more important.''
Now, the first two sentences in the paragraph are actually statements of fact, and only the last sentence expresses the student's opinion. So, our system should be able to extract the sentences which are statements of fact from those which express the student's individual opinion.
This feature is a branch or a result of the one above. It is not enough for a student to just make any arbitrary statements and say they are facts. They have to be facts indeed. In case a student makes an erroneous statement of fact, he should be penalized for the same. Hence, our system will include a feature which will check the accuracy of the facts stated by the student. If the statement is not an actual fact, they will lose some points.
Consider the above example again. The student states that India is an agricultural country and that the occupation of around 70% of the population is agriculture. If India were not an agriculture-oriented country or say less than half of the population was into farming, and not almost three-fourths as stated by the student, then the statements of ``fact'' made by the students would not actually be factual. The system should be able to identify between erroneous and actual statements of fact and assign the scores appropriately.
In the final version, we have successfully implemented all the four features - namely Gibberish Detection, Relevance Checking, Fact Identification and the Verification of Identified facts. As previously explained, Gibberish detection helps identify senseless or meaningless sentences in the essay. Relevance checking will identify whether a particular sentence, paragraph, or the whole essay pertains to or addresses the question asked in the prompt. The Fact Identification module goes through the essay and identifies the various statements of fact present in the essay. We would like to point out here that we only identify fact statements and not the student's opinion on anything. The final feature - Fact Verification - then checks whether the identified facts are accurate or not. This feature gives one of three outputs - That the fact is true, that the fact is wrong, or that it cannot be verified. Now below, we shall explain how we have implemented all these approaches for the final version of the project.
This module was really deceptive, in that it appears to be fairly simple at the outset. We tried a variety of approaches for this and they all returned very disappointing results. The methods used and the results are briefly explained below, so as to give the user an idea of the approaches NOT to follow when implementing the gibberish detector.
The primary step to follow in any gibberish detection is Part of Speech Tagging. This has been done by using Brill's Part-of-Speech Tagger, which was downloaded from Eric Brill's website. The entire essay is first passed through the Tagger and thus every sentence in our essay is tagged according to the part of speech. This is the input which was then given to the various implementations of the Gibberish Detector.
The first thing that we tried was manually creating a set of rules for valid sentences and then testing every sentence against each of the rules. However, we soon realised the error of our ways. The main problem of this approach is that there are just way too many valid sentences in any language. There is just no practical way to list all the ways which list out all the possible ways in which a sentence can be valid. Hence, we gave up this approach as impractical.
The next approach was the use of the Brown Corpus. Since this corpus has already been tagged, what we did was extract the bigrams of tags from the essay and compare these bigrams with the allowable ones from the Brown Corpus. However, this approach too failed, as we did not consider the order of the bigrams while making comparisons. This led to results which showed any sentence as non-gibberish. We then thought of extending the bigrams to n-grams and then making comparisons, however they led to the same results. Thus, we realized that an approach which compared the valid sentences was not going to work, as there can be innumerable structures for a valid sentence. And much of the allowable syntax is such that it just does not repeat again (following Zipf's Law).
Thus, we realized that we would have to implement this differently. We then came up with the idea of listing a set of rules for definitely invalid sentences. Examples of the same are sentences without any stop words, or sentences which do not contain any nouns or pronouns. Then making use of these rules, we implemented a pre-filter which would identify sentences as either gibberish or non-gibberish. This was what we had in the alpha version. However, even at that time we were aware that this approach was riddled with problems and was never going to do on its own. This was because the pre-filter could only catch a few (definitely gibberish) sentences as such, and that too if the sentence structure matched the few rules that we came up with.
At the alpha phase, it was decided to use our previous system as a pre-filter to the next stage in gibberish detection. Though, it was felt that the pre-filter would in fact be redundant, it soon came to light that this was not so. We shall see how, momentarily. To improve the performance of the gibberish detection module, we decided to use syntactic tree structure. Since we were sure that we would find several packages for a parser on the web, we decided to try those before coding one ourselves. True enough, we came across numerous parser packages, of which the best were the Minipar (developed at the university of Alberta) and the Link Parser (which was developed at the Carnegie Mellon University). The links to their respective websites are given below in the references section.
After some experimentation, we came to the conclusion that the Link Parser is
best suited for our purposes. This is because it makes use of the concept of Link
Grammars. Link Grammar is a syntax which gives various choices for which words
can follow a given word. For example, a definite article (say The
) can be
followed by nouns, adjectives, adverbs, etc. So, a parser that strictly says
that The
should only be followed by a noun will definitely judge a sentence
harshly in many cases. We need a parser that does not follow a strictly structured
approach, and the Link Parser fulfilled some of these requirements.
However, the accuracy of the Link Parser is not 100%. The very flexibility which makes the Link Parser an attractive package to use, can sometimes let even obviously gibberish sentences pass through unchecked. So, to enhance its performance, we have added a few filters on top of the parser which give us more fine-grained control on deciding which sentences are gibberish and which are not. Also, to customize the parser to our needs, we have changed the values of some of the variables in our program. Hence now, the system is intelligent enough to detect word salads and any obvious attempts made to fool it, but at the same time can identify genuine and minor grammatical errors that students tend to make in examinations. We ignore sentences which have one or two grammatical errors and consider them ungrammatical, but not gibberish. Sentences with more than three to four errors are classified as ``Potentially Gibberish'' in the Developer mode. Then, based on the ratio of number of errors in the sentence to the number of words in the sentence, it is decided as to whether or not the sentence is gibberish. Sentences which really do not make sense are classified as Gibberish, at the outset.
In the Beta version, the Gibberish Detector was unable to consistently classify sentences which we called ``borderline'' sentences. These are sentences which have few grammatical errors, but do not make sense. Sometimes, these sentences are shown as Gibberish, while otherwise they are shown as non-gibberish. However, this was because we had tried to make the system somewhat lenient in that it should not penalize a student to a very great extent. We had tried to optimize the system such that it was neither too lenient, nor too strict.
Also, at that point, whether a sentence is Gibberish or not was decided based only on the syntax of the sentence, not its semantic form. Hence, a sentence like ``Ferocious Paint barks at purple tree'' would pass unchecked through the detector, since its syntactic structure is perfect. For the final version, we have tried to address this issue by assigning sense tags to every word in a sentence, and hence classifying it as Gibberish , Non-Gibberish or Un-grammatical.
Now, we had to make a decision as to how to assign sense tags to the words in the essay. Our problem was solved by Dr. Ted Pedersen, who suggested that we make use of the SenseRelate package developed at the University of Minnesota, Duluth by Jason Michelizzi and Dr. Ted Pedersen. We are extremely grateful to both of them for generously offering us the use of this package, which is as yet in its Beta stage. Though we did not make use of this package due to time constraints, its overall implementation gave us an idea as to how to proceed further with our Gibberish Detection.
Let us now see how we made use of the idea obtained from the SenseRelate package to identify Gibberish sentences. Outlined below is the whole algorithm used to classify sentences as Gibberish, Ungrammatical and Non-Gibberish.
Thus, we have have 2 methods for gibberish detection working together. The original Linkparser based and the new sense tagged bases. Initially Linkparser based method is called which returns the No of Linkages found and Null Count value back to the parser in a data structure for every sentence. Next we call the sense tagged based method which then return the 3 values described above in a data structure for every sentence. Using a combination of rules applied on the 2 data structures we base our judgement of whether a sentence is Gibberish , Non-Gibberish or Ungrammatical .Finally, if a sentence is Gibberish, it is assigned 0 points, if it is ungrammatical it is assigned 0.8 points, while it gets 1 point if it is Non-Gibberish. These values are calculated for all the sentences in the essay and the final value is scaled down to a score range of 0 to 2. We shall explain the scoring process later.For now, we shall just say that the Gibberish Detector contributes and scores for 2 out of the total 6 points allotted to an essay.
This is one module that has given considerably successful results. In fact, we would go so far as to say that this module is ready to be presented as the final version too. It gives clear, distinct and good results for relevance. In the relevance check we have used a collection of essays from various websites, which share the same prompt, and have a score of 4 or more. These essays then form a benchmark for the relevance checking algorithm and the essay of the student will be compared to these essays for checking relevance at present, and for scoring later on. For every prompt, we have used a minimum of five essays to train the system. If more good essays for a given prompt were available, they were also used. However, it should be noted that even with three essays, the system works very well.
We have used the idea of a co-occurrence matrix as done in LSA, but with a slight change. As in the LSA, here too, we have removed all the stop words from this corpus and made a Word by Word co-occurence matrix for this corpus. The entries in the co-occurrence matrix show the count, that is the number of times that the two words occur together in the corpus. This count is incremented whenever two words co-occur in a sentence.
Now, we come to the weighing function. The weight for every word was calculated with the help of the following simple formula:
With the help of the above weights, we compute the similarity scores for various words. These scores are then used to calculate the similarity between sentences. The similarity is calculated using Cosine measures [2]. How this is done exactly is explained below.
What we do is take the prompt first. Then extract every sentence from the answer essay. Further, based on the weights associated with each word, we compute the cosine similarity measure for the prompt and the sentence. The similarity for a sentence was calculated as follows.
Once this is done, we compute the mean of the relevance scores for all the sentences in the essay. This is done by a simple average formula of adding all the values obtained so far and dividing the total by the number of sentences for which similarity was calculated.
In the alpha version, we had to multiply the divergence scores by a certain factor, to get a clear cut-off where we can identify whether a sentence is relevant or not. In the beta version, we found that there is no need for such a factor multiplication. We get proper and conclusive answer due to the assignment of weights to every word in the essay. In the final version, we do not need divergence values at all. What we do is just calculate the similarity scores of different sentences. If the similarity score of a sentence is less than 0.4 or if the value of the difference between the similarity score and the mean is less than -0.25 (similarity - mean), then it is directly classified as irrelevant.
Thus, we had the similarity scores for every sentence in the essay. We also calculated the relevance of the entire essay to the prompt. This was done by considering the whole essay as a single long sentence and finding its similarity scores mentioned above. This gives us a very clear and distinct idea of whether the whole essay is relevant to the prompt or not.
For this feature, we get the output from the boundary.pl. This output is then compared with a text file which contains numerous keywords which symbolise scientific, historical, astronomical, meteorological, and other such facts. For example, a sentence like ``There are 9 planets in the solar system - Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto''. In this case, the words ``Jupiter'', ``Earth'', etc. are present as keywords in our text file and so this sentence is thus classified as a ``Potential fact''.
These potential facts are then scanned for the presence of words that show that the user is stating them as an opinion, rather than a fact statement. So, sentences containing words like, ``might'', ``may'', ``guess'', ``think'', etc. are then ignored and not counted as facts. The program gives two files as output. One is for the user, which only shows all the statements of fact present in the student essay. The other output is in the developer mode. It shows the exact execution and flow of the program which can help the developer to understand how the program identifies a sentence as a statement of fact. In this mode, every sentence is tagged to show whether it is a fact, an opinion, or neither. The text file has been created by using a Perl program which filters out relevant words out of a glossary.
Thus, to summarise, we can say that the Fact Identification module does the following. First, it identifies statistical, historical, astronomical and scientific facts which are present in the student essay. These facts are then checked to see whether they are opinions or not by scanning for certain keywords. If it is seen that statements of fact are actually opinions expressed by the student, then these are not counted as facts. The final output shown to the user only contains the obvious statements of fact.
This module required a lot of trial and error, before we came up with the most effective approach. We first thought of using the Google API, but then realised that though this approach may work in the short term, this is not ideal. The Google API limits user queries to only a thousand per day, and though this seems adequate to start out, very soon we realise that this number is in no way enough. Then, we thought of making use of the lwp package in Perl, but after some thought decided against it. After this, we tried using AltaVista for doing searches, but then AltaVista does not allow specific commands within quotes easily. So after trying many other approaches like making use of online encyclopedias like Wikipedia, Metapedia, etc. we finally decided to make use of the Yahoo Search Engine.
Hence, in the Beta version, what we did was first take the exact sentence and query it in Yahoo. If we found an exact match, which either confirmed or refuted the fact, we stated the same and stopped. However, if the exact sentence did not occur, we then removed the quotes and tried to search for the words in the sentence. Then we attempted to verify or deny the fact that had been identified. If we still did not come across any occurrences of the fact phrase in any web page, that is, if the search for the sentence still yielded no result, we said that the sentence is a non-verifiable fact.
Forthermore, if the fact statement contained an important number (like Pi or the number of days in a year) or a date (of historical importance, say 4th July 1776 or 15th August 1947) then we only queried the exact string and in the result verified whether or not the fact is true.
But all these approaches were still imperfect and taking a lot of time. Hence, we decided to make use of the Google API again, irrespective of the fact that it has limited number of queries per day. This approach is outlined stepwise below. This approach is now working considerably well and the speed of the module has also improved.
We have implemented the four required modules as mentioned in the previous section. What has been done in every module has been outlined above. We shall now list further changes which can make the modules function even better. Due to time constraints, these could not be implemented in this system.
The Gibberish Detector uses both a structured approach ( Linkparser based ) and semantics based approach ( Sense tagged based ). However the accuracy of gibberish detection depends on the rules that are used to take this decision. There are some constants used in the rules . Even a slight variation in their values can move the sentences from one category to another. The ``correct'' value for the constants can be found out by carrying lots of experiments on sample essays.
The Relevance Detector is working well, and as far as we can see, it need not be further improved. It is clearly demarcating relevant and irrelevant sentences. So far, we have not seen it to show a relevant sentence as irrelevant or vice-versa. The huge difference in the similarity scores also helps us to clearly distinguish relevant and irrelevant sentences. Hence, we cannot come up with any changes in this module, which will improve its performance by a great extent. However, in case the user comes across any bugs in this module or if this module does not work well for any example, please feel free to contact us with the details.
Though the Fact Identification module is giving fairly good results, it can be further improved. We intend to do this by refining the keyword corpus used. The text files of both, facts and opinions will be modified to incorporate more words which identify a particular sentence as fact and/or opinion. This will make the distinction between facts and opinions more clear. Also, we would like to point out that at present we are only looking at this one aspect to identify facts.
What can be further done to enhance the performance of this module is to make use of the Part of Speech Tags already assigned. Using the output of this tagging, we can identify the sentences in the essay which contain proper nouns. These sentences can then be divided into a subject phrase and an object phrase. Then, we could search for the subject phrase online, and if it had a lot of hits, we classify the whole sentence as a fact.
In the Fact Verification module too, the performance of the system could be enhanced with certain changes. The first step would be to separate the subject and object phrases or the noun and verb phrases from the fact statements. Then, we query various verb phrases for a particular noun phrase to find out the similarity scores. This discounts the possibility that similar verb phrases occurring in the vicinity of a given noun phrase also show a high similarity score.
The scoring is done on the basis of various features, such as average word length, average sentence length, type to token ratio, length of the essay and the cohesiveness of the essay. Of course, the modules of our system also contribute significantly to the score of the essay.
However, each of the above mentioned attributes contributes separately to the overall essay score. In our scoring method, factors such as relevance, essay length, cohesiveness, grammaticality and number of paragraphs in the essay are the more important features. Other factors like type to token ratio, average word length, average sentence length, precision of the fcat statements, etc. do contribute to the overall score, but do not make much of an impact to the same.
Let us now see what we mean by some of the terms used above. The word ``type'' means the number of different words present in the essay. By ``tokens'', we mean the number of words present in the essay. So, a type to token ratio is the number of different words in the essay divided by the number of total words in the essay. This ratio gives us an idea of the depth of the student's vocabulary. The higher the ratio, the more varied and rich is the student's vocabulary, and hence it contributes to a higher score in the essay.
By cohesiveness, we mean how relevant the whole essay is to the prompt and how well do sentences and paragraphs flow into each other. We do this by calculating the similarity scores between different sentences, and then finding the divergence values. High divergence values between adjacent sentences indicate that the essay is not flowing smoothly, but rather jumps from one topic to another, abruptly. Such essays, which show less cohesiveness should be given less marks.
Thus, we then integrate all the above features to create a scoring system, which then assigns a score from 0-6 to the student essay. An essay which is extremely good and has all required characteristics scores 6, while if nothing is written the student gets a 0. Scores from 1-5 are then calibrated according to this model. In scoring, relevance to the prompt and relevance within the essay carry the most weight. The length of the essay also plays an important role. However, when the essay starts to get really long (more than 750 words), then we start to cut off points, as the length then tends to reduce the quality of the essay.
We now show various test cases wherein each of our modules does or does not work. These test cases highlight the salient features of our system, while at the same time pointing out some of its limitations. Based on these test cases, we can deduce how well the system in general, and the modules in particular work.
Given below are a list of the input sentences given to the relevanced detector. We have then shown the cases where the module has worked well, and where the module has failed.
Similarity Score = 0.961492172076257
Similarity Score = 0.959603572732741
Similarity Score = 0.954494948184705
Similarity Score = 0.952009766501856
Similarity Score = 0.943521526322683
Similarity Score = 0.912155300870518
Similarity Score = 0.79489428387211
Similarity Score = 0.766618183333831
Similarity Score = 0.53767700948703
Similarity Score = 0
Similarity Score = 0.956490367045901
Similarity Score = 0.664876199191321
Given below are a list of the input sentences given to the gibberish detector. We have then shown the cases where the module has worked well, and where the module has failed.
Now, for every sentence, we explain in detail the varius scores obtained by using both, the Link Parser and the Sense Relate function. We also specify which rules and what values led us to conclude whether a sentence is Grammatical, Umgrammatical or Gibberish.
The following is a list of facts and the output as given by the Fact Identification module.
As we can see above, out of 10 facts, the system identifies 7. However, one of the fact statements given above is a generic fact and so, it need not be classified. Hence, we can say that our system was successful in identifying 7 out of 9 given facts. Though, this figure is in no way impressive, it is enough for our purposes. We shall now try to analyze why our system failed in identifying certain facts.
From the above examples, we get a clear idea of the cases when our system works as expected and when it fails. The reasons for the failure as explained above, are due to the insufficiency of the size of the corpus. However, according to Zipf's law, and as seen in many other cases, a corpus based approach is not going to be sufficient, however large the size of the corpus is. So, an approach which identifies the POS and looks out for proper nouns might work better in identifying the fact statements present in an essay.
This was a tricky module, both to implement and to evaluate. This is because we make extensive use of the World wide web in this module, and hence this makes it susceptible to noise and other problems which arise on the Internet. Also, search engines sometimes yield noisy results. The online encyclopedias are also not free of noise, as they are open-ended and anybody can put information on them. We shall now see some examples of cases where the module showed expected and some unexpected results.
Thus, from the above results, we can say that most of the times our system is working as per our expectations and identifying correct and wrong facts as such. Let us see in exact figures, how well the module worked.
Hence, we can see that for a majority of the cases, the system can correctly identify whether a fact statement is true or not. We have already seen in the previous section as to why the module may fail in certain cases.
However, we would also like to point out the fact that we have taken really difficult and extremely confusing examples. For example, there are many web pages which address Mumbai as the commercial capital of India. In such a scenario, it becomes very easy for the system to fail as most of the words match the query string, with the exception of ``Commercial''. Though semantically, the word makes a huge difference, the syntactic structure of the sentence may fool the system. It should be hence noted that even in this case, our system was able to identify the statement for the wrong fact that it is.
Also, the one correct fact, which was stated as untrue was, ``India, the world's largest democracy, is one of the oldest civilizations in the world.'' We can instantly see that this is a very long sentence and this is the reason that the system fails to identify it as a correct fact. This may be because Google cannot accept very large search strings. If we break up the sentence into ``India is the largest democracy in the world'', and ``India is one of the oldest civilizations in the world'', then the system identifies both sentences as true statements of fact.
Besides the String::Similarity package that was used in this module, gives absurd results when there are many intervening spaces in between the words of a particular sentence. This results in a high similarity score of 0.4545 between two dissimilar sentences such as ``and women managers in Singapore. Human Relations'' and ``In 1960 Ellis Page set the stage for automated writing evaluation''. The occurances of ``age'' in ``managers'' and ``Page'' and ``stage'' causes the algorithm to rate the similarity very high. The package that was downloaded from CPAN was based on And-Or Difference Algorithm [11] which basically works on ``Edit-distance concept'' which compares strings based upon the number of characters that need to be edited to transform one string into another. This algorithm was implemented by Marc Lehmann as a CPAN module.
Also, the large amount of noise present on the World Wide Web can cause the module to fail. The online encyclopedias are also prone to noise as they are open-ended which means that anyone can write anything on that.
So far, in Gibberish Detection we have not used any previous work that we know of. However, if we later on realize that this approach was followed by someone else, we shall definitely include them in our references.
As regards the Relevance Checker, we took the approach as mentioned in Section 8.5 of the book Foundations of Statistical Natural Language Processing by Christopher Manning and Hinrich Shutze. In this section they explain vector space measures and word by word co-occurrence matrices. These are the matrices that we have implemented. Also, we have implemented the Poor Man's LSA, that was the first assignment in this course. The idea of using weights to calculate similarity scores was inspired by Burstein et. al. [2]. Their approach also makes use of weights. Though the weighing is done differently, we have to credit them for this idea.
We would also like to mention here that none of the systems that we have come across has a module that Identifies or Verifies facts. In this regard, our projects seem to be the pioneers. Hence, we do not think that there is any work done which includes this feature. If, in future, we realize that someone has done considerable work in this field, we shall include them in the references.
[1] Landauer, Foltz and Laham. An Introduction to Latent Semantic Analysis. Discourse Processes. 25, 259-284.
[2] Attali and Burstein. Automated Essay Scoring with E-rater V.2.0. Paper presented at the Conference of the International Association for Educational Assessment held between June 13-18, PA.
[3] S. Reid & G. Findlay. Writer's workbench analysis of holistically scored essays. Computers and Composition, 3(2), 1986. 6-32.
[4] Patricia French. Developments in the provision of quality electronic summative assessments. The Open Polytechnic of New Zealand, Working Paper, March 1998.
[5] William Wresch. The Imminence of Grading Essays by Computer--25 Years Later. Computers and Composition 10(2), April 1993, 45-58
[6] H. Hellwig. Computational text analysis for predicting holistic writing scores. Paper presented at Conference on College Composition and Communication, held in March 1990 at Chicago, IL.
[7] E. Page. Computer Grading of Student Prose, Using Modern Concepts and Software. Journal of Experimental Education, 62(2), 1994, 127-142
[8] C. Leacock and M. Chodorow. C-rater: Automated Scoring of Short Answer Questions. Computers and the Humanities 37, no. 4 (2003): 389-405.
[9] D. Grinberg, J. Lafferty and D. Sleator. A robust parsing algorithm for link grammars. Carnegie Mellon University Computer Science technical report CMU-CS-95-125. and Proceedings of the Fourth International Workshop on Parsing Technologies. Prague, September, 1995.
[10]D. Lin. Dependency-based evaluation of minipar. Proceedings of Workshop on the Evaluation of Parsing Systems, First International Conference on Language Resources and Evaluation. 1998.
[11]E. Myers. An O(ND)
Difference Algorithm and its Variations. Algorithmica
Vol. 1 No. 2 (1986): 251-266.