AUTOMATIC SHORT ANSWER GRADING A thesis submitted to Indian Institute of Technology, Kharagpur in partial fulfillment of the degree
Bachelor of Technology by
Buddha Prakash Under the supervision of
Prof. Anupam Basu
Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur May 2, 2016
Certificate This is to certify that the project report titled ‘Automatic Short Answer Grading’ submitted by Buddha Prakash to the Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, in partial fulfillment of the requirement for the award of degree of Bachelor of Technology (Hons) is a record of bonafide research work carried out by him under my supervision and guidance. The project report has fulfilled all the requirements as per the regulations of the institute and, in my opinion, has reached the standard needed for submission. Date: _____________________________ Prof. Anupam Basu Professor, Dept. of Computer Science and Engineering 1
Acknowledgements I would like to take this opportunity to extend my thanks and gratitude to Anupam Basu for his guidance and encouragement throughout the course of my work. His encouragement for finding new ideas to work on and choosing my own topic helped me gain interest and confidence in this topic. I am indebted to him for giving me this exposure in the field of Natural Language Processing and Machine Learning. I would also like to thank Syaamantak Das and Archana Sahu, Ph.D students of Prof. Anupam Basu for their ideas and constant help and over the last few months. I would like to thank all the faculty and the staff of the Department of Computer Science, IIT Kharagpur for their and assistance. I also wish to thank all my parents and friends for their help and love.
2
Contents 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Abstract ……………………………………………………………… 4 Introduction …………………………………………………………. 5 Related work…………………………………………………………. 7 Data Description ………………………………………………….… 8 Automatic answer grading system ...…………………………… 9 Preprocessing the answers …..….………………………………. 10 Feature Extraction ……………....………….……………………… 11 Training a Machine Learning model ………….……………….… 20 Experiments and Results ………………..…………………………22 References …………………….…………….………………………. 24
3
Abstract In my work, I explore and improve upon supervised techniques for the task of automatic short answer grading. I develop a two level architecture for the automatic short answer grading task. In the first level I employ a number of knowledgebased and corpusbased measures of text similarity combined with various distributional methods to obtain multiple similarity measures of the student and model answer. These similarity values are then used in the second level as features to train a Machine Learning model. I use the trained model to predict correctness of unseen testing data and compute accuracy of the system. Overall, the system performance is close to the state of the art methods for short answer grading that have been proposed in the past.
4
Introduction A major task in Educational applications of NLP is to assess student responses to examination questions and homeworks. A critical subtask in student dialogue systems is Student Response Analysis (SRA) i.e. given a question and a few reference answers, the system needs to analyze student response and decide whether it is correct or not. An automatic short answer grading system is one which automatically assigns a grade to an answer provided by a student through a comparison with one or more correct answers. The problem can be formally defined as follows: given a question and few reference answers all in Natural English Language. We need to classify given student answers as right or wrong. A key requirement for achieving this is semantic inference, for example to detect whether the student answers say the same thing as the reference answer in different words or contradict it. The task of Automatic Short Answer Grading applications in a wide range of research fields including that of Paraphrase detection, Textual entailment tasks, machine translation evaluations etc In my work, I explore supervised techniques for the task of automatic short answer grading. I develop a two level architecture for the automatic short answer grading task. In the first level I employ a number of knowledgebased and corpusbased measures of text similarity combined with Matrix factorization methods to obtain multiple similarity measures of the student and model answer.These similarity values are then used in the second level as features to train a Machine Learning model. The text similarity methods employed have been used in previous works in similar tasks to obtain similarity measures of texts. Feature extraction is the most important stage of the framework. Choosing and obtaining features which capture the measure of syntactic and semantic similarities between the student and reference answers is really critical and they determine the performance of the system. I extract various semantic and syntactic similarity measures between the student and model answers. I employ various word semantic similarity metrics to obtain text level similarity scores which are used as features. As suggested in the semeval task baseline there are four types of lexically driven text similarity measures, and each is computed by comparing the learner response and the model answer. 5
Most of these approaches are bag of word approaches and hence suffer from their inherent weaknesses ie. they lose the ordering of the words and they also ignore semantics of the words. For example, “powerful,” “strong” and “Paris” are equally distant. An alternative is to start by mapping each word of the input language expressions to a vector that shows how strongly the word cooccurs with particular other words in corpora, possibly also taking into syntactic information. A compositional vectorbased meaning representation theory can then be used to combine the vectors of single words, eventually mapping each one of the two input expressions to a single vector that attempts to capture its meaning; in the simplest case, the vector of each expression could be the sum or product of the vectors of its words, but more elaborate approaches have also been proposed. Text similarity measures can then be calculated by measuring the distance of the vectors of the two input expressions, for example by computing their cosine similarity. For obtaining the fixed length word representations I experiment with both Latent Semantic Analysis and Paragraph Vector(Mikolov et. al). As defined by Wikipedia Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextualusage meaning of words by statistical computations applied to a large corpus of text (Landauer and Dumais, 1997). Paragraph Vector is an unsupervised algorithm that learns fixedlength feature representations from variablelength pieces of texts. The algorithm represents each text by a dense vector which is trained to predict words in the text. Its construction gives the algorithm the potential to overcome the weaknesses of bagofwords models. I use a Random Forest classifier for predicting correct/incorrect answers. This particular classifier is chosen due to its robustness and success in various other similar tasks. The model is then used to predict the correctness of student answers present in an unseen test set. The results I got are comparable to various state of the art methods and systems which participated in the Semeval task.
6
Related Work
Stien et al. (2014) [3] in their survey conclude that the most of the work in this field of Automatic Short Answer Grading falls into one of the following themes: ● Era of Concept Mapping : Consider student answers as made up of several concepts. Detect concept in answers while grading. ● Era of Information Extraction : Employ information extraction methods to find specific ideas in the answers, expected in correct answers ● Era of Corpus Based Methods : Exploits statistical properties of large document corpora, interpret synonyms etc in short answers ● Era of Machine Learning : Use measures extracted from natural language as features to train a classification model. ● Era of Evaluation : Promotes use of shared corpora so that the advancements in the field can compared meaningfully. My work broadly falls under the last two research eras as I used a shared corpora for my work and I apply machine learning approach for the task. Also In my study of a huge amount of research in the area of evaluating similarity of texts I came across three major categories of methods that have been employed for the task of Automatic short answer grading: (1) string similarity metrics such as ngram overlap and BLEU score (2) syntactic operations on the parse structure and (3) distributional methods, such as latent semantic analysis Most of the previous work have focussed on the first three approaches individually, I combine methods from all the above areas for answer grading. Lately there has been a lot of work using the last approach and promising results have been achieved. I focus on developing a model which focuses on distributional similarity measures and combine them with other methods used in other areas of research to obtain a model which has a wider coverage and comparative performance to state of the art methods. The work by other researchers used in my approach is referred in each slide. 7
My approach for Automatic Short Answer Grading The system comprises of a two level architecture for the automatic short answer grading task. In the first level I employ a number of knowledgebased and corpusbased measures of text similarity combined with Matrix factorization methods to obtain multiple similarity measures of the student and model answer.These similarity values are then used in the second level as features to train a Machine Learning model. I use the trained model to predict correctness of unseen testing data and compute accuracy of the system.
Fig 1. Brief overview of the prediction system I use the AdaBoost classifier for the classification task in the second level. AdaBoost has been shown to give really good results without needing much efforts in hyperparameters tuning of the algorithm.
Fig 2. Prediction Pipeline 8
Data Description The corpus I use was made available to the research community for the The t Student Response Analysis and 8th Recognizing Textual Entailment Challenge[11]. The data set draws on two established sources – a data set collected and annotated during an evaluation of the BEETLE II tutorial dialogue system (Dzikovska et al., 2010a) (henceforth, BEETLE corpus) and a set of student answers to questions from 16 science modules in the Assessing Science Knowledge (ASK) assessment inventory (Lawrence Hall of Science, 2006) (henceforth, the Science Entailments Bank or SCIENTSBANK). BEETLE Corpus : The BEETLE corpus consists of the interactions between students and the BEETLE II tutorial dialogue system (Dzikovska et al., 2010b). The BEETLE II system is an intelligent tutoring system that teaches students with no knowledge of highschool physics concepts in basic electricity and electronics. The corpus contains Explanation and definition questions which require longer answers that consist of 12 sentences, e.g., “Why was bulb A on when switch Z was open?” (expected answer “Because it was still in a closed path with the battery”) . From the full BEETLE evaluation corpus, only the students’ answers to explanation and definition questions are extracted, since reacting to them appropriately requires processing more complex input than factual questions. SCIENTSBANK Corpus: This corpus (Nielsen et al., 2008) consists of student responses to science assessment questions. Only a subset of the corpus is taken that required students to explain their beliefs about topics, typically in one to two sentences Both the corpus contains manually labeled students responses to explanation and definition questions. Specifically, the data set contains a question, multiple reference answers and a 12 sentence student answer. Each student answer is labeled as one of the two judgments ie. correct or incorrect by a human annotator. Training Dataset Size : 3940 student answers Test Dataset Size : 440 student answers 9
Preprocessing the student answers We need to preprocess the data since the student answers often contain spelling errors. The system relies on word to word similarity so preprocessing becomes even more essential to the method because I use wordnet for a number of similarity measures which is a lexicon. In case of spelling errors the words would not be present in the lexicon which would result in our system missing important information. In case of methods relying mainly on dependency structure this would not be a major problem. The student answers contain a large number of spelling errors.
Variations of ‘different’ : diffrent, differnt, differant, diferent, diferrent etc
I implement the following workflow for text normalization: For all the questions, student and reference answers 1. Tokenize all text and convert to lower case. (nltk library implementation) 2. Remove all special characters like !@#%^& 3. Perform spell correction (Norvig’s spell correct) 4. Remove all words in answers which are not present in a dictionary or the question. 5. Remove all stopwords. For tokenizing the texts I use the nltk library implementation of word tokenizer. Nltk is a widely used library in python which has a state of the art tokenizer implementation. For spell correction I use the spell correct written by Peter Norvig[12] which achieves a pretty high accuracy at really high processing speed. As my method mainly employs a bag of words model I remove all the stopwords as they don’t help improve the performance of the system. 10
Feature Extraction Feature extraction is the most important stage of the framework. Choosing and obtaining features which capture the measure of syntactic and semantic similarities between the student and reference answers is really critical and they determine the performance of the system. Students often parrot back the information mentioned in the questions. This might result in false positives as the student answer might have a high similarity with the model answer as model answer contains text from the question even though student answer doesn’t contain the actual answer. To tackle this problem, for each feature between student and model answer I also include the student answer and question similarity in the feature set so that the model takes into this behaviour.
The features used can be broadly classified into the following categories:
1. Baseline features : I computes four similarity metrics – the raw number of overlapping words, F1 score, Lesk score and cosine score of the vector representation of the two answers. This baseline is based on the lexical overlap baseline used in RTE tasks (Bentivogli et al., 2009). 2. Semantic Similarity Features: Use various word similarity metrics for obtaining wordtoword similarity which is then combined using a metric to obtain the answertoanswer similarity. 3. Distributional Features: Answers are mapped to a fixed length vector representation. Similarity measures between the two answers can then be detected by measuring the similarity of the vectors of the two input expressions. 4. Other miscellaneous features: Polarity markers and antonym features are used to detect opposite sense or meaning between the two answers
Also for each kind of similarity metric I obtain two features. 1. Similarity of student answer with model answer 2. Similarity of student answer with question 11
Baseline Features: There are four types of lexically driven text similarity measures, and each is computed by comparing the learner response to both the expected answer(s) and the question, resulting in eight features in total – four in comparison with the question, four with the maximum similarity reference answer. 1. Overlapping Words : It is simply the number of overlapping words between the student and reference answers. 2. Cosine Similarity : Representing each answer as a bagofwords vector, cosine similarity is the cosine of the angle between the vectors. 3. Bleu Metric : Bleu is a really popular metric used for MT evaluations. It uses a modified form of precision to compare a candidate translation against multiple reference translations. The metric modifies simple precision since machine translation systems have been known to generate more words than are in a reference text. 4. Lesk Score : Simplified Lesk score is used to compare the overlap in meanings of the words in answers. This baseline is based on the lexical overlap baseline used in RTE tasks (Bentivogli et al., 2009).
12
Distributional Features
Vector space models have recently been widely employed in various related tasks of Paraphrase detection and Textual Entailment Recognition. It is an alternative to using logical meaning representations in which we start by mapping each word of the input language expressions to a vector that shows how strongly the word cooccurs with particular other words in corpora (Lin, 1998b). A compositional vectorbased meaning representation theory can then be used to combine the vectors of single words, eventually mapping each one of the two input expressions to a single vector that attempts to capture its meaning. In the simplest case, the vector of each expression could be the sum or product of the vectors of its words, but more elaborate approaches have also been proposed (Mitchell & Lapata, 2008; Erk & Pad´o, 2009; Clarke, 2009). Similarity measures between two texts can then be detected by measuring the distance of the vectors of the two input expressions, for example by computing their cosine similarity. I employ LSA and Distributed word representations to obtain similarity measure between the student and model answer. The overall approach of how it is obtained is outlined below.
1. Unsupervised methods employed to learn the vector representation of words using all the reference answers as context. 2. Word vector representations combined to obtain a fixed length vector representation for the reference and model answer. 3. Similarity between the reference and model answer computed by the cosine similarity measure of their respective fixed length vector representation. 4. The maximum similarity measure between the model answers and student answer used as a feature. 13
1. Learning vector representation of words
Word Representation Approaches: The first step toward getting distribution features involves learning word representations. There exist several methods in the literature to learn word representations from text corpus in an unsupervised manner. Some of the widely employed examples being Latent Semantic Analysis, Latent Dirichlet Allocation, Vector Space Model (VSM) and Explicit Semantic Analysis (ESA), and distributed word representation approaches such as by Mikolov et al. (2013). I briefly discuss below the methods that I use in my framework. LSA : As defined in Wikipedia Latent Semantic Analysis (LSA) is an algebraic method that represents the meaning of words as a vector in multidimensional semantic space (Landauer et al. 2007). LSA starts by creating a worddocument matrix. It then applies singular value decomposition of the matrix followed by the factoranalysis. In other words, a word is a point in the new semantic space. Semantically similar words appear to be closer in the reduced space. Distributed word vector representation(Mikolov) : In this section I explain the working of a well known framework for learning the word vectors. This framework, given a sequence of training words produces a matrix W where each column corresponds to a word mapping. Therefore we obtain a unique fixed length vector for each word in the dataset.
14
The problem can be formally defined as follows, for a sequence of words w1, w2, w3, .... , wT in the training data, the objective of the word vector model is to maximize the average log probability
1 T
T −k
∑ log p(wt |wt−k , ....., wt+k )
t=k
The prediction task is typically done via a multiclass classifier, such as softmax. There, we have
p(wt |wt−k , ....., wt+k ) =
yw t
e
yw i
∑e
i
Each of yi is unnormalized logprobability for each output word i , computed as
b + Uh(wt−k , ....., wt+k ; W ) where U; b are the softmax parameters. h is constructed by a concatenation or average of word vectors extracted from W .
Figure 1. A framework for learning word vectors. Context of three words (“the,” “cat,” and “sat”) is used to predict the fourth word (“on”). The input words are mapped to columns of the matrix W to predict the output word.
15
The neural network based word vectors are usually trained using stochastic gradient descent where the gradient is obtained via backpropagation (Rumelhart et al., 1986). After the training converges, words with similar meaning are mapped to a similar position in the vector space. For example, “hate” and “envy” are close to each other, whereas “hate” and “love” are more distant.
2.Vector representations of sentences: One simple method of combining individual word representations is to sum the vector representations of all the words in the sentence. I use the Combining LSA word representations: Simple sum of vector representations of all the words in sentence. Paragraph vectors: Combining distributed word vector representations The approach for learning paragraph vectors is inspired by the methods for learning the word vectors. Similar as in the word vector method the paragraph vectors are also asked to contribute to the prediction task of the next word given many contexts sampled from the paragraph. The framework results in every paragraph being mapped to a unique vector, represented by a column in matrix D. The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. The contexts are fixedlength and sampled from a sliding window over the paragraph. The paragraph vector is shared across all contexts generated from the same paragraph but not across paragraphs The paragraph vectors and word vectors are trained using neural networks with stochastic gradient descent.
16
Figure 2. This framework is similar to the framework presented in Figure 1; the only change is the additional paragraph token that is mapped to a vector via matrix D.
3 and 4. Compute Maximum Similarity between reference and student answer. Similarity between the reference and model answer computed by the cosine similarity measure of their respective fixed length vector representation. Also the maximum similarity measure between the model answers and student answer is used as a feature. The above steps are carried out to obtain a LSA based and paragraph vector based similarity measure between student and reference answers. These two values are used as two additional features for the second stage of the framework.
17
Semantic Similarity Features:
Given a metric for wordtoword similarity and a measure of word specificity, I define the semantic similarity of two text segments T1 and T2 using a metric that combines the semantic similarities of each text segment in turn with respect to the other text segment. First, for each word w in the segment T1 try to identify the word in the segment T2 that has the highest semantic similarity (maxSim(w, T2)), according to one of the wordtoword similarity measures described in the following section. Next, the same process is applied to determine the most similar word in T1 starting with words in T2. The word similarities are then weighted with the corresponding word specificity, summed up, and normalized with the length of each text segment. The similarity score obtained in this way has a value in between 0 and 1, with a score of 1 indicating identical text segments, and a score of 0 indicating no semantic overlap between the two segments.
Semantic similarity of words:
Wu and Palmer (Wu & Palmer 1994) : This similarity metric measures the depth of two given concepts in the WordNet taxonomy, and the depth of the least common subsumer (LCS), and combines these figures into a similarity score:
Resnik : The measure introduced by Resnik (Resnik 1995) returns the information content (IC) of the LCS of two concepts: where ,
18
Lin : builds on Resnik’s measure of similarity, and adds a normalization factor consisting of the information content of the two input concepts:
Leacock & Chodorow : The Leacock & Chodorow similarity is determined as:
where length is the length of the shortest path between two concepts using nodecounting, and D is the maximum depth of the taxonomy.
Other Features: Polarity Markers : I have a feature to capture the presence (or absence) of linguistic markers of negative polarity in both text and hypothesis, such as not, no, few, without, except etc. If the student answer as well as the reference answer doesn’t contain words of negative polarity the the polarity feature contains a value of 1 and 0 otherwise. Antonymy: I have a feature to capture the presence of antonyms in the student and reference answer. I check if an aligned pair of words in student and reference answer appear to be antonymous by consulting WordNet ontology[9]. If there is an occurrence of such pair of antonym words, then I also check the preceding words for polarity. For example, if there are words in the student and reference as good and bad then I assign a boolean positive to this feature. However, if bad is preceded by not, then this feature returns a boolean negative.
19
Training a Machine Learning Model Once the feature data set is obtained I fit a Machine Learning model to the data. I train a Random Forest Classifier to learn the weights of the various features in the prediction model. Random forests is the general technique of random decision forests that are an ensemble learning method for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x x 1, ..., n with responses Y = y y B times) selects a random sample with 1, ..., n, bagging repeatedly ( replacement of the training set and fits trees to these samples: For b = 1, ..., B : 1. Sample, with replacement, n training examples from X , Y ; call these X b, Y b. 2. Train a decision or regression tree f on X , Y . b b b After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x' :
or by taking the majority vote in the case of decision trees. This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on 20
a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of decorrelating the trees by showing them different training sets. The number of samples/trees, B , is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal number of trees B can be found using crossvalidation, or by observing the outofbag error : the mean prediction error on each training sample xᵢ , using only the [12] trees that did not have xᵢ in their bootstrap sample. The training and test error tend to
level off after some number of trees have been fit. The above procedure describes the original bagging algorithm for trees. Random forests differ in only one way from this general scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated. The trained model is used to predict the labels on a test set containing student answers and prediction accuracy is obtained.
21
Experiments and Results Evaluation of the system : The labels correct and incorrect are roughly balanced in the dataset so the results are evaluated on the macroaverage F1 metric. MacroAverage: The average of the F1scores of the two classes. Best MacroAverage of 0.826 was obtained with all manually crafted features combined with question similarity.
The results reflect the system’s ability to correctly evaluate student answers in majority cases. An F1 score of 0.826 is comparable to the best systems for automatic short answer grading. 22
Features
Macro Average F1
Bleu
0.718
Bleu+Baseline
0.731
Bleu+Baseline+Semanti c
0.768
Paragraph vectors/LSA
0.814
Rest + combined Question
0.826
ETS2
0.833
CoMet1
0.831
Fig: Performance comparison with other systems Since I use a shared corpora it was possible for me to compare my results directly with the winners of the challenge and my system performs reasonably well in comparison to the state of the art methods. The winner of the challenge from ETS2 had a submission which had a F1 score of 0.833 whereas the best performance of my system was 0.826. Clearly this approach of combining different similarity features is comparable to the state of the art systems and with proper fine tuning can be adapted and used in real examinations.
23
References 1. Michael Mohler and Rada Mihalcea(2009) Text to Text semantic similarity
measures for automatic short answer grading. EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Pages 567575 2. Michael A.G. Mohler, Razvan Bunescu, Rada Mihalcea (2011). Learning to Grade Questions using Similarity Measures and Dependency Graph Alignments. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies 3. Steven Burrows, Iryna Gurevych, and Benno Stein. 2015. The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25:60–117. 4. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. Proceeding ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics Pages 311318 5. Socher, Huang, Pennington, Andrew Ng, and Manning. (2011). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Advances in Neural Information Processing Systems. 6. Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 3941. 7. Rada Mihalcea, Courtney Corley and Carlo Strapparava.(2006). Corpusbased and Knowledgebased Measures of Text Semantic Similarity. AAAI. 8. Yangfeng Ji, Jacob Eisenstein.(2013). Discriminative Improvements to Distributional Sentence Similarity. Proceedings of the Conference on Empirical Methods in Natural Language Processing 9. Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 3941. 10. Sumit Basu, Chuck Jacobs and Lucy Vanderwende.(2013). PowerGrading : A clustering approach to Amplify Human Effort for Short Answer Grading. ACL – Association for Computational Linguistics 11. The t Student Response Analysis and 8th Recognizing Textual Entailment Challenge Task 8. SemEval2013 : Semantic Evaluation Exercises International Workshop on Semantic Evaluation. 12. A simple spell correct by Peter Norvig ( http://norvig.com/spellcorrect.html )
24
13. Perez, D., & Alfonseca, E. (2005). Application of the BLEU algorithm for recognizing textual entailments. In Proc. of the PASCAL Challenges Worshop on Recognising Textual Entailment,Southampton, UK. 14. Budanitsky, A., & Hirst, G. (2006). Evaluating WordNetbased measures of lexical semantic relatedness. Comp. Linguistics, 32(1), 13–47. 15. Corley, C., & Mihalcea, R. (2005). Measuring the semantic similarity of texts. In Proc. of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, pp. 13–18, Ann Arbor, MI. 16. Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. 17. Q. Le, T. Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of ICML 2014.
25