call for paticipation
Russian Paraphrase Detection Task
We are pleased to announce shared task on sentence paraphrase detection for the Russian language. A core dataset for the task is ParaPhraser, a freely available corpus of Russian sentence pairs manually annotated as paraphrases, near-paraphrases and non-paraphrases. The corpus is developed in St. Petersburg State University as a part of the project led by E.Yagunova. The current size of the corpus is 7000 pairs, these data will be used as a training set. The test set is being currently collected using the same crowdsourcing procedure and will become a part of the general corpus after the end of the shared task; we expect the size of the test set will be about 1000 pairs.


The shared task will follow the standard procedure: the participating teams will take as an input a pair of sentences and return as a response the similarity class. The shared task will consist of two sub-tasks: binary classification (paraphrase or non-paraphrase) and three-class classification (precise paraphrases, near paraphrases or non-paraphrases). Participants may submit "standard" runs, which use as training data only the ParaPhraser corpus, and "non-standard" runs, which may use any other data. "Standard" and "non-standard" run will be evaluated separately.

More details
on the tasks, evaluation and data formats
are available here:
ParaPhraser Program
10th of November
ParaPhraser: Russian Paraphrase Corpus and Shared Task
Elena Yagunova
Combined Models for Russian Paraphrase Detection Shared Task
Ekaterina Pronoza
In this report we describe paraphrase detection models we have built as part of the Russian Paraphrase Detection Shared Task 2016. Our approach towards paraphrase detection is based on the use of three types of features: shallow, semantic and distributional. In the standard runs, we used the combination of shallow and semantic feature sets, and in the non-standard runs we also experimented with distributional features based on word and phrase embeddings. According to the achieved results, our simple model performs better than the complex one.
HSE-School of linguistics at Russian Paraphrase Detection shared task
Anastasia Romanova, Mikhail Nefedov
The task of paraphrase detection is to tell whether a pair of sentences is semantically equivalent or to give a score of that equivalency. To perform it effectively, methods of meaning representation are used. For representing meaning of words, vector space representation models like word2vec, GloVe were proved to be effective. A common approach for sequences is to take the mean of all the word vectors. This is however a rough approach. In order to obtain higher accuracy on capturing sequence meaning we explore other suggested techniques including modification of BM25 algorithms with idfweighting, binning of per dimension similarities and binning of max similarities. We try several pre-trained word2vec models with different parameters and their combination. We also experiment with recently published syntactic parser SyntaxNet, which determines the syntactic relationships between words in the sentence and presents them in the dependency parse tree. We compute the tree edit distance between the two given dependency trees and use it as another feature. A combination of these semantic features with simple surface features like precision, recall or BLEU score let us achieve the best results in Task 1 and 2 (non-standard runs).
Paraphrase Detection using Semantic Similarity Algorithms

Dmitry Kravchenko (skype)
Paraphrase detection using Machine Learning algorithms, External resources and Toolkits
Russian Paraphrase Identification with Simple Overlap Features with SVMs
Asli Eyecioglu Ozmutlu (skype)
The usage of overlap features in character level with SVMs has been applied to English and Turkish languages for paraphrase identification. We also obtained satisfactory results on Russian paraphrase identification task using the same set of features. Here, we use three-class classification in addition to binary classification. In this research, our methods will be explained in comparison with the previously obtained results from other corpora.
NLX-Group at Russian Paraphrase Detection Task: Character-level Convolutional Neural Network for Sentence Paraphrase Detection
Vladislav Maraev
This paper reports on the results of an experimental study on the application of characterlevel embeddings and basic convolutional neural network to the shared task of sentence paraphrase detection in Russian. The approach was tested in Task#2 standard run and offered competitive results (72.74% accuracy against the test set). This approach is compared against a word-level convolutional neural network for the same task.
Neural nets for paraphrase detection in Russian
Kirill Skornyakov
In this report solution for paraphrase detection tasks is considered. It will include description of specific features for these tasks. Also profit of neural net approach will be shown.
Closing remarks
Lidia Pivovarova

Lidia Pivovarova, University of Helsinki
Ekaterina Pronoza, St. Petersburg State University
Elena Yagunova, St. Petersburg State University
Made on