Please use this identifier to cite or link to this item:
http://cris.utm.md/handle/5014/466
Title: | Automate plagiarism detection | Authors: | STRATULAT, Eugeniu STROIANEȚKI, Stanislav BOBICEV, Victoria |
Keywords: | plagiarism;automate plagiarism detection;text classification;substring search | Issue Date: | 2019 | Source: | STRATULAT, Eugeniu; STROIANEȚKI, Stanislav; BOBICEV, Victoria. Automate plagiarism detection. In: Electronics, Communications and Computing. Editia a 10-a, 23-26 octombrie 2019, Chişinău. Chișinău, Republica Moldova: Universitatea Tehnică a Moldovei, 2019, p. 29. ISBN 978-9975-108-84-3. | Conference: | Electronics, Communications and Computing | Abstract: | The paper presents a study in which an application for plagiarism detection has been created. It has been evaluated using the set of documents provided by PAN 2009 task on external plagiarism detection [1]. The task has been formulated as follows: Given a set of suspicious documents and a set of source documents the task is to find all text passages in the suspicious documents which have been plagiarized and the corresponding text passages in the source documents. The organizers provided a training corpus which comprises a set of suspicious documents and a set of source documents. A suspicious document may contain plagiarized passages from one or more source documents. The main metrics used for document comparison was NCD (Normalized Compression Distance) which is actually a family of functions which take as arguments two objects (some texts) and evaluate a fixed formula expressed in terms of the compressed versions of these objects, separately and combined [3]. The method is the outcome of a mathematical theoretical developments based on Kolmogorov complexity [4]. The smaller is the result, the more similar are the objects. The application for plagiarism detection has been written in PHP. The similarity of two lines is calculated using the algorithm described in [2]. The selected threshold value has been estimated on the base of training data. This value provides the best plagiarism detection accuracy on the given texts. In order to evaluate our application we used 400 documents from the set provided by the task organizers. We calculated Precision and Recall on 1/10 part of this set, namely, on 40 documents. The information of the plagiarism in these 40 documents has been provided by the task organizers, so we knew exactly that only 5 of these 40 documents contained plagiarized fragments. The application returned exactly 5 files in which plagiarism was found. This result demonstrated that the application is good for the task. |
URI: | http://cris.utm.md/handle/5014/466 | ISBN: | 978-9975-108-84-3 |
Appears in Collections: | Conference Abstracts |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
29-29_11.pdf | 287.49 kB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.