Automate plagiarism detection

STRATULAT, Eugeniu; STROIANEȚKI, Stanislav; BOBICEV, Victoria

DSpace-CRIS UTM

Research outputs

01-Scientific papers

Conference Abstracts

Please use this identifier to cite or link to this item: http://cris.utm.md/handle/5014/466

Title: Automate plagiarism detection

Authors: STRATULAT, Eugeniu
STROIANEȚKI, Stanislav
BOBICEV, Victoria

Keywords: plagiarism;automate plagiarism detection;text classification;substring search

Issue Date: 2019

Source: STRATULAT, Eugeniu; STROIANEȚKI, Stanislav; BOBICEV, Victoria. Automate plagiarism detection. In: Electronics, Communications and Computing. Editia a 10-a, 23-26 octombrie 2019, Chişinău. Chișinău, Republica Moldova: Universitatea Tehnică a Moldovei, 2019, p. 29. ISBN 978-9975-108-84-3.

Conference: Electronics, Communications and Computing

Abstract:
The paper presents a study in which an application for plagiarism detection has been created. It has been evaluated using the set of documents provided by PAN 2009 task on external plagiarism detection [1]. The task has been formulated as follows: Given a set of suspicious documents and a set of source documents the task is to find all text passages in the suspicious documents which have been plagiarized and the corresponding text passages in the source documents. The organizers provided a training corpus which comprises a set of suspicious documents and a set of source documents. A suspicious document may contain plagiarized passages from one or more source documents. The main metrics used for document comparison was NCD (Normalized Compression Distance) which is actually a family of functions which take as arguments two objects (some texts) and evaluate a fixed formula expressed in terms of the compressed versions of these objects, separately and combined [3]. The method is the outcome of a mathematical theoretical developments based on Kolmogorov complexity [4]. The smaller is the result, the more similar are the objects. The application for plagiarism detection has been written in PHP. The similarity of two lines is calculated using the algorithm described in [2]. The selected threshold value has been estimated on the base of training data. This value provides the best plagiarism detection accuracy on the given texts. In order to evaluate our application we used 400 documents from the set provided by the task organizers. We calculated Precision and Recall on 1/10 part of this set, namely, on 40 documents. The information of the plagiarism in these 40 documents has been provided by the task organizers, so we knew exactly that only 5 of these 40 documents contained plagiarized fragments. The application returned exactly 5 files in which plagiarism was found. This result demonstrated that the application is good for the task.

URI: http://cris.utm.md/handle/5014/466

ISBN: 978-9975-108-84-3

Appears in Collections: Conference Abstracts

Files in This Item:

File Description Size Format
29-29_11.pdf 287.49 kB Adobe PDF View/Open

Show full item record

Google Scholar^TM
Check

Altmetric

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Title:	Automate plagiarism detection
Authors:	STRATULAT, Eugeniu STROIANEȚKI, Stanislav BOBICEV, Victoria
Keywords:	plagiarism;automate plagiarism detection;text classification;substring search
Issue Date:	2019
Source:	STRATULAT, Eugeniu; STROIANEȚKI, Stanislav; BOBICEV, Victoria. Automate plagiarism detection. In: Electronics, Communications and Computing. Editia a 10-a, 23-26 octombrie 2019, Chişinău. Chișinău, Republica Moldova: Universitatea Tehnică a Moldovei, 2019, p. 29. ISBN 978-9975-108-84-3.
Conference:	Electronics, Communications and Computing
Abstract:	The paper presents a study in which an application for plagiarism detection has been created. It has been evaluated using the set of documents provided by PAN 2009 task on external plagiarism detection [1]. The task has been formulated as follows: Given a set of suspicious documents and a set of source documents the task is to find all text passages in the suspicious documents which have been plagiarized and the corresponding text passages in the source documents. The organizers provided a training corpus which comprises a set of suspicious documents and a set of source documents. A suspicious document may contain plagiarized passages from one or more source documents. The main metrics used for document comparison was NCD (Normalized Compression Distance) which is actually a family of functions which take as arguments two objects (some texts) and evaluate a fixed formula expressed in terms of the compressed versions of these objects, separately and combined [3]. The method is the outcome of a mathematical theoretical developments based on Kolmogorov complexity [4]. The smaller is the result, the more similar are the objects. The application for plagiarism detection has been written in PHP. The similarity of two lines is calculated using the algorithm described in [2]. The selected threshold value has been estimated on the base of training data. This value provides the best plagiarism detection accuracy on the given texts. In order to evaluate our application we used 400 documents from the set provided by the task organizers. We calculated Precision and Recall on 1/10 part of this set, namely, on 40 documents. The information of the plagiarism in these 40 documents has been provided by the task organizers, so we knew exactly that only 5 of these 40 documents contained plagiarized fragments. The application returned exactly 5 files in which plagiarism was found. This result demonstrated that the application is good for the task.
URI:	http://cris.utm.md/handle/5014/466
ISBN:	978-9975-108-84-3
Appears in Collections:	Conference Abstracts

CRIS of TUM

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM