Automate plagiarism detection

STRATULAT, Eugeniu; STROIANEȚKI, Stanislav; BOBICEV, Victoria

DSpace-CRIS UTM

Research outputs

01-Scientific papers

Conference Abstracts

Please use this identifier to cite or link to this item: http://cris.utm.md/handle/5014/466

DC Field Value Language

dc.contributor.author STRATULAT, Eugeniu en_US

dc.contributor.author STROIANEȚKI, Stanislav en_US

dc.contributor.author BOBICEV, Victoria en_US

dc.date.accessioned 2020-04-28T18:31:20Z -

dc.date.available 2020-04-28T18:31:20Z -

dc.date.issued 2019 -

dc.identifier.citation STRATULAT, Eugeniu; STROIANEȚKI, Stanislav; BOBICEV, Victoria. Automate plagiarism detection. In: Electronics, Communications and Computing. Editia a 10-a, 23-26 octombrie 2019, Chişinău. Chișinău, Republica Moldova: Universitatea Tehnică a Moldovei, 2019, p. 29. ISBN 978-9975-108-84-3. en_US

dc.identifier.isbn 978-9975-108-84-3 -

dc.identifier.uri http://cris.utm.md/handle/5014/466 -

dc.description.abstract The paper presents a study in which an application for plagiarism detection has been created. It has been evaluated using the set of documents provided by PAN 2009 task on external plagiarism detection [1]. The task has been formulated as follows: Given a set of suspicious documents and a set of source documents the task is to find all text passages in the suspicious documents which have been plagiarized and the corresponding text passages in the source documents. The organizers provided a training corpus which comprises a set of suspicious documents and a set of source documents. A suspicious document may contain plagiarized passages from one or more source documents. The main metrics used for document comparison was NCD (Normalized Compression Distance) which is actually a family of functions which take as arguments two objects (some texts) and evaluate a fixed formula expressed in terms of the compressed versions of these objects, separately and combined [3]. The method is the outcome of a mathematical theoretical developments based on Kolmogorov complexity [4]. The smaller is the result, the more similar are the objects. The application for plagiarism detection has been written in PHP. The similarity of two lines is calculated using the algorithm described in [2]. The selected threshold value has been estimated on the base of training data. This value provides the best plagiarism detection accuracy on the given texts. In order to evaluate our application we used 400 documents from the set provided by the task organizers. We calculated Precision and Recall on 1/10 part of this set, namely, on 40 documents. The information of the plagiarism in these 40 documents has been provided by the task organizers, so we knew exactly that only 5 of these 40 documents contained plagiarized fragments. The application returned exactly 5 files in which plagiarism was found. This result demonstrated that the application is good for the task. en_US

dc.language.iso en en_US

dc.subject plagiarism en_US

dc.subject automate plagiarism detection en_US

dc.subject text classification en_US

dc.subject substring search en_US

dc.title Automate plagiarism detection en_US

dc.type Article en_US

dc.relation.conference Electronics, Communications and Computing en_US

item.grantfulltext open -

item.fulltext With Fulltext -

item.languageiso639-1 other -

crisitem.author.dept Department of Computer Science and Systems Engineering -

crisitem.author.parentorg Faculty of Computers, Informatics and Microelectronics -

Appears in Collections: Conference Abstracts

Files in This Item:

File Description Size Format
29-29_11.pdf 287.49 kB Adobe PDF View/Open

Show simple item record

Google Scholar^TM
Check

Altmetric

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

DC Field	Value	Language
dc.contributor.author	STRATULAT, Eugeniu	en_US
dc.contributor.author	STROIANEȚKI, Stanislav	en_US
dc.contributor.author	BOBICEV, Victoria	en_US
dc.date.accessioned	2020-04-28T18:31:20Z	-
dc.date.available	2020-04-28T18:31:20Z	-
dc.date.issued	2019	-
dc.identifier.citation	STRATULAT, Eugeniu; STROIANEȚKI, Stanislav; BOBICEV, Victoria. Automate plagiarism detection. In: Electronics, Communications and Computing. Editia a 10-a, 23-26 octombrie 2019, Chişinău. Chișinău, Republica Moldova: Universitatea Tehnică a Moldovei, 2019, p. 29. ISBN 978-9975-108-84-3.	en_US
dc.identifier.isbn	978-9975-108-84-3	-
dc.identifier.uri	http://cris.utm.md/handle/5014/466	-
dc.description.abstract	The paper presents a study in which an application for plagiarism detection has been created. It has been evaluated using the set of documents provided by PAN 2009 task on external plagiarism detection [1]. The task has been formulated as follows: Given a set of suspicious documents and a set of source documents the task is to find all text passages in the suspicious documents which have been plagiarized and the corresponding text passages in the source documents. The organizers provided a training corpus which comprises a set of suspicious documents and a set of source documents. A suspicious document may contain plagiarized passages from one or more source documents. The main metrics used for document comparison was NCD (Normalized Compression Distance) which is actually a family of functions which take as arguments two objects (some texts) and evaluate a fixed formula expressed in terms of the compressed versions of these objects, separately and combined [3]. The method is the outcome of a mathematical theoretical developments based on Kolmogorov complexity [4]. The smaller is the result, the more similar are the objects. The application for plagiarism detection has been written in PHP. The similarity of two lines is calculated using the algorithm described in [2]. The selected threshold value has been estimated on the base of training data. This value provides the best plagiarism detection accuracy on the given texts. In order to evaluate our application we used 400 documents from the set provided by the task organizers. We calculated Precision and Recall on 1/10 part of this set, namely, on 40 documents. The information of the plagiarism in these 40 documents has been provided by the task organizers, so we knew exactly that only 5 of these 40 documents contained plagiarized fragments. The application returned exactly 5 files in which plagiarism was found. This result demonstrated that the application is good for the task.	en_US
dc.language.iso	en	en_US
dc.subject	plagiarism	en_US
dc.subject	automate plagiarism detection	en_US
dc.subject	text classification	en_US
dc.subject	substring search	en_US
dc.title	Automate plagiarism detection	en_US
dc.type	Article	en_US
dc.relation.conference	Electronics, Communications and Computing	en_US
item.grantfulltext	open	-
item.fulltext	With Fulltext	-
item.languageiso639-1	other	-
crisitem.author.dept	Department of Computer Science and Systems Engineering	-
crisitem.author.parentorg	Faculty of Computers, Informatics and Microelectronics	-
Appears in Collections:	Conference Abstracts

CRIS of TUM

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM