Plagiarism Detection - YAP


The Problem of Plagiarism

It is an unfortunate fact of life for University lecturers that the pressures on students leads some of them to copy other students' assignments or at least to obtain more assistance from their friends than is appropriate. Apart from discrediting the use of assignments for assessment, the copying of assignments also vitiates the assignments' educational aims. The typical institutional response is to require that assignments only form a small part of a student's assessment. However, such a response is inappropriate because it either results in trivial assignments or in assignments which do not adequately repay students in marks for the effort that they have invested.

Computer based plagiarism detection can restore much of the confidence in the usefulness of computer-based assignments because the same computers that students use to do the assignments can be used to automate the testing of the assignments, and then the detection of similarities among the submissions.

Plagiarism versus Cooperation

Setting aside the issue of group assignments (where a different set of issues arise to do with the equitable division of labour), students are encouraged to discuss their work with other students, e.g. the merits of particular data-structure choices. With this in mind, and given that students will be attempting the same task in the same language, low level similarlities are bound to arise. However, in the same way that two people, given the same topic and a lexicon will none-the-less write very different essays, the similarities due to discussion tend to dissipate remarably quickly.

The situation is very different when students have seen other's work, because even if they hand back the source texts - unlikely anyway - it is almost inevitable that they will reproduce the original, because if they had any strong ideas about how to tackle the assignment, they would not be need the crutch!

Plagiarism versus Accidental Similarity

Parker and Hamblen (1982) define software plagiarism as: a program which has been produced from another program with a small number of routine changes. In practice, if one program can be transformed into another simply through use of editor operations (such as global substitutions) or by exploiting synonymous expressions provided by the programming language, then a prima facie case of plagiarism has been found and should be examined further. Note that neither ploy requires a knowledge of the problem being solved by the source program. Note also that this is also the level at which optimizing compilers operate.

YAP

YAP, which stands for Yet Aanother Plague, is a series of systems which follow a common pattern. YAP3 is the current version.

In the first stage, which is common to all three systems, source texts are tokenized. In particular:

The real difference between the three systems is primarily in their respective second, O(n2), stages:


A paper which appeared in the First Australian Conference on Computer Science Education, Sydney University, July 3-5, 1996, http://www.pam1.bcs.uwa.edu.au/~michaelw/ftp/doc/yap_vs_acm.ps compares the YAP approach with two others from the literature, both of which involve attribute-counting, i.e. comparison of various statistics from the source texts.

Obtaining YAP

At http://www.pam1.bcs.uwa.edu.au/~michaelw/ftp/src/YAP.distibution.tar.gz you will find tar file containing tokenizers for Pascal, C and LISP, plus YAP1, YAP2 and YAP3, is available for ftp.