Seminar: Large-Scale Clone Detection and Benchmarking

Jeff Svajlenko

Speaker: Jeff Svajlenko

Abstract: Code clones are pairs of code fragments that are similar, and are created when developers re-use code using copy and paste.  Clone detection is important for measuring and maintaining software quality.  Recently, many applications of clone detection in large inter-project source datasets have been proposed, including: mining the seeds of APIs, license violation detection, API recommendation, code search, code completion, and so on. However, very few clone detectors can scale to such large datasets.  As well, there are very few benchmarks for evaluating and comparing the detection performance of the clone detection tools.

In this thesis work, we introduce a new large-scale clone detection tool named CloneWorks.  Our tool is capable of scaling to large inter-project source datasets (e.g., 25 thousand projects, 250 million lines of code) on an average workstation.  Its user-guided approach allows the targeting of any type or kind of clone, which is essential for exploring the potential use-cases in large datasets.  To evaluate this new tool we introduce two modern clone benchmarks:  (1) The Mutation and Injection Framework, which evaluates clone detectors for synthetic clones produced in a mutation-analysis procedure, and (2) BigCloneBench, a big benchmark of over 8 million validated clones mined using a novel efficient procedure. 

Bio: Jeff Svajlenko is a PhD candidate at the University of Saskatchewan.  His research interests include code clones, clone detection, benchmarking, big data and machine learning.

Date September 25, 2017
Time 3:00pm
Place Thorv 105