Leveraging a Corpus of Natural Language Descriptions for Program Similarity (Onward! 2016 - Papers)

Who

Meital Zilberstein, Eran Yahav

Track

Onward! 2016 Onward! Papers

Time Zone

The program is currently displayed in (GMT+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 4 Nov 2016 11:45 - 12:10 at Matterhorn 2 - Session 4 Chair(s): Veselin Raychev

Abstract

Program similarity is a central challenge in many programming-related applications, such as code search, clone detection, automatic translation, and programming education.

We present a novel approach for establishing the similarity of code fragments by:
(i) obtaining textual descriptions of code fragments captured in millions of posts on question-answering sites, blogs and other sources, and
(ii) using natural language processing techniques to establish similarity between textual descriptions, and thus between their corresponding code fragments.

To improve precision, we use a simple static analysis that extracts type signatures, and combine the results of textual similarity with similarity of the signatures.
Because our notion of code similarity is based on similarity of textual descriptions, our approach can determine semantic relatedness and similarity of code across different libraries and even across different programming languages, a task considered extremely difficult using traditional approaches.
To evaluate our approach, we use data obtained from the popular question-answering site, Stackoverflow. To obtain a ground-truth to compare against, we developed a crowdsourcing system, Like2Drops, that allows users to label the similarity of code fragments. We used the system to collect similarity classifications for a massive corpus of 6,500 program pairs. Our results show that our technique is effective in determining similarity, and achieves more than 85 percent precision, recall and accuracy.

DOI

https://doi.org/10.1145/2986012.2986013

Meital Zilberstein

Technion

Eran Yahav