Till sidans topp

Sidansvarig: Webbredaktion
Sidan uppdaterades: 2012-09-11 15:12

Tipsa en vän

Recognizing lines of code… - Göteborgs universitet Till startsida
Till innehåll Läs mer om hur kakor används på gu.se

Recognizing lines of code violating company-specific coding guidelines using machine learning A Method and Its Evaluation

Artikel i vetenskaplig tidskrift
Författare Miroslaw Ochodek
Regina Hebig
W. Meding
G. Frost
Miroslaw Staron
Publicerad i Empirical Software Engineering
ISSN 1382-3256
Publiceringsår 2019
Publicerad vid Institutionen för data- och informationsteknik (GU)
Språk en
Länkar dx.doi.org/10.1007/s10664-019-09769...
Ämnesord Measurement, Machine learning, Action research, Code reviews, Computer Science
Ämneskategorier Data- och informationsvetenskap


Software developers in big and medium-size companies are working with millions of lines of code in their codebases. Assuring the quality of this code has shifted from simple defect management to proactive assurance of internal code quality. Although static code analysis and code reviews have been at the forefront of research and practice in this area, code reviews are still an effort-intensive and interpretation-prone activity. The aim of this research is to support code reviews by automatically recognizing company-specific code guidelines violations in large-scale, industrial source code. In our action research project, we constructed a machine-learning-based tool for code analysis where software developers and architects in big and medium-sized companies can use a few examples of source code lines violating code/design guidelines (up to 700 lines of code) to train decision-tree classifiers to find similar violations in their codebases (up to 3 million lines of code). Our action research project consisted of (i) understanding the challenges of two large software development companies, (ii) applying the machine-learning-based tool to detect violations of Sun's and Google's coding conventions in the code of three large open source projects implemented in Java, (iii) evaluating the tool on evolving industrial codebase, and (iv) finding the best learning strategies to reduce the cost of training the classifiers. We were able to achieve the average accuracy of over 99% and the average F-score of 0.80 for open source projects when using ca. 40K lines for training the tool. We obtained a similar average F-score of 0.78 for the industrial code but this time using only up to 700 lines of code as a training dataset. Finally, we observed the tool performed visibly better for the rules requiring to understand a single line of code or the context of a few lines (often allowing to reach the F-score of 0.90 or higher). Based on these results, we could observe that this approach can provide modern software development companies with the ability to use examples to teach an algorithm to recognize violations of code/design guidelines and thus increase the number of reviews conducted before the product release. This, in turn, leads to the increased quality of the final software.

Sidansvarig: Webbredaktion|Sidan uppdaterades: 2012-09-11

På Göteborgs universitet använder vi kakor (cookies) för att webbplatsen ska fungera på ett bra sätt för dig. Genom att surfa vidare godkänner du att vi använder kakor.  Vad är kakor?