Automatically Detecting Technical Debt Discussions with Machine Learning
TAGSArtificial Intelligence and Machine Learning Technical Debt Artificial Intelligence Engineering
Technical debt (TD) refers to choices made during software development that achieve short-term goals at the expense of long-term quality. Since developers use issue trackers to coordinate task priorities, issue trackers are a natural focal point for discussing TD. In addition, software developers use preset issue types, such as feature, bug, and vulnerability, to differentiate the nature of the task at hand. We have recently started seeing developers explicitly use the phrase "technical debt" or similar terms such as "design debt" or "architectural smells."
Although developers often informally discuss TD, the concept has not yet crystalized into a consistently applied issue type when describing issues in repositories. Application of machine learning to locate technical debt issues can improve our understanding of TD and help develop practices to manage it. In this blog post, which is based on an SEI white paper, we describe the results of a study in which machine learning was used to quantify the prevalence of TD-related issues in issue trackers. Although more work is needed, the study achieved promising results in producing a classifier that automatically determines whether a ticket in an issue tracker relates to TD. Our results suggest the need to designate a new technical debt issue type for technical debt to raise visibility and awareness of TD issues among developers and managers.
Results of The Study
Our machine-learning engine for automatically detecting TD extensively uses text mining, which is an approach increasingly used in software engineering studies that focus on classifying bugs. Text mining has been found to improve both bug-classification accuracy and problem-report classification. We manually labeled references to TD for 1,934 tickets in the Chromium issue tracker to determine whether the discussions in each ticket indicated TD. We used these labels to train by means of machine learning a classifier to identify unlabeled tickets and estimate labels for an additional 475,000 tickets. Our classifier significantly outperforms key-phrase search. We concluded that discussion of TD appears in about 16 percent of the tracked Chromium issues. In more detail, our contributions are as follows:
- 441 of the 1,934 labels we generated from the Chromium issue tracker show strong evidence of TD. These labels are now publicly available to support future research.
- We estimated that developers identify TD as an important underlying factor in 14.5-17.1 percent of tracked Chromium tickets. This estimate is adjusted (making it more conservative) for selection bias in our labeled sample.
- We trained a gradient-boosting machine to automatically determine whether a Chromium ticket concerns TD. On a pure holdout test set, our classifier performed significantly better than a naÃ¯ve key-phrase search in terms of precision, recall, and area under the receiver-operator characteristic (AUROC) curve, after adjusting for sampling bias in our labeled data.
Both new labeled data and subsequent quantitative analysis that our study illustrates raise the profile of TD as a software engineering construct of importance comparable to bugs and vulnerabilities. In particular, because each of these constructs is highly complex and context dependent, labeled data sets and machine-learning tools are critically needed to help developers locate, discuss, and address TD throughout a software development life cycle.
A practical motivation for designating a technical debt issue type is to improve time to resolution and avoid unintended consequences due to misreporting TD issues. Several studies (see here, here, and here) suggest bug reports that include certain bug descriptors tend to get addressed sooner.
Readers interested in reviewing the methodology that we used to apply machine learning to the classification of technical debt issues can find the detailed description of our study in the SEI white paper Automatically Detecting Technical Debt Discussions.
Examples of Technical Debt Discussions in Chromium
The following are some examples of discussions in issue trackers that our analysis determined to be related to technical debt. These examples illustrate the challenges and complexities of automating the identification of technical debt in issue trackers.
The term "technical debt" first appears in the Chromium issue tracker in 2010, though developers have been discussing the concept and its management throughout the lifespan of the project. Here is a sample of developer comments in that ticket:
[Chromium #43780] One might consider this a technical debt paydown bug. However, feel free to reprioritize.... Backup sockets were committed conditionally on them being refactored to the "right" place (10/18/2010)... Looks like the statements about the code are still true. (08/17/2017)
The discussions clearly demonstrate a short-term tradeoff that was taken at the time with the under-standing that it would be refactored, but this ticket remained open at the time our study was published.
In other examples, we see indications of a practice emerging in dealing with TD as Chromium developers discuss the tickets:
[Chromium #243948] Paying off technical debt becomes a higher priority, not lower, when in those rare cases it must be deferred. Tests are not a 'nice to fix' feature. Raising to Pri-1.
It is straightforward to spot TD issues when the developers explicitly refer to them as such. Previous work has demonstrated that discussions of TD often involve more convoluted design issues and hard- to-trace changes. The challenge is to locate TD when it is not explicitly discussed. For instance, here is a ticket that experts classified as TD.
[Chromium #442327] 1) Make sure all JNI registration functions are autogenerated by the JNI generator. Currently a few are manual and therefore must be called even when native exports are in use. 2) Make the JNI generator emit both manual registration functions *and* native exports... ... Factor out the code which generates the native export stub name for a given native function, previously duplicated in two places, and also use it in a third place: when generating the table of method registrations... adding ... as blocker because this is still not quite working as intended (though it's functional)
The discussion in this ticket indicates that the developers recognize the limitations of manual registration and the design consequences to change the system with a configuration-time flag. The TD grows as the refactoring is postponed.
Some tickets, however, are especially challenging to label, such as Chromium ticket #507796:
[Chromium 507796] This is just a first step to make sure the code is being exercised. It's been tested locally but only on this configuration. Some more work might be needed to get this working in non-GN builds. Further refactoring of the Telemetry dependencies will occur in follow-on CLs....Unfortunate that that build breakage wasn't caught. ... let me know if you have any trouble diagnosing what went wrong. I don't know why so many of the other isolates would complain about crashpad_database_util not having been built.
Experts initially labeled this issue as not TD since it focuses on alignment of unit tests, which is a routine task after changes have been made. Upon further consideration, they correctly identified the design concepts by cueing on the build dependencies discussed and they relabeled this ticket as TD.
Occasionally, related words or concepts that reflect frustration such as, "I don't understand," "we are getting nagged," or "workaround" suggest that the issues affecting software developers are symptomatic of TD. Other phrases that refer to the consequence of design changes such as "increasingly complex," "consequence of refactoring," or "should not have been hard coded" are suggestive of TD as well.
Wrapping Up and Looking Ahead
We concluded from our study that machine learning can help (a) detect discussions of TD at scale and (b) identify features that are strongly associated with TD and that are thus potentially useful for understanding or defining TD. Free text in the Chromium project alone points to many tens of thousands of TD discussions, suggesting that studies of TD have only begun to identify a large class of code-development challenges.
Our experience and findings from this study inform both software engineering practice and research. Our sample data set of 1,934 labeled tickets is a reference point for anyone trying to refine the definition of TD and provides examples for software engineering teams who would like to experiment with more explicitly tracking TD along with other software anomalies such as bugs and vulnerabilities. The tickets identified as TD by our classifier can be studied further, and additional features, over and above those used in our model, can be tested by other researchers to develop new automatic TD-detection methods.
The extent of TD identified in a large project, such as the one used in our study where TD accounts for 16 percent of Chromium issues, suggests the potential value of augmenting issue trackers with fields that can effectively monitor TD in large projects. The motivation for why a designated issue type such as techdebt is needed is similar to that of tracking security bugs and vulnerabilities separately. Technical debt issues represent design concerns that are otherwise hard to identify automatically. Without a clear way of making these issues visible, software development teams rely on anecdotal information.
Providing a systematic way of discussing these issues is a first step toward quantifying their impact on the project resources and managing the technical debt issues strategically and proactively. Such information would help project managers decide whether to allocate resources to new feature development or to pay down TD. If we can effectively classify TD-related comments and issues, we can focus on what practices could be most useful for its timely communication and resolution.
While our classifier represents progress, it falls far short of a perfect classifier that automatically determines whether an arbitrary ticket in an issue tracker relates to TD. Such an oracle would
- support any TD management strategy by providing a list of all TD-related tickets in a repository.
- obviate the need for developers to manually apply a TD ticket label and for analysts to be trained in a standard TD terminology (provided they operationally understand it well enough to talk about it in some language).
- practically define TD, to the extent that the oracle is human-interpretable, providing reasoning behind each classification.
We expect future work in two key areas to significantly improve its accuracy:
Increasing the amount of labeled data via the current post hoc manual process is costly, but other label sources may soon become more common, particularly as more projects begin to use TD as a standard issue tag. Feature-engineering advances are needed as well, and the problem is not trivial. In our current work, we tried most of the obvious feature-engineering strategies in natural-language processing, and fundamental advances may be needed before additional features can yield a sizeable performance boost.
Read the SEI white paper, Automatically Detecting Technical Debt Discussions.
Read other SEI blog posts about technical debt.
Read other SEI blog posts about machine learning.
Learn about the eLearning course Managing Technical Debt of Software.
Listen to our podcast, Managing Technical Debt: A Focus on Automation, Design, and Architecture.