We experimented different tools providing some kind of Technical Debt Indexes, we tried to identify and analyze their main features and observe what is missing. I outline below some observations we done with Marco Zanoni on these issues.
Different measures have been proposed to address the estimation of Technical Debt, by analyzing the artifacts contained in a project.
The most recurring characteristics of the proposed measures are:
- localized hints or issues, with an assigned score;
- scores are then aggregated by following the containment relationships existing in the software, e.g., modules, and the aggregation operator is usually the sum;
- scores are assigned to the single issues a priori and arbitrarily; default scores are setup by relying on the experience of tools’ developers and may be customized to adapt to the specific project’s needs.
Most hints contributing to the building of the measure fall under these categories:
1. coding issues regarding violation of best practices
2. metric values and violations to some defined thresholds
3. detection of more structured issues, e.g., architectural/code smells.
This approach is surely reasonable and motivated by practical feasibility, scalability (new knowledge about issues can be coded and added to the additive model), and manageability of the analysis (as pointed out by Jean Louis Letouzey in “Questions that are quite systematically raised about TD”).
The final goal of TD indexes is to allow the **Management** of Technical Debt, i.e., to allow developers (we use “developers” to mean all people designing/developing/operating the software at any level of abstraction) to understand that a choice (conscious or not) has consequences, which are usually non-functional, that can affect both developers (e.g., ease of maintenance and evolution) and users (e.g., performances, security). Given the knowledge about the **risk** derived by each choice, developers should also know how much resources are needed to transform their software in a way that removes or mitigates the risk. So, there are the two widely recognized aspects of the problem: the cost of keeping the system, and the cost of fixing it.
Do the current TD indexes allow estimating these aspects?
Since measures are implemented by summing scores that are assigned to each issue recognized by the analysis tool, the precision of this estimation is tied to two factors:
1. the precision of the single scores
2. the appropriateness of the aggregation model
1. As for the precision of the single scores, in all the models we know about, scores are arbitrarily fixed. They are fixed by experts, but they are fixed. Depending on the index definition, scores represent the cost/time of fixing the issue or its relative contribution (a penalty, usually) to the overall quality of the system. Both aspects lack empirical evidence, as other details like the thresholds applied to metrics when detecting, e.g., size or complexity violations. A more sound result would be obtained, in our opinion, if the maintenance costs and the impact on the quality of the systems could be fitted from empirical data, and customized per domain/technology/organization. This could allow choosing which issues are relevant and which are not, on a statistical base, and to obtain an estimation, e.g., of their *actual* cost of fixing, or to quantify the existing relations with maintenance times or defects.
2. As for the appropriateness of the aggregation model, when trying to estimate the cost of fixing a certain set of TD issues, one should consider that any change applied to a system has consequences that are not obvious. Software systems are structured as complex graphs, where each single change impacts every neighbor node recursively, both at design time and runtime. In this context, aggregation by sum is simplistic. Especially when dealing with design/architecture level issues, fixing an issue may remove an entire other set of related issues, or generate other ones. An ideal MTD tool should be able to understand these inter-relations and exploit them to suggest the sets of fixes that maximize the obtained quality, by following the structure of the system, and not a super-imposed estimation model driven essentially by quality aspects. Quality aspects are relevant, but should be used to understand which aspects of a system should receive more attention at coarse granularity.
Moreover, respect to point 1. , some issues may be underestimated due to their rarity in history. This is where another complementary approach could be used, that cares about rare (but not so much) issues that lead (potentially) to out-of-scale risks. We can borrow this approach from security analysis, where a very small vulnerability can disclose extremely important information. People working on these issues do not think in terms of tradeoffs, but try to follow practices that are proved and *do not allow* some bad things happening. When dealing with security, a lot of effort is spent to be sure that, e.g., a user cannot enter a system without successful authentication. That would have extremely bad consequences, which costs are exponential w.r.t. the effort spent in avoiding the issue.
This complementary approach consists in collecting which issues can have (even anecdotically) generated external issues that are “extreme”, like system shut down, data loss, project failure, and (when they are detectable) remove them from the project, with maximum priority. Do not even associate them to a score, because their effect is out of scale. If you multiply the chance of suffering from this risk by the costs it have, it is probably high anyway, but the problem is that if that happens no one would be able to pay that cost. This goes beyond the usual debt/interest metaphor, and resembles more how “black swans” behave.
**A possible research agenda
What we would like to see and work on in future research is:
- estimating the relative relevance and/or absolute time/costs associated to hints/issues detectable by software analysis tools, with the aim of providing a TD estimation index with an **empirical base**;
- collecting evidence regarding the root causes of known large-scale failures in both operations and development, with the aim of generating a blacklist of the issues to absolutely avoid in any project;
- explore the existing structural and statistical inter-relations among different TD issues:
- generating alternative estimation models that rely on the structure of a software, and that allow the simulation of changes to estimate with higher precision the effort needed to implement fixes and their consequences.