TD Indexes – What is missing?

We experimented different tools providing some kind of Technical Debt Indexes, we tried to identify and analyze their main features and observe what is missing. I outline below some observations we done with Marco Zanoni on these issues.

Different measures have been proposed to address the estimation of Technical Debt, by analyzing the artifacts contained in a project.
The most recurring characteristics of the proposed measures are:

  • localized hints or issues, with an assigned score;
  • scores are then aggregated by following the containment relationships existing in the software, e.g., modules, and the aggregation operator is usually the sum;
  • scores are assigned to the single issues a priori and arbitrarily; default scores are setup by relying on the experience of tools’ developers and may be customized to adapt to the specific project’s needs.

Most hints contributing to the building of the measure fall under these categories:
1. coding issues regarding violation of best practices
2. metric values and violations to some defined thresholds
3. detection of more structured issues, e.g., architectural/code smells.

This approach is surely reasonable and motivated by practical feasibility, scalability (new knowledge about issues can be coded and added to the additive model), and manageability of the analysis (as pointed out by Jean Louis Letouzey in “Questions that are quite systematically raised about TD”).
The final goal of TD indexes is to allow the **Management** of Technical Debt, i.e., to allow developers (we use “developers” to mean all people designing/developing/operating the software at any level of abstraction) to understand that a choice (conscious or not) has consequences, which are usually non-functional, that can affect both developers (e.g., ease of maintenance and evolution) and users (e.g., performances, security). Given the knowledge about the **risk** derived by each choice, developers should also know how much resources are needed to transform their software in a way that removes or mitigates the risk. So, there are the two widely recognized aspects of the problem: the cost of keeping the system, and the cost of fixing it.

Do the current TD indexes allow estimating these aspects?

Since measures are implemented by summing scores that are assigned to each issue recognized by the analysis tool, the precision of this estimation is tied to two factors:
1. the precision of the single scores
2. the appropriateness of the aggregation model

1. As for the precision of the single scores, in all the models we know about, scores are arbitrarily fixed. They are fixed by experts, but they are fixed. Depending on the index definition, scores represent the cost/time of fixing the issue or its relative contribution (a penalty, usually) to the overall quality of the system. Both aspects lack empirical evidence, as other details like the thresholds applied to metrics when detecting, e.g., size or complexity violations. A more sound result would be obtained, in our opinion, if the maintenance costs and the impact on the quality of the systems could be fitted from empirical data, and customized per domain/technology/organization. This could allow choosing which issues are relevant and which are not, on a statistical base, and to obtain an estimation, e.g., of their *actual* cost of fixing, or to quantify the existing relations with maintenance times or defects.

2. As for the appropriateness of the aggregation model, when trying to estimate the cost of fixing a certain set of TD issues, one should consider that any change applied to a system has consequences that are not obvious. Software systems are structured as complex graphs, where each single change impacts every neighbor node recursively, both at design time and runtime. In this context, aggregation by sum is simplistic. Especially when dealing with design/architecture level issues, fixing an issue may remove an entire other set of related issues, or generate other ones. An ideal MTD tool should be able to understand these inter-relations and exploit them to suggest the sets of fixes that maximize the obtained quality, by following the structure of the system, and not a super-imposed estimation model driven essentially by quality aspects. Quality aspects are relevant, but should be used to understand which aspects of a system should receive more attention at coarse granularity.

Moreover, respect to point 1. , some issues may be underestimated due to their rarity in history. This is where another complementary approach could be used, that cares about rare (but not so much) issues that lead (potentially) to out-of-scale risks. We can borrow this approach from security analysis, where a very small vulnerability can disclose extremely important information. People working on these issues do not think in terms of tradeoffs, but try to follow practices that are proved and *do not allow* some bad things happening. When dealing with security, a lot of effort is spent to be sure that, e.g., a user cannot enter a system without successful authentication. That would have extremely bad consequences, which costs are exponential w.r.t. the effort spent in avoiding the issue.

This complementary approach consists in collecting which issues can have (even anecdotically) generated external issues that are “extreme”, like system shut down, data loss, project failure, and (when they are detectable) remove them from the project, with maximum priority. Do not even associate them to a score, because their effect is out of scale. If you multiply the chance of suffering from this risk by the costs it have, it is probably high anyway, but the problem is that if that happens no one would be able to pay that cost. This goes beyond the usual debt/interest metaphor, and resembles more how “black swans” behave.

**A possible research agenda

What we would like to see and work on in future research is:

  • estimating the relative relevance and/or absolute time/costs associated to hints/issues detectable by software analysis tools, with the aim of providing a TD estimation index with an **empirical base**;
  • collecting evidence regarding the root causes of known large-scale failures in both operations and development, with the aim of generating a blacklist of the issues to absolutely avoid in any project;
  • explore the existing structural and statistical inter-relations among different TD issues:
  • generating alternative estimation models that rely on the structure of a software, and that allow the simulation of changes to estimate with higher precision the effort needed to implement fixes and their consequences.

Which are the best techniques to be used for code and architectural debt identification?

We are particularly interested (University of Milano Bicocca-Software Evolution and Reverse Engineering Lab) in the identification of code and architectural issues that can be the source of technical debt at code and architectural level, in the automated support for the identification of these issues and in the prioritization of the most critical ones.

Our Work.
Respect to code level issues, we worked on code smells detection both through metrics-based and machine learning-based techniques (ESE-2015). Regarding the metric-based approach, we developed a code smells detection tool (JcodeOdor), we defined an Intensity Index to evaluate and prioritize the most critical smells to be removed (MTD2015) and we defined some filters to remove false positive instances (ICSE2015-poster, SANER 2016).

Respect to architectural level issues we performed an empirical study on the possible correlations existing among code smells and architectural smells. We are working also with other colleagues on the definition of a catalogue and classification of architectural smells.

We experimented different commercial and open source tools (e.g., inFusion, Structure101, Sonarqube, Sonargraph) with the aim to evaluate their support in the identification of architectural smells or other problems at architectural level, in the evaluation of the usefulness of the Technical Debt Index they provide (SAC2016), and the support these tools offer for the refactoring of architectural problems (WICSA2016). We observed and found many limitations in the tools and different problems to effectively use the computed Technical Debt Indexes, e.g.:
• Tools usually do not exploit historical and/or dynamic information to assess TD.
• Tools detect different relevant issues, and sometimes provide integration with IDEs to show issues in context, but they do not leverage the IDE features to trigger and automate refactoring operations.
• The way time/costs are associated to TD index values is arbitrary and not supported by evidence; even if the index provides a good approximation of TD, its prioritization suffers from this lack of a solid link with measures that can be used for decision-making.

Future Work
We are interested in defining and providing an Architectural/Technical Debt measure that can be effectively used and that takes into account also the history of a project. We also plan to enhance the detection of architectural smells by developing a new tool or collaborating with researchers already working in this direction. We would like also to exploit our experience in design pattern detection (JSS-2015) to enhance the identification of architectural smells, e.g., prioritizing architectural smells that affect subsystems considered more “critical” due to the presence of some specific design patterns.

Another direction we are interested in is the integration of data and software analysis to reach a more precise and holistic assessment of the debt of a system. We had some experience of this kind (JSME-2013) in recovering conceptual schemas from data-intensive systems.