A Research Agenda

"My code is very modular."

I have heard this claim and variations of it many times (and not just by students...) and every time I hear it, I want to ask how the speakers knows this to be true. When I do ask, I get useful answers such as "I designed it that way." This kind of circular reasoning tells me nothing and can hardly be considered good engineering practise. If engineering means anything, it means measurement of some form or other and so I do not accept any support for claims of modularity that does not include measurement.

The problem with requiring measurements of modularity is there are none. I've been told that such things as coupling and cohesion are measures of modularity, but these attributes do not even have precise definitions so cannot be considered measurements. I've been told that there are metrics for coupling and cohesion such as CBO and LCOM from the CK metric suite [CK91]. The problem with those metrics is there are so many of them [BDW98,BDW99] and so it's not clear which should be used to measure modularity, or even do any of them have any useful relationship to what we consider to be modularity.

The real problem I have with the CK suite and similar metrics is that they only measure a single module (or class in this case). I claim that the quality of the design of a system is not something that can be determined by looking at individual classes. It is conceivable that every class in a system could be considered reasonable by whatever single-class metrics are being used, but the overall design is considered bad. Even if we do find some "bad" classes, we are still left with the question as to whether it is worth the effort to redesign them so that they are "good" without causing problems for other classes. Problems at the design level of a class are unlikely to be fixed without impacting other classes. Finally, it is quite clear that claims of modularity are not being made of single classes, but of the overall design. What we need is measurements of design quality that apply to the entire design.

Measuring overall design quality has a number of issues, not least of which is what "quality" means. I suspect that often claims about modularity are really referring to the degree to which the system can be changed, that is, "modifiability". I also suspect that most of the time, the speaker is really just using "modularity" as a synonym for "good" and haven't thought about what it might really mean. The problem with this is there are a number of different quality attributes that we care about and it seems very unlikely that one design will simultaneously meet all requirements of goodness. For example, a design intended to allow a system to be created quickly ("buildability") will probably contain design decisions that will make it more difficult to change ("modifiability") than other designs. Designs aimed at modifiability are likely to have properties that affect "performance". Designs intended to have good performance may contain decisions that are difficult to understand ("understandability").

One quality attribute that has received a lot of attention is "correctness", that is, the degree to which the system meets the behaviour needed by the customer. Many researchers have tried to find relationships between various forms of measurements (usually related to the CK metric suite) and measures of correctness based on "defects". The problem I have with these studies is that there is rarely sufficient detail about the studies to allow them to be replicated. Repeatable experiments is a fundamental aspect of science. Most importantly, in my opinion, the studies to not give clear definitions of what is meant by "defect". Given the problems in measuring defects [FP96] it is difficult to treat the results of these studies with any confidence, and impossible to repeat the studies. We need to carry out repeatable studies.

One means to carry out repeatable studies is through the use of software corpora. If studies are carried out on a a clearly identified set of software artifacts, then other studies can be performed on the same set with some confidence that all studies are dealing with the same thing. Developing a corpus has been difficult in the past. For example, the existence of different dialects of languages made it difficult to manage different applications. Much more problematic was the fact that much software was proprietary and so access was difficult. The growth of the open source software movement has reduce this problem and the large-scale use of languages such as Java whose standards are followed have made management easier. There is a difficulty in knowing how "representative" any given corpus is, but that difficulty exists in linguistics corpora too, and the linguistics community seem to cope [Hun02].

Even with a corpus, there's still the question of where to start in the quest for a design quality metric. What we need to do is start with the basics. We need to find attributes of software that we can all agree on how to measure them. Much of the existing work on software measurement tries to relate some set of measurements with some quality attributes. This seems to have been taken to the point where work on measurements that don't try to establish some relationship with quality attributes is deemed not useful. But if we can't agree on what attribute the measurements apply to, then what is the point of knowing that some relationship exists between those measurements and some quality attribute?

So what kind of attribute is worth measuring if it doesn't related to some notion of quality? Consider the measurements of "height" of a human. Such measurements are made all the time despite the fact that there is no direct relationship between height and quality attributes of humans. Yes height can be indicative - a height too different from the norm is considered an indicator of health problems, but that's only useful because the norms are known. Heights can affect choice of career (someone who is tall has some advantage for basketball but a disadvantage for pilots of small military aeroplanes); heights have some impact on choice of clothes (but knowledge of height alone isn't sufficient for finding a good fit). But knowing only heights is generally of limited use, and yet it is considered a fundamental measurement for humans. We need to agree on how to measure the fundamental attributes of software.

An attribute that people often like to begin with is "size". In fact many raise it as a criticism of software metrics because of the problems associated with such measurements as "lines of code". Yes there is a problem with using "lines of code" as a measurement, but not because it is inherently bad, but because there are a number of different meanings that might be associated with this phrase, and generally those using it don't bother to say which one is meant. Having different meanings does not in itself invalidate the concept. Afterall, we have all kinds of size measurements for humans: height, chest, waist, inside leg, neck, foot and so on. Is it so unreasonable that we have many different size measurements for software?

In fact there has been discussion as to how to measure size. Fenton and Pfleeger discuss the dimensions of "size" of software [FP96]. Munson proposes a means to measure what Fenton and Pfleeger call "length" [Mun03], based (roughly) on the number of tokens in the source. Other reasonable notion of "size" of code written in an object-oriented language is "number of classes", which may, on the face of it, seem like something that can be easily defined, but turns out not to be. Nevertheless, what we need to do is continue these efforts. It doesn't matter if we end up with 27 different size metrics, just so long as we say which one we're using when we report size measurements.

Other criticisms of software metrics have focused on the difficulty of making the measurements. Again we can look at examples from attributes of humans. Measuring such things as blood pressure, eyesight, or hearing are also difficult, and the history of measuring them can be quite enlightening (e.g., [CEN]). We should not give up on measuring fundamental software attributes because it is "too hard".


[BDW99] Briand, L. C., Daly, J. W., and Wüst, J. K. 1999. A Unified Framework for Coupling Measurement in Object-Oriented Systems. IEEE Trans. Softw. Eng. 25, 1 (Jan. 1999), 91-121. DOI= http://dx.doi.org/10.1109/32.748920
[BDW98] Briand, L. C., Daly, J. W., and Wüst, J. 1998. A Unified Framework for Cohesion Measurement in Object-Oriented Systems. Empirical Softw. Engg. 3, 1 (Jul. 1998), 65-117. DOI= http://dx.doi.org/10.1023/A:1009783721306
[CK91] Chidamber, S. R. and Kemerer, C. F. 1991. Towards a metrics suite for object oriented design. In Conference Proceedings on Object-Oriented Programming Systems, Languages, and Applications (Phoenix, Arizona, United States, October 06 - 11, 1991). A. Paepcke, Ed. OOPSLA '91. ACM Press, New York, NY, 197-211. DOI= http://doi.acm.org/10.1145/117954.117970
[FP96] Norman Fenton and Shari L. Pfleeger, "Software Metrics: A Rigorous and Practical Approach," International Thomson Computer Press, London, UK, 1997, second edition.
[CEN] Clinical Engineering Network "The history of blood pressure measurement"
[Hun02] Susan Hunston "Corpora in Applied Linguistics" Cambridge University Press, 2002.
[Mun03] John C. Munson "Software Engineering Measurement" Auerbach Publications, 2003.


Minor amendments 13 March 2007
Minor amendments 4 May 2007
Replaced broken link 3 Jan 2012