A Corpus of Java Software
On my research page I explained how a Java program can be modeled as a directed graph, the nodes of which are source files and the edges of which are compilation dependencies. I also introduced two metrics, Class Reachability Set Size (CRSS) and Strongly Connected Component (SCC) size, that can be computed from this directed graph representation of a program. I explained how these metrics can give us some useful information about the quality of a program's internal structure, or design. On this page I present the results of an empirical study of these two metrics over a large corpus of Java programs.
How to Interpret the Graphs
The graphs that follow differ from those I have presented elsewhere in that a single graph contains both CRSS and SCC information. The graphs are still CRSS histograms at heart, but they differ in that the classes in each bar in the histogram are broken down into their SCCs. The yellow part of a bar indicates the number of classes that are not involved in any cycles. The red parts of the bar divide the remaining classes in the bar into their SCCs. I arbitrarily made some decisions about the ordering of the bars within a bar. The rule for ordering is that the yellow part goes at the bottom, and the red parts go on top of it in order of increasing size.
Figure 1 shows the CRSS histograms (described above) for 9 major releases of a commercial application. If we were able to "zoom-in" on the first release (like I've tried to depict in the figure) we could see that there are 3 SCCs --- two small ones in with "low" CRSS values, and one "big" SCC with a "large" CRSS value. There are also classes not involved in any cycles in both bars, represented by the yellow parts. If you're too busy to read my research page, and want to use these graphs to tell a "good" from "bad" structure: big red bars are bad.
Figure 1: CRSS histograms (that also show SCCs) for consecutive releases of a commercial product
(If you're wondering why the graphs on this page are "watermarked" with "DEMO" it's because I used a trial version of a graph drawing package called BFO-Graph to generate them. It's the only graph package that I know of that can do stacked, multi-series bar graphs. And even Excel can't do stacked, multi-series bar graphs).
Figure 2 shows the CRSS distributions for the latest versions of all programs in the Java software corpus. The programs are sorted along the axis slanted like this "/" in order of increasing size, where size is measured in terms of number of source files. You can click on the image to get the full-sized (readable) version of it. Because of the bin size chosen for the histogram (200) the distributions of CRSS values are "hidden" for smaller programs in the corpus. I have attempted to reveal the CRSS distributions for smaller programs by splitting the corpus into three histograms -- one showing small programs only, one showing medium-sized programs only, and one showing large programs only. These are shown immediately below Figure 2. Again you can click on them to view the full-sized images.
Figure 2: CRSS distributions when the corpus is split
into small, medium and large programs (click for full-sized images)
In the table below I have shown in the left-column how the distributions of CRSS values change between consecutive releases of each program (for the programs I have multiple versions of). In the right column I have attempted to compare the actual CRSS distribution (and SCCs) with the CRSS/SCCs that arise when the "USES-IN-THE-INTERFACE". The "USES-IN-THE-INTERFACE" relation is meant to give a lower bound on the sizes of the Strongly Connected Components, because it is has been said that some cycles among classes are unavoidable. I don't want to go into the details of the "USES" and "USES-IN-THE-INTERFACE" relations here --- they are described in my paper "An Empirical Study of Cycles among Classes in Java".
I have also provided links to the data files used to generate the graphs. If you want to see which files are involved in SCCs, or have particular CRSS values you should download the zip file (follow the link "download data files" next to the version you're interested in) and open the file _all.txt in Excel. Opening the file in excel you will see a table with these column names "class", "crss", "fanout", "fanin", "scc". The "class" and "crss" columns are self-explanatory. As for the other columns: "fanout" is the number of other source files a given class directed depends on, "fanin" is the number of source files that reference this class, and "scc" is the size of the SCC the class participates in. You can even download the tool I used to collect all the data from Java bytecode here.