A Corpus of Java Software


On my research page I explained how a Java program can be modeled as a directed graph, the nodes of which are source files and the edges of which are compilation dependencies. I also introduced two metrics, Class Reachability Set Size (CRSS) and Strongly Connected Component (SCC) size, that can be computed from this directed graph representation of a program. I explained how these metrics can give us some useful information about the quality of a program's internal structure, or design.  On this page I present the results of an empirical study of these two metrics over a large corpus of Java programs.


How to Interpret the Graphs

The graphs that follow differ from those I have presented elsewhere in that a single graph contains both CRSS and SCC information. The graphs are still CRSS histograms at heart, but they differ in that the classes in each bar in the histogram are broken down into their SCCs. The yellow part of a bar indicates the number of classes that are not involved in any cycles. The red parts of the bar divide the remaining classes in the bar into their SCCs. I arbitrarily made some decisions about the ordering of the bars within a bar. The rule for ordering is that the yellow part goes at the bottom, and the red parts go on top of it in order of increasing size.

Figure 1 shows the CRSS histograms (described above) for 9 major releases of a commercial application. If we were able to "zoom-in" on the first release (like I've tried to depict in the figure) we could see that there are 3 SCCs --- two small ones in with "low" CRSS values, and one "big" SCC with a "large" CRSS value. There are also classes not involved in any cycles in both bars, represented by the yellow parts. If you're too busy to read my research page, and want to use these graphs to tell a "good" from "bad" structure: big red bars are bad.

Figure 1: CRSS histograms (that also show SCCs) for consecutive releases of a commercial product

(If you're wondering why the graphs on this page are "watermarked" with "DEMO" it's because I used a trial version of a graph drawing package called BFO-Graph to generate them. It's the only graph package that I know of that can do stacked, multi-series bar graphs. And even Excel can't do stacked, multi-series bar graphs). 


The Results

Figure 2 shows the CRSS distributions for the latest versions of all programs in the Java software corpus. The programs are sorted along the axis slanted like this "/" in order of increasing size, where size is measured in terms of number of source files. You can click on the image to get the full-sized (readable) version of it. Because of the bin size chosen for the histogram (200) the distributions of CRSS values are "hidden" for smaller programs in the corpus. I have attempted to reveal the CRSS distributions for smaller programs by splitting the corpus into three histograms -- one showing small programs only, one showing medium-sized programs only, and one showing large programs only. These are shown immediately below Figure 2. Again you can click on them to view the full-sized images.

Figure 2: CRSS distributions for the latest version of each program in the Java corpus (click for full-sized image)


Figure 2: CRSS distributions when the corpus is split into small, medium and large programs (click for full-sized images)


In the table below I have shown in the left-column how the distributions of CRSS values change between consecutive releases of each program (for the programs I have multiple versions of). In the right column I have attempted to compare the actual CRSS distribution (and SCCs) with the CRSS/SCCs that arise when the "USES-IN-THE-INTERFACE". The "USES-IN-THE-INTERFACE" relation is meant to give a lower bound on the sizes of the Strongly Connected Components, because it is has been said that some cycles among classes are unavoidable. I don't want to go into the details of the "USES" and "USES-IN-THE-INTERFACE" relations here --- they are described in my paper "An Empirical Study of Cycles among Classes in Java".

I have also provided links to the data files used to generate the graphs. If you want to see which files are involved in SCCs, or have particular CRSS values you should download the zip file (follow the link "download data files" next to the version you're interested in) and open the file _all.txt in Excel. Opening the file in excel you will see a table with these column names "class", "crss", "fanout", "fanin", "scc". The "class" and "crss" columns are self-explanatory. As for the other columns: "fanout" is the number of other source files a given class directed depends on, "fanin" is the number of source files that reference this class, and "scc" is the size of the SCC the class participates in. You can even download the tool I used to collect all the data from Java bytecode here.

CRSS/SCC histograms for consecutive releases of each application in the corpusComparison between CRSS/SCC histograms for USES and USES-IN-THE-INTERFACE relations
aglets-2.0.2   download data files

ant-1.1   download data files

ant-1.2   download data files

ant-1.3   download data files

ant-1.4   download data files

ant-1.4.1   download data files

ant-1.5   download data files

ant-1.5.1   download data files

ant-1.5.2   download data files

ant-1.5.4   download data files

ant-1.6.0   download data files

ant-1.6.2   download data files

ant-1.6.5   download data files

antlr-2.7.5   download data files

aoi-2.2   download data files

argouml-0.16.1   download data files

argouml-0.18.1   download data files

argouml-0.20   download data files

axion-1.0-M2   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

azureus-   download data files

bluej-2.1.0   download data files

colt-1.0.1   download data files

colt-1.0.2   download data files

colt-1.0.3   download data files

colt-1.1.0   download data files

colt-1.2.0   download data files

columba-1.0   download data files

compiere-251e   download data files

derby-   download data files

drawswf-1.2.9   download data files

drjava-20050814   download data files

eclipse-SDK-1.0-win32   download data files

eclipse-SDK-2.0-win32   download data files

eclipse-SDK-2.0.1-win32   download data files

eclipse-SDK-2.0.2-win32   download data files

eclipse-SDK-2.1-win32   download data files

eclipse-SDK-2.1.1-win32   download data files

eclipse-SDK-2.1.2-win32   download data files

eclipse-SDK-2.1.3-win32   download data files

eclipse-SDK-3.0-win32   download data files

eclipse-SDK-3.0.1-win32   download data files

eclipse-SDK-3.0.2-win32   download data files

eclipse-SDK-3.1-win32   download data files

eclipse-SDK-3.1.2-win32   download data files

fitjava-1.1   download data files

fitlibraryforfitnesse-20050923   download data files

galleon-1.8.0   download data files

ganttproject-1.11.1   download data files

geronimo-1.0-M5   download data files

glassfish-9.0-b15   download data files

gt2-2.2-rc3   download data files

hibernate-3.1-rc2   download data files

hsqldb-   download data files

ireport-0.5.2   download data files

jag-5.0.1   download data files

jaga-1.0.b   download data files

james-2.2.0   download data files

jasperreports-1.1.0   download data files

javacc-3.2   download data files

jboss-4.0.3-SP1   download data files

jchempaint-2.0.12   download data files

jedit-4.2   download data files

jeppers-20050607   download data files

jetty-5.1.8   download data files

jext-5.0   download data files

jfreechart-1.0.0-rc1   download data files

jgraph-5.7.4   download data files

jgraph-   download data files

jhotdraw-5.2.0   download data files

jhotdraw-5.3.0   download data files

jhotdraw-5.4.1   download data files

jhotdraw-5.4.2   download data files

jhotdraw-6.0.1   download data files

jmeter-1.8.1   download data files

jmeter-1.9.1   download data files

jmeter-2.0.0   download data files

jmeter-2.0.1   download data files

jmeter-2.0.2   download data files

jmeter-2.0.3   download data files

jmeter-2.1   download data files

jmeter-2.1.1   download data files

joggplayer-1.1.4s   download data files

jparse-0.96   download data files

jre-   download data files

jrefactory-2.9.19   download data files

jtopen-4.9   download data files

jung-1.0.0   download data files

jung-1.1.0   download data files

jung-1.2.0   download data files

jung-1.3.0   download data files

jung-1.4.0   download data files

jung-1.4.2   download data files

jung-1.4.3   download data files

jung-1.5.0   download data files

jung-1.5.1   download data files

jung-1.5.2   download data files

jung-1.5.3   download data files

jung-1.5.4   download data files

jung-1.6.0   download data files

jung-1.7.0   download data files

jung-1.7.1   download data files

junit-2.0   download data files

junit-2.1   download data files

junit-3.0   download data files

junit-3.4   download data files

junit-3.5   download data files

junit-3.6   download data files

junit-3.7   download data files

junit-3.8   download data files

junit-3.8.1   download data files

junit-3.8.2   download data files

junit-4.0   download data files

junit-4.1   download data files

lucene-1.2-final   download data files

lucene-1.3-final   download data files

lucene-1.4.3   download data files

megamek-2005.10.11   download data files

netbeans-3.5.1   download data files

netbeans-3.6   download data files

netbeans-4.0   download data files

netbeans-4.1   download data files

netbeans-5.0   download data files

netbeans-5.5-beta   download data files

openoffice-2.0.0   download data files

pmd-3.3   download data files

poi-2.5.1   download data files

rssowl-1.2   download data files

sablecc-3.1   download data files

sandmark-3.4   download data files

scala-   download data files

sequoiaerp-0.8.2-RC1-all-platforms   download data files

soot-1.0.0   download data files

soot-1.2.0   download data files

soot-1.2.1   download data files

soot-1.2.2   download data files

soot-1.2.3   download data files

soot-1.2.4   download data files

soot-1.2.5   download data files

soot-2.0   download data files

soot-2.0.1   download data files

soot-2.1.0   download data files

soot-2.2.3   download data files

springframework-1.2.7   download data files

tomcat-5.0.28   download data files

tomcat-5.5.17   download data files




Home Research Corpus Software Papers Other