A Corpus of Java Software



Introduction

On my research page I explained how a Java program can be modeled as a directed graph, the nodes of which are source files and the edges of which are compilation dependencies. I also introduced two metrics, Class Reachability Set Size (CRSS) and Strongly Connected Component (SCC) size, that can be computed from this directed graph representation of a program. I explained how these metrics can give us some useful information about the quality of a program's internal structure, or design.  On this page I present the results of an empirical study of these two metrics over a large corpus of Java programs.

 

How to Interpret the Graphs

The graphs that follow differ from those I have presented elsewhere in that a single graph contains both CRSS and SCC information. The graphs are still CRSS histograms at heart, but they differ in that the classes in each bar in the histogram are broken down into their SCCs. The yellow part of a bar indicates the number of classes that are not involved in any cycles. The red parts of the bar divide the remaining classes in the bar into their SCCs. I arbitrarily made some decisions about the ordering of the bars within a bar. The rule for ordering is that the yellow part goes at the bottom, and the red parts go on top of it in order of increasing size.

Figure 1 shows the CRSS histograms (described above) for 9 major releases of a commercial application. If we were able to "zoom-in" on the first release (like I've tried to depict in the figure) we could see that there are 3 SCCs --- two small ones in with "low" CRSS values, and one "big" SCC with a "large" CRSS value. There are also classes not involved in any cycles in both bars, represented by the yellow parts. If you're too busy to read my research page, and want to use these graphs to tell a "good" from "bad" structure: big red bars are bad.

Figure 1: CRSS histograms (that also show SCCs) for consecutive releases of a commercial product

(If you're wondering why the graphs on this page are "watermarked" with "DEMO" it's because I used a trial version of a graph drawing package called BFO-Graph to generate them. It's the only graph package that I know of that can do stacked, multi-series bar graphs. And even Excel can't do stacked, multi-series bar graphs). 

 

The Results

Figure 2 shows the CRSS distributions for the latest versions of all programs in the Java software corpus. The programs are sorted along the axis slanted like this "/" in order of increasing size, where size is measured in terms of number of source files. You can click on the image to get the full-sized (readable) version of it. Because of the bin size chosen for the histogram (200) the distributions of CRSS values are "hidden" for smaller programs in the corpus. I have attempted to reveal the CRSS distributions for smaller programs by splitting the corpus into three histograms -- one showing small programs only, one showing medium-sized programs only, and one showing large programs only. These are shown immediately below Figure 2. Again you can click on them to view the full-sized images.



Figure 2: CRSS distributions for the latest version of each program in the Java corpus (click for full-sized image)


smallmediumlarge

Figure 2: CRSS distributions when the corpus is split into small, medium and large programs (click for full-sized images)

 

In the table below I have shown in the left-column how the distributions of CRSS values change between consecutive releases of each program (for the programs I have multiple versions of). In the right column I have attempted to compare the actual CRSS distribution (and SCCs) with the CRSS/SCCs that arise when the "USES-IN-THE-INTERFACE". The "USES-IN-THE-INTERFACE" relation is meant to give a lower bound on the sizes of the Strongly Connected Components, because it is has been said that some cycles among classes are unavoidable. I don't want to go into the details of the "USES" and "USES-IN-THE-INTERFACE" relations here --- they are described in my paper "An Empirical Study of Cycles among Classes in Java".

I have also provided links to the data files used to generate the graphs. If you want to see which files are involved in SCCs, or have particular CRSS values you should download the zip file (follow the link "download data files" next to the version you're interested in) and open the file _all.txt in Excel. Opening the file in excel you will see a table with these column names "class", "crss", "fanout", "fanin", "scc". The "class" and "crss" columns are self-explanatory. As for the other columns: "fanout" is the number of other source files a given class directed depends on, "fanin" is the number of source files that reference this class, and "scc" is the size of the SCC the class participates in. You can even download the tool I used to collect all the data from Java bytecode here.


CRSS/SCC histograms for consecutive releases of each application in the corpusComparison between CRSS/SCC histograms for USES and USES-IN-THE-INTERFACE relations
aglets
aglets-2.0.2   download data files



ant
ant-1.1   download data files



ant-1.2   download data files



ant-1.3   download data files



ant-1.4   download data files



ant-1.4.1   download data files



ant-1.5   download data files



ant-1.5.1   download data files



ant-1.5.2   download data files



ant-1.5.4   download data files



ant-1.6.0   download data files



ant-1.6.2   download data files



ant-1.6.5   download data files



antlr
antlr-2.7.5   download data files



aoi
aoi-2.2   download data files



argouml
argouml-0.16.1   download data files



argouml-0.18.1   download data files



argouml-0.20   download data files



axion
axion-1.0-M2   download data files



azureus
azureus-2.0.1.0   download data files



azureus-2.0.3.0   download data files



azureus-2.0.3.2   download data files



azureus-2.0.4.0   download data files



azureus-2.0.4.2   download data files



azureus-2.0.7.0   download data files



azureus-2.0.8.0   download data files



azureus-2.0.8.2   download data files



azureus-2.0.8.4   download data files



azureus-2.1.0.0   download data files



azureus-2.1.0.2   download data files



azureus-2.1.0.4   download data files



azureus-2.2.0.0   download data files



azureus-2.2.0.2   download data files



azureus-2.3.0.0   download data files



azureus-2.3.0.2   download data files



azureus-2.3.0.4   download data files



bluej
bluej-2.1.0   download data files



colt
colt-1.0.1   download data files



colt-1.0.2   download data files



colt-1.0.3   download data files



colt-1.1.0   download data files



colt-1.2.0   download data files



columba
columba-1.0   download data files



compiere
compiere-251e   download data files



derby
derby-10.1.1.0   download data files



drawswf
drawswf-1.2.9   download data files



drjava
drjava-20050814   download data files



eclipse
eclipse-SDK-1.0-win32   download data files



eclipse-SDK-2.0-win32   download data files



eclipse-SDK-2.0.1-win32   download data files



eclipse-SDK-2.0.2-win32   download data files



eclipse-SDK-2.1-win32   download data files



eclipse-SDK-2.1.1-win32   download data files



eclipse-SDK-2.1.2-win32   download data files



eclipse-SDK-2.1.3-win32   download data files



eclipse-SDK-3.0-win32   download data files



eclipse-SDK-3.0.1-win32   download data files



eclipse-SDK-3.0.2-win32   download data files



eclipse-SDK-3.1-win32   download data files



eclipse-SDK-3.1.2-win32   download data files



fitjava
fitjava-1.1   download data files



fitlibraryforfitnesse
fitlibraryforfitnesse-20050923   download data files



galleon
galleon-1.8.0   download data files



ganttproject
ganttproject-1.11.1   download data files



geronimo
geronimo-1.0-M5   download data files



glassfish
glassfish-9.0-b15   download data files



gt2
gt2-2.2-rc3   download data files



hibernate
hibernate-3.1-rc2   download data files



hsqldb
hsqldb-1.8.0.2   download data files



ireport
ireport-0.5.2   download data files



jag
jag-5.0.1   download data files



jaga
jaga-1.0.b   download data files



james
james-2.2.0   download data files



jasperreports
jasperreports-1.1.0   download data files



javacc
javacc-3.2   download data files



jboss
jboss-4.0.3-SP1   download data files



jchempaint
jchempaint-2.0.12   download data files



jedit
jedit-4.2   download data files



jeppers
jeppers-20050607   download data files



jetty
jetty-5.1.8   download data files



jext
jext-5.0   download data files



jfreechart
jfreechart-1.0.0-rc1   download data files



jgraph
jgraph-5.7.4   download data files



jgraph-5.7.4.3   download data files



jhotdraw
jhotdraw-5.2.0   download data files



jhotdraw-5.3.0   download data files



jhotdraw-5.4.1   download data files



jhotdraw-5.4.2   download data files



jhotdraw-6.0.1   download data files



jmeter
jmeter-1.8.1   download data files



jmeter-1.9.1   download data files



jmeter-2.0.0   download data files



jmeter-2.0.1   download data files



jmeter-2.0.2   download data files



jmeter-2.0.3   download data files



jmeter-2.1   download data files



jmeter-2.1.1   download data files



joggplayer
joggplayer-1.1.4s   download data files



jparse
jparse-0.96   download data files



jre
jre-1.4.2.04   download data files



jrefactory
jrefactory-2.9.19   download data files



jtopen
jtopen-4.9   download data files



jung
jung-1.0.0   download data files



jung-1.1.0   download data files



jung-1.2.0   download data files



jung-1.3.0   download data files



jung-1.4.0   download data files



jung-1.4.2   download data files



jung-1.4.3   download data files



jung-1.5.0   download data files



jung-1.5.1   download data files



jung-1.5.2   download data files



jung-1.5.3   download data files



jung-1.5.4   download data files



jung-1.6.0   download data files



jung-1.7.0   download data files



jung-1.7.1   download data files



junit
junit-2.0   download data files



junit-2.1   download data files



junit-3.0   download data files



junit-3.4   download data files



junit-3.5   download data files



junit-3.6   download data files



junit-3.7   download data files



junit-3.8   download data files



junit-3.8.1   download data files



junit-3.8.2   download data files



junit-4.0   download data files



junit-4.1   download data files



lucene
lucene-1.2-final   download data files



lucene-1.3-final   download data files



lucene-1.4.3   download data files



megamek
megamek-2005.10.11   download data files



netbeans
netbeans-3.5.1   download data files



netbeans-3.6   download data files



netbeans-4.0   download data files



netbeans-4.1   download data files



netbeans-5.0   download data files



netbeans-5.5-beta   download data files



openoffice
openoffice-2.0.0   download data files



pmd
pmd-3.3   download data files



poi
poi-2.5.1   download data files



rssowl
rssowl-1.2   download data files



sablecc
sablecc-3.1   download data files



sandmark
sandmark-3.4   download data files



scala
scala-1.4.0.3   download data files



sequoiaerp
sequoiaerp-0.8.2-RC1-all-platforms   download data files



soot
soot-1.0.0   download data files



soot-1.2.0   download data files



soot-1.2.1   download data files



soot-1.2.2   download data files



soot-1.2.3   download data files



soot-1.2.4   download data files



soot-1.2.5   download data files



soot-2.0   download data files



soot-2.0.1   download data files



soot-2.1.0   download data files



soot-2.2.3   download data files



springframework
springframework-1.2.7   download data files



tomcat
tomcat-5.0.28   download data files



tomcat-5.5.17   download data files



 

 

 



Home Research Corpus Software Papers Other