The contents details metadata is found in file contents.txt in the metadata directory of a system version entry.
The purpose of this metadata is to completely characterise what code is available for analysis from what is distributed for a version of a system. The issue is that different systems distribute different kinds of information, making it difficult to know what code might be analysed. The metadata here supports both identifying what is available, and making decisions about what to analyse.
The metadata lists, for every .java file in src and every .class file found in an archive in bin, the actual location of the file, plus information regarding how the Java type these files corresponds to is classified in the corpus (see details of the corpus structure). It also provides other information that might be of use.
The example below shows some entries for springframework, version 1.2.7 (out of 2251). Each row shows the name of the type, where it appears in bin (if it does); where it appears in src (if it does); is it a type considered to be in the system (as determined by sourcepackages, which in this case is org.springframework); whether it is in both bin and src (0), bin only (1), or src only (2) – this summarises the bin and src fields; whether it is considered distributed; if it is a public (0) or non-public (1) top-level type; how many physical lines of code in the file; and how many non-commented, non-blank lines of code there are in the file.
In the example, the first two entries show types that are in the system, are only provided in src, are not distributed, and are top-level public types. The source file containing AccountForm is 45 physical lines, of which 17 are either blank or contain only comments (leaving 28).
Entries 3 and 4 are for two types are that not considered in the system (since their fully-qualified names do not begin with org.springframework). Also there is no source for them provided (and hence no LOC or NCLOC values).
Entires 5 and 6 are for two nested types, and so their src entry shows the file for their container types, and no LOC/NCLOC values as that wouldn't make sense.
Entries 7 and 8 are for two types that we are likely to be interested in.
Entry 9 is interesting, because it is in the system, binary has been provided, and yet there is no source. By the looks of it, it is a class that was generated.
The concept of distribution identifies source directories that are expected to contain source for types that appear in bin. This provides a cross-check that we have correctly identified the system types (helping check sourcepackages), and also means we can determine when non-distributed code is mixed with distributed code (as sometimes happens). Those wanting to analyse source code may find this useful.
The information shown in the example is provided in a tab-separated file (the first column is just for the purposes of the example, and is not in the file). There are perl scripts for reading the doing some basic analysis of the file, and which can be extended for other kinds of analysis.
Updated: 22-May-2012, Managed by Ewan Tempero