The Effects of Training Set Size on Decision Tree Complexity

Authors: Tim Oates David Jensen

Computer Science Department, LGRC

University of Massachusetts

Reviewer

Tingfu Chen; (tche048@student.auckland.ac.nz)

Reference

Breiman, L.; Friedman, J.; Olshen, R.; and Stone, C. 1984. Classification and Repression trees, Wadsworth International.

Brodley, C. E., and Friedl, M. A. 1996. Identifying and eliminating mislabeled training instances. In Proceedings of the Thirteenth NationaL Conference on Artificial Intelligence.

Catlett, J. 1991. Overpruning large decision trees. In Proceedings of the 12th Internation Joint Conference on Artificial Intelligence, 764-769.

Cestndc, B., and Brats, I. 1991. On estimating probabilities in tree pruning. In Proceedings of the Fifth Etsropeon Working Session on Learning, 138-150.

Cohen, P. R. 1995. Empirical Methods for Artificial Intelligence. The MIT Press.

Holte, R. C. 1993. Very simple dassification rules perform well on most commonly used dataset. Machine Learning, 11:63-90.

Jensen, D. 1997. Adjusting for multiple testing in decision tree pruning. In Preliminary Papers of the Sixth International Workshop on Artificial Inteligence and Statistics.

John, G. H. 1995. Robust decision trees: Removing outliers Born databases. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining.

Mingers, J. 1989. An empirical comparison of pruning methods for decision tree induction. Machine Learning 4:227-243.

Quinlan, J. R., and Rivest, R. 1989. Inferring decision trees using the minimum description length principle. Information and Computation 80:227-248.

Quinlan, J. R. 1987. Simplifying decision trees. International Journal of Man-Machine Studies 27:221-234.

Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.

Keywords

computer science machine learning decision tree algorithm tree prunning tree complexity data mining experiments report Knowledge Aquisition Expert Systems

Summary

0 Abstract

This paper presents experiments with 19 datasets and 5 decision tree pruning algorithms that show that increasing training set size often results in a linear increase in tree size, even when that additional complexity results in no significant increase in classification accuracy.

This implies that decreases in tree size should be decomposed into two parts:

. that which is due to reduction of training set size
. and the remainder, which is due to how the method selects instances to discard.

Experiments show that a large percentage of its effect on tree size is attributable to the fact that it simply reduces the size of the training set.

1 Introduction

Techniques to improve the performance of decision tree algorithms often involve data reduction, the removal of training instances prior to tree construction.

In this paper we argue that, under a broad range of circumstances, all data reduction techniques will result in some decrease in tree size with little impact on accuracy.

An intuitive feeling: look at Figure 1

. plots of tree size and accuracy as a function of training set size

. for the UC Irvine australian dataset

. C4.5 was used to generate the trees (Quinlan 1993)

. each plot corresponds to 5 pruning mechanism:
---- error--based (EBP-the c4.5 default)

We can see from Figure 1:

. accuracy peaks with small numbers of training instances, thereafter remaining almost constant.

. Surprisingly, tree size continues to grow nearly linearly in three of the graphs

. Growth continued despite two important facts:
(1) accuracy has ceased to increase; and
(2) c4.5 is pruning the trees to avoid overfitting.

. almost any scheme for removing training instances prior to tree construction will, on this dataset, yield smaller trees with accuracies roughly equivalent to that obtainable from the full training set

. Also, the size of the resulting tree will depend strongly on the fraction of instances that are discarded

Previously research results (Catlett 1991)

. trees build from extremely large datasets were both significantly larger and more accurate than trees built on datasets half that size.

. The increase in accuracy was attributed, in part, to better attribute selection made possible by additional training instances.

In this paper we demonstrate:

. the linear relationship between training set size and tree size holds across datasets of many different sizes and for several pruning techniques

. In contrast to Catlett, small trees built from subsets of the available data were often just as accurate as larger trees built from all of the available data

2 Empirical Results

. The experiments generate plots like figure 1

. find the training set size at which accuracy ceases to increase

. run a linear regression on the points in the tree size curve to the right of that training set size

. In general, additional tree structure is welcome as long as it improves classification accuracy, and it is unwelcome, otherwise.

Experimental Method

. 5 pruning methods
--- Error-based (EBP-the c4.5 default)
--- Reduced error (REP)
--- Minimum description length (MDL)
--- Cost-complexity with the 1SE rule (CCP1SE )
--- Cost-complexity with the 0SE rule (CCP0SE )

. 19 datasets taken from the UC Irvine repository.

. K-fold cross-validation is used to obtain estimates of the true performance of decision tree algorithms.

. Incremental cross-validation
(1) Given dataset D, with n instances
(2) Divided D into k disjoint sets Di
(3) Build trees on subsets D-Di
(4) Test them on Di

. In this experiment,
.. 20 subsets were created, increments of 5%
.. standard k-fold CV corresponds to the case in which 100% of the instances in D-Di wrer retained
.. the order of the instances in D was permuted prior to creating the k=10 folds
.. the instances to be retained were gathered sequentically with the first instance in D-Di for each level of data reduction

.The training set size at which accuracy ceased to grow was found by
.. scanning the accuracy curve from left to right
.. stopping when the mean of three adjacent accuracy estimates was no more than 1% less than the accuracy of the tree based on all available training data

Experimental Results --- Table 1

. %Kept: the percentage of available training instances at which accuracy ceased to grow

. p and R2: results of the linear regression of tree size on training set size

. Dsize: the percentage decrease in tree size

. Daccuracy: the absolute difference in accuracy between the tree built from all available training instances and the tree built from the number of instances at which accuracy ceased to grow.

. the final row of the table gives the number of datasets for which accuracy peaked prior to seeing 100% of the available training instances, the number of datasets for which the relationship between tree size and training set size is significant, and the means of R2, Dsize and Daccuracy for those datasets with significant p values.

The results for EBP

. Accuracy peaked prior to seeing 100% of the available training instances for 16 of the 19 datasets

. every one of the 16 datasets exhibited a significant relationship between tree size and training set size beyond the point at which accuracy stopped growing, and 12 of them weere highly significant (at the 0.001 level)

. In spite of the fact that accuracy remains basically constant, tree size continues to grow as training set size

. The most remarkable feature of the table is the R2 column:
--- First it says that training set size has an extremely strong and predictable effect on tree size.

--- Second this effect is robust over a large group of datasets with widely varying characteristics.

. The Dsize column shows the percent reduction in size from trees built on all available training instances to trees built on the number of instances in the %Kept column(mean reduction in tree size for the 16 datasets with significant p values is 38.29%)

. The Daccuracy column shows the absolute difference in accuracy between those same trees(the mean difference in absolute accuracy is -10.14%)

. By reducing training set sizes through the removal of randomly selected instances, it is possible, on average, to obtain trees that are 38.29% smaller, with a sacrifice in accuracy of less than two tenths of one percent.

. accuracy was higher with reduced training sets in 8 cases, and it was lower in 8 cases.

3 A Case Study

-- EXPERIMENTS ON ROBUST C4.5

To determine how much is due to reduction of training set size alone, we need to:
. the size of the tree that c4.5 builds on the entire dataset (c4.5 Size)
. the size of the tree that c4.5 builds on the reduced datasets generated by RC4.5 (RC4.5 Size)
. the percentage of training instances retained by RC4.5(%Kept)
. and the size of the tree that c4.5 builds when the same percentage of randomly selected training instances are retained (RDR Size)

The percentage can be computed as 100*(c4.5 Size-RDR Size)/(c4.5Size-RC4.5Size)

Table 3

It is clear that tree sizes obtained through random data reduction should serve as a baseline against which other data reduction techniques measure their success, much as default accuracy or Holte's one-rules serve as a baseline for classification accuracy.

These results by themselves do not shed any addional light on the merits of RC4.5.

4 Bias in Reduced Error Pruning

The experiments in this section confirmed our hypotheses that

. larger tree and smaller pruning sets each lead to increased bias

. Bias peaked with the largest training set size and the smallest pruning set size

5 Discussion

Experiments with 5 pruning methods and 19 datasets demonstrated that:

1. Tree size is strongly dependent on training set size.

2. Small numbers of training instances suffice to build small, accurate trees, in addition to yielding a useful tree-simplification tool, trees data previously used in tree construction for other purposes. (RDR produces smaller tree size and makes more data available for pruning.)

3. Random data reduction can also serve as a method for evaluation new pruning techniques.

6 Future research

1. Additional investigation of why three of the pruning methods tested in this paper do not avoid overfitting as training set size increases.

2. Investigate the extent to which other model construction algorithms fall victim to a pathological relationship between model complexity and the amount of data used to build the model.

Evaluation

>From my point of view, the authors give us a very important facts that:

. the linear relationship between training set size and tree size holds across datasets of many different sizes and for several pruning techniques

. small trees built from subsets of the available data were often just as accurate as larger trees built from all of the available data

This result is very meaningful when the given dataset is very large, you can then choose a smaller subset randomly to building the decision tree without affect the accuracy.

But further works need to be done, such as: . aAdditional investigation of why three of the pruning methods tested in this paper do not avoid overfitting as training set size increases.

. investigate the extent to which other model construction algorithms fall victim to a pathological relationship between model complexity and the amount of data used to build the model.