10.2. Stopping
Criteria for the Parallel Frameworks

In the
second step of the genetic algorithm framework and of the parallel growing
neural network framework, variables are added to the model until the prediction
errors evaluated by a subsampling procedure does not improve any more. A
difficult question is, how the significance of this improvement should be
judged (stopping criterion for the addition of variables). Many different approaches
can be found in literature, which can be classified into several categories
like significance tests or numerical comparisons, robust or non-robust tests,
paired or non-paired tests and local or global error minima.

In this
study, 6 different methods were used to determine the optimal number of variables.
First of all, the simple numerical mean prediction errors of the subsampled
test data were compared before and after the addition of a variable. Thereby
the addition of variables was stopped when the first local minimum of the mean
prediction error was found. A second approach calculates all mean prediction
errors and uses the number of variables, which corresponds to the global minimum
of the prediction errors. Commonly used methods to judge the improvement of
predictions are based on statistical significance tests. An overview of the
different tests can be found in [232]. The significance
tests were implemented in the frameworks for stopping the addition of variables
when the test determines the improvement of the predictions after the addition
being not significant (see also figure 44
and figure 53). The most popular statistical
test to compare the predictions of the subsampled test data before and after
the addition of a variable is the Student T-test [254].
The T-test needs a normal distribution of the prediction errors (this can be
checked by a Kolmogorov Smirnov test [38]) and thus is
sensitive to outliers. A robust option for comparing the predictions of the
test data subsets is the Kruskal Wallis Anova [262],[263],
which corresponds to the Man-Whitney U-test, as only two groups are compared.
If the partitioning of the subsampling procedure is reproducible for each addition
of a variable (this means that the same test subsets are predicted during each
loop of the variable addition) paired significance tests can be used like the
paired T-test for normally distributed prediction errors and the Wilcoxon signed
rank test as its robust counterpart [103],[264],[265].
The different categories of the significance tests have different requirements,
which can be summarized as follows. In contrast to robust tests, T-tests need
a normal distribution of the prediction errors and thus are sensitive to biases
and outliers whereas robust tests are less powerful in terms of detecting differences
of the prediction abilities. The paired tests need the same partitioning of
the data into calibration and test subsets for each loop of the variable addition
step. In contrast to finding the global minimum of the prediction errors, the
implementation of the significance tests needs only as many loops as improvements
of the prediction errors are observed.

The number
of variables selected depends not only on the method, but also on the significance
level of the statistical test, which was set to 5 % error probability for all
tests. In principle, the different methods can be divided into 4 groups according
to the number of variables selected. The T-test, the Kruskal Wallis Anova and
the Wilcoxon signed rank test were most conservative in terms of selecting
variables. These three tests selected the same small number of variables for
all data sets under investigation in this work. The paired T-test selected in
most cases some more variables followed by the criterion of the first local
minimum of the prediction errors whereas the method of the global minimum of
the prediction errors generally corresponded to more variables. All these methods
are based on the prediction errors of the subsampled test data and not on an
external data set. The question is how the prediction errors of the subsampled
data correspond with the prediction errors of external validation data. The
answer can be found in the so-called biasing [11], which
means that when the same data are used for a model building process and for
the variable selection process, the variable selection is biased towards selecting
too many variables whereby the bias increases with a decreasing number of samples.
As the subsampled data are used several times for both processes, the optimal
method depends on the sample size. For large data sets, the global minimum of
the prediction errors of the subsampled test data corresponds with the smallest
prediction errors of external validation data whereas for smaller data sets
methods that are more conservative correspond with the best errors of the validation
data. This effect could be observed for all data sets under investigation. For
the rather large refrigerant data set (441 samples for the calibration of only
2 analytes), the optimal method was the first local minimum criterion, whereas
for the small quaternary mixtures (256 samples for the calibration of 4 analytes)
and the ternary mixtures (245 samples for 3 analytes) the optimal method was
the Kruskal Wallis Anova.

Although
the selection of the stopping criterion influences the prediction ability of
the frameworks, an investigation using all data sets of this work showed that
the selection is less critical than supposed at first glance. Among all data sets,
the highest difference of the prediction errors of external validation data was
0.4% when using different stopping criteria for the calibration data. The
general recommendation of measuring as many samples as possible renders a sophisticated
recommendation for a stopping criterion rather unnecessary, as in the case of
not too small data sets, the local or global minimum criteria are adequate.