7.1. Single Run Genetic Algorithm (Dr. Frank Dieterle)

Frank Dieterle

Ph. D. Thesis

7. Results – Genetic Algorithm Framework

7.1. Single Run Genetic Algorithm

Home
News
About Me
Ph. D. Thesis
	Abstract
	Table of Contents
	1. Introduction
	2. Theory – Fundamentals of the Multivariate Data Analysis
	3. Theory – Quantification of the Refrigerants R22 and R134a: Part I
	4. Experiments, Setups and Data Sets
	5. Results – Kinetic Measurements
	6. Results – Multivariate Calibrations
	7. Results – Genetic Algorithm Framework
		7.1. Single Run Genetic Algorithm
		7.2. Genetic Algorithm Framework - Theory
		7.3. Genetic Algorithm Framework - Results
		7.4. Genetic Algorithm Framework – Conclusions
	8. Results – Growing Neural Network Framework
	9. Results – All Data Sets
	10. Results – Various Aspects of the Frameworks and Measurements
	11. Summary and Outlook
	12. References
	13. Acknowledgements
Publications
Research Tutorials
Downloads and Links
Contact
Search
Site Map
Print this Page

7.1. Single Run Genetic Algorithm

For the variable selection, a combination of a genetic algorithm and neural networks described in section 2.8.5 was used. For the evaluation of the fitness function , the calibration data set of the refrigerant measurements (see section 4.5.1.1) was randomly split into a calibration (75%) and a test data subset (25%). The neural networks were fully connected with 4 hidden neurons and 2 output neurons (1 network for both analytes together). The genetic algorithm evaluated 50 populations during 76 generations whereas the stopping criterion was set to a convergence of the standard deviation of the genes below 0.04. The parameter a of the fitness function was set to 0.9, which resulted in the selection of 8 time points (0, 12, 15, 51, 67, 93, 122 and 125 seconds) as most dominant solution in the last generation. The corresponding neural network (8 hidden neurons, fully connected and 1 output neuron) predicted the test data subset with excellent low rel. RMSE of 1.87% for R22 and 2.50% for R134a. Yet, the prediction of the external validation data by this network, which had been trained using the complete calibration data set, shows RMSE of 2.32% for R22 and 2.93% for R134a comparable with the non-optimized neural networks using all time points (see table 3 in section 7.4). A second run of the genetic algorithm using a different partitioning of the calibration data into calibration and test data subsets showed even worse results. After 86 generations 8 time points (0, 3, 6, 51, 74, 90, 115 and 125 seconds) were selected with rel. RMSE of 1.84% (R22) and 2.62% (R134a) for the prediction of the test subsets. The prediction of the external validation data showed disappointing high errors of 2.63% for R22 and 3.35% R134a (see table 3). For both runs, the predictions of the external validation data are significantly worse compared with the test data subset used for the evaluation of the fitness for the genetic optimization. Additionally, the selection of the time points is not reproducible. This instability of the variable selection can also be seen in figure 46, which shows the frequency of the time points being selected during 100 runs of the GA. Although some time points are more often selected than other time points, there is no time point, which was never selected. Both findings, the instability of the variable selection and the deterioration of the prediction ability for external validation data can be ascribed to a general problem of single run genetic algorithms. The variables are selected on the basis of a fitness function with a static test and calibration data set. Consequently, the optimal solution is only valid for one individual partitioning of the data into calibration and test data subsets and is not representative for the complete data set. Although the fitness function tries to compensate for the overestimation of the test data by partly considering the calibration data (in contrast to most GA found in literature), the drawbacks of a static partitioning cannot be completely compensated. Apart from these problems known in literature (approximately 99% of all GA are based on static data sets), the single run algorithms are faced by additional problems:

1. Both, the chromosomes of the initial population and the weights of the neural networks are randomly generated. As there is no guarantee that the walk of the genetic algorithm in the search space, which also contains random steps, can always find the best subset of variables before converging, different runs (even with identical test and calibration data subsets) often find similar but not exactly identical subsets of variables [254].

2. Jouan-Rimbaud et al. [255] recently demonstrated that by chance correlation of variables often irrelevant variables are selected by GA or have at least a significant influence on the final model, even if validation procedures are used.

Page 102