2.4. Data Splitting and Validation (Dr. Frank Dieterle)

Frank Dieterle

Ph. D. Thesis

2. Theory – Fundamentals of the Multivariate Data Analysis

2.4. Data Splitting and Validation

Home
News
About Me
Ph. D. Thesis
	Abstract
	Table of Contents
	1. Introduction
	2. Theory – Fundamentals of the Multivariate Data Analysis
		2.1. Overview of the Multivariate Quantitative Data Analysis
		2.2. Experimental Design
		2.3. Data Preprocessing
		2.4. Data Splitting and Validation
			2.4.1. Crossvalidation
			2.4.2. Bootstrapping
			2.4.3. Random Subsampling
			2.4.4. Kennard Stones
			2.4.5. Kohonen Neural Networks
			2.4.6. Conclusions
		2.5. Calibration of Linear Relationships
		2.6. Calibration of Nonlinear Relationships
		2.7. Neural Networks – Universal Calibration Tools
		2.8. Too Much Information Deteriorates Calibration
		2.9. Measures of Error and Validation
	3. Theory – Quantification of the Refrigerants R22 and R134a: Part I
	4. Experiments, Setups and Data Sets
	5. Results – Kinetic Measurements
	6. Results – Multivariate Calibrations
	7. Results – Genetic Algorithm Framework
	8. Results – Growing Neural Network Framework
	9. Results – All Data Sets
	10. Results – Various Aspects of the Frameworks and Measurements
	11. Summary and Outlook
	12. References
	13. Acknowledgements
Publications
Research Tutorials
Downloads and Links
Contact
Search
Site Map
Print this Page

2.4. Data Splitting and Validation

A typical multivariate calibration procedure needs several separate data sets. The calibration or training data set is needed for setting up the model by estimating the parameters of an equation or for training a neural network. Often a second data set is needed to determine when to stop the training or to determine how many and which model components and variables to include. This second data set is usually called monitor data set. If several models are developed, a third data set called test set is required to select the most appropriate model. Finally, a validation data set is essential to estimate the quality of the final model. It has been shown that different data are needed for all these data sets, as otherwise the models and estimations are biased [9]-[12]. For example, if the same data set is used for the calibration and validation, the estimation of the prediction ability is overly optimistic. Additionally, each data set should be as large as possible. The larger the calibration data set the better the model and the larger the validation data set the better the estimation of the predictivity. If many data are available, representative large independent samples can be used for training, monitoring, testing and validating by simply partitioning the large pool of all samples. Typically in analytical chemistry, only data sets limited in size are available as measurements are expensive and work intensive. To solve the dilemma of partitioning a small pool of data into independent data subsets, which should be as large and as representative as possible, subsampling procedures, which are also known as resampling procedures, have become the quasi standard in chemometrics. There are many subsampling techniques, whereby the most important ones are described below.

Page 33