Previous Topic Back Forward Next Topic
Print Page Dr. Frank Dieterle
 
Ph. D. ThesisPh. D. Thesis 2. Theory  Fundamentals of the Multivariate Data Analysis 2. Theory Fundamentals of the Multivariate Data Analysis 2.4. Data Splitting and Validation2.4. Data Splitting and Validation
Home
News
About Me
Ph. D. Thesis
  Abstract
  Table of Contents
  1. Introduction
  2. Theory Fundamentals of the Multivariate Data Analysis
    2.1. Overview of the Multivariate Quantitative Data Analysis
    2.2. Experimental Design
    2.3. Data Preprocessing
    2.4. Data Splitting and Validation
      2.4.1. Crossvalidation
      2.4.2. Bootstrapping
      2.4.3. Random Subsampling
      2.4.4. Kennard Stones
      2.4.5. Kohonen Neural Networks
      2.4.6. Conclusions
    2.5. Calibration of Linear Relationships
    2.6. Calibration of Nonlinear Relationships
    2.7. Neural Networks Universal Calibration Tools
    2.8. Too Much Information Deteriorates Calibration
    2.9. Measures of Error and Validation
  3. Theory Quantification of the Refrigerants R22 and R134a: Part I
  4. Experiments, Setups and Data Sets
  5. Results Kinetic Measurements
  6. Results Multivariate Calibrations
  7. Results Genetic Algorithm Framework
  8. Results Growing Neural Network Framework
  9. Results All Data Sets
  10. Results Various Aspects of the Frameworks and Measurements
  11. Summary and Outlook
  12. References
  13. Acknowledgements
Publications
Research Tutorials
Links
Contact
Search
Site Map
Guestbook
Print this Page Print this Page

2.4.   Data Splitting and Validation

A typical multivariate calibration procedure needs several separate data sets. The calibration or training data set is needed for setting up the model by estimating the parameters of an equation or for training a neural network. Often a second data set is needed to determine when to stop the training or to determine how many and which model components and variables to include. This second data set is usually called monitor data set. If several models are developed, a third data set called test set is required to select the most appropriate model. Finally, a validation data set is essential to estimate the quality of the final model. It has been shown that different data are needed for all these data sets, as otherwise the models and estimations are biased [9]-[12]. For example, if the same data set is used for the calibration and validation, the estimation of the prediction ability is overly optimistic. Additionally, each data set should be as large as possible. The larger the calibration data set the better the model and the larger the validation data set the better the estimation of the predictivity. If many data are available, representative large independent samples can be used for training, monitoring, testing and validating by simply partitioning the large pool of all samples. Typically in analytical chemistry, only data sets limited in size are available as measurements are expensive and work intensive. To solve the dilemma of partitioning a small pool of data into independent data subsets, which should be as large and as representative as possible, subsampling procedures, which are also known as resampling procedures, have become the quasi standard in chemometrics. There are many subsampling techniques, whereby the most important ones are described below.

Page 15 © Dr. Frank Dieterle, 14.08.2006 Navigation