Abstract #5

# 5
Cross validation and bootstrapping: Part II (exercises).
J. A. D. R. N. Appuhamy*1, L. E. Moraes2, 1Department of Animal Science, Iowa State University, Ames, IA, 2Department of Animal Science, The Ohio State University, Columbus, OH.

Here we demonstrate a few applications of cross validation and bootstrapping in evaluating the predictive ability and determining uncertainty of the parameter estimates of a linear regression model using R, a freely available and widely used statistical programming language. The packages such as “design,” “DAAG,” “caret,” and “boot” are capable of performing cross validation of linear models in R. The “boot” package particularly provides extensive facilities for bootstrapping and thus estimating the standard error or confidence interval of a single statistic (e.g., mean), or a vector (e.g., regression coefficients). A data set including a given number of enteric methane emission (CH4) measurements, and corresponding dry matter intake (DMI) and dietary fat content is used. A simple linear regression model to predict CH4 is first developed including DMI and evaluated separately using Hold-up, K-fold, and Leave-one-out cross validation methods. The outputs are discussed and the methods are compared related to the variability of MSPE, and computational cost. The K-fold cross validation is performed with traditional K = 10 (90% of data for training and 10% for test), and compared with lower (K = 5) and higher (K = 20) number of folds. One of the issues with K-fold cross validation is that it often has a high variability, if performed multiple times on the same data. The replicated K-fold cross-validation method addresses this issue by performing the whole process several times averaging over replications. Therefore, we perform replicated K-fold cross validation and compare the MSPE with previous values. We then use our simple prediction model to demonstrate an application of nonparametric bootstrapping to estimate bias, standard error, and 95% confidence interval of the parameter estimate. Histograms and normal quantile-comparison plots for the bootstrap replications are obtained and discussed. If time permits, the bootstrapping will be repeated with a multiple regression model including both DMI and dietary fat content. The data and all the R scripts will be available in advance for download.

Key Words: confidence interval, K-fold cross validation, standard error