Abstract #M132

Section: Production, Management and the Environment (posters)
Session: Production, Management and the Environment 1
Format: Poster
Day/Time: Monday 7:30 AM–9:30 AM
Location: Exhibit Hall A

# M132
Comparing multiple regression with two machine learning methods in a case study predicting individual survival to second lactation in Holstein cattle.
E. M. M. van der Heide^*1, R. F. Veerkamp¹, M. L. Pelt², C. Kamphuis¹, I. Athanasiadis³, B. J. Ducro¹, ¹Wageningen University and Research, Animal Breeding and Genomics, Wageningen, the Netherlands, ²Cooperation CRV, Arnhem, the Netherlands, ³Wageningen University, Information Technology Group, Wageningen, the Netherlands.

In this study we compare linear multiple regression to the machine learning methods naive Bayes and random forest, to assess the added value of machine learning for the prediction of the complex trait ‘survival to second lactation’. Our dataset contained 6847 heifers born between January 2012 and June 2013, which had a known outcome for survival to second lactation and were genotyped at birth. Each heifer had 50 genomically estimated breeding values and up to 65 phenotypic records that accumulate over time. Survival to second lactation was predicted at five distinct moments in life. Methods were tested using a 20-fold validation of a randomly selected training (70%) and testing (30%) set, and then compared by various metrics, including area under the curve (AUC) value and by testing a scenario showing the realised gain in survival if the 50% highest scoring heifers were selected. At birth and 18 months, all methods had overlapping performance, with no method significantly outperforming the other. Naive Bayes has the highest average AUC at all decision moments up to 200 days past first calving. At 200 days post calving, random forest has the highest AUC. Individual heifer predictions varied between methods. Correlations of individual predictions between methods ranged from moderate to high (lowest correlation seen was r = 0.417 and highest was r = 0.700). The correlations were highest at birth and once all information was available, decreasing for prediction around first calving. In short, all three methods were able to predict survival on population level as all methods improved survival in a practical scenario. However, depending on the method used, an individual animal could be quite different between methods.

Key Words: machine learning, survival, phenotypic prediction