Automated procedure for candidate compound selection in GCMS metabolomics based on prediction of Kovats retention index

V.V. Mihaleva 1, H.A.Verhoeven 1,2, R.C.H. de Vos 1,2, R.D. Hall 1,2, and R.C.H.J. van Ham 1,3*

1 Applied Bioinformatics, PRI, Droevendaalsesteeg 1, Wageningen, The Netherlands, 2 Centre for BioSystems Genomics (CBSG), Droevendaalsesteeg 1, Wageningen, The Netherlands, and 3 Laboratory of Bioinformatics, Wageningen University, Dreijenlaan 3, Wageningen, The Netherlands

ABSTRACT Motivation. Matching both the retention index (RI) and the mass spectrum of an unknown compound against a mass spectral reference library provides strong evidence for a correct identification of that compound. Data on retention indices are, however, available for only a small fraction of the compounds in such libraries. We propose a quantitative structure – retention index model that enables the ranking and filtering of putative identifications of compounds for which the predicted RI falls outside a predefined window. Results. We constructed multiple linear regression and support vector regression (SVR) models using a set of descriptors obtained with a genetic algorithm as variable selection method. The SVR model is a significant improvement over previous models built for structurally diverse compounds as it covers a large range (360 to 4100) of RI values and gives better prediction of isomer compounds. The hit list reduction varied from 41% to 60% and depended on the size of the original hit list. Large hit lists were reduced to a greater extend compared to small hit lists. Contact. roeland.vanham@wur.nl Software availability. http://appliedbioinformatics.wur.nl/GC-MS Supplementary information: Supplementary data are available at Bioinformatics online.

Download_data.zip