PRlab Header
  • EnsembleGASVR: A novel ensemble method for classifying missense Single Nucleotide Polymorphisms

    Trisevgeni Rapakoulia , Konstantinos Theofilatos, Dimitrios Kleftogiannis, Spiros Likothanasis, Athanasios Tsakalidis and Seferina Mavroudi


    Single Nucleotide Polymorphisms (SNPs) are considered the most frequently occurring DNA sequence variations. Their experimental association with diseases is extremely costly and time consuming. For this reason, several computational methods have been proposed for the classification of missense SNPs to neutral and disease associated. However, existing computational approaches fail to select relevant features by choosing them arbitrary without sufficient documentation. Moreover, they are limited to the problem of missing values, imbalance between the learning datasets and most of them do not support their predictions with confidence scores. To overcome these limitations, a novel ensemble computa-tional methodology is proposed. EnsembleGASVR facilitates a two-step algorithm, which in its first step applies a novel evolutionary embedded algorithm to locate close to optimal Support Vector Regression models. In its second step these models are combined to extract a universal predictor, which is less prone to overfitting issues, systematizes the rebalancing of the learning sets and uses an inter-nal approach for solving the missing values problem without loss of information. Confidence scores support all the predictions and the model becomes tunable by modifying the classification thresholds. An extensive study was performed for collecting the most relevant features for the problem of classifying SNPs and a superset of 88 features was constructed, including one newly introduced character-istic: proteins essentiality Experimental results show that the pro-posed framework outperforms four well-known algorithms in terms of classification performance in the datasets which were utilized in the present study. Finally, the proposed algorithmic framework was able to uncover the significant role of certain features such as the solvent accessibility feature and the top-scored predictions were further validated by linking them with disease phenotypes.
  • Contents :
  • Should be Empty: