Artificial intelligence paradigm for ligand-based virtual screening on the drug discovery of type 2 diabetes mellitus

Bustamam A., Hamzah H., Husna N.A., Syarofina S., Dwimantara N., Yanuar A., Sarwinda D.

Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Depok, Indonesia; Faculty of Pharmacy, Universitas Indonesia, Gedung A Rumpun Ilmu Kesehatan Lantai 1, Depok, Indonesia


Background: New dipeptidyl peptidase-4 (DPP-4) inhibitors need to be developed to be used as agents with low adverse effects for the treatment of type 2 diabetes mellitus. This study aims to build quantitative structure-activity relationship (QSAR) models using the artificial intelligence paradigm. Rotation Forest and Deep Neural Network (DNN) are used to predict QSAR models. We compared principal component analysis (PCA) with sparse PCA (SPCA) as methods for transforming Rotation Forest. K-modes clustering with Levenshtein distance was used for the selection method of molecules, and CatBoost was used for the feature selection method. Results: The amount of the DPP-4 inhibitor molecules resulting from the selection process of molecules using K-Modes clustering algorithm is 1020 with logP range value of -1.6693 to 4.99044. Several fingerprint methods such as extended connectivity fingerprint and functional class fingerprint with diameters of 4 and 6 were used to construct four fingerprint datasets, ECFP_4, ECFP_6, FCFP_4, and FCFP_6. There are 1024 features from the four fingerprint datasets that are then selected using the CatBoost method. CatBoost can represent QSAR models with good performance for machine learning and deep learning methods respectively with evaluation metrics, such as Sensitivity, Specificity, Accuracy, and Matthew’s correlation coefficient, all valued above 70% with a feature importance level of 60%, 70%, 80%, and 90%. Conclusion: The K-modes clustering algorithm can produce a representative subset of DPP-4 inhibitor molecules. Feature selection in the fingerprint dataset using CatBoost is best used before making QSAR Classification and QSAR Regression models. QSAR Classification using Machine Learning and QSAR Classification using Deep Learning, each of which has an accuracy of above 70%. The QSAR RFC-PCA and QSAR RFR-PCA models performed better than QSAR RFC-SPCA and QSAR RFR-SPCA models because QSAR RFC-PCA and QSAR RFR-PCA models have more effective time than the QSAR RFC-SPCA and QSAR RFR-SPCA models. © 2021, The Author(s).

CatBoost; Deep neural network; Fingerprint; K-modes clustering; principal component analysis; Quantitative structure-activity relationship; Rotation Forest; Sparse principal component analysis


Journal of Big Data

Publisher: Springer Science and Business Media Deutschland GmbH

Volume 8, Issue 1, Art No 74, Page – , Page Count

Journal Link:

doi: 10.1186/s40537-021-00465-3

Issn: 21961115

Type: All Open Access, Gold, Green


(2019) WHO: classification of diabetes mellitus, p. 36. , World Health Organization, Geneva; Cai, J., Li, C., Liu, Z., Du, J., Ye, J., Gu, Q., Xu, J., Predicting DPP-IV inhibitors with machine learning approaches (2017) J Comput Aided Mol Des, 31 (4), pp. 393-402; Lo, Y.-C., Rensi, S.E., Torng, W., Altman, R.B., Machine learning in chemoinformatics and drug discovery (2018) Drug Discov Today, 23 (8), pp. 1538-1546; Geldenhuys, W.J., Optimizing the use of open-source software applications in drug discovery (2006) Drug Discovery Today, 11 (3-4), pp. 127-132; Patel, B.D., Ghate, M.D., Recent approaches to medicinal chemistry and therapeutic potential of dipeptidyl peptidase-4 (DPP-4) inhibitors (2014) Eur J Med Chem, 74, pp. 574-605; Dearden, J.C., The history and development of quantitative structure-activity relationships (QSARs) (2016) IJQSPR, 1 (1), pp. 1-44; Andrada, M.F., Vega-Hissi, E.G., Estrada, M.R., Garro Martinez, J.C., Application of k-means clustering, linear discriminant analysis and multivariate linear regression for the development of a predictive QSAR model on 5-lipoxygenase inhibitors (2015) Chemometr Intell Lab Syst, 143, pp. 122-129; Suhartanto, H., Li, X., Burrage, K., Yanuar, A., Bustamam, A., Hilman, M., Wibisono, A., The development of integrated computing platform to improve user satisfaction and cost efficiency of in silico drug discovery activities (2014) International Journal of Advancements in Computing Technology, 6, pp. 11-20; Ramsundar, B., Eastman, P., Walters, P., Pande, V., (2019) Deep learning for the life sciences applying deep learning to genomics, microscopy, drug discovery, and more, p. 238. , 1, O’Reilly, Boston; Rosselló, F., Valiente, G., Chemical graphs, chemical reaction graphs, and chemical graph transformation (2005) Electron Notes Theor Comput Sci, 127 (1), pp. 157-166; Faulon, J.L., Bender, A., (2010) Handbook of chemoinformatics algorithms, p. 454. , 1, Chapman & Hall/CRC, Taylor & Francis Group, London; O’Donnell, T.J., (2008) Design and use of relational databases in chemistry, p. 224. , 1, CRC Press, Boca Raton; Weininger, D., SMILES, a chemical language and information system: 1: introduction to methodology and encoding rules (1988) J Chem Inf Comput Sci, 28 (1), pp. 31-36; Chackalamannil, S., Rotella, D., Ward, S., (2017) Comprehensive medicinal chemistry III, p. 4536. , 3, Elsevier Ltd., Amsterdam; Cereto-Massagué, A., Ojeda, M.J., Valls, C., Mulero, M., Garcia-Vallvé, S., Pujadas, G., Molecular fingerprint similarity search in virtual screening (2015) Methods, 71 (C), pp. 58-63; Huang, Z., Extensions to the k-Means algorithm for clustering large data sets with categorical values (1998) Data Min Knowl Discov, 2 (1998), pp. 283-304; Khandare, S., Gawade, S., (2017) Turkar, V. ,, Design and development of e-farm with S.C.H.E.M.E, International Conference on Recent Innovations in Signal Processing and Embedded Systems (RISE; Jurafsky, D., Martin, J.H., Norvig, P., Russell, S., (2019) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, p. 613. , 3, Stanford University, Stanford; Levenshtein, V.I., Binary codes capable of correcting deletions, insertions, and reversals (1966) Cybern Control Theory, 10 (8), pp. 845-858; Riskyana Dewi Intan, P., Anwar Ma’Sum, M.A., Alfiany, N., Jatmiko, W., Kekalih, A., Bustamam, A., Ensemble learning versus deep learning for Hypoxia detection in CTG signal. 2019 International Workshop on Big Data and Information Security (2019) IWBIS, 2019, pp. 57-62. ,; Bustamam, A., Musti, M.I.S., Hartomo, S., Aprilia, S., Tampubolon, P.P., Lestari, D., Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences (2019) BMC Genom, 20, pp. 1-13; Ardaneswari, G., Bustamam, A., Siswantining, T., Implementation of parallel k-means algorithm for two-phase method biclustering in Carcinoma tumor gene expression data (2017) AIP Conference Proceedings, 1825. ,; Ginanjar, R., Bustamam, A., Tasman, H., Implementation of regularized markov clustering algorithm on protein interaction networks of 2016 (2016) ICACSIS, 1 (6), pp. 297-302; Muradi, H., Bustamam, A., Lestari D. Application of hierarchical clustering ordered partitioning and collapsing hybrid in Ebola Virus phylogenetic analysis. ICACSIS 2015 – 2015 International Conference on Advanced Computer Science and Information Systems (2016) Proceedings, pp. 317-323. ,; Jing, Y., Bian, Y., Hu, Z., Wang, L., Sean, X.-Q., Chemical, C., Screening, G., Biology, S., Paradigm for drug discovery in the big data era (2018) Aaps J, 20 (3), pp. 1-22; Lenselink, E.B., Ten Dijke, N., Bongers, B., Papadatos, G., Van Vlijmen, H.W.T., Kowalczyk, W., Ijzerman, A.P., Van Westen, G.J.P., Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set (2017) J Cheminform, 9 (1), pp. 1-14; Rao, H., Shi, X., Rodrigue, A.K., Feng, J., Xia, Y., Elhoseny, M., Yuan, X., Gu, L., Feature selection based on artificial bee colony and gradient boosting decision tree (2019) Appl Soft Comput J, 74, pp. 634-642; Hastie, T., Tibshirani, R., Friedman, J., (2008) The elements of statistical learning: data mining, inference, and prediction, , Springer; Prokhorenkova, L.O., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A., CatBoost: Unbiased boosting with categorical features (2018) S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi & R. Garnett (Eds.), Neurips (P./Pp. 6639-6649); Roy, K., Kar, S., Das, R., (2015) A primer on QSAR/QSPR modeling: fundamental concepts; Lipinski, C.A., Lombardo, F., Dominy, B.W., Feeney, P.J., Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings (2012) Adv Drug Deliv Rev, 64, pp. 4-17; Sydow, D., Wichmann, M., Rodríguez-Guerra, J., Goldmann, D., Landrum, G., Volkamer, A., Teachopencadd-knime: a teaching platform for computer-aided drug design using knime workflows (2019) J Chem Inf Model, 59 (10), pp. 4083-4086; Rousseeuw, P.J., Silhouettes: a graphical aid to the interpretation and validation of cluster analysis (1987) J Comput Appl Math, 20, pp. 53-65; Ghose, A.K., Crippen, G.M., Atomic physicochemical parameters for three-dimensional-structure-directed quantitative structure-activity relationships. 2. modeling dispersive and hydrophobic interactions (1987) J Chem Inf Comput Sci, 27 (1), pp. 21-35; Rogers, D., Hahn, M., Extended-Connectivity Fingerprints (2010) Journal of Chemical Information and Modeling, 50 (5), pp. 742-754; Leach, A.R., Gillet, V.J., (2007) An introduction to chemoinformatics, p. 255. , Revised, Springer, Dordrecht; Wildman, S.A., Crippen, G.M., Prediction of physicochemical parameters by atomic contributions (1999) J Chem Inf Comput Sci, 39 (5), pp. 868-873; Dahl, G.E., Jaitly, N., Salakhutdinov, R., Multi-task Neural Networks for QSAR Predictions (2014) Corr, Abs/1406, p. 1231; Bishop, C.M., (2006) Pattern recognition and machine learning, p. 803. , 1, Springer, Singapore; Ma, Y.A., Chen, T., Fox, E.B., A complete recipe for stochastic gradient MCMC (2015) Advances in Neural Information Processing Systems, pp. 2917-2925. , 1506.04696; Ghasemi, F., Mehridehnavi, A., Fassihi, A., Pérez-Sánchez, H., Deep neural network in qsar studies using deep belief network (2018) Appl Soft Comput, 62, pp. 251-258; Ma, J., Sheridan, R.P., Liaw, A., Dahl, G.E., Svetnik, V., Deep neural nets as a method for quantitative structure-activity relationships (2015) J Chem Inf Model, 55 (2), pp. 263-274; Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., Dropout: a simple way to prevent neural networks from overfitting (2014) J Mach Learn Res, 15 (56), pp. 1929-1958; Rodríguez, J.J., Kuncheva, L.I., Alonso, C.J., Rotation forest: a new classifier ensemble method (2006) IEEE Trans Pattern Anal Mach Intell, 28 (10), pp. 1619-1630; Zhang, C.X., Zhang, J.S., Wang, G.W., An empirical study of using rotation forest to improve regressors (2008) Appl Math Comput, 195 (2), pp. 618-629; Rokach, L., Maimon, O., Data Mining with Decision Trees – Theory and Applications (Vol. 69) (2007) Worldscientific; Chicco, D., Jurman, G., The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation (2020) BMC Genom, 21 (1), pp. 1-13

Indexed by Scopus

Leave a Comment