Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest

Prasetiyowati M.I., Maulidevi N.U., Surendro K.

School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Indonesia


Abstract

Feature selection is a pre-processing technique used to remove unnecessary characteristics, and speed up the algorithm’s work process. A part of the technique is carried out by calculating the information gain value of each dataset characteristic. Also, the determined threshold rate from the information gain value is used in feature selection. However, the threshold value is used freely or through a rate of 0.05. Therefore this study proposed the threshold rate determination using the information gain value’s standard deviation generated by each feature in the dataset. The threshold value determination was tested on 10 original datasets transformed by FFT and IFFT and classified using Random Forest. On processing the transformed dataset with the proposed threshold this study resulted in lower accuracy and longer execution time compared to the same process with Correlation-Base Feature Selection (CBF) and a standard 0.05 threshold method. Similarly, the required accuracy value is lower when using transformed features. The study showed that by processing the original dataset with a standard deviation threshold resulted in better feature selection accuracy of Random Forest classification. Furthermore, by using the transformed feature with the proposed threshold excluding the imaginary numbers leads to a faster average time than the three methods compared. © 2021, The Author(s).

Accuracy; Random forest; Standard deviation; Threshold; Time


Journal

Journal of Big Data

Publisher: Springer Science and Business Media Deutschland GmbH

Volume 8, Issue 1, Art No 84, Page – , Page Count


Journal Link: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85107205048&doi=10.1186%2fs40537-021-00472-4&partnerID=40&md5=0516fb8e67a1c9c3214f555c001b81a9

doi: 10.1186/s40537-021-00472-4

Issn: 21961115

Type: All Open Access, Gold, Green


References

Khalid, S., Khalil, T.S., A survey of feature selection and feature extraction techniques in machine learning (2014) Science and Information Conference, pp. 372-378. , https://doi.org/10.1109/SAI.2014.6918213, London, UK; Hira, Z.M., Gillies, D.F., A Review of feature selection and feature extraction methods applied on microarray data (2015) Adv Bioinform, 2015, pp. 1-13; Corizzo, R., Ceci, M., Japkowicz, N., Anomaly detection and repair for accurate predictions in geo-distributed big data (2019) Big Data Res, 16, pp. 18-35; Corizzo, R., Ceci, M., Zdravevski, E., Japkowicz, N., Scalable auto-encoders for gravitational waves detection from time series data (2020) Expert Syst Appl, 151, p. 113378; Zheng, K., Li, T., Zhang, B., Zhang, Y., Luo, J., Zhou, X., Incipient fault feature extraction of rolling bearings using autocorrelation function impulse harmonic to noise ratio index based SVD and teager energy operator (2017) Appl Sci, 7 (11), p. 1117; Gu, Y., Yang, X., Peng, M., Lin, G., Robust weighted SVD-type latent factor models for rating prediction (2020) Expert Syst Appl, 141, p. 112885; Wei, G., Zhao, J., Feng, Y., He, A., Yu, J., A novel hybrid feature selection method based on dynamic feature importance (2020) Appl Soft Comput, 93, p. 106337; Prasetiyowati, M.I., Maulidevi, N.U., Surendro, K., The speed and accuracy evaluation of random forest performance by selecting features in the transformation data (2020) IEEA 2020: Proceedings of the 2020 the 9Th International Conference on Informatics, Environment, Energy and Applications, pp. 125-130; Guyon, I., Elisseeff, A., An introduction to variable and feature selection (2003) J Mach Learn Res, 3, pp. 1157-1182; Ma, J., Gao, X., A filter-based feature construction and feature selection approach for classification using Genetic Programming (2020) Knowl-Based Syst, 196, p. 105806; Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., Lang, M., Benchmark for filter methods for feature selection in high-dimensional classification data (2020) Comput Stat Data Anal, 143, p. 106839; Thabtah, F., Kamalov, F., Hammoud, S., Shahamiri, S.R., Least Loss: A simplified filter method for feature selection (2020) Inf Sci, 534, pp. 1-15; Samami, M., A mixed solution-based high agreement filtering method for class noise detection in binary classification (2020) Phys A, 553, p. 124219; Das, H., Naik, B., Behera, H.S., A Jaya algorithm based wrapper method for optimal feature selection in supervised classification (2020) J King Saud Univ Comput Inf Sci; González, J., Ortega, J., Damas, M., Martín-Smith, P., Gan, J.Q., A new multi-objective wrapper method for feature selection—accuracy and stability analysis for BCI (2019) Neurocomputing, 333, pp. 407-418; Lu, M., Embedded feature selection accounting for unknown data heterogeneity (2019) Expert Syst Appl, 119, pp. 350-361; Zhang, P., Gao, W., Feature selection considering Uncertainty Change Ratio of the class label (2020) Appl Soft Comput, 95, p. 106537; Elmaizi, A., Nhaila, H., Sarhrouni, E., Hammouch, A., Nacir, C., A novel information gain based approach for classification and dimensionality reduction of hyperspectral images (2019) Proc Comput Sci, 148, pp. 126-134; Jadhav, S., He, H., Jenkins, K., Information gain directed genetic algorithm wrapper feature selection for credit rating (2018) Appl Soft Comput, 69, pp. 541-553; Singer, G., Anuar, R., Ben-Gal, I., A weighted information-gain measure for ordinal classification trees (2020) Expert Syst Appl, 152, p. 113375; Demsˇar, J., Demsar, J., Statistical comparisons of classifi ers over multiple data sets (2006) J Mach Learn Res, 7, pp. 1-30; Yang, Z., Robust discriminant feature selection via joint L 2, 1 -norm distance minimization and maximization (2020) Knowl Based Syst; Tsai, C.-F., Sung, Y.-T., Ensemble feature selection in high dimension, low sample size datasets: Parallel and serial combination approaches (2020) Knowl-Based Syst, 203, p. 106097; Leo, B., Bagging predictors (1996) Mach Learn, 24 (2), pp. 123-140; Herff, C., Krusienski, D.J., Extracting features from time series (2019) Fundamentals of clinical data science, pp. 85-100. , Kubben P, Dumontier M, Dekker A, (eds), Springer International Publishing, Cham; Li, M., Chen, W., FFT-based deep feature learning method for EEG classification (2021) Biomed Signal Process Control, 66, p. 102492; Seco, G.B.S., Gerhardt, G.J.L., Biazotti, A.A., Molan, A.L., Schönwald, S.V., Rybarczyk-Filho, J.L., EEG alpha rhythm detection on a portable device (2019) Biomed Signal Process Control, 52, pp. 97-102; Ansari, M.F., Edla, D.R., Dodia, S., Kuppili, V., Brain-computer interface for wheelchair control operations: an approach based on fast fourier transform and on-line sequential extreme learning machine (2019) Clin Epidemiol Global Health, 7 (3), pp. 274-278; Hosseini, S., Roshani, G.H., Setayeshi, S., Precise gamma based two-phase flow meter using frequency feature extraction and only one detector (2020) Flow Meas Instrum, 72, p. 101693; Gowid, S., Dixon, R., Ghani, S., A novel robust automated FFT-based segmentation and features selection algorithm for acoustic emission condition based monitoring systems (2015) Appl Acoust, 88, pp. 66-74; Prasetiyowati, M.I., Maulidevi, N.U., Surendro, K., Feature selection to increase the random forest method performance on high dimensional data (2020) Int J Adv Intell Inf, 6 (3), p. 10; A feature selection method based on information gain and genetic algorithm (2012) International Conference on Computer Science and Electronics Engineering, pp. 355-358. , https://doi.org/10.1109/ICCSEE.2012.97, Hangzhou, Zhejiang, China; Genuer, R., Poggi, J., Tuleau-malot, C., Villa-vialaneix, N., Random Forests for Big Data (2017) Big Data Res, 1, pp. 1-19; Breiman, L.E.O., (2001) Random forests, , Kluwer Academic Publishers, Netherlands; Ye, Y., Wu, Q., Zhexue Huang, J., Ng, M.K., Li, X., Stratified sampling for feature subspace selection in random forests for high dimensional data (2013) Pattern Recogn, 46 (3), pp. 769-787; Chen, M.-Y., Chen, B.-T., Online fuzzy time series analysis based on entropy discretization and a Fast Fourier Transform (2014) Appl Soft Comput, 14, pp. 156-166; Ghaderi, H., Kabiri, P., Fourier transform and correlation-based feature selection for fault detection of automobile engines (2012) The 16Th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012), pp. 514-519. , https://doi.org/10.1109/AISP.2012.6313801, Shiraz, Fars, Iran; Sim, J., Lee, J.S., Kwon, O., Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications (2015) Math Prob Eng, 2015, pp. 1-14; Ichikawa, M., Hosono, A., Tamai, Y., Watanabe, M., Shibata, K., Tsujimura, S., Oka, K., Suzuki, S., Handling missing data in an FFQ: multiple imputation and nutrient intake estimates (2019) Public Health Nutr., 22 (8), pp. 1351-1360; Hening, D., Koonce, D.A., Missing Data Imputation Method Comparison in Ohio University Student Retention Database, p. 10; “Breast Cancer Wisconsin (Diagnostic) Data Set Predict Whether the Cancer is Benign Or Malignant., , https://www.kaggle.com/uciml/breast-cancer-wisconsin-data; Yöntem, M.K., Ilhan, T., (2019) Divorce Prediction Using Correlation Based Feature Selection and Artificial Neural Networks. Nevşehir Hacı Bektaş Veli Üniversitesi SBE Dergisi; Andrzejak, R.G., Lehnertz, K., Mormann, F., Rieke, C., David, P., Elger, C.E., Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state (2001) Phys Rev E, 64 (6), p. 061907; Zarchi, M.S., Fatemi Bushehri, S.M.M., Dehghanizadeh, M., SCADI: A standard dataset for self-care problems classification of children with physical and motor disability (2018) Int J Med Inf, 114, pp. 81-87; Fatemi Bushehri, S.M.M., Zarchi, M.S., An expert model for self-care problems classification using probabilistic neural network and feature selection approach (2019) Appl Soft Comput, 82, p. 105545; Johnson, B., Xie, Z., Classifying a high resolution image of an urban area using super-object information (2013) ISPRS J Photogramm Remote Sens, 83, pp. 40-49; Johnson, B., High-resolution urban land-cover classification using a competitive multi-scale object-based approach (2013) Remote Sens Lett, 4 (2), pp. 131-140

Indexed by Scopus

Leave a Comment