A COMPARISON OF MISSING DATA IMPUTATION METHODS ON SUPERVISED MACHINE LEARNING

Penulis

Kata Kunci:

Machine Learning, Supervise Learning, Data Imputation, Missing Data, Methods Comparison

Abstrak

The advancement of technology has resulted in an ever-increasing amount of data. However, this data is not always ready for immediate use due to issues such as incomplete data. Imputation methods, such as statistical approaches using mean, median, and mode, can be applied to address missing values in features. Another approach is to delete data with missing values, though this might result in the loss of valuable information. Machine learning algorithms, like KNN, can be utilized for data imputation by predicting missing values based on defined neighborhoods. Techniques like Miceforest and data interpolation are also applicable. This study provides an overview of data preprocessing before machine learning. Six machine learning algorithms— Support Vector Classification, Random Forest, Naive Bayes, Gradient Boosting, Decision Tree, and K-Nearest Neighbors—were used to evaluate predictive performance on data processed with seven methods: dropping samples, dropping features, mean, median, mode imputer, KNN imputer, and Miceforest imputer. The results showed that the Random Forest with Median Imputer achieved the highest training and testing accuracy of 100% and 82.1%. Naïve Bayes was the lowest among all the experiments, such as in Dropping Samples, the training accuracy was only 22.88% and the testing accuracy was 20.1%.

Unduhan

Data unduhan tidak tersedia.

Referensi

[1] R. Ananda, A. R. Dewi, and N. Nurlaili, “a Compari-son of Clustering By Imputation and Special Clustering Algorithms on the Real Incomplete Data,” J. Ilmu Komput. dan Inf., vol. 13, no. 2, pp. 65–75, 2020, doi: 10.21609/jiki.v13i2.818.

[2] M. W. Heymans and J. W. R. Twisk, “Handling miss-ing data in clinical research,” J. Clin. Epidemiol., vol. 151, pp. 185–188, 2022, doi: 10.1016/j.jclinepi.2022.08.016.

[3] T. B. Berrett and R. J. Samworth, “Optimal Nonpar-ametric Testing of Missing Completely At Random and Its Connections To Compatibility,” Ann. Stat., vol. 51, no. 5, pp. 2170–2193, 2023, doi: 10.1214/23-AOS2326.

[4] G. Carreras et al., “Missing not at random in end of life care studies: multiple imputation and sensitivity analysis on data from the ACTION study,” BMC Med. Res. Method-ol., vol. 21, no. 1, pp. 1–12, 2021, doi: 10.1186/s12874-020-01180-y.

[5] T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago, and O. Tabona, A survey on missing data in machine learning, vol. 8, no. 1. Springer International Pub-lishing, 2021.

[6] A. Mirzaei, S. R. Carter, A. E. Patanwala, and C. R. Schneider, “Missing data in surveys: Key concepts, ap-proaches, and applications,” Res. Soc. Adm. Pharm., vol. 18, no. 2, pp. 2308–2316, 2022, doi: 10.1016/j.sapharm.2021.03.009.

[7] F. Ochieng’ Odhiambo, “Comparative Study of Vari-ous Methods of Handling Missing Data,” Math. Model. Appl., vol. 5, no. 2, p. 87, 2020, doi: 10.11648/j.mma.20200502.14.

[8] W. Junthopas and C. Wongoutong, “Comparison of Listwise Deletion and Imputation Methods for Handling a Single Missing Response Value in a Central Composite Design,” Thail. Stat., vol. 20, no. 3, pp. 545–561, 2022.

[9] A. R. Vazquez, P. Goos, and E. D. Schoen, “Projec-tions of Definitive Screening Designs by Dropping Col-umns: Selection and Evaluation,” Technometrics, vol. 62, no. 1, pp. 37–47, 2020, doi: 10.1080/00401706.2019.1566095.

[10] H. Shi, P. Wang, X. Yang, and H. Yu, “An Improved Mean Imputation Clustering Algorithm for Incomplete Da-ta,” Neural Process. Lett., vol. 54, no. 5, 2022, doi: 10.1007/s11063-020-10298-5.

[11] Y. Cheng, X. Ma, L. Yuan, Z. Sun, and P. Wang, “Evaluating imputation methods for single-cell RNA-seq data,” BMC Bioinformatics, vol. 24, no. 1, pp. 1–24, 2023, doi: 10.1186/s12859-023-05417-7.

[12] A. B. Md Soom, A. Mat Jasin, A. Asmat, R. Canda, and J. Ismail, “a Bad Idea of Using Mode Imputation Method,” J. Inf. Syst. Technol. Manag., vol. 7, no. 29, pp. 01–09, 2022, doi: 10.35631/jistm.729001.

[13] K. Alnowaiser, “Improving Healthcare Prediction of Diabetic Patients Using KNN Imputed Features and Tri-Ensemble Model,” IEEE Access, vol. 12, no. 629, pp. 16783–16793, 2024, doi: 10.1109/ACCESS.2024.3359760.

[14] L. Shao and W. Chen, “Coal and Gas Outburst Pre-diction Model Based on Miceforest Filling and PHHO–KELM,” Processes, vol. 11, no. 9, pp. 1–12, 2023, doi: 10.3390/pr11092722.

Diterbitkan

2026-03-31