Comparing Feature Selection Methods on Metagenomic Data using Random Forest Classifier

Authors

  • Zoltán Pödör Eötvös Loránd University, Faculty of Informatics, Budapest, H-1117, Hungary
  • Máté Hekfusz Eötvös Loránd University, Faculty of Informatics, Budapest, H-1117, Hungary

DOI:

https://doi.org/10.14738/tecs.121.16525

Keywords:

Feature Selection, Metagenomics, Random Forest, Ensemble feature selection

Abstract

Feature selection (FS) as a data preprocessing strategy is an efficient way to prepare input data for various fields, such as metagenomics, where datasets tend to be very high-dimensional. The objectives of feature selection include creating lower dimensional and cleaner input data, along with building simpler and more coherent machine learning models. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data, which needs to be preprocessed with feature selection. In this article we provide a general overview of different feature selection methods and their applicability for disease risk prediction. From these, we selected and compared six different FS methods on two freely available metagenomic datasets using the same machine learning algorithm (Random Forest) for comparability. Based on the results of the individual FS methods, ensemble feature sets were created in multiple ways to improve the accuracy of Random Forest predictions.

Downloads

Published

2024-02-29

How to Cite

Pödör, Z., & Hekfusz, M. (2024). Comparing Feature Selection Methods on Metagenomic Data using Random Forest Classifier. Transactions on Engineering and Computing Sciences, 12(1), 175–187. https://doi.org/10.14738/tecs.121.16525