Feature Selection and an Ensemble Framework for Metagenomic Data

Authors

  • Zoltán Pödör Eötvös Loránd University, Faculty of Informatics, Budapest, H-1117, Hungary
  • Máté Hekfusz Eötvös Loránd University, Faculty of Informatics, Budapest, H-1117, Hungary

DOI:

https://doi.org/10.14738/tmlai.1304.19266

Keywords:

Feature Selection, Classification, Ensemble Framework, Genome Data

Abstract

Genome data, characterized by its high dimensionality and complexity, presents significant challenges for computational analysis and biological interpretation. Feature selection plays a crucial role in reducing dimensionality, improving model interpretability, and enhancing predictive performance by identifying the most informative genomic attributes. In this study, we construct a robust, generalisable ensemble framework for the feature selection and ML classification of metagenomic data. The framework incorporates six different feature selection algorithms of different types working in an ensemble. We comprehensively assess four ML classifiers to pair with them and three aggregation methods to combine their results, testing numerous configurations to find which ones perform best. Our result shows that Random Forest is a general and reliable algorithm for metagenomic datasests and consistent with the literature, we found that feature selection universally improves classification performance, though this improvement varies per dataset and, on non-wrapper methods, depends on choosing the right subset size. When looking at their best scores, the six FS algorithms performed broadly similarly across the data, with the largest differences being on the hardest-to-classify datasets, where mRMR and Boruta edged out the others.

Downloads

Published

2025-08-25

How to Cite

Pödör, Z., & Hekfusz, M. (2025). Feature Selection and an Ensemble Framework for Metagenomic Data. Transactions on Engineering and Computing Sciences, 13(04), 116–143. https://doi.org/10.14738/tmlai.1304.19266