Feature Selection and an Ensemble Framework for Metagenomic Data
DOI:
https://doi.org/10.14738/tmlai.1304.19266Keywords:
Feature Selection, Classification, Ensemble Framework, Genome DataAbstract
Genome data, characterized by its high dimensionality and complexity, presents significant challenges for computational analysis and biological interpretation. Feature selection plays a crucial role in reducing dimensionality, improving model interpretability, and enhancing predictive performance by identifying the most informative genomic attributes. In this study, we construct a robust, generalisable ensemble framework for the feature selection and ML classification of metagenomic data. The framework incorporates six different feature selection algorithms of different types working in an ensemble. We comprehensively assess four ML classifiers to pair with them and three aggregation methods to combine their results, testing numerous configurations to find which ones perform best. Our result shows that Random Forest is a general and reliable algorithm for metagenomic datasests and consistent with the literature, we found that feature selection universally improves classification performance, though this improvement varies per dataset and, on non-wrapper methods, depends on choosing the right subset size. When looking at their best scores, the six FS algorithms performed broadly similarly across the data, with the largest differences being on the hardest-to-classify datasets, where mRMR and Boruta edged out the others.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Zoltán Pödör, Máté Hekfusz

This work is licensed under a Creative Commons Attribution 4.0 International License.
