Page 1 of 13
Transactions on Engineering and Computing Sciences - Vol. 12, No. 1
Publication Date: February 25, 2024
DOI:10.14738/tecs.121.16525.
Pödör, Z., & Hekfusz, M. (2024). Comparing Feature Selection Methods on Metagenomic Data using Random Forest Classifier.
Transactions on Engineering and Computing Sciences, 12(1). 175-187.
Services for Science and Education – United Kingdom
Comparing Feature Selection Methods on Metagenomic Data
using Random Forest Classifier
Zoltán Pödör
Eötvös Loránd University, Faculty of Informatics, Budapest, H-1117, Hungary
Máté Hekfusz
Eötvös Loránd University, Faculty of Informatics, Budapest, H-1117, Hungary
ABSTRACT
Feature selection (FS) as a data preprocessing strategy is an efficient way to prepare
input data for various fields, such as metagenomics, where datasets tend to be very
high-dimensional. The objectives of feature selection include creating lower
dimensional and cleaner input data, along with building simpler and more coherent
machine learning models. One of the promising applications of machine learning is
in precision medicine, where disease risk is predicted using patient genetic data,
which needs to be preprocessed with feature selection. In this article we provide a
general overview of different feature selection methods and their applicability for
disease risk prediction. From these, we selected and compared six different FS
methods on two freely available metagenomic datasets using the same machine
learning algorithm (Random Forest) for comparability. Based on the results of the
individual FS methods, ensemble feature sets were created in multiple ways to
improve the accuracy of Random Forest predictions.
Keywords: Feature selection, Metagenomics, Random Forest, Ensemble feature selection.
INTRODUCTION
Nowadays, machine learning algorithms and artificial intelligence are unavoidable parts of big
data analysis. There are many scientific and practical problems where the number of
independent variables (known as the features) is often so high that they are affected by what is
known as the curse of dimensionality. As the number of dimensions or features increases, the
amount of data needed to generalize the machine learning model accurately increases
exponentially, and it is more difficult to achieve high accuracy values [1]. Also, with a huge
number of features, learning models often tend to overfit, which may cause performance
degradation on unseen, new data. Data of high dimensionality can also significantly increase
the memory storage requirements and computational costs for data analytics [2]. This problem
occurs in many areas, but it is a constant challenge in genomics, and especially when dealing
with metagenomic data, which sequences entire microbial communities, resulting in huge,
diverse, and noisy feature sets. [3]
The advancement of genetic sequencing and machine learning algorithms over the last 10-15
years has increased interest in precision medicine and genome data-based disease detection
[4] [5]. The advent of high-throughput, next-generation sequencing (NGS) has brought a huge
influx of metagenomic data, [6] and today there is an abundance of metagenomic samples
Page 2 of 13
176
Transactions on Engineering and Computing Sciences (TECS) Vol 12, Issue 1, February - 2024
Services for Science and Education – United Kingdom
available in public databases. Metagenomics is a field devoted to understanding the workings
of microbes by sequencing and analysing their genomes. To extract knowledge and patterns
and to make decisions based on this sequenced metagenomic data, artificial intelligence and
machine learning methods have been instrumental.
A typical (digitised) microbiome sample is made up of millions of raw sequences reads, which
are converted with bioinformatics tools into a two-dimensional table (the rows are the samples,
the columns are the microbial taxa) containing the relative abundances of each microbial taxa
(the features in this case) present in the sample [7]. This table tends to be sparse with high
noise, which makes for suboptimal training data for machine learning models [8]. Metagenomic
data is extremely high-dimensional: the number of microbial features in each sample is orders
of magnitude greater than the number of samples available for analysis [7] [8]. This number
depends on the dataset, but it is usually on the order of tens of thousands [9].
Thus, to be able to apply machine learning approaches effectively, metagenomic data needs to
be preprocessed: redundant and noisy features must be removed, and dimensionality must be
drastically reduced before a dataset could be used to train an ML model. This preprocessing is
known as feature selection (FS), and it has become an indispensable part of most bioinformatics
pipelines [10]. Well-prepared input datasets are necessary conditions for effective and reliable
data analysis.
This paper has two main purposes: on the one hand to compare several FS methods on the same
databases with the same machine learning algorithm. On the other hand, to create, examine and
compare ensemble feature sets based on the results of individual FS methods. Our research
questions – connected to metagenomic data – were: (1) what type of FS methods give the best
feature subset for machine learning, (2) do the different FS methods define similar feature
subsets, (3) are the ensemble feature sets better than the result of individual selections’ results?
The remainder of this paper is organized as follows: Section 2 provides a literature review on
the most important FS methods and their applications in metagenomics, as well as on our
applied machine learning algorithm. Section 3 discusses the performance of the FS methods we
selected, both individually and in ensemble, on the same input datasets. Section 4 answers our
research questions and summarizes the main findings and implications of this study.
METHODS
Data preprocessing is the foundation of successful and reliable data analysis. Metagenomic
databases pose several challenges, the most important one is their high dimensionality: even if
the number of samples is not too large, the number of features is. This means that an
appropriate preprocessing step is very critical for a high-quality analysis [11].
In this chapter we introduce the main types of the feature selection methodologies, their
applications in metagenomics, and our one chosen machine learning algorithm, Random Forest
(RF), which builds a model from the selected features. Our aim was to compare the result of the
selected FS algorithms with the same machine learning algorithm (RF) in all cases for
consistency.
Page 3 of 13
177
Pödör, Z., & Hekfusz, M. (2024). Comparing Feature Selection Methods on Metagenomic Data using Random Forest Classifier. Transactions on
Engineering and Computing Sciences, 12(1). 175-187.
URL: http://dx.doi.org/10.14738/tecs.121.16525
Feature Selection
One of the main goals of the preprocessing step is to reduce the dimensionality and the
complexity of a dataset, which is accomplished by feature selection. There are five main types
of feature selection methods:
Basic Methods:
1. Filter methods: Features are selected on the basis of their scores in various statistical
tests for their correlation with the dependent variable. Features are ranked by an
evaluation criterion, and those above a certain threshold (usually chosen by the user)
are selected as the feature subset, independently of the learning model. Because these
evaluations are fast and independent of machine learning classifiers, filter methods are
considered the simplest and least computationally intensive of the FS methods [12].
Filter methods can be univariate (testing each feature independently, for example
Fischer-, t-, Mann-Whitney-test, Pearson, correlation) or multivariate (testing subsets
of features simultaneously, for example Fast Correlation-based filter, Minimal- redundancy-maximal-relevance, Relief-based algorithms).
2. Wrapper methods: Unlike filter techniques, wrapper methods are invariably tied to a
given ML classifier, as they use it to evaluate different combinations of features and
select the subset that performs the best [12]. Their main issue is cost [10]: given the
high-dimensional nature of metagenomic data, evaluating every possible subset is
computationally infeasible, requiring search strategies to narrow down the options.
Perhaps the best-known search strategies are forward selection and backward
elimination.
3. Embedded methods: They integrate feature selection and ML model training into one
step. During training, the ML algorithm automatically determines the importance of
each feature. These methods can be considered a middle ground between filters and
wrappers: like filters, they are reasonably fast, and like wrappers, they consider the
characteristics of the classifier to achieve higher performance [12]. Random Forest and
LASSO methods are typical examples of this type of FS.
Advanced Methods:
They Are Built from The Three Basic Types Mentioned Above
1. Hybrid methods: They implement different types of FS algorithms within one, multi- step sequential process, taking advantage of their different characteristics. The most
intuitive way to construct a hybrid method is to start with a fast filter technique and
then give its (lower dimensional) output to a wrapper or embedded method, reducing
their higher computational cost while retaining their higher accuracy – the best of both
worlds [12].
2. Ensemble methods: Ensemble methods also utilize multiple FS algorithms, but unlike
hybrid techniques, they do not implement them step-by-step, but rather, in parallel. In
an ensemble process, multiple FS algorithms are run on the dataset separately, each of
which returns a subset of features. Then, these subsets are aggregated in some manner
to find the final feature set. This aggregation can be a simple intersection or union of
the individual subsets or some kind of weighting of each feature based on its position
in each individual subset [12].
Page 4 of 13
178
Transactions on Engineering and Computing Sciences (TECS) Vol 12, Issue 1, February - 2024
Services for Science and Education – United Kingdom
Application of FS Algorithms in Bioinformatics
We studied the bioinformatics literature to find feature selection methods effectively used for
disease risk prediction from metagenomic and other genomic data. While the focus of this paper
is metagenomics, feature selection of genomic data in general faces many of the same
challenges: high dimension, sparsity, noise. Thus, we include papers and algorithms that deal
with other forms of genomic data (such as micro-array and genotype data) as well to present a
more comprehensive picture of the field.
Filter methods have proven enduringly popular in bioinformatics, with several different
algorithms being used: Hacilar et al. [13] used the minimum redundancy – maximum relevance
(mRMR) multivariate FS method on an Inflammatory Bowel Disease (IBD) dataset to find the
subset of features most associated with the disease. Urbanowitz et al. [14] analysed the popular
Relief family of multivariate FS methods on simulated genotype datasets, finding that they
accurately detected two-way feature interactions.
Despite their higher cost, wrapper methods have also been used in genomics. He et al. [15]
devised a novel wrapper FS algorithm based on the mRMR filter to predict genetic traits.
Kavakiotis et al. [16] also created a new wrapper, Frequent Item Feature Selection (FIFS), which
outperformed other FS methods in the informative marker selection task. Shen et al. [17] used
the Boruta wrapper, paired with the Random Forest classifier, to find microbes from the gut
microbiome that were important to predicting schizophrenia.
Studies that do not specify a feature selection step but use a classifier like Random Forest could
be considered to be using an embedded method. More explicitly, Kumar & Rath [18] used
Support Vector Machines (SVM) as an embedded way of feature selection, along with statistical
filter methods, on leukaemia datasets. Sasikala et al. [19] proposed a new Genetic Algorithm
(GA), integrating it with four different ML classifiers to produce a highly accurate model for
breast cancer diagnosis.
Hybrid methods have recently become quite popular in the field, with some studies calling it
the ‘best practice’ for feature selection [10] [12]. Jafari et al. [20] combined two univariate filters
(Pearson correlation and information gain) and a multivariate filter (ReliefF) with a Genetic
Algorithm wrapper to infer gene networks. Wang & Cai [21] analysed five different types of
cancer with a two-step FS framework followed by an SVM classifier, manually confirming that
the hybrid process selected near-optimal feature subsets.
Studies show that ensemble methods outperform single-algorithm methods in a variety of
genomic tasks. Verma et al. [22] used a variety of filter and embedded methods – wrappers are
rarely present in ensembles because of their high cost – to show that different FS algorithms
selected different features from genetic data, and thus an ensemble method is needed for the
best performance. Farid et al. [23] proposed an ensemble feature selection and clustering
method specifically for high-dimensional genomic data, showing that it worked better on a
Brugada syndrome dataset than non-ensemble alternatives. Sarkar et al. [24] combined no less
than eight different FS methods into an ensemble process and devised an innovative
Page 5 of 13
179
Pödör, Z., & Hekfusz, M. (2024). Comparing Feature Selection Methods on Metagenomic Data using Random Forest Classifier. Transactions on
Engineering and Computing Sciences, 12(1). 175-187.
URL: http://dx.doi.org/10.14738/tecs.121.16525
aggregation step that delivered high classification accuracy on breast cancer microRNA
biomarkers.
The above-mentioned applications show the importance and inevitability of FS in genomic data
analysis, including in metagenomic data, as a preprocessing step.
Random Forest
Random Forest is a machine learning classifier which combines the output of multiple decision
trees to reach a single result. It conducts feature selection during training, making it an
embedded FS algorithm as well. Its ease of use and flexibility have fuelled its adoption, as it
handles both classification and regression problems [12]. It is highly data adaptive, which
makes it well-suited for problems with “large p, small n” problems (where p is the number of
features and n is the number of samples) and can account for correlation as well as interactions
among the features [25].
RF is a widely used machine learning algorithm for classification tasks in metagenomic analysis,
and it outperforms LASSO and SVM on metagenomics datasets for colorectal cancer, which is a
disease we also analyse [26]. Degenhardt et al. [27] established that RF has been successfully
applied on metagenomics datasets. In [28] the authors have found that RF showed better
generalizability and robustness than KNN and SVM methods, making it suitable for use with
high-dimensional data. Shen et al. [29] show examined RF, stochastic gradient boosting, and
SVM methods on metagenomics data. They selected Random Forest due to its slightly better
performance from a holistic perspective and capability of ranking the predictor variables.
These properties of RF and its popularity were the reasons we chose this learning model to
examine the results of our selected feature selection methods.
RESULTS
During our literature review, we have identified several feature selection methods effectively
used to identify diseases from genomic data, which we believe are well-suited to be used with
metagenomic data given the similarity of the challenges. Based on this we have selected six for
our experiments: Chi-squared (Chi2), Mutual Information (MI), Minimum Redundancy- Maximum Relevance (mRMR), Fast Correlation-Based Filter (FCBF), MultiSURF, and Random
Forest. The first five are all filter methods, which are frequently used in ensemble feature
selection because of their speed and simplicity, allowing several algorithms to be executed in
tandem. Random Forest is an embedded method, meaning it conducts feature selection and
model training in the same step. We use it not just for feature selection, but also as the machine
learning classifier with which we test all the other methods, providing us an appropriate
baseline. No wrappers were included as their computational cost tends to be too excessive for
very high-dimensional datasets, and thus they rarely feature in ensemble algorithms.
We used two publicly available metagenomic datasets. The first dataset deals with Parkinson’s
disease (PD), with 366 gut microbiome samples from patients in the United States, as described
in the study of Hill-Burns et al. [31]: 211 of these samples have the disease, while the other 155
are healthy controls. The second dataset, published by Zeller et al. [32], contains 182 samples
related to colorectal cancer (CRC), with a good balance of 90 cancerous and 92 healthy samples.
Page 6 of 13
180
Transactions on Engineering and Computing Sciences (TECS) Vol 12, Issue 1, February - 2024
Services for Science and Education – United Kingdom
The samples in both datasets have already been processed from raw reads into two feature
abundance tables, which can directly be fed into feature selection algorithms. The PD table,
processed during previous research, contains 10,631 features, each corresponding to a
microbial taxon. The CRC table, processed and made available by Qu et al. [33], contains 18,170
features.
The feature selection algorithms we chose were drawn from various open-source libraries and
implemented in Python. MI, Chi2, and RF were taken from the popular machine learning toolkit
scikit-learn [34]. Accompanying their review of the Relief feature selection family, Urbanowicz
et al. published a suite of the algorithms, which included MultiSURF. Finally, mRMR and FCBF
algorithms were taken from separate open-source repositories (Available at
https://github.com/smazzanti/mrmr under the MIT license).
Individual Performance on Feature Subsets
First, we examined how well each of our feature selection methods performed individually on
feature subsets of various (but always small, as dimensionality reduction is one of the main
goals of feature selection) sizes. Each method produces a ranking of every feature based on its
own measure of relevance and importance. After a ranking was produced, we took the highest
ranked features and reduced the dataset to just these features. We then divided the samples of
this reduced dataset into train and test splits (80% and 20% of the data, respectively) and
trained the Random Forest classifier with 10-fold cross-validation, recording mean accuracy
and standard deviation for each experiment.
Overall, we conducted 240 experiments at this stage, testing all six of our algorithms on both
datasets, with feature subsets ranging from size 10 to 200. The results can be found in Table 1
and Table 2. We also ran the full, unreduced datasets through Random Forest to get a baseline
accuracy to compare against. We chose RF for this task partly because it also implicitly employs
feature selection, and thus we believe it provides a higher baseline accuracy on metagenomic
data than classifiers without FS. For the PD dataset, this resulted in 62.57% classification
accuracy with a standard deviation of 4.02. For the CRC dataset, which appears to be easier to
classify, the baseline was 73.04% accuracy with 8.03 standard deviation.
Table 1: Classification accuracy (Acc in %) and standard deviation (Std) values for the
individual FS algorithms on various subsets of the Parkinson’s dataset. Highest
accuracy is highlighted in bold for each algorithm.
RF CHI2 MUTUAL
INFO
MRMR FCBF MULTISURF
FEATURES Acc Std Acc Std Acc Std Acc Std Acc Std Acc Std
10 62.5
8
11.5
7
57.3
9
5.3
2
60.65 12.72 62.3
6
7.88 63.4
1
10.4
5
54.6
3
8.4
20 68.0 3
3
6.63 63.9
7
6.8
2
59.03 8.22 59.5
3
8.25 64.9
9
8.93 58.7
5
6.6
30 68.2 6
4
9.74 64.2
3
6.5
8
65.81 5.66 69.9
3
9.52 66.4 8.65 62.3 8.0
40 70.4 2
5
7.64 63.4
2
9.3
7
60.34 10.07 72.4
2
9.08 65.0
2
9.08 60.6
8
9.2
50 71.5 7
3
7.88 65.0
4
5.9
4
68.27 5.36 74.0
8
11.0
3
66.1
3
9.15 62.0
6
8.4
60 71.2 5
8
8.36 65.8
3
6.0
8
65.87 9.77 74.3
3
9.99 63.1
4
8 64.2
1
7.4
70 69.6 7
4
10.0
2
65.5
6
5.7
8
66.71 9.47 72.7 8.96 63.4 7.17 63.3
9
6.0
80 70.7 1
4
6.99 65.6
1
6.9 57.92 7 73.2
4
9.24 63.9
4
7.24 63.1
4
5.4
5
Page 7 of 13
181
Pödör, Z., & Hekfusz, M. (2024). Comparing Feature Selection Methods on Metagenomic Data using Random Forest Classifier. Transactions on
Engineering and Computing Sciences, 12(1). 175-187.
URL: http://dx.doi.org/10.14738/tecs.121.16525
90 66.3
5
9.8 64.7
7
5.8
5
61.46 8.2 74.6
2
8.59 63.3
9
8.05 64.2 7.0
100 68.2 8
5
8.78 66.1
3
5.9
9
63.37 8.49 72.9
7
9.68 63.9
2
8.85 65.2
8
7.8
110 67.1 9
6
7.27 66.6
6
8.6
7
64 9.96 74.6 10.1 64.4
6
7.24 65.2
9
7.9
120 69.8 2
9
8.14 68.3
1
6.8
8
61.46 8.05 73.0
1
9.93 63.6
6
6.13 64.4
6
6.4
130 67.7 8.56 67.2 6.8 5
9
65.53 10.12 75.6
9
9.15 63.0
9
6.26 64.7
2
7.8
140 68.5 3
4
7.44 67.4
8
5.9
9
66.36 10.38 74.6
1
8.04 63.3
6
6.7 64.7
3
7.3
150 67.1 5
8
7.79 66.9
1
6.9
8
65.26 8.61 73.2
5
7.8 67.7
6
7.3 64.7
4
7.0
160 69.6 4
3
5.64 66.6
1
7.0
7
69.64 10.35 72.3
9
9.34 61.1
7
6.45 63.3
5
8.1
170 68.7 1
9
9.16 68.5
5
7.4
4
66.66 9.53 74.8
7
9.22 64.4
9
9.67 63.9 6.4
180 66.9 5
2
7.86 66.6
1
9.3 66.07 7.08 72.6
9
7.81 63.1
2
4.37 61.1
6
7.7
190 67.4 3
4
9.47 66.9
4
5.7
6
70.68 10.77 71.0
8
9.16 63.1 5.3 63.0
9
5.7
200 69.6
5
6.97 66.4
1
7.4
7
65.55 9.76 70.4
7
9.33 63.9
2
6.18 63.6
3
7.4
8
Table 2: Classification accuracy (Acc in %) and standard deviation (Std) values for the
individual FS algorithms on various subsets of the CRC dataset. Highest accuracy is
highlighted in bold for each algorithm.
RF CHI2 MUTUAL
INFO
MRMR FCBF MULTISURF
FEATURES Acc Std Acc Std Acc Std Acc Std Acc Std Acc Std
10 68.04 9.37 60.96 9.37 46.64 9.9 72.95 7.44 63.65 10.8 66.49 6.11
20 73.01 10.08 64.91 6.98 60.41 7.43 81.31 8.94 59.24 8.98 70.73 9.87
30 72.98 9.55 66.43 6.96 56.46 13.06 85.67 7.55 62.02 9.94 72.46 6.31
40 74.65 9.16 64.8 7.21 61.55 11.78 84.5 8.24 63.63 14.42 71.29 9.1
50 74.68 7.24 70.79 9.17 64.82 9.03 87.31 6.65 65.26 11.58 79.53 8.3
60 79.65 8.65 72.46 8.03 62.63 10.14 84.53 7 63.13 11.16 77.34 8.16
70 76.96 7.59 77.43 10.14 67.02 10.44 86.2 6.75 63.13 13.94 75.2 7.21
80 77.46 7.21 74.68 5.82 66.05 9.84 85.61 5.76 71.35 10.66 75.18 8.73
90 76.37 9.61 71.35 9.07 70.41 12.92 85.58 8 69.12 8.17 79.59 7.75
100 79.65 9 70.76 10.16 75.26 7.56 85.06 7.51 71.9 8.55 80.15 6.22
110 79.65 9 74.09 12.04 69.15 8.87 88.95 6.14 70.76 11.79 79.53 8.67
120 75.88 7.62 75.18 7.3 68.63 9.74 90.01 6 66.84 11.88 76.29 7.48
130 80.79 7.48 75.79 6.29 66.99 9.07 87.84 7.38 69.12 11.52 76.81 9.63
140 77.98 9.68 75.23 8.42 58.22 9.68 86.81 5.53 70.85 8.66 79.59 9.86
150 78.54 7.29 75.29 7.43 66.52 10.09 86.73 6.74 65.82 14.91 75.76 6.73
160 79.12 6.92 74.15 4.42 74.65 6.29 86.75 8.28 70.23 10.49 75.82 7.68
170 79.68 6.55 76.4 8.47 68.71 11.1 87.31 7.01 66.43 10.89 76.32 8.14
180 76.32 9.98 78.54 8.82 77.37 6.97 86.2 5.2 67.46 10.3 77.34 9.56
190 80.2 8.33 76.35 7.08 77.37 14.63 90 8.9 69.15 13.04 79.06 7.31
200 81.87 7.44 77.49 7.89 70.15 13.6 85.64 5.18 69.74 10.87 74.18 6.01
These individual results in Table 1 and Table 2 confirm that feature selection in general is an
important preprocessing step, generally providing higher classification performance with
drastically reduced feature counts, which also speed up model training. This performance
increase, however, is dependent on several factors. Perhaps the most important is the dataset
Page 8 of 13
182
Transactions on Engineering and Computing Sciences (TECS) Vol 12, Issue 1, February - 2024
Services for Science and Education – United Kingdom
itself: we can see that using same methods and the same parameters nonetheless resulted in
significantly (10-15%) lower accuracy on the PD dataset than the CRC one, indicating that the
former’s samples are more difficult to classify. Another important factor is the number of
features selected: unlike wrappers which determine the optimum number themselves, the filter
and embedded methods we use require the user to make that choice. The tables show that each
FS algorithm has a ‘sweet spot’, an ideal size for the feature set which provides the best
performance, and that this sweet spot is different for each algorithm – meaning each method
must be tested individually to find it.
The choice of the algorithm itself can make a substantial difference as well. Across the two
tested datasets, we have identified mRMR as the best performer by a considerable margin,
indicating that it is a powerful tool for selecting features from metagenomic datasets. We
initially believed that our multivariate filters (FCBF and MultiSURF) would perform better than
our univariate ones (Chi2 and Mutual Info), given the complex nature of the data. And while
MultiSURF delivered the second-highest accuracy on the CRC dataset, overall, we could not
establish such a pattern, with FCBF actually performing the worst from all methods on that
same dataset.
We have also observed high standard deviation values across algorithms and datasets, meaning
that the individual classification accuracies during cross-validation varied widely. This is
indicative of the instability of metagenomic datasets: there are subsets of our data for which
the classifier performs much better (and subsets for which it performs much worse) than the
average. For this study, we judged the performance of our FS algorithms by their accuracy, but
their stability (measured by standard deviations) could be an equally valid metric.
Ensemble Feature Set Creation
After finding the optimal subsets for each feature selection algorithm, we took those subsets
(50 features from RF, 170 features from Chi2, etc. for Parkinson’s, for example) and aggregated
them in two different ways. First, we took the union of the optimal subsets. Even though the
CRC dataset had 7,500 more features than the PD dataset and the subset sizes for each dataset
were different, the dimensions of the unions ended up being remarkably similar: 686 features
for PD, 678 for CRC.
Next, we checked each feature in the union set to see how many subsets they were included in
(in other words, how many FS algorithms ranked them high enough) and split them based on
this number.
Table 3: Breakdown of how many subsets each union feature appears in, for both
datasets.
Subsets 6 5 4 3 2 1
PD features 0 0 7 12 69 598
CRC features 0 4 7 36 93 538
Table 3 lists the breakdown of feature counts after this split and shows that while most features
were only selected by one algorithm, there was enough of an overlap to construct consensus
Page 9 of 13
183
Pödör, Z., & Hekfusz, M. (2024). Comparing Feature Selection Methods on Metagenomic Data using Random Forest Classifier. Transactions on
Engineering and Computing Sciences, 12(1). 175-187.
URL: http://dx.doi.org/10.14738/tecs.121.16525
sets, especially in the CRC dataset. We thus created one ensemble set with features selected by
at least two algorithms, and another with features selected by at least three.
We tested all three of our aggregated sets with the same RF classifier that was used for the
individual subsets.
Table 4: Feature counts and classification performance of three different ensemble sets
on both datasets.
Dataset Method Features count Accuracy Std. dev.
PD Union 686 68.81 8.59
At least two 88 68.79 7.95
At least three 19 66.65 9.82
CRC Union 678 81.78 9.73
At least two 140 79.12 7.18
At least three 47 81.84 5.01
As Table 4 shows, all three ensemble sets had comparable performances on both datasets. It
should be noted, however, that the consensus sets achieved their accuracy with a fraction of the
features of the union set. For any given machine learning task, it is advised to have fewer
features than samples to avoid the curse of dimensionality, which we discussed earlier – and
the consensus sets achieve this. But it is especially important in metagenomics, where each
feature is a microbial taxon which can be extracted and identified. To demonstrate this, we
extracted three features from each dataset, each of which was selected by at least four of our
six FS algorithms, and then identified the taxa they represented through the GreenGenes
reference database (available at https://greengenes.secondgenome.com/). These microbial
features are presented in Table 5 below.
Table 5: Operational Taxonomic Unit (OTU) code and taxonomical identification of
some features selected by at least four FS algorithms.
Dataset OTU Taxonomy
PD 370287 c__Clostridia; o__Clostridiales; f__Ruminococcaceae;
g__Faecalibacterium; s__prausnitzii
336012 c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides
1078587 c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Blautia
CRC 364048 c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Blautia
514523 c__Clostridia; o__Clostridiales; f__Ruminococcaceae;
g__Faecalibacterium; s__prausnitzii
301910 c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Coprococcus
Identifying the exact connection (whether these features correspond to disease or health) is
beyond the scope of this study, but our example shows that having few features makes it easier
to pinpoint specific bacteria species and families that appear to be highly correlated with
certain diseases [8] [24] which could improve their detection and treatment. Classification
accuracy on the ensemble sets was similar or better than on the individual FS algorithm subsets,
with mRMR being the one notable exception which performed considerably better than the
Page 10 of 13
184
Transactions on Engineering and Computing Sciences (TECS) Vol 12, Issue 1, February - 2024
Services for Science and Education – United Kingdom
ensemble. This suggests that rather than augment the strength of mRMR (which is the idea
behind ensemble feature selection), the findings of the other algorithms dragged it down,
although it could equally be said that they elevated the worse-performing methods. Throughout
our experiments, all algorithms had equal say in constructing the ensemble, which could help
explain this phenomenon – in future research, a weighted voting scheme could be instituted
which prioritises the features selected by higher-performing algorithms such as mRMR. This
could also help in increasing stability, as for now, the standard deviations of the ensemble sets
remained high, meaning that the ensemble frameworks did not make feature selection more
stable.
CONCLUSION
In this study, we have explored feature selection in the field of bioinformatics, where it has
become an increasingly crucial preprocessing step in a machine learning pipeline because of
the massive amounts of metagenomic data available today, most of which is high-dimensional,
sparse, and noisy. By conducting a thorough literature review, we have described the three
basic categories of feature selection methods, naming specific algorithms used with genomic
data. We have also looked at how combining them in a hybrid or ensemble framework has the
potential to improve classification accuracy and stability. From the gamut of FS algorithms used
in relevant studies, we have selected six and tested their performance extensively on two
publicly available metagenomic datasets, one for Parkinson’s disease, the other for colorectal
cancer. One of our selected FS methods was Random Forest, which we used both as an ensemble
FS algorithm and a machine learning classifier. Afterwards, we have created a new ensemble
framework with these six algorithms, constructing ensemble sets from their highest- performing subsets through different aggregation methods, before comparing their accuracy
with each other and with our individual results using the Random Forest classifier. We have
also identified a few microbial features found through our ensemble framework, which appear
to be highly correlated with our chosen diseases – showing that feature selection can also aid
in precision medicine.
The main conclusion we can draw from our findings is that choices matter, and care must be
taken at every step of the feature selection process. We have shown that the dataset itself affects
the quality of the classification, with all FS methods performing much better on the CRC dataset
than the PD one. The choice of FS algorithm also has a profound difference on classification
performance: we have identified the minimal redundancy – maximum relevance (mRMR)
algorithm as being particularly effective on our data, while another promising method, the Fast
Correlation-Based Filter (FCBF) did not result in a meaningful improvement over the baseline.
We have also found that despite the general potential of (and large interest in) ensemble
methods, they do not always give better results than a single effective FS algorithm, at least in
terms of our metric, which was classification accuracy. One cannot simply throw together
different algorithms and hope for improvement; the participating methods and the way their
subsets are aggregated must both be carefully chosen.
A voting scheme of the FS algorithms has the potential to create a high-performing ensemble
set, and we will continue exploring in this direction in future research. We also aim to gather
more datasets and more feature selection methods, so that we can help advance understanding
Page 11 of 13
185
Pödör, Z., & Hekfusz, M. (2024). Comparing Feature Selection Methods on Metagenomic Data using Random Forest Classifier. Transactions on
Engineering and Computing Sciences, 12(1). 175-187.
URL: http://dx.doi.org/10.14738/tecs.121.16525
of feature selection for metagenomic data. Finally, to improve the generalization possibilities of
the results, we also plan to include other classification methods in our experiments, like SVM,
logistic regression, and gradient boosting.
ACKNOWLEDGEMENT
The research was supported by the project No. 2019-1.3.1-KK-2019-00011 financed by the
National Research, Development and Innovation Fund of Hungary under the Establishment of
Competence Centers, Development of Research Infrastructure Programme funding scheme.
References
[1]. Berisha, V., Krantsevich, C., Hahn, P.R. et al. Digital medicine and the curse of dimensionality. npj Digit.
Med. 2021; 4:153, https://doi.org/10.1038/s41746-021-00521-5.
[2]. Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu.
Feature Selection: A Data Perspective. ACM Comput. Surv. 2018;50:6, https://doi.org/10.1145/3136625.
[3]. Wooley J C, Godzik A, Friedberg I. “A Primer on Metagenomics,” PLoS Computational Biology 2010; 6(2);
https://doi.org/10.1371/journal.pcbi.1000667
[4]. Spiegel, A. M., and Hawkins, M. ’Personalized Medicine’ to Identify Genetic Risks for Type 2 Diabetes and
Focus Prevention: Can it Fulfill its Promise? Health Aff. (Millwood), 2012;31:43–49.
doi:10.1377/hlthaff.2011.1054.
[5]. Manolio, T. A. Bringing Genome-wide Association Findings into Clinical Use. Nat. Rev. Genet. 2013;14:549–
558. doi:10.1038/nrg3523.
[6]. Oulas A, Pavloudi C, Polymenakou p, Pavlopoulos G A, Papanikolaou N, Kotoulas G, et al., “Metagenomics:
Tools and Insights for Analyzing Next-Generation Sequencing Data Derived from Biodiversity Studies,”
Bioinformatics and Biology Insights 2015;9:75-88, doi: 10.4137/BBI.S12462
[7]. Oudah M, Henschel A. “Taxonomy-aware feature engineering for microbiome classification,” BMC
Bioinformatics 2018;19:227
[8]. Zhu Q, Li B, He T, Li G, Jiang X. Robust biomarker discovery for microbiome-wide association studies.
Methods 2020;173:44–51. DOI: 10.1016/j.ymeth.2019.06.012
[9]. Lai K, Twine N, O'Brien A, Guo Y, Bauer D. Artificial Intelligence and Machine Learning in Bioinformatics.
Encyclopedia of Bioinformatics and Computational Biology, Elsevier, 2019, p. 272–286.
[10]. Tadist K, Najah S, Nikolov N S, Mrabti F, Zahi A. Feature selection methods and genomic big data: a
systematic review. Journal of Big Data 2019;6:79. https://doi.org/10.1186/s40537-019-0241-0
[11]. Berrar D, Bradbury I, Dubitzky W. Avoiding model selection bias in small-sample genomic datasets.
Bioinformatics. 2006;22:10:1245.12.
[12]. Pudjihartono N, Fadason T, Kempa-Liehr A W, O'Sullivan J M. A Review of Feature Selection Methods for
Machine Learning-Based Disease Risk Prediction. Frontiers in Bioinformatics 2022;2. doi:
10.3389/fbinf.2022.927312
[13]. Hacilar H, Nalbantoglu O U, Bakir-Gungor B. Machine Learning Analysis of Inflammatory Bowel Disease- Associated Metagenomics Dataset. 3rd International Conference on Computer Science and Engineering
(UBMK), 2018.
Page 12 of 13
186
Transactions on Engineering and Computing Sciences (TECS) Vol 12, Issue 1, February - 2024
Services for Science and Education – United Kingdom
[14]. Urbanowicz R J, Olson R S, Schmitt P, Meeker M, Moore J H. Benchmarking relief-based feature selection
methods for bioinformatics data mining. Journal of Biomedical Informatics 2018; vol. 85:168–188.
https://doi.org/10.1016/j.jbi.2018.07.015
[15]. He D, Rish I, Haws D. Parida L. MINT: Mutual Information Based Transductive Feature Selection for
Genetic Trait Prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2016; vol.
13:578–583. https://doi.org/10.48550/arXiv.1310.1659
[16]. Kavakiotis I, Samaras P, Triantafyllidis A. Vlahavas I. FIFS: A data mining method for informative marker
selection in high dimensional population genomic data. Computers in Biology and Medicine 2017;vol.
90:146–154. https://doi.org/10.1016/j.compbiomed.2017.09.020
[17]. Shen Y, Xu J, Li Z, Huang Y, Yuan Y, Wang J, et al. Analysis of gut microbiota diversity and auxiliary
diagnosis as a biomarker in patients with schizophrenia: A cross-sectional study. Schizophrenia Research
2018;vol. 197:470–477. DOI: 10.1016/j.schres.2018.01.002
[18]. Kumar M, Rath S K. Classification of microarray using MapReduce based proximal support vector machine
classifier. Knowledge-Based Systems 2015; vol. 89:584–602.
[19]. Sasikala S, Balamurugan S A, Geetha S. A Novel Feature Selection Technique for Improved Survivability
Diagnosis of Breast Cancer. Procedia Computer Science 2015;vol. 50:16–23.
https://doi.org/10.1016/j.procs.2015.04.005
[20]. Jafari M, Ghavami B, Sattari V. A hybrid framework for reverse engineering of robust Gene Regulatory
Networks. Artificial Intelligence in Medicine 2017;vol. 79:15–27.
https://doi.org/10.1016/j.artmed.2017.05.004
[21]. Wang S, Cai Y. Identification of the functional alteration signatures across different cancer types with
support vector machine and feature analysis. Biochimica et Biophysica Acta (BBA) - Molecular Basis of
Disease 2018;vol. 1864:2218–2227. DOI: 10.1016/j.bbadis.2017.12.026
[22]. Verma S S, Lucas A, Zhang X, Veturi Y, Dudek S, Li B, et al. Collective feature selection to identify crucial
epistatic variants. BioData Mining 2018;vol. 11:5 DOI: 10.1186/s13040-018-0168-6
[23]. Farid D M, Nowe A, Manderick B. A feature grouping method for ensemble clustering of high-dimensional
genomic big data. Future Technologies Conference (FTC), 2016. doi: 10.1109/FTC.2016.7821620
[24]. Sarkar J P, Saha I, Sarkar A, Maulik U. Machine learning integrated ensemble of feature selection methods
followed by survival analysis for predicting breast cancer subtype specific miRNA biomarkers. Computers
in Biology and Medicine 2021;vol. 131:104-244. DOI: 10.1016/j.compbiomed.2021.104244
[25]. Chen X, Ishwaran H. Random forests for genomic data analysis, Genomics 2012;vol. 99:6:323-329, ISSN
0888-7543, https://doi.org/10.1016/j.ygeno.2012.04.003.
[26]. Gao Y, Zhu Z, Sun F. Increasing prediction performance of colorectal cancer disease status using random
forests classification based on metagenomic shotgun sequencing data. Synthetic and Systems
Biotechnology 2022; vol 7:574–585. https://doi.org/10.1016/j.synbio.2022.01.005
[27]. Degenhardt F, Seifert S, Szymczak S. Evaluation of variable selection methods for random forests and
omics data sets. Brief Bioinform. 2019; vol. 20(2):492-503. doi: 10.1093/bib/bbx124.
[28]. Ai D, Pan H, Han R, Li X, Liu G, Xia LC. Using Decision Tree Aggregation with Random Forest Model to
Identify Gut Microbes Associated with Colorectal Cancer. Genes (Basel) 2019; vol 10(2):112. doi:
10.3390/genes10020112.
Page 13 of 13
187
Pödör, Z., & Hekfusz, M. (2024). Comparing Feature Selection Methods on Metagenomic Data using Random Forest Classifier. Transactions on
Engineering and Computing Sciences, 12(1). 175-187.
URL: http://dx.doi.org/10.14738/tecs.121.16525
[29]. Shen J, McFarland A G, Blaustein R A et al. An improved workflow for accurate and robust healthcare
environmental surveillance using metagenomics. Microbiome 2022; vol 10.
https://doi.org/10.1186/s40168-022-01412-x
[30]. Montesinos L, Montesinos López O A, Crossa A. Random Forest for Genomic Prediction. In: Multivariate
Statistical Machine Learning Methods for Genomic Prediction. Springer, Cham. 2022;
https://doi.org/10.1007/978-3-030-89010-0_15
[31]. Hill-Burns E M, Debelius J W, Morton J M, Wissemann W T, Lewis M R, Wallen Z D, et al. Parkinson's
disease and Parkinson's disease medications have distinct signatures of the gut microbiome. Movement
Disorders 2017;vol. 32:739–749. DOI: 10.1002/mds.26942
[32]. Zeller G, Tap J, Voigt A Y, Sunagawa S, Kultima J R, Costea P I, et al. Potential of fecal microbiota for early- stage detection of colorectal cancer. Molecular Systems Biology 2014;vol. 10:766. doi:
10.15252/msb.20145645
[33]. Qu K, Gao F, Guo F, Zou Q. Taxonomy dimension reduction for colorectal cancer prediction. Computational
Biology and Chemistry 2019;vol. 83:107-160. DOI: 10.1016/j.compbiolchem.2019.107160
[34]. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning
in Python. Journal of Machine Learning Research 2011;vol. 12:2825-2830.