Feature Selection

The selection of features , also known as the selection of variables , the selection of attributes or the selection of a subset of variables , is a kind of abstraction , the process of selecting a subset of significant features (variables dependent and independent) for use in building the model. Character selection techniques are used for four reasons:

simplification of models in order to make them easier to interpret by researchers / users ^[1] ,
Shorter training time
to avoid the curse of dimensionality ,
improved generalization by reducing retraining ^[2] (formally, reducing variance ^[1] ).

The central message of using the feature selection technique is that the data contains some features that are either redundant or not significant , and therefore can be deleted without significant loss of information ^[2] . Superfluous and insignificant are two different concepts, since one significant trait may be superfluous in the presence of another significant trait with which it strongly correlates ^[3] .

Character selection techniques should be distinguished from feature extraction . The selection of features creates new features as a function of the original features, while the selection of features returns a subset of the features. Character selection techniques are often used in areas where there are many features and the samples are relatively small (few data points). The classic places of application for the selection of characters are the analysis of handwritten texts and DNA microarrays , where there are many thousands of characters and from tens to hundreds of samples.

Content

1 Introduction
2 Select a subset
3 Optimality criterion
4 Structural training
5 Characterization mechanisms based on information theory
- 5.1 Selection of attributes based on minimum redundancy-maximum relevance
- 5.2 Quadratic programming for feature selection
- 5.3 Conditional mutual information
- 5.4 Joint mutual information
6 Selection of attributes based on the Lasso Hilbert-Schmidt independence criterion
7 Feature selection based on correlation
8 Regularized Trees
9 Overview of Metaheuristics
- 9.1 Basic principles
  - 9.1.1 Filter Method
  - 9.1.2 Wrapping Method
  - 9.1.3 Nesting Method
- 9.2 Attribute metaheuristics of feature selection
10 Selection of attributes embedded in learning algorithms
11 See also
12 Notes
13 Literature
14 References for Further Reading
15 Links

Introduction

The feature selection algorithm can be considered as a combination of search techniques to represent a new subset of features along with calculating a measure that reflects the difference in the subsets of features. The simplest algorithm is to check every possible subset of features and find one that minimizes the amount of error. This is an exhaustive spatial search and it is computationally difficult for feature sets that are not very small. The choice of metric greatly affects the algorithm and they are different for the three main categories of feature selection algorithms: wrappers, filters, and embedding methods ^[3] .

Wrapping methods use an a priori result estimation model to evaluate subsets of attributes. Each new subset is used to train the model, which is tested on a control sample. On this control sample, the number of errors (the indicator of model errors) is considered, which gives an estimate for a given subset. Since wrapping methods train a model for each subset, they are computationally very expensive, but usually provide the best set of attributes for a particular type of model.
Filter methods use an indirect metric instead of an error metric to evaluate a subset of features. This indicator is selected so that it can be easily calculated while maintaining the utility indicator of a set of features. Commonly used measures are mutual information ^[3] , ^[4] , Pearson mixed moment correlation coefficient , based algorithm ^[5] and distance between classes / within a class or the result of significance criteria for each class / attribute combinations ^[4] ^[6] . Filters are usually computationally less intense than wrappers, but they give sets of attributes that are not tuned to a specific type of predictive model ^[7] . This lack of customization means that the set of attributes obtained from the filter is more general than the set obtained from the wrapper, which leads to a lower prediction ability than the wrapper. However, the set of features does not contain any assumptions about the predictive model, and therefore is more suitable for detecting relationships between features. Many filters provide ranking of features without giving a clearly better subset of features, and the cut-off point in the ranking is selected using cross-validation . Filter methods are also used as preliminary processing steps for wrapping methods, which allows you to use wrapping for large tasks. Another popular approach is the recursive feature exclusion algorithm, commonly used in conjunction with the support vector method to repeatedly build a model and remove features with low weights.
Embedding methods are a generalizing group of techniques that select features as part of the model building process. An this approach is the method ( English Least absolute shrinkage and selection operator = method for estimating the coefficients of a linear regression model) to construct a linear model that penalizes the regression coefficient by a fine L1, reducing many of them to zero. Any features that have non-zero regression coefficients are “selected” by the LASSO algorithm. Improvements to the LASSO algorithm include the Bolasso algorithm, which generates a sample by bootstrap ^[8] , , which combines the L1 penalty L1 of the LASSO algorithm with the ridge regression penalty L2, and the FeaLect method, which estimates all features based on a combinatorial analysis of regression coefficients ^{[ 9]} . These approaches to computational complexity are somewhere between filters and wrappers.

In traditional statistics, the most popular form of feature selection is , which is a wrapping technique. This is a greedy algorithm that adds the best attribute (or removes the worst) at each step of the algorithm. The main problem is when to stop the algorithm. When training machines, this is usually done by cross validation . In statistics, some criteria are optimized. This leads to an inheritance of the investment problem. More robust methods were also investigated, such as the branch-and-bound method and a piecewise linear network.

Select a subset

Choosing a subset evaluates a subset of features as a stability group. The subset selection algorithms can be divided into Wrappers, Filters, and Attachments. Wrappers use the search algorithm to search through the space of possible attributes and evaluate each subset by running the model on the subset. Wrappers can be computationally expensive and have the risk of over-fitting (retraining) the model. Filters are similar to Wrappers in their search approach, but instead of evaluating the model, a simpler filter is evaluated. Embedding techniques are built into the model and specific to it.

Many popular approaches use a greedy climb to the top , which iteratively evaluates a subset of attributes as a candidate, then modifies the subset and evaluates how much the new subset is better than the old. Evaluating a subset requires the use of an evaluation , which ranks subsets of features. An exhaustive search, as a rule, is not feasible, so that the developer (or operator) determines a breakpoint, a subset of features with the highest score achieved, detected at this point is selected as a satisfactory subset of features. The stopping criterion depends on the algorithm. Possible criteria: the subset estimate exceeds the threshold, the program has exceeded the maximum allowable time, and so on.

Alternative search-based techniques are based on a , which finds projections of low dimensionality of data with a high score — features that have the largest projections in the low-dimensional space are selected.

Search Approaches:

Exhaustive search
Search by first best match
Simulated annealing
Genetic Algorithm ^[10]
greedy direct selection ^[11] ^[12] ^[13]
Greedy reverse exception
Particle Swarm Method ^[14]
Distributed Search ^[15]
^[16] ^[17]

Two popular filter metrics for classification problems are correlation and mutual information , although none of them is a true or a "measure of distance" in the mathematical sense, since the triangle inequality does not hold for them, and therefore they do not represent a real "distance "- they should rather be understood as" assessment ". These scores are calculated between candidate attributes (or sets of attributes) and the desired category. There are, however, true metrics that are simple functions of mutual information ^[18] .

Other possible filter metrics:

Class separability
- Error probability
- Interclass distance
- Probability distance
- Entropy
Feature Selection Based on Consistency
Selection of features based on correlation

Optimality Criteria

The choice of the optimality criterion is complicated, since there are several goals in the problem of selecting features. Many criteria include a measure of accuracy with a penalty on the number of selected features (e.g., Bayesian information criterion ). The oldest are the and the Akaike information criterion (AIC). They add variables if t- statistic is superior ${\sqrt {2}}$ ${\ displaystyle {\ sqrt {2}}}$ .

Other criteria are Bayesian information criterion (BIC), which uses ${\sqrt {\log {n}}}$ ${\ displaystyle {\ sqrt {\ log {n}}}}$ , minimum description length (MDL), which asymptotically uses ${\sqrt {\log {n}}}$ ${\ displaystyle {\ sqrt {\ log {n}}}}$ Bonferroni / RIC that uses ${\sqrt {2\log {p}}}$ ${\ displaystyle {\ sqrt {2 \ log {p}}}}$ , selection of signs with maximum dependence, and a set of new criteria that are dictated by the idea of a ( English false discovery rate , FDR) and which use something close to ${\sqrt {2\log {\frac {p}{q}}}}$ ${\ displaystyle {\ sqrt {2 \ log {\ frac {p} {q}}}}}$ . The criterion of maximum entropy velocity can also be used to select the most significant subset of features ^[19] .

Structural Learning

The filter for selecting traits is a special case of a more general paradigm called “structural learning” . The choice of attributes finds a significant set of attributes for a particular target variable, while structural learning finds relationships between variables, usually expressing these relationships in a graph. The most common structural learning algorithms suggest that data is generated by a Bayesian network , so the structure is a directed graph model . The optimal solution to the problem of filtering the selection of attributes is the Markov fence of the target node and the Bayesian network has a unique Markov fence for each node ^[20] .

Feature Selection Mechanisms Based on Information Theory

There are various feature selection mechanisms that use mutual information to evaluate different features. They usually use the same algorithm:

Mutual information is calculated as an estimate between all the attributes ( $f_{i}\in F$ ${\ displaystyle f_ {i} \ in F}$ ) and the target class ( $c$ ${\ displaystyle c}$ )
The attribute with the highest rating (for example, $argmax_{f_{i}\in F}(I(f_{i},c))$ ${\ displaystyle argmax_ {f_ {i} \ in F} (I (f_ {i}, c))}$ ) and is added to the set of selected characteristics ( $S$ ${\ displaystyle S}$ )
An estimate is calculated that can be obtained from the mutual information.
We select the characteristic with the highest rating and add to the set of selected characteristics (for example, $argmax_{f_{i}\in F}(I_{derived}(f_{i},c))$ ${\ displaystyle argmax_ {f_ {i} \ in F} (I_ {derived} (f_ {i}, c))}$ )
Repeat steps 3. and 4. Until we pick up a certain number of signs (for example, $|S|=l$ ${\ displaystyle | S | = l}$ )

The simplest approach uses mutual information as a “derivative” of an estimate ^[21] .

However, there are various hikes that try to reduce redundancy between symptoms.

Feature Selection Based on Minimum Redundancy-Maximum Relevance

Pan, Lawn, and Dean ^[22] proposed a feature selection method that can use mutual information, correlation, or distance / similarity estimates to select features. The aim is to fine the significance of the trait in case of redundancy caused by the presence in other selected traits. The significance of the set of attributes S for class c is determined by the average value of all values of mutual information between an individual attribute f _i and class c :

D(S,c)={\frac {1}{|S|}}\sum _{f_{i}\in S}I(f_{i};c)

{\ displaystyle D (S, c) = {\ frac {1} {| S |}} \ sum _ {f_ {i} \ in S} I (f_ {i}; c)}

.

The redundancy of all signs in the set S is equal to the average value of all values of the important information between the attribute f _i and the attribute f _j :

R(S)={\frac {1}{|S|^{2}}}\sum _{f_{i},f_{j}\in S}I(f_{i};f_{j})

{\ displaystyle R (S) = {\ frac {1} {| S | ^ {2}}} \ sum _ {f_ {i}, f_ {j} \ in S} I (f_ {i}; f_ { j})}

The criterion of minimum redundancy-maximum relevance ( English Minimum-redundancy-maximum-relevance , mRMR}} is a combination of the two measures defined above and defined as:

\mathrm {mRMR} =\max _{S}\left[{\frac {1}{|S|}}\sum _{f_{i}\in S}I(f_{i};c)-{\frac {1}{|S|^{2}}}\sum _{f_{i},f_{j}\in S}I(f_{i};f_{j})\right].

{\ displaystyle \ mathrm {mRMR} = \ max _ {S} \ left [{\ frac {1} {| S |}} \ sum _ {f_ {i} \ in S} I (f_ {i}; c ) - {\ frac {1} {| S | ^ {2}}} \ sum _ {f_ {i}, f_ {j} \ in S} I (f_ {i}; f_ {j}) \ right] .}

Suppose that there is a complete set of n attributes. Let x _i be the indicator function of entering the set f _i , so that x _i = 1 reflects the presence, and x _i = 0 reflects the absence of the feature f _i in the global optimal set of features. Let be $c_{i}=I(f_{i};c)$ ${\ displaystyle c_ {i} = I (f_ {i}; c)}$ and $a_{ij}=I(f_{i};f_{j})$ ${\ displaystyle a_ {ij} = I (f_ {i}; f_ {j})}$ . The formula above can now be rewritten as an optimization task:

\mathrm {mRMR} =\max _{x\in \{0,1\}^{n}}\left[{\frac {\sum _{i=1}^{n}c_{i}x_{i}}{\sum _{i=1}^{n}x_{i}}}-{\frac {\sum _{i,j=1}^{n}a_{ij}x_{i}x_{j}}{(\sum _{i=1}^{n}x_{i})^{2}}}\right].

{\ displaystyle \ mathrm {mRMR} = \ max _ {x \ in \ {0,1 \} ^ {n}} \ left [{\ frac {\ sum _ {i = 1} ^ {n} c_ {i } x_ {i}} {\ sum _ {i = 1} ^ {n} x_ {i}}} - {\ frac {\ sum _ {i, j = 1} ^ {n} a_ {ij} x_ { i} x_ {j}} {(\ sum _ {i = 1} ^ {n} x_ {i}) ^ {2}}} \ right].}

The mRMR algorithm is an approximation of a theoretically optimal trait selection algorithm with maximum dependence, which maximizes the mutual information between the joint distribution of the selected traits and the classification variable. Since mRMR approximates the combinatorial estimation problem by a series of substantially smaller problems, each of which uses only two variables, it uses pairwise joint probabilities that are more stable. In some situations, the algorithm may underestimate the usefulness of the traits, as it does not have the ability to measure the relationship between the traits, which can increase significance. This can lead to poor performance ^[21] if the characteristics are separately useless, but become significant in combination (a pathological case is found when the class is attributes). In the general case, the algorithm is more efficient (in terms of the amount of data required) than the theoretically optimal choice of the maximum dependence, nevertheless gives a set of features with small pairwise redundancy.

The mRMR algorithm is representative of a large class of filter methods that balance in various ways between significance and redundancy ^[21] ^[23] .

Quadratic programming for feature selection

The mRMR algorithm is a typical example of an incremental greedy strategy for selecting features - once a feature is selected, it cannot be removed from the selection in the next steps. While mRMR can be optimized using floating search to reduce some features, it can be reformulated as a global optimization problem of quadratic programming ^[24] :

\mathrm {QPFS} :\min _{\mathbf {x} }\left\{\alpha \mathbf {x} ^{T}H\mathbf {x} -\mathbf {x} ^{T}F\right\}\quad \ \sum _{i=1}^{n}x_{i}=1,x_{i}\geq 0

{\ displaystyle \ mathrm {QPFS}: \ min _ {\ mathbf {x}} \ left \ {\ alpha \ mathbf {x} ^ {T} H \ mathbf {x} - \ mathbf {x} ^ {T} F \ right \} \ quad \ \ sum _ {i = 1} ^ {n} x_ {i} = 1, x_ {i} \ geq 0}

Where $F_{n\times 1}=[I(f_{1};c),\ldots ,I(f_{n};c)]^{T}$ ${\ displaystyle F_ {n \ times 1} = [I (f_ {1}; c), \ ldots, I (f_ {n}; c)] ^ {T}}$ is the significance vector of the traits, assuming there are only n traits $H_{n\times n}=[I(f_{i};f_{j})]_{i,j=1\ldots n}$ ${\ displaystyle H_ {n \ times n} = [I (f_ {i}; f_ {j})] _ {i, j = 1 \ ldots n}}$ is a matrix of pairwise significance, and $\mathbf {x} _{n\times 1}$ ${\ displaystyle \ mathbf {x} _ {n \ times 1}}$ represents the relative weights of the attributes. The QPFS problem is solved by quadratic programming methods. It was shown that QFPS is biased in the direction of traits with lower entropy ^[25] due to the self-redundancy of the trait $I(f_{i};f_{i})$ ${\ displaystyle I (f_ {i}; f_ {i})}$ on the diagonal of the matrix H.

Conditional Mutual Information

Another estimate, derived from mutual information, is based on conditional significance ^[25] :

\mathrm {SPEC_{CMI}} :\max _{\mathbf {x} }\left\{\mathbf {x} ^{T}Q\mathbf {x} \right\}\quad \ \|\mathbf {x} \|=1,x_{i}\geq 0

{\ displaystyle \ mathrm {SPEC_ {CMI}}: \ max _ {\ mathbf {x}} \ left \ {\ mathbf {x} ^ {T} Q \ mathbf {x} \ right \} \ quad \ \ | \ mathbf {x} \ | = 1, x_ {i} \ geq 0}

Where $Q_{ii}=I(f_{i};c)$ ${\ displaystyle Q_ {ii} = I (f_ {i}; c)}$ and $Q_{ij}=I(f_{i};c|f_{j}),i\neq j$ ${\ displaystyle Q_ {ij} = I (f_ {i}; c | f_ {j}), i \ neq j}$ .

The advantage of SPEC _CMI is that it can be solved simply by finding the dominant eigenvector Q. SPEC _CMI also processes for signs of a second-order relationship.

Joint Reciprocal Information

When studying various evaluations, Brown, Paukoc, Zhao, and Luhan ^[21] recommended mutual mutual information ^[26] as a good assessment for selecting traits. The assessment tries to find a feature that adds the most new information to the features already selected to avoid redundancy. The assessment is formulated as follows:

${\begin{aligned}JMI(f_{i})&=\sum _{f_{j}\in S}(I(f_{i};c)+I(f_{i};c|f_{j}))\\&=\sum _{f_{j}\in S}{\bigl [}I(f_{j};c)+I(f_{i};c)-{\bigl (}I(f_{i};f_{j})-I(f_{i};f_{j}|c){\bigr )}{\bigr ]}\end{aligned}}$ ${\ displaystyle {\ begin {aligned} JMI (f_ {i}) & = \ sum _ {f_ {j} \ in S} (I (f_ {i}; c) + I (f_ {i}; c | f_ {j})) \\ & = \ sum _ {f_ {j} \ in S} {\ bigl [} I (f_ {j}; c) + I (f_ {i}; c) - {\ bigl (} I (f_ {i}; f_ {j}) - I (f_ {i}; f_ {j} | c) {\ bigr)} {\ bigr]} \ end {aligned}}}$

The assessment uses conditional mutual information and mutual information to assess redundancy between already selected features ( $f_{j}\in S$ ${\ displaystyle f_ {j} \ in S}$ ) and the investigated sign ( $f_{i}$ ${\ displaystyle f_ {i}}$ )

Feature Selection Based on Lasso Hilbert-Schmidt Independence Criteria

For high dimension data and small data (e.g. dimension> $10^{5}$ ${\ displaystyle 10 ^ {5}}$ and sample size < $10^{3}$ ${\ displaystyle 10 ^ {3}}$ ), useful is the Lasso Hilbert – Schmidt independence criterion (HSIC Lasso) ^[27] . The optimization task of HSIC Lasso is defined as

\mathrm {HSIC_{Lasso}} :\min _{\mathbf {x} }{\frac {1}{2}}\sum _{k,l=1}^{n}x_{k}x_{l}{\mbox{HSIC}}(f_{k},f_{l})-\sum _{k=1}^{n}x_{k}{\mbox{HSIC}}(f_{k},c)+\lambda \|\mathbf {x} \|_{1},\quad \ x_{1},\ldots ,x_{n}\geq 0,

{\ displaystyle \ mathrm {HSIC_ {Lasso}}: \ min _ {\ mathbf {x}} {\ frac {1} {2}} \ sum _ {k, l = 1} ^ {n} x_ {k} x_ {l} {\ mbox {HSIC}} (f_ {k}, f_ {l}) - \ sum _ {k = 1} ^ {n} x_ {k} {\ mbox {HSIC}} (f_ {k }, c) + \ lambda \ | \ mathbf {x} \ | _ {1}, \ quad \ x_ {1}, \ ldots, x_ {n} \ geq 0,}

Where ${\mbox{HSIC}}(f_{k},c)={\mbox{tr}}({\bar {\mathbf {K} }}^{(k)}{\bar {\mathbf {L} }})$ ${\ displaystyle {\ mbox {HSIC}} (f_ {k}, c) = {\ mbox {tr}} ({\ bar {\ mathbf {K}}} ^ {(k)} {\ bar {\ mathbf {L}}})}$ is a nuclear measure of independence called the (empirical) Hilbert-Schmidt independence criterion (HSIC), ${\mbox{tr}}(\cdot )$ ${\ displaystyle {\ mbox {tr}} (\ cdot)}$ denotes a trace $\lambda$ ${\ displaystyle \ lambda}$ is a regularization parameter, ${\bar {\mathbf {K} }}^{(k)}=\mathbf {\Gamma } \mathbf {K} ^{(k)}\mathbf {\Gamma }$ ${\ displaystyle {\ bar {\ mathbf {K}}} ^ {(k)} = \ mathbf {\ Gamma} \ mathbf {K} ^ {(k)} \ mathbf {\ Gamma}}$ and ${\bar {\mathbf {L} }}=\mathbf {\Gamma } \mathbf {L} \mathbf {\Gamma }$ ${\ displaystyle {\ bar {\ mathbf {L}}} = \ mathbf {\ Gamma} \ mathbf {L} \ mathbf {\ Gamma}}$ are input and output centered Gram matrices , $K_{i,j}^{(k)}=K(u_{k,i},u_{k,j})$ ${\ displaystyle K_ {i, j} ^ {(k)} = K (u_ {k, i}, u_ {k, j})}$ and $L_{i,j}=L(c_{i},c_{j})$ ${\ displaystyle L_ {i, j} = L (c_ {i}, c_ {j})}$ are Gram matrices, $K(u,u')$ ${\ displaystyle K (u, u ')}$ and $L(c,c')$ ${\ displaystyle L (c, c ')}$ are nuclear functions $\mathbf {\Gamma } =\mathbf {E} _{m}-{\frac {1}{m}}\mathbf {1} _{m}\mathbf {1} _{m}^{T}$ ${\ displaystyle \ mathbf {\ Gamma} = \ mathbf {E} _ {m} - {\ frac {1} {m}} \ mathbf {1} _ {m} \ mathbf {1} _ {m} ^ { T}}$ is a centered matrix, $\mathbf {E} _{m}$ ${\ displaystyle \ mathbf {E} _ {m}}$ is an m- dimensional unit matrix ( m : number of sample elements), $\mathbf {1} _{m}$ ${\ displaystyle \ mathbf {1} _ {m}}$ is an m- dimensional vector with all units, and $\|\cdot \|_{1}$ ${\ displaystyle \ | \ cdot \ | _ {1}}$ is an $\ell _{1}$ ${\ displaystyle \ ell _ {1}}$ -norm. HSIC always takes a non-negative value and is equal to zero if and only if two random variables are statistically independent when applying a universal generating core, such as a Gaussian kernel.

HSIC Lasso can be written as

\mathrm {HSIC_{Lasso}} :\min _{\mathbf {x} }{\frac {1}{2}}\left\|{\bar {\mathbf {L} }}-\sum _{k=1}^{n}x_{k}{\bar {\mathbf {K} }}^{(k)}\right\|_{F}^{2}+\lambda \|\mathbf {x} \|_{1},\quad \ x_{1},\ldots ,x_{n}\geq 0,

{\ displaystyle \ mathrm {HSIC_ {Lasso}}: \ min _ {\ mathbf {x}} {\ frac {1} {2}} \ left \ | {\ bar {\ mathbf {L}}} - \ sum _ {k = 1} ^ {n} x_ {k} {\ bar {\ mathbf {K}}} ^ {(k)} \ right \ | _ {F} ^ {2} + \ lambda \ | \ mathbf {x} \ | _ {1}, \ quad \ x_ {1}, \ ldots, x_ {n} \ geq 0,}

Where $\|\cdot \|_{F}$ ${\ displaystyle \ | \ cdot \ | _ {F}}$ is the Frobenius norm . The optimization problem is the Lasso problem, and therefore it can be effectively solved using modern Lasso solution methods, such as the dual .

Feature Selection Based on Correlation

Feature selection based on a correlation measure ( Corflation Feature Selection , CFS) evaluates subsets of features based on the following hypothesis: “Good feature subsets contain features that are highly correlated with classification but not correlated with each other” ^[28] ^[29] . The following equality gives an estimate of the subset of attributes S consisting of k attributes:

\mathrm {Merit} _{S_{k}}={\frac {k{\overline {r_{cf}}}}{\sqrt {k+k(k-1){\overline {r_{ff}}}}}}.

{\ displaystyle \ mathrm {Merit} _ {S_ {k}} = {\ frac {k {\ overline {r_ {cf}}}} {\ sqrt {k + k (k-1) {\ overline {r_ { ff}}}}}}.}

Here ${\overline {r_{cf}}}$ ${\ displaystyle {\ overline {r_ {cf}}}}$ is the average of all sign-classification correlations, and ${\overline {r_{ff}}}$ ${\ displaystyle {\ overline {r_ {ff}}}}$ is the average of all sign-to-sign correlations. The CFS criterion is defined as follows:

\mathrm {CFS} =\max _{S_{k}}\left[{\frac {r_{cf_{1}}+r_{cf_{2}}+\cdots +r_{cf_{k}}}{\sqrt {k+2(r_{f_{1}f_{2}}+\cdots +r_{f_{i}f_{j}}+\cdots +r_{f_{k}f_{1}})}}}\right].

{\ displaystyle \ mathrm {CFS} = \ max _ {S_ {k}} \ left [{\ frac {r_ {cf_ {1}} + r_ {cf_ {2}} + \ cdots + r_ {cf_ {k} }} {\ sqrt {k + 2 (r_ {f_ {1} f_ {2}} + \ cdots + r_ {f_ {i} f_ {j}} + \ cdots + r_ {f_ {k} f_ {1} })}}} \ right].}

Variables $r_{cf_{i}}$ ${\ displaystyle r_ {cf_ {i}}}$ and $r_{f_{i}f_{j}}$ ${\ displaystyle r_ {f_ {i} f_ {j}}}$ are correlations, but not necessarily Pearson or correlation coefficients . Mark Hall's dissertation does not use any of them, but uses three different measures of connectivity, minimum description length (MDL), symmetric uncertainty, and .

Let x _i be the indicator function of entry into the set for the sign f _i . Then the formula above can be rewritten as an optimization problem:

\mathrm {CFS} =\max _{x\in \{0,1\}^{n}}\left[{\frac {(\sum _{i=1}^{n}a_{i}x_{i})^{2}}{\sum _{i=1}^{n}x_{i}+\sum _{i\neq j}2b_{ij}x_{i}x_{j}}}\right].

{\ displaystyle \ mathrm {CFS} = \ max _ {x \ in \ {0,1 \} ^ {n}} \ left [{\ frac {(\ sum _ {i = 1} ^ {n} a_ { i} x_ {i}) ^ {2}} {\ sum _ {i = 1} ^ {n} x_ {i} + \ sum _ {i \ neq j} 2b_ {ij} x_ {i} x_ {j }}} \ right].}

The combinatorial problems above are, in fact, mixed 0–1 linear programming problems that can be solved using the branch and bound algorithm ^[30] .

Regularized trees

It was shown that signs from the decision tree or ensembles of trees are redundant. A recent method called “regularized tree” ^[31] can be used to select a subset of features. Regularized trees are fined with a variable similar to the variables selected at previous nodes of the tree to split the current node. For regularized trees, you need to build only one model (go one ensemble of trees), and therefore the computational algorithm is efficient.

Regularized trees naturally work with numerical and categorical attributes, interactions and non-linearities. They are invariant with respect to the scale of attributes (units) and are insensitive to outliers, and therefore require little preliminary data processing, such as . Regularized random forest (RRF) ^[32] is one of the types of regularized trees. Guided RRF is an improved RRF method that is driven by severity rating from a regular random forest.

Overview of Metaheuristics Methods

A meta-algorithm (or metaheuristics) is a general description of an algorithm designed to solve difficult (typically NP-hard problems) optimization problems for which there are no solution methods. Typically, a meta-algorithm is a stochastic algorithm seeking to achieve a global optimum. There are many meta-algorithms from simple local search to complex global search algorithm.

Basic Principles

Character selection methods are usually represented by three classes according to how they combine the selection and construction algorithms of the model.

Filter Method

Filter Method for Feature Selection

Filter methods select variables regardless of model. They are based only on common features, such as the correlation of a variable with prediction. Filter methods suppress the least interesting variables. Other variables will be part of the classification or regression model used for classification or prediction. These methods are very time-efficient and resistant to retraining ^[33] .

However, filter methods tend to select redundant variables because they do not take into account the relationship between variables. For this reason, these methods are mainly used as preprocessing methods.

Wrap Method

Wrapping Method for Feature Selection

Wrapping methods evaluate subsets of variables and allow, unlike filtering approaches, to detect a possible relationship between variables ^[34] . The two main disadvantages of these methods are:

The risk of retraining increases when the number of observations is insufficient.
Significant calculation time when the number of variables is large.

Attachment Method

Attachment Method for Feature Selection

Nesting methods have been proposed as an attempt to combine the advantages of the two previous methods. The training algorithm has the advantage of its own variable selection process and carries out the selection of attributes and classification at the same time.

Feature Selection Metaheuristics Application

Below is a review of the applications of meta-algorithms for feature selection used in the literature. A review was given in the theses of Julia Hammon ^[33] .

application	Algorithm	An approach	classifier		Link
ONP	Characteristic Selection Using Characteristic Similarity	Filter		r ²	Phyong 2005 ^[34]
ONP	Genetic algorithm	Wrapper	Decision tree	Correctness of classification (10-cr)	Shah, Kushiak 2004 ^[35]
ONP	Climb to the top	Filter + Wrapper	Naive Bayes Classifier	Predictive residual summquadres	Lawn 2007 ^[36]
ONP	Annealing Simulation Algorithm		Naive Bayes Classifier	Correctness of classification (5-cr)	Ustunkar 2011 ^[37]
Segments parole	Ant colony algorithm	Wrapper	Artificial neural network		Al Ani 2005
Marketing	Annealing Simulation Algorithm	Wrapper	Regression	AIC , r ²	Mayrie 2006 ^[38]
Economy	Annealing Simulation Algorithm, Genetic Algorithm	Wrapper	Regression	BIC	Kapetanios 2005 ^[39]
Spectral mass	Genetic algorithm	Wrapper	Multiple Linear Regression,		Broadhurst 2007 ^[40]
Spam	Binary particle swarm method +	Wrapper	Decision tree	weighted price	Jung 2014 ^[14]
Microarray	Prohibition Search + Particle Swarm Method	Wrapper	Support Vector Method, k-Nearest Neighbor Method	Euclidean metric	Chang, Young 2009 ^[41]
Microarray	PSO + Genetic Algorithm	Wrapper	Support Vector Method	Correctness of classification (10-cr)	Alba 2007 ^[42]
Microarray	Genetic Algorithm +	Nested	Support Vector Method	Correctness of classification (10-cr)	Duval 2009 ^[43]
Microarray	Wrapper	Regression	Posterior probability	Hans, Dorba, West 2007 ^[44]
Microarray	Genetic algorithm	Wrapper	K-nearest-neighbor method	Correctness of classification ( Cross check with exception )	Aitken 2005 ^[45]
Microarray		Wrapper	K-nearest-neighbor method	Correct classification (cross check with exception)	Oh, Moon 2004 ^[46]
Microarray	Genetic algorithm	Wrapper	Support Vector Method	Sensitivity and specificity	Xuan 2011 ^[47]
Microarray	Genetic algorithm	Wrapper	Pairwise support vector method	Correct classification (cross check with exception)	Peng 2003 ^[48]
Microarray	Genetic algorithm	Nested	Support Vector Method	Correctness of classification (10-cr)	Hernandez 2007 ^[49]
Microarray	Genetic algorithm	Hybrid	Support Vector Method	Correct classification (cross check with exception)	Huerta 2006 ^[50]
Microarray	Genetic algorithm		Support Vector Method	Correctness of classification (10-cr)	Mooney, Pal, Das 2006 ^[51] .
Microarray	Genetic algorithm	Wrapper	Support Vector Method	EH-DIALL, CLUMP	Jourdain 2011 ^[52] .
Alzheimer's disease		Filter	nuclear support vector method	Correctness of classification (10-cr)	Zhang 2015 ^[53]
Computer vision	Endless feature selection	Filter	Independent	ROC area under the curve	Roffo 2015 ^[54]
Microarrays	Eigenvector centrality fs	Filter	Independent	Medium Accuracy, Precision, ROC AUC	Roffo, Meltzi 2016 ^[55]
XML	Symmetric Tau Algorithm	Filter	Structural associative classification	Precision Coverage	Shaharani, Hadzic 2014

Selection of attributes embedded in learning algorithms

Some training algorithms select features as part of the algorithm:

Technicians $l_{1}$ ${\ displaystyle l_ {1}}$ -regularizations such as sparse regression, LASSO, and $l_{1}$ ${\ displaystyle l_ {1}}$ -SVM
Regularized trees ^[31] , for example, a regularized random forest implemented in the RRF package ^[32]
Decision Tree ^[56]
Random multinomial logit (RMNL)
Narrow Level AutoCoding Network
Isolation of features ^[57] ^[58] ^[59]
Choice of attributes based on local learning ^[60] . Compared to traditional methods, this method does not use heuristic search, can easily cope with problems with many classes, and works on both linear and non-linear problems. The method is also supported on the theoretical side. Numerical experiments showed that the method can achieve a near optimal solution even when the data contains more than 1M insignificant features.

Notes

↑ ¹ ² James, Witten, Hastie, Tibshirani, 2013 , p. 204.
↑ ¹ ² Bermingham, Pong-Wong, Spiliopoulou et al., 2015 , p. 10312.
↑ ¹ ² ³ Guyon, Elisseeff, 2003 .
↑ ¹ ² Yang, Pedersen, 1997 .
↑ Urbanowicz, Meeker, LaCava, Olson, Moore, 2017 .
↑ Forman, 2003 , p. 1289-1305.
↑ Zhang, Li, Wang, Zhang, 2013 , p. 32–42.
↑ Bach, 2008 , p. 33–40.
↑ Zare, 2013 , p. S14.
↑ Soufan, Kleftogiannis, Kalnis, Bajic, 2015 , p. e0117988.
↑ Figueroa, 2015 , p. 162–169.
↑ Figueroa, Neumann, 2013 .
↑ Figueroa, Neumann, 2014 , p. 4730–4742.
↑ ¹ ² Zhang, Wang, Phillips, 2014 , p. 22–31.
↑ Garcia-Lopez, Garcia-Torres, Melian, Moreno-Perez, Moreno-Vega, 2006 , p. 477–489.
↑ Garcia-Lopez, Garcia-Torres, Melian, Moreno-Perez, Moreno-Vega, 2004 , p. 59–68.
↑ Garcia-Torres, Gomez-Vela, Melian, Moreno-Vega, 2016 , p. 102-118.
↑ Kraskov, Stögbauer, Andrzejak, Grassberger, 2003 .
↑ Einicke, 2018 , p. 1097–1103.
↑ Aliferis, 2010 , p. 171–234.
↑ ¹ ² ³ ⁴ Brown, Pocock, Zhao, Luján, 2012 , p. 27-66.
↑ Peng, Long, Ding, 2005 , p. 1226-1238.
↑ Nguyen, Franke, Petrovic, 2010 , p. 1529-1532.
↑ Rodriguez-Lujan, Huerta, Elkan, Santa Cruz, 2010 , p. 1491-1516.
↑ ¹ ² Vinh, Chan, Romano, Bailey, 2014 .
↑ Yang, Moody, 2000 , p. 687-693.
↑ Yamada, Jitkrittum, Sigal, Xing, Sugiyama, 2014 , p. 185-207.
↑ Hall, 1999 .
↑ Senliol, Gulgezen, Yu, Cataltepe, 2008 , p. 1-4.
↑ Nguyen, Franke, Petrovic, 2009 .
↑ ¹ ² Deng, Runger, 2012 .
↑ ¹ ² RRF: Regularized Random Forest , R package in the Comprehensive R Archive Network (CRAN) repository
↑ ¹ ² Hammon, 2013 .
↑ ¹ ² Phuong, Lin, Altman, 2005 , p. 301-309.
↑ Shah, Kusiak, 2004 , p. 183–196.
↑ Long, Gianola, Weigel, 2011 , p. 247–257.
↑ Ustunkar, Ozogur-Akyuz, Weber, Friedrich, Son, 2011 , p. 1207-1218.
↑ Meiri, Zahavi, 2006 , p. 842-858.
↑ Kapetanios, 2005 .
↑ Broadhurst, Goodacre, Jones, Rowland, Kell, 1997 , p. 71-86.
↑ Chuang, Yang, 2009 , p. 1689-1703.
↑ Alba, Garia-Nieto, Jourdan, Talbi, 2007 .
↑ Duval, Hao, Hernandez, 2009 , p. 201-208.
↑ Hans, Dobra, West, 2007 , p. 507-516.
↑ Aitken, 2005 , p. 148.
↑ Oh, Moon, 2004 , p. 1424-1437.
↑ Xuan, Guo, Wang, Liu, Liu, 2011 , p. 588-603.
↑ Peng, 2003 , p. 358-362.
↑ Hernandez, Duval, Hao, 2007 , p. 90-101.
↑ Huerta, Duval, Hao, 2006 , p. 34-44.
↑ Muni, Pal, Das, 2006 , p. 106-117.
↑ Jourdan, Dhaenens, Talbi, 2011 .
↑ Zhang, Dong, Phillips, Wang, 2015 , p. 66.
↑ Roffo, Melzi, Cristani, 2015 , p. 4202-4210.
↑ Roffo, Melzi, 2016 , p. 19-38.
↑ Kohavi, John, 1997 , p. 273-324.
↑ Das, Kempe, 2011 .
↑ Liu, Wei, Kirchhoff, Song, Bilmes, 2013 .
↑ Zheng, Jiang, Chellappa, Phillip, 2014 .
↑ Sun, Todorovic, Goodison, 2010 , p. 1610-1626.

Literature

Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning . - Springer, 2013.
Mairead L. Bermingham, Ricardo Pong-Wong, Athina Spiliopoulou, Caroline Hayward, Igor Rudan, Harry Campbell, Alan F. Wright, James F. Wilson, Felix Agakov, Pau Navarro, Chris S. Haley. Application of high-dimensional feature selection: evaluation for genomic prediction in man // Sci. Rep. . - 2015 .-- T. 5 . - DOI : 10.1038 / srep10312 . - . - PMID 25988841 .
Othman Soufan, Dimitrios Kleftogiannis, Panos Kalnis, Vladimir B. Bajic. DWFS: A Wrapper Feature Selection Tool Based on a Parallel Genetic Algorithm // PLOS One. - 2015.- T. 10 , no. 2 . - ISSN 1932-6203 . - DOI : 10.1371 / journal.pone.0117988 . - . - PMID 25719748 .
Alejandro Figueroa. Exploring effective features for recognizing the user intent behind web queries // Computers in Industry. - 2015 .-- T. 68 . - DOI : 10.1016 / j.compind.2015.01.01.005 .
Alejandro Figueroa, Guenter Neumann. Learning to Rank Effective Paraphrases from Query Logs for Community Question Answering // 27th AAAI Conference on Artificial Intelligence . - 2013.
Alejandro Figueroa, Guenter Neumann. Category-specific models for ranking effective paraphrases in community Question Answering // Expert Systems with Applications. - 2014 .-- T. 41 , no. 10 . - DOI : 10.1016 / j.eswa.2014.02.02.004 .
Zhang Y., Wang S., Phillips P. Binary PSO with Mutation Operator for Feature Selection using Decision Tree applied to Spam Detection // Knowledge-Based Systems. - 2014 .-- T. 64 . - DOI : 10.1016 / j.knosys.2014.03.03.015 .
Garcia-Lopez FC, Garcia-Torres M., Melian B., Moreno-Perez JA, Moreno-Vega JM Solving feature subset selection problem by a Parallel Scatter Search // European Journal of Operational Research. - 2006. - T. 169 , No. 2 .
Garcia-Lopez FC, Garcia-Torres M., Melian B., Moreno-Perez JA, Moreno-Vega JM Solving Feature Subset Selection Problem by a Hybrid Metaheuristic // First International Workshop on Hybrid Metaheuristics. - 2004. - S. 59–68.
Garcia-Torres M., Gomez-Vela F., Melian B., Moreno-Vega JM High-dimensional feature selection via feature grouping: A Variable Neighborhood Search approach // Information Sciences. - 2016 .-- T. 326 .
Alexander Kraskov, Harald Stögbauer, Ralph G. Andrzejak, Peter Grassberger. Hierarchical Clustering Based on Mutual Information . - 2003. - . - arXiv : q-bio / 0311039 .
Nguyen X. Vinh, Jeffrey Chan, Simone Romano, James Bailey. Effective Global Approaches for Mutual Information based Feature Selection // 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), August 24–27 . - New York City, 2014.
Howard Hua Yang, John Moody. Data visualization and feature selection: New algorithms for nongaussian data // Advances in Neural Information Processing Systems. - 2000.
Yamada M., Jitkrittum W., Sigal L., Xing EP, Sugiyama M. High-Dimensional Feature Selection by Feature-Wise Non-Linear Lasso // Neural Computation. - 2014.- T. 26 , No. 1 .
Mark A. Hall. Correlation-based Feature Selection for Machine Learning . - 1999.
Baris Senliol, Gokhan Gulgezen, Lei Yu, Zehra Cataltepe. Fast Correlation Based Filter (FCBF) with a different search strategy // ISCIS'08. 23rd International Symposium on. . - IEEE, 2008 .-- S. 1-4.
Hai Nguyen, Katrin Franke, Slobodan Petrovic. Optimizing a class of feature selection measures // Conference NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Vancouver, Canada, December 2009 . - 2009.
Hammon J. Optimization combinatoire pour la sélection de variables en régression en grande dimension: Application en génétique animale. . - 2013.
Kohavi R., John G. Wrappers for feature subset selection // Artificial intelligence 97. - 1997. - Vol. 1-2 .
Deng H., Runger G. Feature Selection via Regularized Trees // Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN) . - IEEE, 2012.
Phuong TM, Lin Z., Altman RB Choosing SNPs using feature selection // IEEE Computational Systems Bioinformatics Conference, CSB. IEEE Computational Systems Bioinformatics Conference . - 2005. Archived copy of September 13, 2016 on Wayback Machine
Gavin Brown, Adam Pocock, Ming-Jie Zhao, Mikel Luján. Conditional Likelihood Maximization: A Unifying Framework for Information Theoretic Feature Selection // Journal of Machine Learning Research. - 2012 .-- T. 13 . [one]
Shah SC, Kusiak A. Data mining and genetic algorithm based gene / SNP selection // Artificial Intelligence in Medicine. - 2004 .-- T. 31 , no. 3 . - DOI : 10.1016 / j.artmed.2004.04.002 . - PMID 15302085 .
Long N., Gianola D., Weigel KA Dimension reduction and variable selection for genomic selection: application to predicting milk yield in Holsteins // Journal of Animal Breeding and Genetics. - 2011.- T. 128 , no. 4 . - DOI : 10.1111 / j.1439-0388.2011.00917.x . - PMID 21749471 .
Ustunkar G., Ozogur-Akyuz S., Weber GW, Friedrich CM, Yesim Aydin Son. Selection of representative SNP sets for genome-wide association studies: a metaheuristic approach // Optimization Letters. - Springer-Verlag, 2011. - November ( t. 6 , issue 6 ). - DOI : 10.1007 / s11590-011-0419-7 .
Meiri R., Zahavi J. Using simulated annealing to optimize the feature selection problem in marketing applications // European Journal of Operational Research. - 2006. - Juin ( t. 171 , No. 3 ).
Kapetanios G. Variable Selection using Non-Standard Optimization of Information Criteria . - 2005. - (Working Paper, Queen Mary, University of London, School of Economics and Finance).
Broadhurst D., Goodacre R., Jones A., Rowland JJ, Kell DB Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry // Analytica Chimica Acta. - 1997. - August ( t. 348 , No. 1-3 ).
Chuang L.-Y., Yang C.-H. Tabu search and binary particle swarm optimization for feature selection using microarray data // Journal of Computational Biology. - 2009. - T. 16 , no. 12 . - DOI : 10.1089 / cmb.2007.0211 . - PMID 20047491 .
Alba E., Garia-Nieto J., Jourdan L., Talbi E.-G. Gene Selection in Cancer Classification using PSO-SVM and GA-SVM Hybrid Algorithms // Congress on Evolutionary Computation, Singapor, 2007 . - Singapore, 2007.
Duval B., Hao J.-K., Hernandez JCH Amemetic algorithm for gene selection and molecular classification of an cancer // Proceedings of the 11th Annual conference on Genetic and evolutionary computation, GECCO '09 . - New York, NY, USA: ACM, 2009.
Hans C., Dobra A., West M. Shotgun stochastic search for 'large p' regression // Journal of the American Statistical Association. - 2007.- T. 102 , no. 478 . - S. 507-516 . - ISSN 0162-1459 . - DOI : 10.1198 / 016214507000000121 .
Isabelle Guyon, André Elisseeff. An Introduction to Variable and Feature Selection // JMLR . - 2003. - T. 3 .
Ryan J. Urbanowicz, Melissa Meeker, William LaCava, Randal S. Olson, Jason H. Moore. Relief-Based Feature Selection: Introduction and Review // Journal of Biomedical Informatics. - 2017. - Issue. 85 . - DOI : 10.1016 / j.jbi.2018.07.014 .
Yiming Yang, Jan O. Pedersen. A comparative study on feature selection in text categorization // Proceedings of the Fourteenth International Conference on Machine Learning (ICML). - 1997. - ISBN 1-55860-486-3 .
George Forman. An extensive empirical study of feature selection metrics for text classification // Journal of Machine Learning Research. - 2003. - T. 3 . - ISSN 1533-7928 .
Yishi Zhang, Shujuan Li, Teng Wang, Zigang Zhang. Divergence-based feature selection for separate classes // Neurocomputing. - 2013 .-- T. 101 , no. 4 . - DOI : 10.1016 / j.neucom.2012.06.06.036 .
Francis R. Bach. Bolasso: model consistent lasso estimation through the bootstrap . - Proceedings of the 25th International Conference on Machine Learning. - 2008. - ISBN 9781605582054 . - DOI : 10.1145 / 1390156.1390161 .
Habil Zare. Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis // BMC Genomics. - 2013 .-- T. 14 . - DOI : 10.1186 / 1471-2164-14-S1-S14 . - PMID 23369194 .
Einicke GA Maximum-Entropy Rate Selection of Features for Classifying Changes in Knee and Ankle Dynamics During Running // IEEE Journal of Biomedical and Health Informatics. - 2018 .-- T. 28 , no. 4 . - DOI : 10.1109 / JBHI.2017.2711487 . - PMID 29969403 .
Constantin Aliferis. Local causal and markov blanket induction for causal discovery and feature selection for classification part I: Algorithms and empirical evaluation // Journal of Machine Learning Research. - 2010 .-- T. 11 .
Peng HC, Long F., Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy // IEEE Transactions on Pattern Analysis and Machine Intelligence. - 2005. - T. 27 , no. 8 . - DOI : 10.1109 / TPAMI.2005.159 . - PMID 16119262 . Program
Nguyen H., Franke K., Petrovic S. Towards a Generic Feature-Selection Measure for Intrusion Detection // 20h International Conference on Pattern Recognition (ICPR) . - Istanbul, Turkey, 2010.
Rodriguez-Lujan I., Huerta R., Elkan C., Santa Cruz C. Quadratic programming feature selection // JMLR . - 2010 .-- T. 11 .
Aitken S. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes // BMC Bioinformatics. - 2005. - T. 6 , no. 1 . - DOI : 10.1186 / 1471-2105-6-148 . - PMID 15958165 .
Oh IS, Moon BR Hybrid genetic algorithms for feature selection // IEEE Transactions on Pattern Analysis and Machine Intelligence. - 2004 .-- T. 26 , no. 11 . - DOI : 10.1109 / tpami.2004.105 . - PMID 15521491 .
Xuan P., Guo MZ, Wang J., Liu XY, Liu Y. Genetic algorithm-based efficient feature selection for classification of pre-miRNAs // Genetics and Molecular Research. - 2011 .-- T. 10 , no. 2 . - DOI : 10.4238 / vol10-2gmr969 . - PMID 21491369 .
Peng S. Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines // FEBS Letters. - 2003. - T. 555 , no. 2 . - DOI : 10.1016 / s0014-5793 (03) 01275-4 .
Jose Crispin Hernandez Hernandez, B´eatrice Duval, Jin-Kao Hao. A genetic embedded approach for gene selection and classification of microarray data // Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, EvoBIO'07. - Berlin, Heidelberg: Springer Verlag, 2007 .-- T. 4447. - (Lecture Notes in Computer Science). - ISBN 3-540-71782-X .
Huerta EB, Duval B., Hao J.-K. A hybrid GA / SVM approach for gene selection and classification of microarray data. Evoworkshops // Applicationsof EvolutionaryComputing. - 2006. - T. 3907. - S. 34-44. - (Lecture Notes in Computer Science).
Muni DP, Pal NR, Das J. Genetic programming for simultaneous feature selection and classifier design // IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics. - 2006 .-- T. 36.
Laetitia Jourdan, Clarisse Dhaenens, El-Ghazali Talbi. Linkage disequilibrium study with a parallel adaptive GA // International Journal of Foundations of Computer Science. - 2011 .-- T. 16 , no. 2 .
Zhang Y., Dong Z., Phillips P., Wang S. Detection of subjects and brain regions related to Alzheimer's disease using 3D MRI scans based on eigenbrain and machine learning // Frontiers in Computational Neuroscience. - 2015 .-- T. 9 . - DOI : 10.3389 / fncom.2015.000.00066 . - PMID 26082713 .
Roffo G., Melzi S., Cristani M. Infinite Feature Selection . - 2015 IEEE International Conference on Computer Vision (ICCV). - 2015. - ISBN 978-1-4673-8391-2 . - DOI : 10.1109 / ICCV.2015.4.4 .
Giorgio Roffo, Simone Melzi. Features Selection via Eigenvector Centrality // New Frontiers in Mining Complex Patterns, (NFMCP 2016). . - Springer, 2016 .-- T. 10312. - S. 19-38. - (Lecture Notes in Artificial Intelligence (LNAI}). - ISBN 978-3-319-61460-1 . - DOI : 10.1007 / 978-3-319-61461-8 . Link points to a slightly different version of the article
Abhimanyu Das, David Kempe. Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection // The 28th International Conference on Machine Learning. - 2011.
Yuzong Liu, Kai Wei, Katrin Kirchhoff, Yisong Song, Jeff A. Bilmes. Submodular feature selection for high-dimensional acoustic score spaces // 2013 IEEE International Conference on Acoustics, Speech and Signal Processing . - 2013 .-- DOI : 10.1109 / ICASSP.2013.6639057 .
Jinging Zheng, Zhuolin Jiang, Rama Chellappa, P. Jonathon Phillip. Submodular Attribute Selection for Action Recognition in Video // Advances in Neural Information Processing Systems 27 (NIPS 2014) / Z. Ghahramani, M. Welling, C. Cortes, ND Lawrence, KQ Weinberger .. - 2014.
Sun Y., Todorovic S., Goodison S. Local-Learning-Based Feature Selection for High-Dimensional Data Analysis] // IEEE Transactions on Pattern Analysis and Machine Intelligence . - 2010 .-- T. 32.

Links

Feature Selection Package, Arizona State University (Matlab Code)
NIPS challenge 2003 (see also NIPS )
Naive Bayes implementation with feature selection in Visual Basic (includes executable and source code)
Minimum-redundancy-maximum-relevance (mRMR) feature selection program
FEAST (Open source Feature Selection algorithms in C and MATLAB)

[_9de13e8f0143238b-1] ¹ ² James, Witten, Hastie, Tibshirani, 2013 , p. 204.

[_c49ad95586ade95f-2] ¹ ² Bermingham, Pong-Wong, Spiliopoulou et al., 2015 , p. 10312.

[_83fb62fcbdf3a4be-3] ¹ ² ³ Guyon, Elisseeff, 2003 .

[_500bde49d5e2c30a-4] ¹ ² Yang, Pedersen, 1997 .

[_02b6947028f1e069-5] Urbanowicz, Meeker, LaCava, Olson, Moore, 2017 .

[_079efb43781b17db-6] Forman, 2003 , p. 1289-1305.

[_538574b13826443b-7] Zhang, Li, Wang, Zhang, 2013 , p. 32–42.

[_818a78ed1ff50d90-8] Bach, 2008 , p. 33–40.

[_b3ee8be05cdb0501-9] Zare, 2013 , p. S14.

[_f0fd76899f3468c5-10] Soufan, Kleftogiannis, Kalnis, Bajic, 2015 , p. e0117988.

[_dfdf7cbc082c5fc9-11] Figueroa, 2015 , p. 162–169.

[_032a00b3e72ea40d-12] Figueroa, Neumann, 2013 .

[_74a5475112eb24b8-13] Figueroa, Neumann, 2014 , p. 4730–4742.

[_b4f4aab254b05685-14] ¹ ² Zhang, Wang, Phillips, 2014 , p. 22–31.

[_cb023435c4c968db-15] Garcia-Lopez, Garcia-Torres, Melian, Moreno-Perez, Moreno-Vega, 2006 , p. 477–489.

[_d3d62069894ef7ac-16] Garcia-Lopez, Garcia-Torres, Melian, Moreno-Perez, Moreno-Vega, 2004 , p. 59–68.

[_9f66200cdda30458-17] Garcia-Torres, Gomez-Vela, Melian, Moreno-Vega, 2016 , p. 102-118.

[_117db0cfcfa8213a-18] Kraskov, Stögbauer, Andrzejak, Grassberger, 2003 .

[_28160bb15d9a4017-19] Einicke, 2018 , p. 1097–1103.

[_b77cf5ad995129ca-20] Aliferis, 2010 , p. 171–234.

[_34c873303727ad72-21] ¹ ² ³ ⁴ Brown, Pocock, Zhao, Luján, 2012 , p. 27-66.

[_58fd0ea5708950fa-22] Peng, Long, Ding, 2005 , p. 1226-1238.

[_4011b274e3d31740-23] Nguyen, Franke, Petrovic, 2010 , p. 1529-1532.

[_76c9d8beef561360-24] Rodriguez-Lujan, Huerta, Elkan, Santa Cruz, 2010 , p. 1491-1516.

[_a72f263c3ec3b7bf-25] ¹ ² Vinh, Chan, Romano, Bailey, 2014 .

[_7b33663c1d4a5c7c-26] Yang, Moody, 2000 , p. 687-693.

[_609a801fba12002c-27] Yamada, Jitkrittum, Sigal, Xing, Sugiyama, 2014 , p. 185-207.

[_5301f47d8e950ef2-28] Hall, 1999 .

[_30ca77bfe93888b1-29] Senliol, Gulgezen, Yu, Cataltepe, 2008 , p. 1-4.

[_b27537a03eac389f-30] Nguyen, Franke, Petrovic, 2009 .

[_73bd21f80a4353ad-31] ¹ ² Deng, Runger, 2012 .

[RRF-32] ¹ ² RRF: Regularized Random Forest , R package in the Comprehensive R Archive Network (CRAN) repository

[_66cc33382256f67f-33] ¹ ² Hammon, 2013 .

[_a5312e84876539f2-34] ¹ ² Phuong, Lin, Altman, 2005 , p. 301-309.

[_c77aced0749bd236-35] Shah, Kusiak, 2004 , p. 183–196.

[_902814cfd4a8a371-36] Long, Gianola, Weigel, 2011 , p. 247–257.

[_af87eed8dfc35051-37] Ustunkar, Ozogur-Akyuz, Weber, Friedrich, Son, 2011 , p. 1207-1218.

[_94c5b0a427e6e456-38] Meiri, Zahavi, 2006 , p. 842-858.

[_77471e665259db65-39] Kapetanios, 2005 .

[_25f39bc3fd8b7762-40] Broadhurst, Goodacre, Jones, Rowland, Kell, 1997 , p. 71-86.

[_7c47f232f3b88719-41] Chuang, Yang, 2009 , p. 1689-1703.

[_d5ccd29385e6bc2b-42] Alba, Garia-Nieto, Jourdan, Talbi, 2007 .

[_2fcba2f156bb9501-43] Duval, Hao, Hernandez, 2009 , p. 201-208.

[_2c2f920f252487f2-44] Hans, Dobra, West, 2007 , p. 507-516.

[_875c3dac69bcd103-45] Aitken, 2005 , p. 148.

[_cf42785a33085d24-46] Oh, Moon, 2004 , p. 1424-1437.

[_76e30fe416d998a0-47] Xuan, Guo, Wang, Liu, Liu, 2011 , p. 588-603.

[_b180f9be2a95f158-48] Peng, 2003 , p. 358-362.

[_e7bc1f61b30ec2c3-49] Hernandez, Duval, Hao, 2007 , p. 90-101.

[_89e4708e9f5727fe-50] Huerta, Duval, Hao, 2006 , p. 34-44.

[_46a844355250a112-51] Muni, Pal, Das, 2006 , p. 106-117.

[_54d51c728529d8ec-52] Jourdan, Dhaenens, Talbi, 2011 .

[_f2b701ea1ef5597b-53] Zhang, Dong, Phillips, Wang, 2015 , p. 66.

[_33328a0b0c7093f3-54] Roffo, Melzi, Cristani, 2015 , p. 4202-4210.

[_68a4d236c368db35-55] Roffo, Melzi, 2016 , p. 19-38.

[_bfd5985e3a4c9168-56] Kohavi, John, 1997 , p. 273-324.

[_f4edc16c20e065b9-57] Das, Kempe, 2011 .

[_a9c059fa2898c70d-58] Liu, Wei, Kirchhoff, Song, Bilmes, 2013 .

[_e8e2b7ab9b898def-59] Zheng, Jiang, Chellappa, Phillip, 2014 .

[_51b8460f36542e3d-60] Sun, Todorovic, Goodison, 2010 , p. 1610-1626.