You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, great library and related blog posts. I was beginning to code this procedure and then stumbled upon your work. Here is my question / concern. I am using data likely akin to Uber for marketing purposes (mix of continuous and dummy coded, some highly predictive features, some irrelevant and correlation between designed features). If I look at the complete list of features and count the number of features with an (absolute value) correlation over 0.6 there are many. After the feature selection I see more relative correlation. This issue seems to be the F-stat can be very large for some correlated features and it cant be dampened enough by the denominator.
Here is an example from your quick starts (with a bit of change)
from mrmr import mrmr_classif
from sklearn.datasets import make_classification
# create some data
X, y = make_classification(n_samples = 1000, n_features = 100, n_informative = 10, n_redundant = 40)
X = pd.DataFrame(X)
y = pd.Series(y)
corr_X = X.corr().abs().clip(0.00001)
threshold_corr = 0.6
pdf_feature_cnt_corr = pd.Series(corr_X.apply(lambda x: sum(x > threshold_corr)-1, axis=1))
ax = pdf_feature_cnt_corr.value_counts().sort_index().plot(kind = 'bar')
ax.set_xlabel(f'Number of Features Correlated With Feature\n Above {threshold_corr}')
ax.set_ylabel('Number of Features')
ax.bar_label(ax.containers[0])
# use mrmr classification
selected_features = mrmr_classif(X, y, K = 10)
threshold_corr = 0.6
pdf_feature_cnt_corr = pd.Series(corr_X.loc[selected_features,selected_features].apply(lambda x: sum(x > threshold_corr)-1, axis=1))
ax = pdf_feature_cnt_corr.value_counts().sort_index().plot(kind = 'bar')
ax.set_xlabel(f'Number of Features Correlated With Feature\n Above {threshold_corr}')
ax.set_ylabel('Number of Features')
ax.bar_label(ax.containers[0])
It seems to me we have far fewer features but the ones left show strong amount of correlation, in terms of proportion of the model candidate features that are correlated.....
The text was updated successfully, but these errors were encountered:
Hello we ran into this issue as well. Our solution was to transform the denominator (leveraging the redundancy parameter) using something like $\frac{1}{[1-abs(correlation)]^4}$, so that significantly correlated values (say correlation > 0.95) would be severely penalized. We also took the square root of the f-statistic.
First, great library and related blog posts. I was beginning to code this procedure and then stumbled upon your work. Here is my question / concern. I am using data likely akin to Uber for marketing purposes (mix of continuous and dummy coded, some highly predictive features, some irrelevant and correlation between designed features). If I look at the complete list of features and count the number of features with an (absolute value) correlation over 0.6 there are many. After the feature selection I see more relative correlation. This issue seems to be the F-stat can be very large for some correlated features and it cant be dampened enough by the denominator.
Here is an example from your quick starts (with a bit of change)
It seems to me we have far fewer features but the ones left show strong amount of correlation, in terms of proportion of the model candidate features that are correlated.....
The text was updated successfully, but these errors were encountered: