Feature Selection with FSelector Package

I am currently working on the Countable Care Challenge hosted by the Planned Parenthood Federation of America. The dataset for this challenge has over a thousand features. Feature selection was used to help cut down on runtime and eliminate unecessary features prior to building a prediction model. The random.forest.importance function in the FSelector package was implemented in R to accomplish this task.

To demonstrate how random.forest.importance can be implemented, I used the SAheart dataset1 available in the ElemStatLearn package. The SAheart disease dataset is a retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. Patients positive for coronary heart disease, chd, are labeled with a value of 1 and patients negative for coranary heart disease are labeled with a value of zero.

> library(ElemStatLearn)
> library(FSelector)

> head(SAheart)
sbp tobacco  ldl adiposity famhist typea obesity alcohol age chd
1 160   12.00 5.73     23.11 Present    49   25.30   97.20  52   1
2 144    0.01 4.41     28.61  Absent    55   28.87    2.06  63   1
3 118    0.08 3.48     32.28 Present    52   29.14    3.81  46   0
4 170    7.50 6.41     38.03 Present    51   31.99   24.26  58   1
5 134   13.60 3.50     27.78 Present    60   25.99   57.34  49   1
6 132    6.20 6.47     36.21 Present    62   30.77   14.14  45   0

The random.forest.importance function is used to rate the importance of each feature in the classification of the outcome, chd. The function returns a data frame containing the name of each attribute and the importance value based on the mean decrease in accuracy.

> SAheart$chd <- as.factor(SAheart$chd)
> att.scores <- random.forest.importance(chd ~ ., SAheart)
          attr_importance
sbp              4.511806
tobacco         19.434096
ldl              5.959607
adiposity        8.099294
famhist         12.450121
typea            2.472882
obesity         -3.520101
alcohol         -3.420076
age             22.236682

The FSelector package offers several functions to choose the best features using the importance values returned by random.forest.importance. The cutoff.biggest.diff function automatically identifies the features which have a significantly higher importance value than other features. cutoff.k provides the k features with the highest importance values. Similarly, cutoff.k.percent returns k percent of the features with the highest importance values.

> cutoff.biggest.diff(att.scores)
[1] "age"     "tobacco"
> cutoff.k(att.scores, k = 4)
[1] "age"       "tobacco"   "famhist"   "adiposity"
> cutoff.k.percent(att.scores, 0.4)
[1] "age"       "tobacco"   "famhist"   "adiposity"

1Rousseauw, J., du Plessis, J., Benade, A., Jordaan, P., Kotze, J. and Ferreira, J. (1983). Coronary risk factor screening in three rural communities, South African Medical Journal 64: 430–436.