I am currently working on the Countable Care Challenge hosted by the Planned Parenthood Federation of America. The dataset for this challenge has over a thousand features. Feature selection was used to help cut down on runtime and eliminate unecessary features prior to building a prediction model. The random.forest.importance function in the FSelector package was implemented in R to accomplish this task.
To demonstrate how random.forest.importance can be implemented, I used the SAheart dataset1 available in the ElemStatLearn package. The SAheart disease dataset is a retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. Patients positive for coronary heart disease, chd, are labeled with a value of 1 and patients negative for coranary heart disease are labeled with a value of zero.
The random.forest.importance function is used to rate the importance of each feature in the classification of the outcome, chd. The function returns a data frame containing the name of each attribute and the importance value based on the mean decrease in accuracy.
The FSelector package offers several functions to choose the best features using the importance values returned by random.forest.importance. The cutoff.biggest.diff function automatically identifies the features which have a significantly higher importance value than other features. cutoff.k provides the k features with the highest importance values. Similarly, cutoff.k.percent returns k percent of the features with the highest importance values.
1Rousseauw, J., du Plessis, J., Benade, A., Jordaan, P., Kotze, J. and Ferreira, J. (1983). Coronary risk factor screening in three rural communities, South African Medical Journal 64: 430–436.