Are my datasets different?
I ran into the following problem:
I am performing a regression analysis. I have a training dataset named train and a test dataset named test with a single dependent variable (my output). I find that when performing cross validation on the training set I get a low MSE and a high R2 value (yea!). Now that I have some faith this model could work, I train a model using all my training data and then predict on the features from the test set. When I check the R2 I find that it is negative and the MSE is much higher. So what could be my issue?
In my case I am using data that was collected at different times, i.e. train was collected a year before test. Is it possible some of the features in the dataset have different distribution? I can use the following tests to find out.
The tests and methods I’ll cover in this post can be used for a number of different purposes outside of the example I provided.
Purpose
The purpose of this post is to provide examples of non-parametric tests and methods along with brief (generalized) descriptions of what each test does. If you are looking for advice on which test to choose for your application, that is beyond the scope of this post. However I would point you to the links below as a great starting point:
How to choose between t-test or non-parametric test
When to Use Non-Parametric Tests
Example 1 : Single Feature Comparison
For this example I will only be focusing on 1 feature with two labels a and b. To clarify a is one of my features from the train dataset and b is the same feature from the test dataset.
Probability Density of a and b
Empirical CDF of a and b
Kolmogorov-Smirnov (KS) test
The KS test compares cumulative distribution functions (CDF) of two sample sets. The KS test calculates a D statistic value which indicates the maximum discrepancy between the two CDF’s.
Things to know about KS:
- Null HYPOTHESIS: The two distributions are the same
- p-value can be used to determine rejection of null hypothesis (i.e. p< 0.05 reject null)
- D close to 1 indicates the two samples are from different distributions
- D closer to 0 indicates the two samples are from the same distribution
In the case of our example dataset we can see the KS test returns a p-value < 0.05, so we would reject the null hypothesis. However we can also see that the D statistic is only 0.1, so although the distributions are different, this lets us know they are not so far apart.
Median Comparison
The next set of tests, the Wilcoxon signed-rank test and the Mann-Whitney U test, examine a difference in the median value between the two distibutions. The Wilcoxon signed-rank test is used for paired data, while the Mann-Whitney U test is used for unpaired data.
Given the example I cited at the beginning of the post one would assume the data is not paired, but I will go through how to test paired samples anyway. In R, both the Wilcoxon and Mann-Whitney tests are carried out using the wilcox.test function. To implement the Wilcoxon test the paired argument should be set to TRUE, to implement the Mann-Whitney test paired is set to FALSE.
Things to know about Wilcoxon Signed-Rank Test:
- For use with matched/paired data.
- Null Hypothesis: The median difference between the pairs is zero.
- W: Test Statistic
- Non-zero median differences will cause this value to be very large (negative or positive)
- Small values when median difference is zero/close to zero.
- p-value: significance value used to determine if null hypothesis should be rejected (i.e. p <0.05 -> reject)
The results of the Wilcoxon test are a p-value less than 0.05 and and a large statistic value. In this case we reject the null hypothesis that the difference between the medians is zero. (Note: R lists the W statistic I mentioned before as V.)
Things to know about the Mann-Whitney U Test:
- For use with unmatched/unpaired data
- Null Hypothesis: The distributions are identical.
- U statistic:
- If U is close to 1, the medians are very different
- If the medians are similar U will be close to n1*n2/2.
- where n1 is the number of samples in distribution 1 and n2 is the number of samples in distribution 2.
- p-value : significance value used to determine if null hypothesis should be rejected (i.e. p <0.05 -> reject)
We can see in this case the p-value is less than 0.05, so we reject the null hypothesis. However if we look at the value of the U-statistic we can see it is reasonably close to 5e5 (n1*n2/2). From this we can gather, the null hypothesis should be rejected, but the medians are not all that far from one another.
Permutation Tests
Permutation tests are a method that can be used to estimate any statistic which you specify such as difference in means, difference in medians, difference in CDF’s etc.
Things to know about permutation tests:
- Permutation tests can be used with any desired statistic.
- The way it works:
- For N repetitions:
- Shuffle labels of the datasets randomly.
- Example [a,a,a,b,b,b] corresponding to values [1,2,3,4,5,6]
- After shuffle: labels [a,b,b,a,a,b] , values stay the same [1,2,3,4,5,6]
- Example [a,a,a,b,b,b] corresponding to values [1,2,3,4,5,6]
- Calculate statistic of the shuffled set:
- Example using the difference between the means (similar to a t-test) and the shuffle above:
- mean of a = [ 1,4,5] = 3.33
- mean of b = [2,3,6] = 3.67
- difference between the means = 0.34
- Example using the difference between the means (similar to a t-test) and the shuffle above:
- Store value
- Shuffle labels of the datasets randomly.
- Calculate the sample means for the two original datasets (with their original labels)
- Compare the N mean differences from the shuffles with the difference between the sample mean difference
- mean(abs(N_mean_differences) > abs(sample_mean_a - sample_mean_b)) = p-value approximation
- For N repetitions:
The functions in the above snippet perform the example I outlined previously. It looks at the difference between the means each time the data is shuffled. To make things a bit more compact I first create a dataframe, ds, which contains the name of the original distribution and the values from the original distribution.
We can see from the results that the estimated p-value for the permutation test is less than 0.05 indicating that the null hypothesis should be rejected. In the case of this test my null hypothesis was the same as a t-test : the means of the two distributions are equal.
The plot above shows the resulting distribution of the difference in means from each permutation. The difference between the true sample means is shown by the dashed red line. We can see that the distribution doesn’t overlap the difference between the sample means further reinforcing the rejection of the null hypothesis.
Example 2: Multiple Feature Testing with Broom
The above example is great if we only have one feature we want to check, but what if I’m trying to determine which features have the same distribution in both datasets and which features have different distributions? We can do the same thing, but then we have a bunch of test model output objects (lists) that are messy. That is where the broom package comes into play. It takes the outputs of all our tests and puts them in a dataframe.
First I’ll create a dataframe which has two features one with a gamma distribution and one with a normal distribution. The parameters of the distributions vary depending on the label a or b in the gamma distribution feature.
Below are the contents of the dataframe. Each row is an observation and each column is a variable; the first two columns are independent variables (features) and the last column is the label of the dataset they reside in (a is from the train dataset, b is from the test dataset). I didn’t create a dependent variable in this case as we are only concerned with checking if a given feature has a different distribution based on which dataset it came from.
To perform my test I want to reshape my data a bit.
In the above snippet I used the tidyr package to gather the feature columns so that I now have a feature_name column and a feature_value column. This structure will allow me to run my remaining analysis with broom.
Shapiro-Wilk
The first test I’m going to run here is the Shapiro-Wilk test for normality. The Shapiro-Wilk test is one of the available options for testing normality along with the KS test (using a normal distribution for comparison) and the Anderson Darling test.
Things to know about the Shapiro-Wilk test:
- Null Hypothesis: The data is normally distributed.
- W statistic:
- Similar to a correlation between the sample data and the ideal score from a normal distribution
- Values between 0 and 1
- Value closer to 1 would suggest a good fit to normality (sample matches the scores of a normal distribution)
- Value closer to 0 would suggest non-normality
- p-value : significance value used to determine if null hypothesis should be rejected (i.e. p < 0.05 -> reject)
- q-q plot should also be used to plot sample values against normally distributed values.
- Quantiles from the sample data should run along the diagnonal of the plot (think x = y). The plot can be used to help support or dispute the rejection of the null hypothesis.
From the above test results we can gather that the null hypothesis should only be rejected for feature 2 (both a and b samples). However lets assume each feature comes from a single distribution (a and b are the same - which we already know is not true for feature 2) . We can plot the q-q plot for each feature and do a follow up Shapiro Wilk test on the each feature as a whole (a and b combined).
Q-Q Plot of Feature 1 (a and b)
In the q-q plot above the sample data runs along the diagonal (the points are from the sample data and the line is from a normal distribution). This goes along with the result from the Shapiro-Wilk test which has a p-value > 0.05 for both breakdowns of feature 1 which indicate that the null hypothesis (NH: the distribution being tested has a normal distribution) should not be rejected.
Q-Q Plot of Feature 2 (a and b)
The q-q plot and Shapiro-Wilk test for feature 2 shows a much different result. The test has a very small p-value indicating the null hypothesis should be rejected. The q-q plot confirms this result as we can see the sample data does not overlay the solid line representing a normal distribution.
The above results confirm what we already know, but I wanted to make sure I run through at least one test for normality prior to moving on with the non-parametric tests.
The following snippets run through KS, Mann-Whitney, and a permutation test which checks the difference between the median values between the distributions. Each test is implemented using the broom package along with dplyr so that I can group by feature and compare the distributions of the a and b labeled data.
KS
Result of KS Test:
Null HYPOTHESIS: The two distributions are the same
- feature 1 : Do not reject the null hypothesis
- feature 2 : Reject null hypothesis
Mann-Whitney U Test
Result of Mann-Whitney U Test:
Null Hypothesis: The distributions are identical.
- feature 1 : Do not reject null hypothesis
- feature 2 : Reject null hypothesis
Perm test
Result of Median Difference Permutation Test:
Null Hypothesis: The median difference between the distributions is zero.
- feature 1 : Do not reject null hypothesis
- feature 2 : Reject null hypothesis
Conclusion for Example 2
Test outcomes:
Feature | Shapiro-Wilk | KS | Mann-Whitney | Permutation Test |
---|---|---|---|---|
Feature 1 | Do Not Reject | Do Not Reject | Do Not Reject | Do Not Reject |
Feature 2 | Reject | Reject | Reject | Reject |
The table above contains the results from the tests performed on the sample data. In this case all our tests reinforce what we already know about the dataset. However this is not to say all the tests will always agree, but using the broom package it is simple enough to run through several test and compare results.