Are my p-values reliable?

Several collegues of mine have an ‘unfriendly’ relationship with biometry. They just look for p-values. When I ask them: “Do you really need a p-value for this statement?”, they usually answer: “No, I don’t. But editors do”. I must admit that this is indeed true: the role of p-values has been extremely important on the development of rigorous scientific procedures and it is no wonder that adding such a measure of ‘strength’ to all main statements in a scientific paper would seem highly appropriate. However, not everybody agrees on this and there are now several dissenting voices (see for example this post here). I think that these voices should be taken as warning messages against the abuse of p-values as the sole and final arbiter of whether the null hypothesis is true or false. In particular, the importance of high sample sizes and confirmatory studies has been frequently highlighted.

To my opinion, there is nothing wrong with p-values, though we should all learn to understand their real meaning. In particular, it is necessary to reinforce the idea that p-values will never measure the probability of the hypothesis given a certain body of data! On the contrary, they measure the probability of the data given a certain null hypothesis. Such a difference may seem small for us biologists, but it is of extreme importance! In other words, we should never say:

there is less than 5% probability that the null is true;
there is less than 5% probability that I’m wrong, if I reject the null.

These statements are nonsense, in frequentist statistics. What we can say is:

If the null is right, there is less than 5% probability that I obtain such a dataset. Therefore I reject the null.

Obviously, it is fundamental that we make sure that our p-values make sense, which is a far too often neglected aspect! All statistical tests make assumptions and it is our sole responsibility to ensure that those assumptions were met, before reporting p-values (and editors should make sure that the authors made sure about this!)

With reference to F tests in the ANOVA, these rely on the basic assumptions for linear models, i.e. errors are normal, independent and identically distributed (i.i.d). Independendence should be ensured by an appropriate experimental design, but we should always check that:

there are no outliers;
errors are gaussian;
errors are homoscedastic.

How should we do this?

A case study

Let’s consider the following dataset:

sugarData <- structure(data.frame(
  Treat = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 
8L, 9L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 
11L, 11L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 14L, 
14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 
16L, 17L, 17L, 17L, 17L, 17L, 18L, 18L, 18L, 18L, 18L, 19L, 19L, 
19L, 19L, 19L, 20L, 20L, 20L, 20L, 20L, 21L, 21L, 21L, 21L, 21L, 
22L, 22L, 22L, 22L, 22L, 23L, 23L, 23L, 23L, 23L, 24L, 24L, 24L, 
24L, 24L), .Label = c("1", "2", "3", "4", "5", "6", "7", "8", 
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", 
"20", "21", "22", "23", "24"), class = "factor"), 
  Block = structure(c(1L, 
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 
3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 
4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 
5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 
1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 
3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 
4L, 5L, 1L, 2L, 3L, 4L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), 
  Sugar = c(4.812, 5.515, 5.546, 4.444, 5.895, 
3.138, 3.113, 2.007, 2.877, 3.306, 5.516, 8.151, 7.258, 9.246, 
6.212, 7.614, 9.283, 7.019, 8.015, 5.117, 11.477, 10.671, 10.567, 
9.8, 10.111, 10.019, 8.686, 8.904, 8.956, 7.975, 4.356, 5.689, 
7.35, 4.767, 5.122, 7.591, 8.071, 9.086, 9.285, 8.312, 5.806, 
6.524, 7.365, 7.728, 8.784, 6.063, 5.833, 3.623, 7.196, 6.464, 
8.117, 9.526, 9.901, 10.175, 8.911, 7.608, 8.442, 8.272, 7.738, 
9.684, 3.699, 4.14, 3.073, 3.979, 2.199, 3.618, 3.604, 2.917, 
5.287, 1.698, 2.593, 3.808, 4.93, 2.649, 4.607, 6.584, 6.437, 
7.934, 7.365, 5, 6.508, 5.74, 3.999, 7.025, 6.699, 5.699, 5.918, 
5.341, 5.119, 6.984, 4.945, 5.471, 6.563, 3.81, 4.531, 5.345, 
6.564, 6.81, 6.878, 8.586, 5.968, 6.306, 6.28, 4.634, 6.103, 
6.85, 5.914, 6.447, 7.146, 7.686, 9.014, 7.96, 9.765, 5.587, 
7.254, 6.712, 7.4, 8.494, 7.766, 7.497)), 
  .Names = c("Treat", "Block", "Sugar"))

head(sugarData)
##   Treat Block Sugar
## 1     1     1 4.812
## 2     1     2 5.515
## 3     1     3 5.546
## 4     1     4 4.444
## 5     1     5 5.895
## 6     2     1 3.138

This dataset comes from an experiment based on one experimental treatment (Treat) with 24 levels, which is investigated in relation to its effect on the sugar yield of sugarbeet. This experiment was performed in a randomised complete blocks design and the ANOVA can be performed as:

mod.aov <- lm(Sugar ~ Block + Treat, data=sugarData)
anova(mod.aov)
## Analysis of Variance Table
## 
## Response: Sugar
##           Df Sum Sq Mean Sq F value Pr(>F)    
## Block      4   2.65  0.6627  0.6125 0.6547    
## Treat     23 443.57 19.2855 17.8262 <2e-16 ***
## Residuals 92  99.53  1.0819                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case the F test yields highly significant results, but we should still be concerned about the meaning of such a p-value: is it reliable? We need to check for the basic assumptions.

Graphical analysis of residuals

A graphical inspection of residuals is enough to spot a main part of the problems with basic assumptions. This can be obtained by:

par(mfrow=c(2,2))
plot(mod.aov)

Which is quite reassuring: the ‘residual vs. fitted’ plot shows a random distribution of residuals (there are no visible patterns), while the ‘QQ plot’ shows that data lie along the bisector line. Very good, then!

Some possible ‘suspect’ patterns for the ‘residual vs. fitted’ plot are like this:

alt text

Testing for homoscedasticity

Sometimes a more formal test is required. One possibility is to use the Bartlett’s test for homoscedasticity:

bartlett.test(Sugar ~ Treat, data=sugarData)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Sugar by Treat
## Bartlett's K-squared = 18.1506, df = 23, p-value = 0.7493

which confirms that there are no problems. Be warned: this test rely on the normality assumption! A more robust alternative is the Levene’s test (be careful to remove the Block effect!):

library(car)
## Warning: package 'car' was built under R version 3.0.3
leveneTest(Sugar ~ Treat, data=sugarData)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group 23  0.6047 0.9159
##       96

This is also reassuring, in our case.

Testing for normality

Normality is rarely an issue, as the ANOVA is pretty robust against this type of violation. If we need a formal test, we can use the Shapiro-Wilk test, that makes use of the standardised residuals from model fit (the same as used for the QQ-plot):

standardRes <- residuals(mod.aov)/summary(mod.aov)$sigma
shapiro.test(standardRes)
## 
##  Shapiro-Wilk normality test
## 
## data:  standardRes
## W = 0.9876, p-value = 0.3452

which confirms what we have already observed on the QQ-plot. Please, do not give too much importance to this test: the graphical inspection of the QQ-plot is much more useful!

Help me: I have problems!

Only after this fundamental checks we can be sure that the p-values in ANOVA are reliable. In our case, everything was OK, but what if some violation is discovered?

The simplest answer is that you need to transform your data. Instead of selecting a transformation by common sense (which is by the way a correct behaviour in most cases), we can select a family of transformations, such as:

\[ W = Y^\lambda;\,\, if \lambda \ne 0 \]

\[ W = log (Y); \,\, if \lambda = 0 \]

where W is the transformed variable, Y is the original variable and \(\lambda\) is the trasformation parameter. By changing this latter we obtain almost all the main transformations, i.e. 1 = untransformed data, 0 = logarithmic transformation, 0.5 = root transformation …

We can select \(\lambda\) by maximum likelihood, with the help of a graphical tool, that is available in R as:

library(MASS)
## Warning: package 'MASS' was built under R version 3.0.3
boxcox(mod.aov)

Obviously, in this case the maximum likelihood value for \(\lambda\) is close to 1, which confirms that this dataset does not require any correcting measure.

Should a transformation be required, do not be scared: transform your data and repeat the analyses. Unfortunately, the discussion and presentation of results becomes slightly more complex… You may take a look at my other post on this subject.

Models are wrong, .... but some are useful (G. Box)

Saturday 11 April 2015

Are my p-values reliable?

A case study

Graphical analysis of residuals

Testing for homoscedasticity

Testing for normality

Help me: I have problems!

Further readings

No comments:

Post a Comment