lm
, glm
, lmer
, or glmer
.(result: 12 out of 48 people)
contrasts
command?(result: three or four)
A test especially for those who worked with this before.
score <- c(34, 45, 33, 54, 45, 23, 66, 35, 24, 50,
67, 50, 64, 43, 47, 50, 68, 56, 60, 39)
region <- c("D", "D", "D", "D", "D", "D", "D", "D", "D", "D",
"F", "F", "F", "F", "F", "F", "F", "F", "F", "F")
geslacht = c("M", "V", "M", "M", "V", "M", "V", "V", "M", "V",
"M", "M", "M", "M", "M", "V", "V", "V", "V", "V")
table <- data.frame (region, geslacht, score)
table
## region geslacht score
## 1 D M 34
## 2 D V 45
## 3 D M 33
## 4 D M 54
## 5 D V 45
## 6 D M 23
## 7 D V 66
## 8 D V 35
## 9 D M 24
## 10 D V 50
## 11 F M 67
## 12 F M 50
## 13 F M 64
## 14 F M 43
## 15 F M 47
## 16 F V 50
## 17 F V 68
## 18 F V 56
## 19 F V 60
## 20 F V 39
(geslacht means gender)
We run a linear model:
model <- lm (score ~ region * geslacht, table)
summary (model)
##
## Call:
## lm(formula = score ~ region * geslacht, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6 -7.8 -1.9 6.5 20.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.60 5.08 6.61 6e-06 ***
## regionF 20.60 7.19 2.87 0.011 *
## geslachtV 14.60 7.19 2.03 0.059 .
## regionF:geslachtV -14.20 10.16 -1.40 0.181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.4 on 16 degrees of freedom
## Multiple R-squared: 0.411, Adjusted R-squared: 0.301
## F-statistic: 3.73 on 3 and 16 DF, p-value: 0.0331
So far, so good. We now decide to code gender not in Dutch but in English, i.e. by changing “V” to “F”.
(result: 4 out of 48 people answered g, and 1 of the 12 “experts” answered g)
If you answered “yes” to the first question and either “no” to the second question or something different from “g” to the last question, you may be in trouble.
Many experimental designs can be captured in a linear model. Consider the Dutch–Flemish case again:
dutch = c(34, 45, 33, 54, 45, 23, 66, 35, 24, 50)
flemish = c(67, 50, 64, 43, 47, 50, 68, 56, 60, 39)
t.test (dutch, flemish, var.equal = TRUE)
##
## Two Sample t-test
##
## data: dutch and flemish
## t = -2.512, df = 18, p-value = 0.02176
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -24.791 -2.209
## sample estimates:
## mean of x mean of y
## 40.9 54.4
We can achieve the same result by regarding this design as a linear model with one independent variable (factor) and one dependent variable.
The score is the dependent variable, and region is the factor. We will now put these data into a table.
The data are not paired. Dutch subject 3 has nothing to do with Flemish subject 3. In fact, the numbers of subjects didn’t have to be exactly the same in both groups. The t-test will work fine with 9 Dutch and 10 Flemish subjects. The data therefore doesn’t go in separate columns, and the table will have 20 rows:
score <- c(dutch, flemish)
region <- c(rep ("D", 10), rep ("F", 10))
table <- data.frame (region, score)
table
## region score
## 1 D 34
## 2 D 45
## 3 D 33
## 4 D 54
## 5 D 45
## 6 D 23
## 7 D 66
## 8 D 35
## 9 D 24
## 10 D 50
## 11 F 67
## 12 F 50
## 13 F 64
## 14 F 43
## 15 F 47
## 16 F 50
## 17 F 68
## 18 F 56
## 19 F 60
## 20 F 39
We can now run the command lm
, which means ‘linear model’:
model <- lm (score ~ region, table)
summary (model)
##
## Call:
## lm(formula = score ~ region, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.90 -7.53 -1.40 9.22 25.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.90 3.80 10.76 2.8e-09 ***
## regionF 13.50 5.37 2.51 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared: 0.26, Adjusted R-squared: 0.218
## F-statistic: 6.31 on 1 and 18 DF, p-value: 0.0218
The resulting model makes the following approximation:
\[ score \approxeq 40.90 + 13.50 \times region \]
where region is 0 for the Dutch and 1 for the Flemish.
The p-value of 0.022 for “regionF” is the same as the p-value of the t-test, which is good. The t-value is also the same as the -2.512 seen earlier, although the sign has changed. Whereas the t-test computed the t for the Dutch minus the Flemish, the t-value here is computed on the basis of going from the reference category (regionD) to the other category (regioF), which is up in score.
The intercept is 40.90, which is the estimate for the score of the reference category, “regionD”, i.e. Dutch (region = 0); this value of 40.90 is significantly greater than 0 (p = 2.8·10-9). The coefficient for “regionF” is 13.50, because the estimated mean score goes up by 13.50 if you go from Dutch to Flemish (it becomes 54.40).
To summarize the summary, you can use the anova
command:
anova (model)
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## region 1 911 911 6.31 0.022 *
## Residuals 18 2599 144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This yields the same p-value as we saw in the summary of the model. You report this as:
The effect of region is significant (F[1,18] = 6.31, p = 0.022), indicating that Flemish people are better at our German-language proficieny task than Dutch people.
You may think now that everything is under control, but there is a worrisome detail. Consider the summary again:
summary (model)
##
## Call:
## lm(formula = score ~ region, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.90 -7.53 -1.40 9.22 25.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.90 3.80 10.76 2.8e-09 ***
## regionF 13.50 5.37 2.51 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared: 0.26, Adjusted R-squared: 0.218
## F-statistic: 6.31 on 1 and 18 DF, p-value: 0.0218
By default (i.e. if you don’t specify otherwise), linear model computation in R will use alphabetical order to interpret the levels of a factor: for the factor “region”, the level “D” will be equated to 0 in the computations, and the level “F” will be equated to 1. That is, you would have obtained the exact same results if you had put “0” in the table instead of “D”, and “1” instead of “F”.
How bad is this dependence on the alphabet? We’ll investigate.
To figure this out, we add an additional factor called “region2”:
table $ region2 <- ifelse (table $ region == "D", "N", "F")
table
## region score region2
## 1 D 34 N
## 2 D 45 N
## 3 D 33 N
## 4 D 54 N
## 5 D 45 N
## 6 D 23 N
## 7 D 66 N
## 8 D 35 N
## 9 D 24 N
## 10 D 50 N
## 11 F 67 F
## 12 F 50 F
## 13 F 64 F
## 14 F 43 F
## 15 F 47 F
## 16 F 50 F
## 17 F 68 F
## 18 F 56 F
## 19 F 60 F
## 20 F 39 F
We can now compare the two summaries:
summary (lm (score ~ region, table))
##
## Call:
## lm(formula = score ~ region, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.90 -7.53 -1.40 9.22 25.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.90 3.80 10.76 2.8e-09 ***
## regionF 13.50 5.37 2.51 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared: 0.26, Adjusted R-squared: 0.218
## F-statistic: 6.31 on 1 and 18 DF, p-value: 0.0218
summary (lm (score ~ region2, table))
##
## Call:
## lm(formula = score ~ region2, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.90 -7.53 -1.40 9.22 25.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.40 3.80 14.32 2.8e-11 ***
## region2N -13.50 5.37 -2.51 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared: 0.26, Adjusted R-squared: 0.218
## F-statistic: 6.31 on 1 and 18 DF, p-value: 0.0218
As you can see from the word “region2N”, the reference category (“level”) for the region has changed to “F” (Flanders), because that comes before “N” (Netherlands) in the alphabet. The intercept has changed to 54.40, which is the observed mean of the Flemish (i.e. when the value of region2 is 0). Its p-value (comparison to 0) has become much lower, because the Flemish value is much farther from 0 then the Dutch value of 40.90 was. The sign of the estimate of the factor “region2” is negative, because you go down in score as you go from Flanders to the Netherlands. Fortunately, the p-value has not changed.
Something feels silly, though. Why isn’t the intercept the estimate of the mean of the Dutch and Flemish populations together? Well, we can accomplish that, namely by specifying explicit “contrasts”:
table $ regionCentered <- table $ region
contrasts (table $ regionCentered) <- c(-0.5, +0.5)
table
## region score region2 regionCentered
## 1 D 34 N D
## 2 D 45 N D
## 3 D 33 N D
## 4 D 54 N D
## 5 D 45 N D
## 6 D 23 N D
## 7 D 66 N D
## 8 D 35 N D
## 9 D 24 N D
## 10 D 50 N D
## 11 F 67 F F
## 12 F 50 F F
## 13 F 64 F F
## 14 F 43 F F
## 15 F 47 F F
## 16 F 50 F F
## 17 F 68 F F
## 18 F 56 F F
## 19 F 60 F F
## 20 F 39 F F
You don’t see the difference between columns “region” and “regionCentered” in the table view, but you can see it by asking:
table $ region
## [1] D D D D D D D D D D F F F F F F F F F F
## Levels: D F
table $ regionCentered
## [1] D D D D D D D D D D F F F F F F F F F F
## attr(,"contrasts")
## [,1]
## D -0.5
## F 0.5
## Levels: D F
By the way, R assigned the first number (-0.5) to the level that comes first in the alphabet (“D”)…
Now compare the three summaries:
summary (lm (score ~ region, table))
##
## Call:
## lm(formula = score ~ region, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.90 -7.53 -1.40 9.22 25.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.90 3.80 10.76 2.8e-09 ***
## regionF 13.50 5.37 2.51 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared: 0.26, Adjusted R-squared: 0.218
## F-statistic: 6.31 on 1 and 18 DF, p-value: 0.0218
summary (lm (score ~ regionCentered, table))
##
## Call:
## lm(formula = score ~ regionCentered, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.90 -7.53 -1.40 9.22 25.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.65 2.69 17.73 7.6e-13 ***
## regionCentered1 13.50 5.37 2.51 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared: 0.26, Adjusted R-squared: 0.218
## F-statistic: 6.31 on 1 and 18 DF, p-value: 0.0218
summary (lm (score ~ region2, table))
##
## Call:
## lm(formula = score ~ region2, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.90 -7.53 -1.40 9.22 25.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.40 3.80 14.32 2.8e-11 ***
## region2N -13.50 5.37 -2.51 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared: 0.26, Adjusted R-squared: 0.218
## F-statistic: 6.31 on 1 and 18 DF, p-value: 0.0218
The sign of the estimate for “regionCentered” is positive, because we are going from Dutch to Flemish when our number goes up (from -0.5 to +0.5). The p-value is, reassuringly, again 0.022.
The intercept has now becomes 47.65, i.e. the mean of the two estimated means. The p-value is again lower.
The p-value is better because it is now based on 20 instead of 10 participants.
Short form: center your binary variables!
Long form:
Although it does not seem very important, because you are often not interested in the intercept, please make a habit of always specifying explicit contrasts, so as not to have to rely on R’s silly alphabetic default strategy. If you do want to use Dutch as the reference, use c(1, 0)
as the contrast, and if you want to use Flemish as the reference, use c(0, 1)
as the contrast; these are “treatment contrasts”, because in medical experiments you want “no treatment” to be the reference. usually, However, you will want symmetric contrasts whose elements sum to zero, such as c(-0.5, 0.5), because that gives you some additional information (in the one-factor case):
Now, finally, everything is under control. Even the analysis of variance behaves well:
anova (lm (score ~ regionCentered, table))
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## regionCentered 1 911 911 6.31 0.022 *
## Residuals 18 2599 144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Same result again. As I said, everything is under control, at least for the one-factor case.
Many interesting datasets involve more than one factor. Let’s add gender (“geslacht”) in:
table $ geslacht = c("M", "V", "M", "M", "V", "M", "V", "V", "M", "V",
"M", "M", "M", "M", "M", "V", "V", "V", "V", "V")
table
## region score region2 regionCentered geslacht
## 1 D 34 N D M
## 2 D 45 N D V
## 3 D 33 N D M
## 4 D 54 N D M
## 5 D 45 N D V
## 6 D 23 N D M
## 7 D 66 N D V
## 8 D 35 N D V
## 9 D 24 N D M
## 10 D 50 N D V
## 11 F 67 F F M
## 12 F 50 F F M
## 13 F 64 F F M
## 14 F 43 F F M
## 15 F 47 F F M
## 16 F 50 F F V
## 17 F 68 F F V
## 18 F 56 F F V
## 19 F 60 F F V
## 20 F 39 F F V
model <- lm (score ~ region * geslacht, table)
summary (model)
##
## Call:
## lm(formula = score ~ region * geslacht, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6 -7.8 -1.9 6.5 20.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.60 5.08 6.61 6e-06 ***
## regionF 20.60 7.19 2.87 0.011 *
## geslachtV 14.60 7.19 2.03 0.059 .
## regionF:geslachtV -14.20 10.16 -1.40 0.181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.4 on 16 degrees of freedom
## Multiple R-squared: 0.411, Adjusted R-squared: 0.301
## F-statistic: 3.73 on 3 and 16 DF, p-value: 0.0331
The resulting model makes the following approximation:
\[ score \approxeq 33.60 + 20.60 \times region + 14.60 \times geslacht - 14.20 \times region \times geslacht \]
where region is 0 for the Dutch and 1 for the Flemish, and geslacht is 0 for males and 1 for females.
The estimated intercept of 33.60 is now the estimate for the mean score of the male Dutch population. The positive value for “regionF” means that your score goes up by 20.60 is you’re Flemish (p = 0.011) and by 14.60 if you’re a woman (almost significant: p = 0.059). The interaction estimate of -14.20 means that those benefits don’t add up if you are both Flemish (region = 1) and a woman (geslacht = 1): you have to subtract 14.20, although this number doesn’t differ significantly from zero (p = 0.181). That tells us that the four groups have the following estimated means (please also try substituting 0’s and 1’s in the above equation):
If the interaction had come out significant, we would have been able to conclude that the female advantage is greater for Dutch than for Flemish people, or, equivalently, that the Flemish advantage is greater for males than for females. Or perhaps we can report something that suggests to a lesser extent that there are indeed advantages everywhere (such as a female advantage for the Flemish), for instance “the difference score of females minus males is greater for Dutch people than for Flemish people.”
These averages can be checked with the aggregate
command:
aggregate (score ~ region + geslacht, data = table, FUN = mean)
## region geslacht score
## 1 D M 33.6
## 2 F M 54.2
## 3 D V 48.2
## 4 F V 54.6
So this table can be understood. As in the one-factor case, the analysis of variance yields a summary of the summary, but…
anova (model)
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## region 1 911 911 7.06 0.017 *
## geslacht 1 281 281 2.18 0.159
## region:geslacht 1 252 252 1.95 0.181
## Residuals 16 2066 129
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values are different. Suddenly, gender is no longer nearly significant. That’s a problem that didn’t seem to occur in the one-factor case.
Then, let’s translate the letters in the coding of gender to English:
table $ gender = factor (ifelse (table $ geslacht == "V", "F", "M"))
table
## region score region2 regionCentered geslacht gender
## 1 D 34 N D M M
## 2 D 45 N D V F
## 3 D 33 N D M M
## 4 D 54 N D M M
## 5 D 45 N D V F
## 6 D 23 N D M M
## 7 D 66 N D V F
## 8 D 35 N D V F
## 9 D 24 N D M M
## 10 D 50 N D V F
## 11 F 67 F F M M
## 12 F 50 F F M M
## 13 F 64 F F M M
## 14 F 43 F F M M
## 15 F 47 F F M M
## 16 F 50 F F V F
## 17 F 68 F F V F
## 18 F 56 F F V F
## 19 F 60 F F V F
## 20 F 39 F F V F
Let’s investigate the influence of this trivial change:
summary (lm (score ~ region * geslacht, table))
##
## Call:
## lm(formula = score ~ region * geslacht, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6 -7.8 -1.9 6.5 20.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.60 5.08 6.61 6e-06 ***
## regionF 20.60 7.19 2.87 0.011 *
## geslachtV 14.60 7.19 2.03 0.059 .
## regionF:geslachtV -14.20 10.16 -1.40 0.181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.4 on 16 degrees of freedom
## Multiple R-squared: 0.411, Adjusted R-squared: 0.301
## F-statistic: 3.73 on 3 and 16 DF, p-value: 0.0331
summary (lm (score ~ region * gender, table))
##
## Call:
## lm(formula = score ~ region * gender, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6 -7.8 -1.9 6.5 20.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.20 5.08 9.48 5.7e-08 ***
## regionF 6.40 7.19 0.89 0.386
## genderM -14.60 7.19 -2.03 0.059 .
## regionF:genderM 14.20 10.16 1.40 0.181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.4 on 16 degrees of freedom
## Multiple R-squared: 0.411, Adjusted R-squared: 0.301
## F-statistic: 3.73 on 3 and 16 DF, p-value: 0.0331
Apart from the estimate and p-value of the intercept, three things have changed:
This last change occurs because in the earlier test the inluence of region was evaluated only for the males (the reference category “M”), and now it is evaluated only for the females (the reference category “V”). Sort of. This is analogous to what happened to the intercept in the one-factor case. This change of p-values in the main effects only happens if the interaction effect is present in the model (that’s why I said “sort of”).
All of you who have been throwing two factors into a linear model without setting explicit contrasts now have to think OOPS and tell their advisors that their analyses have to be redone. The goes not only for lm
, but for glm
, lmer
and glmer
as well! Judging from lots of presentations on research where these commands have been used, the alphabetical order is the researcher’s default!
None of those p-values are correct if you just want to evaluate the difference between the regions or between the genders. The solution should be clear now:
table $ genderCentered <- table $ gender
contrasts (table $ genderCentered) <- c(-0.5, 0.5)
model = lm (score ~ regionCentered * genderCentered, table)
summary (model)
##
## Call:
## lm(formula = score ~ regionCentered * genderCentered, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6 -7.8 -1.9 6.5 20.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.65 2.54 18.75 2.6e-12 ***
## regionCentered1 13.50 5.08 2.66 0.017 *
## genderCentered1 -7.50 5.08 -1.48 0.159
## regionCentered1:genderCentered1 14.20 10.16 1.40 0.181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.4 on 16 degrees of freedom
## Multiple R-squared: 0.411, Adjusted R-squared: 0.301
## F-statistic: 3.73 on 3 and 16 DF, p-value: 0.0331
The p-values are the same as they were above in the analysis of variance, which we can also try now:
anova (model)
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## regionCentered 1 911 911 7.06 0.017 *
## genderCentered 1 281 281 2.18 0.159
## regionCentered:genderCentered 1 252 252 1.95 0.181
## Residuals 16 2066 129
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Would the practical advice be to always use anova
, or to always use explicit symmetric contrasts? The use of an anova
that gives you all the effects is not always possible, especially if you want to add and/or drop single terms in an exploration of a big model.
Besides, anova
does not always give you the same values as summary
. We can see that after removing one Flemish female from the data:
table = table [1 : 19, ]
model = lm (score ~ regionCentered * genderCentered, table)
summary (model)
##
## Call:
## lm(formula = score ~ regionCentered * genderCentered, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.20 -7.85 -2.50 5.65 20.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.62 2.50 19.47 4.7e-12 ***
## regionCentered1 15.45 5.00 3.09 0.0074 **
## genderCentered1 -9.45 5.00 -1.89 0.0780 .
## regionCentered1:genderCentered1 10.30 9.99 1.03 0.3189
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.8 on 15 degrees of freedom
## Multiple R-squared: 0.487, Adjusted R-squared: 0.384
## F-statistic: 4.74 on 3 and 15 DF, p-value: 0.0161
anova (model)
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## regionCentered 1 1096 1096 9.33 0.008 **
## genderCentered 1 449 449 3.82 0.069 .
## regionCentered:genderCentered 1 125 125 1.06 0.319
## Residuals 15 1762 117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values for the main effects are different in summary
than in anova
. This is because in summary
the p-value for a term is based on comparing the full model with the model without that term (“type-III sums of squares”), whereas in anova
the terms are entered one by one starting from the null model (“type-I sums of squares”). The null model, by the way, is the model with only an intercept; you get it with the formula score ~ 1
.
So now for another silly change, namely a change in the order of the factors.
summary
, but not in anova
.anova
, but not in summary
.summary (lm (score ~ regionCentered * genderCentered, table))
##
## Call:
## lm(formula = score ~ regionCentered * genderCentered, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.20 -7.85 -2.50 5.65 20.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.62 2.50 19.47 4.7e-12 ***
## regionCentered1 15.45 5.00 3.09 0.0074 **
## genderCentered1 -9.45 5.00 -1.89 0.0780 .
## regionCentered1:genderCentered1 10.30 9.99 1.03 0.3189
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.8 on 15 degrees of freedom
## Multiple R-squared: 0.487, Adjusted R-squared: 0.384
## F-statistic: 4.74 on 3 and 15 DF, p-value: 0.0161
summary (lm (score ~ genderCentered * regionCentered, table))
##
## Call:
## lm(formula = score ~ genderCentered * regionCentered, data = table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.20 -7.85 -2.50 5.65 20.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.62 2.50 19.47 4.7e-12 ***
## genderCentered1 -9.45 5.00 -1.89 0.0780 .
## regionCentered1 15.45 5.00 3.09 0.0074 **
## genderCentered1:regionCentered1 10.30 9.99 1.03 0.3189
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.8 on 15 degrees of freedom
## Multiple R-squared: 0.487, Adjusted R-squared: 0.384
## F-statistic: 4.74 on 3 and 15 DF, p-value: 0.0161
Since in summary
a p-value is based on comparing the full and the full-minus-one model, there is no difference. Both regionCentered * genderCentered
and genderCentered * regionCentered
specify the full model.
anova (lm (score ~ regionCentered * genderCentered, table))
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## regionCentered 1 1096 1096 9.33 0.008 **
## genderCentered 1 449 449 3.82 0.069 .
## regionCentered:genderCentered 1 125 125 1.06 0.319
## Residuals 15 1762 117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova (lm (score ~ genderCentered * regionCentered, table))
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## genderCentered 1 373 373 3.18 0.0949 .
## regionCentered 1 1172 1172 9.98 0.0065 **
## genderCentered:regionCentered 1 125 125 1.06 0.3189
## Residuals 15 1762 117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since in anova
terms are entered one by one, the first anova
gives the p-value for comparing the null model with the model with only “regionCentered” as a factor, whereas the second anova
gives the p-value for comparing the model with only “genderCentered” as a factor with the model with both “genderCentered” and “regionCentered” as factors. If the number of participants is the same in all groups, there is no difference, but if the groups have different sizes, the p-values will differ.
If you try all of these options, then please mention in your paper that you have tried all of them, as on pages 9–10 by Wanrooij et al. (2014b).
Always use explicit contrasts. Period.
Because the authors of lme4
deemed them unreliable, because of a problem in determining the number of degrees of freedom. You can still get them by comparing the full model with the model minus that one term:
anova (model, modelMinusThatOneTerm, test = "Chisq")
The p-values from a \(\chi^2\)-test on the difference between a mixed-effects model and the same model with one term added or dropped, is just as unreliable as the p-values that the authors of lme4
decided to no longer give:
If you have very large samples an asymptotic approach (using Wald z or chi-square statistics) is probably just fine. However, the further you depart from conventional repeated measures ANOVA assumptions the harder it is to know how large a sample news [sic] to be before the asymptotics kick in. (Thom Baguley on his blog)
Practical advice: add terms if they contribute with p < 0.01 or p < 0.05 perhaps, but trust their significance only if their p-values are below 0.0001 or 0.001 perhaps.