Question: Who of you worked with multiple levels of interaction before in R? More precisely, who of you satisfy all of the following three conditions?:

  1. You worked with lm, glm, lmer, or glmer.
  2. You used a formula with at least two fixed factors (between- or within-subjects doesn’t matter).
  3. You were interested in the significance of at least two degrees of interaction of those factors, e.g. in the case of two factors you were interested both in the main effect(s) and in the interaction effect.

.

.

.

.

.

.

(result: 12 out of 48 people)

Question: Who of these people used explicit contrasts, e.g. with the contrasts command?

.

.

.

.

.

.

(result: three or four)

A test especially for those who worked with this before.

score <- c(34, 45, 33, 54, 45, 23, 66, 35, 24, 50,
           67, 50, 64, 43, 47, 50, 68, 56, 60, 39)
region <- c("D", "D", "D", "D", "D", "D", "D", "D", "D", "D",
            "F", "F", "F", "F", "F", "F", "F", "F", "F", "F")
geslacht = c("M", "V", "M", "M", "V", "M", "V", "V", "M", "V",
             "M", "M", "M", "M", "M", "V", "V", "V", "V", "V")
table <- data.frame (region, geslacht, score)
table
##    region geslacht score
## 1       D        M    34
## 2       D        V    45
## 3       D        M    33
## 4       D        M    54
## 5       D        V    45
## 6       D        M    23
## 7       D        V    66
## 8       D        V    35
## 9       D        M    24
## 10      D        V    50
## 11      F        M    67
## 12      F        M    50
## 13      F        M    64
## 14      F        M    43
## 15      F        M    47
## 16      F        V    50
## 17      F        V    68
## 18      F        V    56
## 19      F        V    60
## 20      F        V    39

(geslacht means gender)

We run a linear model:

model <- lm (score ~ region * geslacht, table)
summary (model)
## 
## Call:
## lm(formula = score ~ region * geslacht, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -15.6   -7.8   -1.9    6.5   20.4 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          33.60       5.08    6.61    6e-06 ***
## regionF              20.60       7.19    2.87    0.011 *  
## geslachtV            14.60       7.19    2.03    0.059 .  
## regionF:geslachtV   -14.20      10.16   -1.40    0.181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.4 on 16 degrees of freedom
## Multiple R-squared:  0.411,  Adjusted R-squared:  0.301 
## F-statistic: 3.73 on 3 and 16 DF,  p-value: 0.0331

So far, so good. We now decide to code gender not in Dutch but in English, i.e. by changing “V” to “F”.

Question: What changes in our summary of the model?

  1. nothing
  2. the sign of the estimate for gender
  3. the sign of the estimates for gender and the interaction
  4. b plus the p-value of the intercept
  5. c plus the p-value of the intercept
  6. b plus more than one p-value
  7. c plus more than one p-value

.

.

.

.

.

.

(result: 4 out of 48 people answered g, and 1 of the 12 “experts” answered g)

If you answered “yes” to the first question and either “no” to the second question or something different from “g” to the last question, you may be in trouble.

A linear model with one factor

Many experimental designs can be captured in a linear model. Consider the Dutch–Flemish case again:

dutch = c(34, 45, 33, 54, 45, 23, 66, 35, 24, 50)
flemish = c(67, 50, 64, 43, 47, 50, 68, 56, 60, 39)
t.test (dutch, flemish, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  dutch and flemish
## t = -2.512, df = 18, p-value = 0.02176
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -24.791  -2.209
## sample estimates:
## mean of x mean of y 
##      40.9      54.4

We can achieve the same result by regarding this design as a linear model with one independent variable (factor) and one dependent variable.

.

.

.

Question: What is the dependent variable?

  1. the German-language proficiency score
  2. the region
  3. don’t know

.

.

.

.

.

.

The score is the dependent variable, and region is the factor. We will now put these data into a table.

.

.

.

Question: How many rows will the table have?

  1. 10
  2. 20
  3. don’t know

.

.

.

.

.

.

The data are not paired. Dutch subject 3 has nothing to do with Flemish subject 3. In fact, the numbers of subjects didn’t have to be exactly the same in both groups. The t-test will work fine with 9 Dutch and 10 Flemish subjects. The data therefore doesn’t go in separate columns, and the table will have 20 rows:

score <- c(dutch, flemish)
region <- c(rep ("D", 10), rep ("F", 10))
table <- data.frame (region, score)
table
##    region score
## 1       D    34
## 2       D    45
## 3       D    33
## 4       D    54
## 5       D    45
## 6       D    23
## 7       D    66
## 8       D    35
## 9       D    24
## 10      D    50
## 11      F    67
## 12      F    50
## 13      F    64
## 14      F    43
## 15      F    47
## 16      F    50
## 17      F    68
## 18      F    56
## 19      F    60
## 20      F    39

We can now run the command lm, which means ‘linear model’:

model <- lm (score ~ region, table)
summary (model)
## 
## Call:
## lm(formula = score ~ region, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.90  -7.53  -1.40   9.22  25.10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    40.90       3.80   10.76  2.8e-09 ***
## regionF        13.50       5.37    2.51    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared:  0.26,   Adjusted R-squared:  0.218 
## F-statistic: 6.31 on 1 and 18 DF,  p-value: 0.0218

The resulting model makes the following approximation:

\[ score \approxeq 40.90 + 13.50 \times region \]

where region is 0 for the Dutch and 1 for the Flemish.

The p-value of 0.022 for “regionF” is the same as the p-value of the t-test, which is good. The t-value is also the same as the -2.512 seen earlier, although the sign has changed. Whereas the t-test computed the t for the Dutch minus the Flemish, the t-value here is computed on the basis of going from the reference category (regionD) to the other category (regioF), which is up in score.

The intercept is 40.90, which is the estimate for the score of the reference category, “regionD”, i.e. Dutch (region = 0); this value of 40.90 is significantly greater than 0 (p = 2.8·10-9). The coefficient for “regionF” is 13.50, because the estimated mean score goes up by 13.50 if you go from Dutch to Flemish (it becomes 54.40).

To summarize the summary, you can use the anova command:

anova (model)
## Analysis of Variance Table
## 
## Response: score
##           Df Sum Sq Mean Sq F value Pr(>F)  
## region     1    911     911    6.31  0.022 *
## Residuals 18   2599     144                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This yields the same p-value as we saw in the summary of the model. You report this as:

The effect of region is significant (F[1,18] = 6.31, p = 0.022), indicating that Flemish people are better at our German-language proficieny task than Dutch people.

.

.

.

You may think now that everything is under control, but there is a worrisome detail. Consider the summary again:

summary (model)
## 
## Call:
## lm(formula = score ~ region, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.90  -7.53  -1.40   9.22  25.10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    40.90       3.80   10.76  2.8e-09 ***
## regionF        13.50       5.37    2.51    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared:  0.26,   Adjusted R-squared:  0.218 
## F-statistic: 6.31 on 1 and 18 DF,  p-value: 0.0218

Question: Why does R take “D” as the reference level for the factor “region”?

  1. It comes before “F” in the alphabet.
  2. It comes earlier in the table than “F”.
  3. R chooses randomly.
  4. Don’t know.

.

.

.

.

.

.

By default (i.e. if you don’t specify otherwise), linear model computation in R will use alphabetical order to interpret the levels of a factor: for the factor “region”, the level “D” will be equated to 0 in the computations, and the level “F” will be equated to 1. That is, you would have obtained the exact same results if you had put “0” in the table instead of “D”, and “1” instead of “F”.

How bad is this dependence on the alphabet? We’ll investigate.

.

.

.

Question: What changes in our summary if we reverse the order, e.g. by changing “D” (Dutch) to “N” (Netherlands)?

  1. nothing
  2. the estimate of the intercept
  3. b plus the sign of the estimate for the influence of the factor region
  4. c plus the p-value of the intercept
  5. don’t know

.

.

.

To figure this out, we add an additional factor called “region2”:

table $ region2 <- ifelse (table $ region == "D", "N", "F")
table
##    region score region2
## 1       D    34       N
## 2       D    45       N
## 3       D    33       N
## 4       D    54       N
## 5       D    45       N
## 6       D    23       N
## 7       D    66       N
## 8       D    35       N
## 9       D    24       N
## 10      D    50       N
## 11      F    67       F
## 12      F    50       F
## 13      F    64       F
## 14      F    43       F
## 15      F    47       F
## 16      F    50       F
## 17      F    68       F
## 18      F    56       F
## 19      F    60       F
## 20      F    39       F

We can now compare the two summaries:

summary (lm (score ~ region, table))
## 
## Call:
## lm(formula = score ~ region, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.90  -7.53  -1.40   9.22  25.10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    40.90       3.80   10.76  2.8e-09 ***
## regionF        13.50       5.37    2.51    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared:  0.26,   Adjusted R-squared:  0.218 
## F-statistic: 6.31 on 1 and 18 DF,  p-value: 0.0218
summary (lm (score ~ region2, table))
## 
## Call:
## lm(formula = score ~ region2, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.90  -7.53  -1.40   9.22  25.10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    54.40       3.80   14.32  2.8e-11 ***
## region2N      -13.50       5.37   -2.51    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared:  0.26,   Adjusted R-squared:  0.218 
## F-statistic: 6.31 on 1 and 18 DF,  p-value: 0.0218

As you can see from the word “region2N”, the reference category (“level”) for the region has changed to “F” (Flanders), because that comes before “N” (Netherlands) in the alphabet. The intercept has changed to 54.40, which is the observed mean of the Flemish (i.e. when the value of region2 is 0). Its p-value (comparison to 0) has become much lower, because the Flemish value is much farther from 0 then the Dutch value of 40.90 was. The sign of the estimate of the factor “region2” is negative, because you go down in score as you go from Flanders to the Netherlands. Fortunately, the p-value has not changed.

Something feels silly, though. Why isn’t the intercept the estimate of the mean of the Dutch and Flemish populations together? Well, we can accomplish that, namely by specifying explicit “contrasts”:

  1. In our first try, Dutch (“D”) was coded as 0 and Flemish (“F”) was coded as 1, making use of R’s default strategy of using alphabetical order.
  2. In our second try, the Netherlands (“N”) was coded as 1 and Flanders (“F”) as 0, again making use of R’s alphabetical default.
  3. In our third try, we will be coding Dutch as -0.5 and Flemish as +0.5.
table $ regionCentered <- table $ region
contrasts (table $ regionCentered) <- c(-0.5, +0.5)
table
##    region score region2 regionCentered
## 1       D    34       N              D
## 2       D    45       N              D
## 3       D    33       N              D
## 4       D    54       N              D
## 5       D    45       N              D
## 6       D    23       N              D
## 7       D    66       N              D
## 8       D    35       N              D
## 9       D    24       N              D
## 10      D    50       N              D
## 11      F    67       F              F
## 12      F    50       F              F
## 13      F    64       F              F
## 14      F    43       F              F
## 15      F    47       F              F
## 16      F    50       F              F
## 17      F    68       F              F
## 18      F    56       F              F
## 19      F    60       F              F
## 20      F    39       F              F

You don’t see the difference between columns “region” and “regionCentered” in the table view, but you can see it by asking:

table $ region
##  [1] D D D D D D D D D D F F F F F F F F F F
## Levels: D F
table $ regionCentered
##  [1] D D D D D D D D D D F F F F F F F F F F
## attr(,"contrasts")
##   [,1]
## D -0.5
## F  0.5
## Levels: D F

By the way, R assigned the first number (-0.5) to the level that comes first in the alphabet (“D”)…

Now compare the three summaries:

summary (lm (score ~ region, table))
## 
## Call:
## lm(formula = score ~ region, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.90  -7.53  -1.40   9.22  25.10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    40.90       3.80   10.76  2.8e-09 ***
## regionF        13.50       5.37    2.51    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared:  0.26,   Adjusted R-squared:  0.218 
## F-statistic: 6.31 on 1 and 18 DF,  p-value: 0.0218
summary (lm (score ~ regionCentered, table))
## 
## Call:
## lm(formula = score ~ regionCentered, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.90  -7.53  -1.40   9.22  25.10 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        47.65       2.69   17.73  7.6e-13 ***
## regionCentered1    13.50       5.37    2.51    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared:  0.26,   Adjusted R-squared:  0.218 
## F-statistic: 6.31 on 1 and 18 DF,  p-value: 0.0218
summary (lm (score ~ region2, table))
## 
## Call:
## lm(formula = score ~ region2, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.90  -7.53  -1.40   9.22  25.10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    54.40       3.80   14.32  2.8e-11 ***
## region2N      -13.50       5.37   -2.51    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 18 degrees of freedom
## Multiple R-squared:  0.26,   Adjusted R-squared:  0.218 
## F-statistic: 6.31 on 1 and 18 DF,  p-value: 0.0218

The sign of the estimate for “regionCentered” is positive, because we are going from Dutch to Flemish when our number goes up (from -0.5 to +0.5). The p-value is, reassuringly, again 0.022.

The intercept has now becomes 47.65, i.e. the mean of the two estimated means. The p-value is again lower.

.

.

.

Question: How can that p-value be even lower than the one for the Flemish mean?

  1. This mean is even farther from zero.
  2. This p-value is based on more participants.
  3. Don’t know.

.

.

.

.

.

.

The p-value is better because it is now based on 20 instead of 10 participants.

Practical advice

Short form: center your binary variables!

Long form:

Although it does not seem very important, because you are often not interested in the intercept, please make a habit of always specifying explicit contrasts, so as not to have to rely on R’s silly alphabetic default strategy. If you do want to use Dutch as the reference, use c(1, 0) as the contrast, and if you want to use Flemish as the reference, use c(0, 1) as the contrast; these are “treatment contrasts”, because in medical experiments you want “no treatment” to be the reference. usually, However, you will want symmetric contrasts whose elements sum to zero, such as c(-0.5, 0.5), because that gives you some additional information (in the one-factor case):

  • the estimated intercept will be the estimated mean of means
  • the p-value, based on all data, will tell you whether that estimated mean of means differs reliably from zero

Now, finally, everything is under control. Even the analysis of variance behaves well:

anova (lm (score ~ regionCentered, table))
## Analysis of Variance Table
## 
## Response: score
##                Df Sum Sq Mean Sq F value Pr(>F)  
## regionCentered  1    911     911    6.31  0.022 *
## Residuals      18   2599     144                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Same result again. As I said, everything is under control, at least for the one-factor case.

A linear model with two factors

Many interesting datasets involve more than one factor. Let’s add gender (“geslacht”) in:

table $ geslacht = c("M", "V", "M", "M", "V", "M", "V", "V", "M", "V",
                     "M", "M", "M", "M", "M", "V", "V", "V", "V", "V")
table
##    region score region2 regionCentered geslacht
## 1       D    34       N              D        M
## 2       D    45       N              D        V
## 3       D    33       N              D        M
## 4       D    54       N              D        M
## 5       D    45       N              D        V
## 6       D    23       N              D        M
## 7       D    66       N              D        V
## 8       D    35       N              D        V
## 9       D    24       N              D        M
## 10      D    50       N              D        V
## 11      F    67       F              F        M
## 12      F    50       F              F        M
## 13      F    64       F              F        M
## 14      F    43       F              F        M
## 15      F    47       F              F        M
## 16      F    50       F              F        V
## 17      F    68       F              F        V
## 18      F    56       F              F        V
## 19      F    60       F              F        V
## 20      F    39       F              F        V
model <- lm (score ~ region * geslacht, table)
summary (model)
## 
## Call:
## lm(formula = score ~ region * geslacht, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -15.6   -7.8   -1.9    6.5   20.4 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          33.60       5.08    6.61    6e-06 ***
## regionF              20.60       7.19    2.87    0.011 *  
## geslachtV            14.60       7.19    2.03    0.059 .  
## regionF:geslachtV   -14.20      10.16   -1.40    0.181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.4 on 16 degrees of freedom
## Multiple R-squared:  0.411,  Adjusted R-squared:  0.301 
## F-statistic: 3.73 on 3 and 16 DF,  p-value: 0.0331

The resulting model makes the following approximation:

\[ score \approxeq 33.60 + 20.60 \times region + 14.60 \times geslacht - 14.20 \times region \times geslacht \]

where region is 0 for the Dutch and 1 for the Flemish, and geslacht is 0 for males and 1 for females.

The estimated intercept of 33.60 is now the estimate for the mean score of the male Dutch population. The positive value for “regionF” means that your score goes up by 20.60 is you’re Flemish (p = 0.011) and by 14.60 if you’re a woman (almost significant: p = 0.059). The interaction estimate of -14.20 means that those benefits don’t add up if you are both Flemish (region = 1) and a woman (geslacht = 1): you have to subtract 14.20, although this number doesn’t differ significantly from zero (p = 0.181). That tells us that the four groups have the following estimated means (please also try substituting 0’s and 1’s in the above equation):

If the interaction had come out significant, we would have been able to conclude that the female advantage is greater for Dutch than for Flemish people, or, equivalently, that the Flemish advantage is greater for males than for females. Or perhaps we can report something that suggests to a lesser extent that there are indeed advantages everywhere (such as a female advantage for the Flemish), for instance “the difference score of females minus males is greater for Dutch people than for Flemish people.”

These averages can be checked with the aggregate command:

aggregate (score ~ region + geslacht, data = table, FUN = mean)
##   region geslacht score
## 1      D        M  33.6
## 2      F        M  54.2
## 3      D        V  48.2
## 4      F        V  54.6

So this table can be understood. As in the one-factor case, the analysis of variance yields a summary of the summary, but…

anova (model)
## Analysis of Variance Table
## 
## Response: score
##                 Df Sum Sq Mean Sq F value Pr(>F)  
## region           1    911     911    7.06  0.017 *
## geslacht         1    281     281    2.18  0.159  
## region:geslacht  1    252     252    1.95  0.181  
## Residuals       16   2066     129                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-values are different. Suddenly, gender is no longer nearly significant. That’s a problem that didn’t seem to occur in the one-factor case.

Then, let’s translate the letters in the coding of gender to English:

table $ gender = factor (ifelse (table $ geslacht == "V", "F", "M"))
table
##    region score region2 regionCentered geslacht gender
## 1       D    34       N              D        M      M
## 2       D    45       N              D        V      F
## 3       D    33       N              D        M      M
## 4       D    54       N              D        M      M
## 5       D    45       N              D        V      F
## 6       D    23       N              D        M      M
## 7       D    66       N              D        V      F
## 8       D    35       N              D        V      F
## 9       D    24       N              D        M      M
## 10      D    50       N              D        V      F
## 11      F    67       F              F        M      M
## 12      F    50       F              F        M      M
## 13      F    64       F              F        M      M
## 14      F    43       F              F        M      M
## 15      F    47       F              F        M      M
## 16      F    50       F              F        V      F
## 17      F    68       F              F        V      F
## 18      F    56       F              F        V      F
## 19      F    60       F              F        V      F
## 20      F    39       F              F        V      F

Question: What changes in our summary of the model?

  1. nothing
  2. the sign of the estimate for gender
  3. the sign of the estimates for gender and the interaction
  4. b plus the p-value of the intercept
  5. c plus the p-value of the intercept
  6. b plus more than one p-value
  7. c plus more than one p-value

.

.

.

.

.

.

Let’s investigate the influence of this trivial change:

summary (lm (score ~ region * geslacht, table))
## 
## Call:
## lm(formula = score ~ region * geslacht, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -15.6   -7.8   -1.9    6.5   20.4 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          33.60       5.08    6.61    6e-06 ***
## regionF              20.60       7.19    2.87    0.011 *  
## geslachtV            14.60       7.19    2.03    0.059 .  
## regionF:geslachtV   -14.20      10.16   -1.40    0.181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.4 on 16 degrees of freedom
## Multiple R-squared:  0.411,  Adjusted R-squared:  0.301 
## F-statistic: 3.73 on 3 and 16 DF,  p-value: 0.0331
summary (lm (score ~ region * gender, table))
## 
## Call:
## lm(formula = score ~ region * gender, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -15.6   -7.8   -1.9    6.5   20.4 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        48.20       5.08    9.48  5.7e-08 ***
## regionF             6.40       7.19    0.89    0.386    
## genderM           -14.60       7.19   -2.03    0.059 .  
## regionF:genderM    14.20      10.16    1.40    0.181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.4 on 16 degrees of freedom
## Multiple R-squared:  0.411,  Adjusted R-squared:  0.301 
## F-statistic: 3.73 on 3 and 16 DF,  p-value: 0.0331

Apart from the estimate and p-value of the intercept, three things have changed:

This last change occurs because in the earlier test the inluence of region was evaluated only for the males (the reference category “M”), and now it is evaluated only for the females (the reference category “V”). Sort of. This is analogous to what happened to the intercept in the one-factor case. This change of p-values in the main effects only happens if the interaction effect is present in the model (that’s why I said “sort of”).

All of you who have been throwing two factors into a linear model without setting explicit contrasts now have to think OOPS and tell their advisors that their analyses have to be redone. The goes not only for lm, but for glm, lmer and glmer as well! Judging from lots of presentations on research where these commands have been used, the alphabetical order is the researcher’s default!

.

.

.

Question: Which p-values are correct?

  1. the V/M values
  2. the F/M values
  3. neither

.

.

.

.

.

.

None of those p-values are correct if you just want to evaluate the difference between the regions or between the genders. The solution should be clear now:

table $ genderCentered <- table $ gender
contrasts (table $ genderCentered) <- c(-0.5, 0.5)
model = lm (score ~ regionCentered * genderCentered, table)
summary (model)
## 
## Call:
## lm(formula = score ~ regionCentered * genderCentered, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -15.6   -7.8   -1.9    6.5   20.4 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        47.65       2.54   18.75  2.6e-12 ***
## regionCentered1                    13.50       5.08    2.66    0.017 *  
## genderCentered1                    -7.50       5.08   -1.48    0.159    
## regionCentered1:genderCentered1    14.20      10.16    1.40    0.181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.4 on 16 degrees of freedom
## Multiple R-squared:  0.411,  Adjusted R-squared:  0.301 
## F-statistic: 3.73 on 3 and 16 DF,  p-value: 0.0331

The p-values are the same as they were above in the analysis of variance, which we can also try now:

anova (model)
## Analysis of Variance Table
## 
## Response: score
##                               Df Sum Sq Mean Sq F value Pr(>F)  
## regionCentered                 1    911     911    7.06  0.017 *
## genderCentered                 1    281     281    2.18  0.159  
## regionCentered:genderCentered  1    252     252    1.95  0.181  
## Residuals                     16   2066     129                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Would the practical advice be to always use anova, or to always use explicit symmetric contrasts? The use of an anova that gives you all the effects is not always possible, especially if you want to add and/or drop single terms in an exploration of a big model.

Besides, anova does not always give you the same values as summary. We can see that after removing one Flemish female from the data:

table = table [1 : 19, ]
model = lm (score ~ regionCentered * genderCentered, table)
summary (model)
## 
## Call:
## lm(formula = score ~ regionCentered * genderCentered, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13.20  -7.85  -2.50   5.65  20.40 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        48.62       2.50   19.47  4.7e-12 ***
## regionCentered1                    15.45       5.00    3.09   0.0074 ** 
## genderCentered1                    -9.45       5.00   -1.89   0.0780 .  
## regionCentered1:genderCentered1    10.30       9.99    1.03   0.3189    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.8 on 15 degrees of freedom
## Multiple R-squared:  0.487,  Adjusted R-squared:  0.384 
## F-statistic: 4.74 on 3 and 15 DF,  p-value: 0.0161
anova (model)
## Analysis of Variance Table
## 
## Response: score
##                               Df Sum Sq Mean Sq F value Pr(>F)   
## regionCentered                 1   1096    1096    9.33  0.008 **
## genderCentered                 1    449     449    3.82  0.069 . 
## regionCentered:genderCentered  1    125     125    1.06  0.319   
## Residuals                     15   1762     117                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-values for the main effects are different in summary than in anova. This is because in summary the p-value for a term is based on comparing the full model with the model without that term (“type-III sums of squares”), whereas in anova the terms are entered one by one starting from the null model (“type-I sums of squares”). The null model, by the way, is the model with only an intercept; you get it with the formula score ~ 1.

So now for another silly change, namely a change in the order of the factors.

Question: We used “regionCentered * genderCentered”“, but what changes if we use”genderCentered * regionCentered" instead?

  1. Nothing.
  2. The p-values change in summary, but not in anova.
  3. The p-values change in anova, but not in summary.

.

.

.

.

.

.

summary (lm (score ~ regionCentered * genderCentered, table))
## 
## Call:
## lm(formula = score ~ regionCentered * genderCentered, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13.20  -7.85  -2.50   5.65  20.40 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        48.62       2.50   19.47  4.7e-12 ***
## regionCentered1                    15.45       5.00    3.09   0.0074 ** 
## genderCentered1                    -9.45       5.00   -1.89   0.0780 .  
## regionCentered1:genderCentered1    10.30       9.99    1.03   0.3189    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.8 on 15 degrees of freedom
## Multiple R-squared:  0.487,  Adjusted R-squared:  0.384 
## F-statistic: 4.74 on 3 and 15 DF,  p-value: 0.0161
summary (lm (score ~ genderCentered * regionCentered, table))
## 
## Call:
## lm(formula = score ~ genderCentered * regionCentered, data = table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13.20  -7.85  -2.50   5.65  20.40 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        48.62       2.50   19.47  4.7e-12 ***
## genderCentered1                    -9.45       5.00   -1.89   0.0780 .  
## regionCentered1                    15.45       5.00    3.09   0.0074 ** 
## genderCentered1:regionCentered1    10.30       9.99    1.03   0.3189    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.8 on 15 degrees of freedom
## Multiple R-squared:  0.487,  Adjusted R-squared:  0.384 
## F-statistic: 4.74 on 3 and 15 DF,  p-value: 0.0161

Since in summary a p-value is based on comparing the full and the full-minus-one model, there is no difference. Both regionCentered * genderCentered and genderCentered * regionCentered specify the full model.

anova (lm (score ~ regionCentered * genderCentered, table))
## Analysis of Variance Table
## 
## Response: score
##                               Df Sum Sq Mean Sq F value Pr(>F)   
## regionCentered                 1   1096    1096    9.33  0.008 **
## genderCentered                 1    449     449    3.82  0.069 . 
## regionCentered:genderCentered  1    125     125    1.06  0.319   
## Residuals                     15   1762     117                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova (lm (score ~ genderCentered * regionCentered, table))
## Analysis of Variance Table
## 
## Response: score
##                               Df Sum Sq Mean Sq F value Pr(>F)   
## genderCentered                 1    373     373    3.18 0.0949 . 
## regionCentered                 1   1172    1172    9.98 0.0065 **
## genderCentered:regionCentered  1    125     125    1.06 0.3189   
## Residuals                     15   1762     117                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since in anova terms are entered one by one, the first anova gives the p-value for comparing the null model with the model with only “regionCentered” as a factor, whereas the second anova gives the p-value for comparing the model with only “genderCentered” as a factor with the model with both “genderCentered” and “regionCentered” as factors. If the number of participants is the same in all groups, there is no difference, but if the groups have different sizes, the p-values will differ.

If you try all of these options, then please mention in your paper that you have tried all of them, as on pages 9–10 by Wanrooij et al. (2014b).

Practical advice

Always use explicit contrasts. Period.

Why doesn’t lmer give me p-values?

Because the authors of lme4 deemed them unreliable, because of a problem in determining the number of degrees of freedom. You can still get them by comparing the full model with the model minus that one term:

anova (model, modelMinusThatOneTerm, test = "Chisq")

The p-values from a \(\chi^2\)-test on the difference between a mixed-effects model and the same model with one term added or dropped, is just as unreliable as the p-values that the authors of lme4 decided to no longer give:

If you have very large samples an asymptotic approach (using Wald z or chi-square statistics) is probably just fine. However, the further you depart from conventional repeated measures ANOVA assumptions the harder it is to know how large a sample news [sic] to be before the asymptotics kick in. (Thom Baguley on his blog)

Lots of terms

Practical advice: add terms if they contribute with p < 0.01 or p < 0.05 perhaps, but trust their significance only if their p-values are below 0.0001 or 0.001 perhaps.