## Family-wise error rate (FWER) under H0

We performed a validation study to access the type I error rate when applying the permutation and bootstrap clustering approach for hypothesis testing. We used a balanced repeated measurement ANOVA design with a two-level between-group factor and a three-level within-group factor. A total population of 134 observers (67 each group) was drawn from the previous face viewing eye-movement studies. We centred the cell means for the whole dataset to obtain the validation dataset under the null hypothesis. Thus, we used real data to warrant realistic distributions and centred them to ensure that H0 was confirmed. Any significant output from iMap4 performed on this dataset is considered as false alarm (Type I error).

The validation procedure follows the steps below: we first randomly sampled without replacement a balanced number of subjects from both groups. We then ran *i*Map4 under the default setting and perform hypothesis testing on the two main effects and the interaction. To estimate the Family-wise error rate (FWER), we computed the frequency of significant output under different statistics and MCC setting. Preliminary results based on 1000 randomizations on a sample size of n ∊ [8, 16, 32, 64] showed that with an alpha of .05, the family-wise error rates are indeed all under .05 using non-parametric statistics (see Figure 2b for permutation test, 2c & 2d for bootstrap clustering test). More simulations considering a wider range of scenarios will be required to understand fully the behaviour of the proposed approaches, although cluster stats are likely to behave as in Pernet et al. (2014).

The above figure is the validation result of the proposed resampling procedure as statistical inference.

a) The family-wise error rate using the uncorrected parametric p-value. All FWER are significantly above .05.

b) The family-wise error rate using the permutation approach (Algorithm 1).

c) The family-wise error rate using the proposed bootstrap clustering approach (Algorithm 2) thresholds on cluster mass.

d) The family-wise error rate using the proposed bootstrap clustering approach (Algorithm 2) thresholds on cluster extent.

Notice that the FWER of a) and b) are computed at the pixel level (i.e., the proportion of false positive pixels across simulations), while the FWER of c) and d) are calculated at test level (i.e., the percentage of any false positive per test for the 1000 simulation).

Error bar shows the standard error.