Lecture 9: Use Simulations to Assess Statistical Methods

Zhaoxia Yu

Professor, Department of Statistics

2026-04-27

Code

knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
library(tidyr) #the pipe (%>%) tool is extremely useful
library(MASS)
library(tidyverse)

Midterm Project

Too many choices?

Due to the nature of multivariate analysis, we have see many choices for conducting one-way MANOVA.
Do they work equally well?
Does their performance depend on the true distribution?
You will be asked to compare the methods in your midterm project

Midterm Project

Goal: conduct a simulation study to compare four MANOVA tests

Type I Error Rate

The type I error rate of a test is the probability of rejecting the null hypothesis when it is actually true.
In many research contexts, the acceptable type I error rate is set at 0.05, meaning that there is a 5% chance of falsely rejecting the null hypothesis.
A simulation study can help assess whether a statistical test maintains the desired type I error rate under various conditions.

Power

The power of a statistical test is the probability of correctly rejecting a null hypothesis when it is false.
In other words, power is the ability of a test to detect a true effect or relationship between variables.
Power is affected by several factors, including sample size, effect size, and significance level.
For simple tests, analytical formulas for calculation power might exist. For complicated tests, simulations can be used to evaluate power

Rationale for Simulation Studies

Simulation studies allow researchers to evaluate and compare the performance of statistical tests under controlled conditions.
They can help determine which test is most appropriate for a specific situation.
By simulating data with known properties, researchers can assess the type I error rate, power, and other performance metrics of each test.

Designing a Simulation Study

Choose the statistical methods to compare. Select candidate methods you want to evaluate and ensure they are applicable to your research question. Examples
- Statistical tests for hypothesis testing.
- Estimation methods for parameter estimation.
- Model selection criteria for choosing between competing models.
- Predictive modeling methods for classification or regression tasks.

Define the data-generating process: Create a model to simulate data with known properties that reflect the characteristics of your target population. Consider factors such as
- distribution,
- sample size,
- effect size.
- Its helpful to vary these factors to understand how they affect the performance of the methods being evaluated.

Determine performance metrics: Decide on the metrics you will use to evaluate the performance of each test (e.g., type I error, power, prediction accuracy, confidence interval coverage).
Set up the simulation: Determine the number of repetitions.
Analyze and interpret the results: Assess the performance of each test based on the simulation results and draw conclusions.

Simulation Studies in Stat Research

A Practical Guide Based on the Paper by Morris et al. (2021)

Paper reference: Morris, T. P., White, I. R., & Crowther, M. J. (2021). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 40(8), 2073-2102.

Objectives of Simulation Studies

Evaluate performance of statistical methods
Compare competing methods
Assess sensitivity of methods to different data structures
Investigate robustness of methods to model assumptions

Key Components of a Simulation Study

Data Generating Process (DGP)
Method Implementation
Performance Measures
Simulation Parameters
Analysis and Interpretation

Data Generating Process (DGP)

Define the true model and relationships
Consider the distribution of the data
Incorporate relevant covariates and confounders
Generate data samples based on the defined model

Method Implementation and Performance Measures

Implement the statistical method(s) being evaluated
Apply the method(s) to the generated data samples
Record results and performance metrics for each simulation run
Select appropriate performance measures to evaluate the method(s)
Examples: bias, mean squared error (MSE), coverage, power, type I error rate, etc.

Simulation Parameters

Determine sample size(s) and number of simulation runs
Consider variations in the DGP or model assumptions
Ensure adequate coverage of the parameter space
“Trial and Error” method to choose parameters/effect sizes to obtain reasonable power
- when the effect size / treatment effect is too small, all methods have low power
- when the effect size / treatment effect is large, all methods have power close to 100%

Analysis and Interpretation

Summarize results across simulation runs
Use graphical and numerical summaries
Compare methods and assess performance
Draw conclusions and provide recommendations

Reporting Simulation Studies

Clearly describe the DGP, method implementation, and performance measures
Present results in a transparent and reproducible manner
Discuss limitations and potential biases
Provide guidance for future research and applications

Analyzing and Interpreting the Results

Compare the type I error rates and power of the tests.
Determine which test performs best for your specific situation.
Consider other factors such as assumptions, sample size, and computational efficiency.
Use the results to inform your choice of statistical test in your research.

A Case Study

Compare two-sample t-test and Mann-Whitney-Wilcoxon test in the presence of outliers

The two-sample t-test is a widely used test
Mann-Whitney-Wilcoxon is often used as a nonparametric alternative when distributional assumptions are suspected
It has been frequently reported that the two-sample t-test is still a good choice as long as the sample size is large enough, even for data from a skewed distribution
The purpose of this simulation study is to compare their performances in the presence of outliers

Design of Simulation Study

Verify that indeed that t-test is more powerful when normality assumptions are met
- Equal variance: \(t_E\)
- Unequal variance: \(t\)
Compare the performance of the two-tests when there are outliers

Functions 1

Code

generate_data <- function(n, mu, sd) {
  rnorm(n, mean = mu, sd = sd)}

generate_data_out <- function(n, mu, sd, sd_out, p) {
  n1=round(n*(1-p)); n2=n-n1
  c(rnorm(n1, mean = mu, sd = sd), rnorm(n2, mean=mu, sd=sd_out))
  }

run_tests <- function(data1, data2) {
  t_test <- t.test(data1, data2)$p.value
  t_e_test <- t.test(data1, data2, var.equal=TRUE)$p.value
  wilcox_test <- wilcox.test(data1, data2)$p.value
  c(t = t_test, t_e_test=t_e_test, wilcox_test = wilcox_test)}

Functions 2

Code

#without outliers
pvalues <- function(n, mu1, mu2,sd, n_rep=1000){
  results=matrix(0, n_rep, 3)
  for(b in 1:n_rep){
    data1 <- generate_data(n, mu1, sd)
    data2 <- generate_data(n, mu2, sd)
    results[b, ]=run_tests(data1, data2) 
  }
  return(colMeans(results<0.05))}

#with outliers
pvalues_out <- function(n, mu1, mu2, sd, sd_out, p, n_rep=1000){
  results=matrix(0, n_rep, 3)
  for(b in 1:n_rep){
    data1 <- generate_data_out(n, mu1, sd, sd_out, p)
    data2 <- generate_data_out(n, mu2, sd, sd_out, p)
    results[b, ]=run_tests(data1, data2) 
  }
  return(colMeans(results<0.05))}

Power: No Outlier

Code

power_n5=rbind(pvalues(n=5, mu1=0, mu2=0.2, sd=1, n_rep=1000),
  pvalues(n=5, mu1=0, mu2=0.5, sd=1,n_rep=1000),
  pvalues(n=5, mu1=0, mu2=1, sd=1, n_rep=1000))

power_n10=rbind(pvalues(n=10, mu1=0, mu2=0.2, sd=1, n_rep=1000),
  pvalues(n=10, mu1=0, mu2=0.5, sd=1, n_rep=1000),
  pvalues(n=10, mu1=0, mu2=1, sd=1, n_rep=1000))

power_n50=rbind(pvalues(n=50, mu1=0, mu2=0.2, sd=1, n_rep=1000),
  pvalues(n=50, mu1=0, mu2=0.5, sd=1, n_rep=1000),
  pvalues(n=50, mu1=0, mu2=1, sd=1, n_rep=1000))

colnames(power_n5)=c("t", "t_e", "r")
colnames(power_n10)=c("t", "t_e", "r")
colnames(power_n50)=c("t", "t_e", "r")
power_normal_wide=data.frame(cbind(rbind(power_n5, power_n10, power_n50), 
                                   diff=rep(c(0.2, 0.5, 1.0),3), 
                                   n=rep(c(5, 10, 50), each=3)))
power_normal=gather(power_normal_wide, test, power, 1:3, factor_key=TRUE)

ggplot(power_normal, aes(x=diff, y=power, 
                         group=interaction(test, n), color=factor(n), shape=test)) +
  geom_point(size=3) +geom_line() +  ggtitle("Power: no outlier")

Type I Error Rate: No Outlier

Code

type1=rbind(
  pvalues(n=5, mu1=0, mu2=0, sd=1, n_rep=1000),
  pvalues(n=10, mu1=0, mu2=0, sd=1, n_rep=1000),
  pvalues(n=50, mu1=0, mu2=0, sd=1, n_rep=1000))
colnames(type1)=c("t", "t_e", "r")
type1_wide=data.frame(type1, n=c(5,10,50))
type1=gather(type1_wide, test, power, 1:3, factor_key=TRUE)
ggplot(type1, aes(x=n, y=power, group=test, shape=test, color=test))+
  geom_point(size=3) + geom_line()+
  geom_hline(yintercept=0.05, linetype="dashed", color = "red") +
  ggtitle("Type I Error: no outlier")

Type I Error Rate: \(1\%\) Outliers

Code

#1% outliers
type1_out_mat=rbind(
  pvalues_out(n=5, mu1=0, mu2=0, sd=1, sd_out=5, p=0.01, n_rep=1000),
  pvalues_out(n=10, mu1=0, mu2=0, sd=1, sd_out=5, p=0.01, n_rep=1000),
  pvalues_out(n=20, mu1=0, mu2=0, sd=1, sd_out=5, p=0.01, n_rep=1000),
  pvalues_out(n=50, mu1=0, mu2=0, sd=1, sd_out=5, p=0.01, n_rep=1000))
colnames(type1_out_mat)=c("t", "t_e", "r")
type1_wide_out=data.frame(type1_out_mat, n=c(5,10,20,50))
type1_out=gather(type1_wide_out, test, power, 1:3, factor_key=TRUE)
ggplot(type1_out, aes(x=n, y=power, group=test, shape=test, color=test))+
  geom_point(size=3) +
  geom_line()+
  geom_hline(yintercept=0.05, linetype="dashed", color = "red")+
  ggtitle("Type I Error: 1% outliers")

Type I error rate: \(5\%\) Outliers

Code

#5% outliers
type1_out_mat=rbind(
  pvalues_out(n=5, mu1=0, mu2=0, sd=1, sd_out=5, p=0.05, n_rep=1000),
  pvalues_out(n=10, mu1=0, mu2=0, sd=1, sd_out=5, p=0.05, n_rep=1000),
  pvalues_out(n=20, mu1=0, mu2=0, sd=1, sd_out=5, p=0.05, n_rep=1000),
  pvalues_out(n=50, mu1=0, mu2=0, sd=1, sd_out=5, p=0.05, n_rep=1000))
colnames(type1_out_mat)=c("t", "t_e", "r")
type1_wide_out=data.frame(type1_out_mat, n=c(5,10,20,50))
type1_out=gather(type1_wide_out, test, power, 1:3, factor_key=TRUE)
ggplot(type1_out, aes(x=n, y=power, group=test, shape=test, color=test))+
  geom_point(size=3) +
  geom_line()+
  geom_hline(yintercept=0.05, linetype="dashed", color = "red")+
  ggtitle("Type I Error: 5% outliers")

Conclusions and Limitations

Conclusions:
- As has been reported by many others, t-test tends to be more powerful, especially for small sample sizes
- We examined an outlier model. For this model we observed, the type I error rates of both tests are reasonable
Limitations:
- Only examined two distributions: normal and normal mixture
- The effect of unequal Variances was not examined