Day 2: Introduction to Estimation

Zhaoxia Yu, Department of Statistics, University of California, Irvine

2024-07-14

Introduction

Example

Parameter Estimation

  • For a population, two important parameters are population mean and population variance, denoted as \(\mu\) and \(\sigma^{2}\) respectively, of a random variable.

  • Parameters are unknown in general.

  • We shall use observed data to estimate the unknown parameters.

  • In this process, we often provide

    • a point estimate
    • uncertainty, such as standard error of the point estimate or a confidence interval

Sample Mean and Variance

  • Let \(X_{1}, X_{2}, \ldots, X_{n}\) denote a random sample of size \(n\) from a population with population mean \(\mu\) and population variance \(\sigma^2\).

  • The sample mean: \[ \begin{equation*} \bar{X} = \frac{\sum_{i=1}^{n}X_{i}}{n}=\frac{X_1+\cdots+X_n}{n}. \end{equation*} \]

  • The sample variance

\[ S^2= \frac{\sum_{i=1}^{n}(X_{i}-\bar X)^2}{n-1}=\frac{(X_1-\bar X)^2+\cdots+(X_n-\bar X)^2}{n-1}. \]

Estimators

  • The sample mean \(\bar{X}\) can be used as an estimator for \(\mu\). Notation: \[\hat\mu=\bar X\]

  • The estimator itself is considered as a random variable since it value can change.

  • Similarly, the sample variance \(S^2\) can be used to estimate \(\sigma^2\). Notation: \(\hat\sigma^2=S^2\).

Estimates

Using a sample \(x_{1}, x_{2}, \ldots, x_{n}\), we can compute

\[ \begin{equation*} \bar{x} = \frac{\sum_{i=1}^{n}x_{i}}{n} \end{equation*} \]

\[ \begin{equation*} s^{2} = \frac{\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}}{n-1}. \end{equation*} \]

Theoretical results

The Sampling Distribution of \(\bar X\)

  • As shown in the previous slide, if the true distribution is \(N(\mu, \sigma^2)\), then \[\bar X \sim N(\mu, \frac{\sigma^2}{n}) \mbox{ and } \frac{\bar X-\mu}{\sigma/\sqrt{n}} \sim N(0, 1)\]

  • If the sample is not from a normal distribution, in many cases, as long as the sample size \(n\) is large enough, the normal distribution still works well.

  • The underlying theories related are

    • the central limit theorem
    • law of large numbers

Use Z-crtical Valyes

  • Confidence intervals based on Z-critical values have the form of \(\bar x \pm Z_{crit} \frac{\sigma}{\sqrt{n}}\). In practice, we don’t konw \(\sigma\), so we use

\[\bar x \pm Z_{crit} \frac{s}{\sqrt{n}}.\]

  • Use the 68-95-99.7 rule
    • a 95% confidence interval (CI) is \(\bar x \pm 2 \frac{s}{\sqrt n}\).
    • a 99.8% confidence interval (CI) is \(\bar x \pm 3 \frac{s}{\sqrt n}\).

t-critcal value

  • If the sample is from \(N(\mu, \sigma^2)\), then

\[\frac{\bar X-\mu}{s/\sqrt{n}} \sim t_{n-1}\]

  • For this example, t-critical values are more accurate.

  • Confidence intervals based on t-critical values

\[\bar x \pm t_{crit} \frac{s}{\sqrt{n}},\] where \(t_{crit}\) depends on both the sample size \(n\) and the chosen confidence level.

Margin of error

  • We refer to \(s/\sqrt{n}\) as the standard error of the sample mean \(\bar{X}\).

  • We can write the confidence interval as \[\begin{equation*} \bar{x} \pm t_{\mathrm{crit}}\times SE \end{equation*}\]

  • The term \(t_{\mathrm{crit}}\times SE\) is called the margin of error for the given confidence level.

  • It is common to present interval estimates for a given confidence level as \[\begin{equation*} \textrm{Point estimate} \pm\textrm{Margin of error.} \end{equation*}\]

  • Note, in many articles, people also present mean \(\pm\) SD.

Example

Estimate the volume of hippocampus for women between 40 and 50 years old

Code
alzheimer_data <- read.csv('data/alzheimer_data.csv')
dim(alzheimer_data)
[1] 2700   57
Code
alzheimer_subset =alzheimer_data[alzheimer_data$age<=50 & alzheimer_data$age>40 & alzheimer_data$female==1, ]
dim(alzheimer_subset)
[1] 70 57
Code
hippo=alzheimer_subset$lhippo + alzheimer_subset$rhippo
#estimate of mean
mean(hippo)
[1] 6.470666
Code
#standard error
sd(hippo)/sqrt(length(hippo))
[1] 0.1023539
Code
#a 95% c.i. using z-critical value
mean(hippo) - 2*sd(hippo)/sqrt(length(hippo))
[1] 6.265958
Code
mean(hippo) + 2*sd(hippo)/sqrt(length(hippo))
[1] 6.675374
Code
mean(hippo) - qnorm(0.975)*sd(hippo)/sqrt(length(hippo))
[1] 6.270056
Code
mean(hippo) + qnorm(0.975)*sd(hippo)/sqrt(length(hippo))
[1] 6.671276
Code
mean(hippo) - qt(0.975, df=length(hippo)-1)*sd(hippo)/sqrt(length(hippo))
[1] 6.266475
Code
mean(hippo) + qt(0.975, df=length(hippo)-1)*sd(hippo)/sqrt(length(hippo))
[1] 6.674856