A person’s well-being: social, economic, psychological, medical, physical, etc
A person’s annual physical exam report
What is Multivariate Analysis
The term “multivariate analysis” implies a broader scope than univariate analysis.
Certain approaches like simple linear regression and multiple regression are typically not considered as multivariate analysis as they tend to focus on the conditional distribution of one univariate variable rather than multiple variables.
Multivariate analysis focuses on the joint behavior of several variables simultaneously to identify patterns and relationships.
Learning Objectives
Matrix algebra, distributions
Visualization
Inference about a mean vector or multiple mean vectors
Multivariate analysis of variance (MANOVA) and multivariate regression
Linear discriminant analysis (LDA)
Principal component analysis (PCA)
Cluster analysis
Factor analysis
Milestones in the history of multivariate analysis
1901: PCA was invented by Karl Pearson; independently developed by Harold Hotelling in the 1930s.
1904: Charles Spearman introduced factor analysis to identify underlying factors that explain the correlation between multiple variables.
1928: Wishart presented the distribution of the covariance matrix of a random sample from a multivariate normal distribution.
1936: Ronald Fisher developed discriminant analysis.
Milestones in the history of multivariate analysis
1932: Cluster analysis by Driver and Kroeber.
1936: Canonical analysis by Harold Hotelling.
1960s: Multidimensional scaling.
1970s: Multivariate regression.
1980s: Structural equation modeling; the idea dated back to (1920-1921) by Sewall Wright.
Matrix Algebra
Vectors: We begin with a little bit matrix algebra
Vectors in R
There are many ways to create or define a vector in R
Let \(x=\begin{pmatrix}x_1\\ \cdots\\ x_n\end{pmatrix}, y=\begin{pmatrix}y_1\\ \cdots\\ y_n\end{pmatrix}\) The inner product of \(x\) and \(y\) is \(<x,y>=x_1y_1 + \cdots x_ky_n=\sum_{i=1}^n x_iy_i\)
Note, the two vectors must have the same length
The norm / Euclidean norm / length of \(x\) is \(||x||=\sqrt{<x,x>}\).
The Euclidean distance between \(x\) and \(y\) is \[D(x,y)=||x-y||=\sqrt{(x_1-y_1)^2 + \cdots (x_k-y_k)^2}\]
Inner Product and Norm
Distance: 1d and 2d
Distance: 3d
Example: Norm
x1=matrix(c(0.4,0.2,0.5), 3, 1)#the norm/length of x1sqrt(sum(x1^2))
Motivating example. Consider bivariate random vectors. The standard deviations are 2 and 1, respectively.
What is the distance between (-2,0) and (2,0)? 4.
What is the distance between (0, -2) and (0,2)? 4.
Example: (Euclidean) Distance
#R codeset.seed(20230404)par(pty="s")#to make sure the shape of figure is a squaremvrnorm(n=1000, c(0,0), matrix(c(4,0,0,1),2,2)) %>%plot(xlab="x", ylab="y", xlim=c(-4,4), ylim=c(-4,4))points(x=c(-2, 0, 0, 2), y=c(0, -2, 2, 0), pch=c(15, 16, 16, 15), col=c(2,3,3,2),cex=3)
Both pairs have a distance of 4.
But we notice that the pairs with a y-distance greater than 4 is very rare; as a comparison, there are much pairs with a x-distance greater than 4.
A Homework Problem of Euclidean Distances
Suppose \(X_1, X_2, Y_1, Y_2\) are mutually independent.
\(X_1\) and \(X_2\) are iid from \(N(\mu=0, \sigma_x^2=2^2)\)
\(Y_1\) and \(Y_2\) are iid from \(N(\mu=0, \sigma_y^2=1^2)\)
Consider the two pairs \((X_1, X_2)\) and \((Y_1, Y_2)\). Which pair tends to have a larger difference?
To answer the question, we can calculate or estimate the following two probabilities: \[P(|X_1-X_2|>4), P(|Y_1-Y_2|>4)\]
Calculate \(P(|X_1-X_2|>4)\)
First, find the distribution of \(X_1-X_2\) and standardize it to have mean 0 and SD 1.
Second, express the probability to \(P(|Z|>z)\), where \(Z\sim N(0,1)\).
Next, express the probability in terms of \(\Phi(\cdot)\), the CDF of the standard normal distribution.
Last, use the “pnorm” function in R to find the numerical value.
Estimate \(P(|X_1-X_2|>4)\)
The probability can be estimated by doing simulations/sampling.
If you sample many (say 10,000) pairs of \(X_1\) and \(X_2\), count how many pairs satisfying \(|X_1-X_2|>4\). The probability can be used to estimate \(P(|X_1-X_2|>4)\)
Statistical / Mahalanobis Distance
The two probabilities \(P(|X_1-X_2|>4)\)\(P(|Y_1-Y_2|>4)\) are quite different.
Euclidean distance might be misleading.
In this example we have examined, the x-values and y-values are independent but have different variations.
Statistical / Mahalanobis Distance
The variation along \(x\) is greater than along \(y\). Let \(X_1\) and \(X_2\) be two random points along the \(x\) direction, \(Y_1\) and \(Y_2\) be two random points along the \(y\) direction.
One simple idea is to standardize both. Because the SD of Y is 1 we don’t need to change the y-values. Because the SD of X is 2, we shrink the x-values by 50%.
point (-2,0) becomes (-1,0)
point (2, 0) becomes (1,0)
The distance between the red pair is 2, the distance between the green pair is 4.
Standardized Observations
Original vs Standardized Observations
Statistical Distance
In the example above \(X\) and \(Y\) are independent, as a result, the covariance is zero. Statistical distance can also be defined when the covariance matrix \(\Sigma\) is not diagonal;
We will introduce a type of statistical distance, which is known as Mahalanobis distance.