Univariate case
A random variable x has normal distribution if its probability density function (pdf) can be expressed as
Here e is the constant 2.7183…, and π is the constant 3.1415…
The normal distribution is completely determined by the parameters μ (mean) and σ (standard deviation). We use the abbreviation N(μ, σ) to refer to a normal distribution with mean μ and standard deviation σ, although for comparison with the multivariate case it would actually be better to use the abbreviation N(μ, σ2) where σ2 is the variance.
Multivariate case
Definition 1: A random vector X has a multivariate normal distribution with vector mean μ and covariance matrix Σ, written X ~ N(μ, Σ) if X has the following joint probability density function:
Here |Σ| is the determinant of the population covariance matrix Σ. The exponent of e consists of the product of the transpose of X – μ, the inverse of Σ and X – μ, which has dimension (1 × k) × (k × k) × (k × 1) = 1 × 1, i.e. a scalar. Thus f(X) yields a single value. The coefficient (2π)k |Σ| can also be expressed as |2πΣ|.
Definition 2: The expression
which appears in the exponent of e, is called the squared Mahalanobis distance between X and μ. We can also define the squared Mahalanobis distance for a sample to be
Where S is the sample covariance matrix and X̄ is the sample mean vector. In Multivariate Normality Functions, we show how to calculate the Mahalanobis distance using the Real Statistics MDistSq worksheet function.
Observation: If k = 1 then the above definition is equivalent to the univariate normal distribution. If k = 2 the result is a three-dimensional bell-shaped curve (as described in Figure 1).
Properties
Property 1: If X ~ N(μ, Σ) where all the xj in X are independent, then the population covariance matrix is a diagonal matrix [aij] with ajj= for all j and aij = 0 for all i ≠ j, and so the joint probability function simplifies to
where each is the univariate normal pdf of xj with mean μj and standard deviation σj.
Property 2: If y = = CTX, where C = the k × 1 vector [cj], and X ~ N(μ, Σ) then y has normal distribution with mean = CTμ and variance = CTΣC; i.e. y ~ N(CTμ,CTΣC).
The unbiased estimates for population mean and population variance are given by the sample mean = CTX̄ and sample variance = CTSC, where sij = cov(xi, xj).
When k = 2, the joint pdf of X depends on the parameters μ1, μ2, σ1, σ2, and ρ. A plot of the distribution for different values of the correlation coefficient ρ is displayed in Figure 1.
Figure 1 – Bivariate normal density function
Observations about Mahalanobis Distance
Suppose X has a multivariate normal distribution. For any constant c, the set of points X which have a Mahalanobis distance from μ of c sketches out a k-dimensional ellipse. The value of the probability density function at all these points is the constant
Let’s take a look at the situation where k = 2. In this case, we have
Finally, note that the equation
is an ellipse with foci at μ = (μ1, μ2).
Note that when ρ = 0, indicating that x1 and x2 are uncorrelated, then the ellipse takes the form z12 + z22 = c2 which is a circle. When ρ = ±1, indicating that x1 and x2 are completely correlated, then the ellipse becomes a straight line.
Using the above calculations, when k = 2 the multivariate normal pdf is
Distribution of Mahalanobis Distance
Property 3: If X ~ N(μ, Σ), then the squared Mahalanobis distance between X and μ has a chi-square distribution with k degrees of freedom.
This property is an extension of Corollary 1 of Chi-square Distribution. We can interpret the property as follows. Let c = the critical value of the chi-square distribution with k degrees of freedom for α = .05. Then the probability that X will fall within the ellipse defined by c, i.e. (X–μ)T Σ-1 (X–μ) = c2, is 1 – α = .95.
References
Genz, A. (2024) Package ‘mvtnorm’ Multivariate normal and t distributions
https://cran.r-project.org/web/packages/mvtnorm/mvtnorm.pdf
Genz, A. (2004) Numerical computation of rectangular bivariate and trivariate normal and t probabilities
A quick question.
According to the equation of the squared Mahalanobis distance, “Z1^2 -(2*p*Z1*Z2)+Z2^2,”
when ρ = 0, the ellipse takes the form (z1 + z2)2 = c2, instead of (z1 – z2)2 = c2?
Hi John,
Thanks for catching this error. I have just corrected this on the webpage.
I appreciate your help in improving the accuracy of the website.
Charles
One more question.
When I was looking at the ‘Confidence Ellipse’ section. You calculate if a point is inside/outside of the ellipse. I think you’re using the equation x’S^-1x <= c^2 and then you substitute chi-square = c. Then you use c^2 = (chi-square)^2 — in your formula you use [… <= H8^2] which is the square of the chi-square value making the ellipse equation x'S^-1x = (chi-square)^2
When I look at other references, (Tatsuoka, Multivariate Analysis, 2nd ed, 1988, page 73) they use x'S^-1x = chi-square. This would be [… <=H8] and not the squared value. Should it be x'S^-1 x = chi-square or x'S^-1x = (chi-square)^2?
Thanks
Good site and program. Question about cumulative bivariate. Suppose sx, sy = 1, means =0 and correlation = 0. When I directly integrate over the limits x, y {-1, 1} using MathCAD, I get 0.466. However, when I do a numerical simulation and count the units within the ellipse, I get ~0.395. Also, I noted somewhere that the integral for a bivariate was ~0.395.
What is the correct area under the range {+/-1} for the x,y for the bivariate? And why do I get ~0.466 and the reported value is ~0.395, and why are these values different? Any help is appreciated.
Daniel,
What software did you use to get the numerical integration as .395? MathCAD?
Charles
I did a simulation with a standard ellipse. I obtained random x,y values using the usual algorithm for generating random values from a defined bivariate. I used MathCAD for the direct integral of the bivariate distribution equation.
If the ellipse lies on either the x or y axes, then the integral over the bivariate with std-x, y = 1, I get the correct value of 0.395.
Subsequently, I found that the probability of being within the ellipse for a bivariate (df = 2) follows the chi-square distribution. So, the probability of being within a standard ellipse (both x, y standard deviations = 1, df =2) should be 0.395.
Your program performs the integral for the bivariate from -inf to a value or a value to +inf. So, the question becomes how to use your program to find the probability between +/-1 for both variables. Or ranges {x1, x2}, {y1, y2}.
Hopefully I clarified. But any help in finding the bivariate distribution integral over arbitrary ranges will help.
Ultimately I am trying to relate to eigenvalues for the covariance matrix for the bivariate distribution. Hence, the SQRT(eigenvalue * chi-square) should be the principal axis. I was trying to simulate the probability for some proportion of the population (i. e. ellipse that contains 90% of the population) bivariate distribution to fall within the an ellipse drawn using the principal axes derived from the eigenvalues. Or maybe there is a specific reference that covers this?
Thanks for the response and any help or clarifications on my confusion.
Ignore the last comment. I searched your website more thoroughly and have found the answers. Good references. Thanks!
However, is there a method to calculate the cumulative bivariate between {x1, x2} and {y1, y2} with your program?
Maybe I missed it somewhere.
Daniel,
Thanks for the additional information.
I am not sure what you mean by the bivariate distribution integral. Are you looking for the area inside the ellipse or the cumulative distribution function (cdf) or the bivariate normal distribution or something else?
Charles
Sorry about the imprecise wording.
When I use your function: =BNORMDIST(1,1,0,0,1,1,0,TRUE) I get 0.707860786678
I think this performs the cdf from -inf to +1 for both x, y. This is close to the value when I do the direct integration in MathCAD (integraly(-inf, 1)intergralx(-inf, 1) [binomial distribution]dxdy = 0.70786098175
In that function [bnormdist[]] I think you only integrate from -inf to x or y. However, how would I find the cdf for the RANGE for x, y of say {-1,1}, {-1,1}? I think the integral should be the chi-square (~0.395).
I was not able to find a method in your program to find the cdf for th binomial distribution over a limited rage in x and y. Is there a method that I missed in the tutorials?
Hopefully this is clearer.
Thanks
Dan,
BNORMDIST(x, y, m1, m2, s1, s2, r, TRUE) = the cdf of the standard bivariate normal distribution at (x,y) with means m1 and m2, standard deviations s1 and s2 and correlation coefficient r. This means that the function evaluates the cdf from (-inf, inf) to (x,y), which seems like what you want. Am I missing something?
Charles
Thanks for the response. Sorry about not being clear. Let me try to simplify.
I need to calculate the bivariate distribution between x1LOW and x1HIGH and y1LOW and y2HIGH.
Your formula:
=BNORMDIST(x1, x2, m1, m2, s1, s2, r, cum)
Calculates the cdf from -inf to the corresponding x1, x2 values.
However, what I am looking for is a formula to calculate across a range of x, y such as:
=BNORMDIST(x1LOW, x1HIGH, x2LOW, x2HIGH, m1, m2, s1, s2, r, cum).
In other words, calculate x1 over the RANGE of (-1, 1) and x2 across the range of (-1, 1).
From theory, if m1, m2 = 0 and s1, s2 = 1, the result I believe will be the chi-square value ~0.395.
How can I use your program to calculate the cdf for the bivariate distribution across a specific range for x1 and x2?
Dan,
I have now implemented a newer, better version of the bivariate normal distribution. I have also included support for ranges as you have requested. This newer version will be included in the next release of the Real Statistics software. I expect this release to be issued in a few days.
Charles
Hi,
Are you sure about BNORMDIST function ?
Using covariance matrix or simplified equation (btw thanks) calculus I get the same result that i believe to be good. When using BNORMDIST i got a value 72.56433684 higher (value remains constant) than simplified equation …
i think there is a glitch somewhere (and should be me) , just want to understand.
BR
Hi Alex,
I just used the calculator at
http://socr.umich.edu/HTML5/BivariateNormal/
to check my calculations for Example 1 at
https://real-statistics.com/multivariate-statistics/multivariate-normal-distribution/multivariate-normality-functions/
The BNORMDIST result is .110764, while the calculator has a result of .1078.
I don’t know which result is better, but they are not wildly different.
It sounds like the results you are getting is completely different. Can you give me a specific example so that I can check it out?
Charles
Is there any way we can plot the PDF of a bivariate normal distribution in EXCEL?
This youtube may help: https://www.youtube.com/watch?v=4BrMSVV7mBM
Thanks,
Reza,
This would have to be a three dimensional graph. In any case, you can use the Real Statistics function BNORMDIST to calculate the pdf values.
Charles
Dr. buneos días, disculpe que significa pdf y cdf, podría ser que
pdf es lo mismo que función de densidad de probabilidad?
Pero cdf?
Muchas gracias
Dr Zaionts, good morning, excuse me what mean PDF and CDF values?
Could be PDF= probability density function?
But CDF?
Thank you
Gerardo,
Yes, PDF = probability density function. CDF = cumulative distribution function.
Charles
Thank you