Linear Discriminant Analysis | Real Statistics Using Excel

Basic Concepts

In Discriminant Analysis, given a finite number of categories (considered to be populations), we want to determine which category a specific data vector belongs to. More specifically, we assume that we have r populations D₁, …, D_r consisting of k × 1 vectors. Furthermore, we assume that each population has a multivariate normal distribution N(μ_i,Σ_i). The values for μ_i and Σ_i are estimated using training data for which we know which population D_i each X in the training data belongs to.

For each i let f_i(X) be the pdf for N(μ_i,Σ_i), and so we can define f(X|D_i) = f_i(X).

In Linear Discriminant Analysis we assume that Σ₁ = Σ₂ = … = Σ_r = Σ, and so each D_i is differentiated by the mean vector μ_i.

Bayesian Approach

We use a Bayesian analysis approach based on the maximum likelihood function. In particular, we assume some prior probability function

We can then define a posterior probability function

For each X we decide which population X belongs to based on maximizing the value of p_i(X). Since for any i the denominator in the expression for p_i(X) is the same (and positive), this is equivalent to maximizing the numerator, i.e. X ∈ D_i* where

and

It now follows that

Since this first term is the same for each i, it can be dropped as it won’t affect the value of i*. This motivates the following definitions.

We now define the linear discriminant function to be

where

and d_i₀(X) = d_i₀ and d_ij(X) = d_ij.

We also define the linear score to be s_i(X) = d_i(X) + LN(π_i). As we demonstrated above, i* is the i with the maximum linear score.

Using the training data, we estimate the value of μ_i by the mean of the X_i = the average of all the training data in D_i. Similarly, we estimate the value of Σ by the pooled sample covariance matrix S. More precisely, let S_i be the covariance matrix for D_i, then the pooled covariance matrix is

where n_i = the number of elements in D_i. Thus, we estimate the various linear scores using

We can assign the prior probability functions π_i(X) based on our best available prior information, as long as $\sum_{i=1}^r$ π_i(X) = 1 and π_i(X) > 0 for all i.

In particular, if we assume that all populations are equally likely then we can choose

If we assume that the probability is weighted by the population size then

The posterior probability that is given by the formula

Examples

Example 1: Perform discriminant analysis on the data in Example 1 of MANOVA Basic Concepts. This data is repeated in Figure 1 (in two columns for easier readability). Also determine in which category to put the vector X with yield 60, water 25 and herbicide 6.

Figure 1 – Training Data for Example 1

Assumptions

The analysis begins as shown in Figure 2. First, we perform Box’s M test using the Real Statistics formula =BOXTEST(A4:D35). Since p-value = .72 (cell G5), the equal covariance matrix assumption for linear discriminant analysis is satisfied. The other assumptions can be tested as shown in MANOVA Assumptions.

Pooled covariance and mean vectors

We next calculate the pooled covariance matrix (range F9:H11) using the Real Statistics array formula =COVPooled(A4:D35). The inverse of this matrix is shown in range F15:H17, as calculated by the Excel array formula =MINVERSE(F9:H11).

The mean vector for the loam population is displayed in range K6:M6, with the mean vectors for the other three populations shown just below it. These values are calculated by putting the formula

=AVERAGEIF($A$4:$A$35,$J6,B$4:B$35)

in cell K6, highlighting the range K6:M9 and pressing Ctrl-R and Ctrl-D.

Figure 2 – Discriminant Analysis

We will assume that a prior all four categories are equally likely and so we set the prior probabilities to 25% as shown in range O6:O9. The LN for each of these values is shown in range P6:P9.

Discriminant analysis coefficients

Finally, we calculate the coefficients of the discriminant analysis in range K14:N17. Here, K14 contains the value of d₁₀, as calculated by the array formula

=-0.5*MMULT(K6:M6,MMULT($F$15:$H$17,TRANSPOSE(K6:M6)))

The remaining d_1j coefficients (for loam) are calculated by inserting the following array formula in range L14:N14

=MMULT(K6:M6,$F$15:$H$17)

Highlighting the range K14:N17 and pressing Ctrl-D, we fill in the d_ij coefficients for the other three categories.

Best category

We now determine in which category to put the vector X with yield 60, water 25 and herbicide 6 (see Figure 3).

Figure 3 – Determining the best category

First, we determine the score for each category i using the formula

For example, the score for loam is 25.50092, as calculated using the formula

=$K$14+SUMPRODUCT(S38:U38,$L$14:$N$14)+$P$6

The scores for the other three categories are calculated in a similar manner using the discriminant coefficients and prior probability. We see that the score for sandy is the highest and so we use that as the category for X (cell Z38). Note that this category can be determined by the following formula:

=INDEX($V$3:$Y$3,,MATCH(MAX(V38:Y38),V38:Y38,0))

The posterior probability that X is in the loam population is shown in cell V39, as calculated by the formula

=EXP(V$38)/SUMPRODUCT(EXP($V$38:$Y$38))

which is 35.2%. There is a 37.6% probability that it will be classified as sandy (cell W39), and so we are only 37.6% confident that X is in the sandy population (i.e. a 62.4% error rate). Whether this is a sufficiently high level of confidence depends on the cost of being wrong and the reward for being right. If we are predicting which women have cancer based on a mammogram, we would clearly require a much higher level of confidence.

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

Reference

Penn State (2017) Linear discriminant analysis. STAT 505 Applied Multivariate Statistical Analysis
https://online.stat.psu.edu/stat505/lesson/10/10.3

22 thoughts on “Linear Discriminant Analysis”

Hello – I am trying to set up my data to perform LDA and have been using your example as a template. I can’t for the life of me get it to accept my input – I keep getting an ‘invalid training set’ error message. Are you able to provide any guidance on what might be causing this? I was able to select the example data set and perform it fine. Thanks!

Charles

June 29, 2023 at 10:11 am

Hello Darren,
If you email me an Excel spreadsheet with your data and specify how you filled in the dialog box, I will try to figure out why you are getting this error.
Charles
Reply

hola, podria colocar el procedimiento utilizado en excel para ejecutar este analisis?…por favor.

Charles

August 15, 2021 at 10:07 am

Which analysis are you referring to? The Real Statistics software already supports Linear Discriminant Analysis.
Charles
Reply

Thanks for the detailed post. I see that you have 4 classes and you have 4 discriminant functions (with intercept) i.e one for each class. However when I try to run the same in a software such as R, I get only 3 discriminant functions (LD1,LD2 and LD3) i.e one less than the number of classes, in which case, could you clarify how is the prediction done?

Charles

June 13, 2021 at 10:25 pm

Hello Sendil,
In Example 1 on this webpage, you have 4 categories: loan, sandy, salty, clay. Are you saying that R removes one of these categories? That would be strange.
In your question, what do LD1, LD2 and LD3 refer to?
Charles
Reply

I’m working with 3 different places, each with 10 tracks. I have variables to evaluate the length of the trail and the number of vertebrates depending on each place. I would like to know the right way or model to assemble the spreadsheet to be executed in the softwere r

Dear Charles
I am geologist

I want using Real-statistics for population discrimination of grain size.
My samples represent the size range with % for each range. How can I identify the populations composed each sample?
example : this is the grain size analysis of one sample
size (mm) %
>0.04 11.22
0.04 16.55
0.05 16.52
0.06 20.5
0.08 16.29
0.10 70.82
0.13 101.34
0.16 79.5
0.20 45.92
0.25 26.04
0.32 21.53
0.40 17.16
0.50 13.32
0.63 9.24
0.80 10.96
1.00 20.33
2.00 2.48

Best regards

Charles

September 15, 2019 at 9:09 am

Sorry, but I don’t understand your question nor what the data represents.
Charles
Reply
- Mohamed Dassamiour
  
  September 15, 2019 at 10:15 am
  
  Dear Charles
  In my work, I did a granulometric analysis of soil samples.
  Each sample is sieved through a series of sieves (ranging from 0.4 mm to 2 mm). The particle size fractions obtained for each class will be weighed: each particle size fraction of a given weight (Pi) corresponds to class size (in mm).
  
  the problem that the sample is composed by a mixture of granulometric subpopulations
  
  the sieving data are presented in the form of a statistical series composed of classes with the corresponding weight in%
  
  class (mm) ———————– weight ( Pi %)
  > 0.4 ————————- 11.22
  [0.04 – 0.05 [——————— 16.55
  [0.05 – 0.06 [——————— 16.52
  [0.06 – 0.08 [——————— 20.5
  [0.08 – 0.10 [——————— 16.29
  [0.10 – 0.13 [——————— 70.82
  [0.13 – 0.16 [——————— 101.34
  [0.16 – 0.20 [——————— 79.5
  [0.20 – 0.25 [——————— 45.92
  [0.25 – 0.32 [——————— 26.04
  [0.32 – 0.40 [——————— 21.53
  [0.40 – 0.50 [——————— 17.16
  [0.50 – 0.63 [——————— 13.32
  [0.63 – 0.80 [——————— 9.24
  [0.80 – 1.00 [——————— 10.96
  [1.00 – 2.00 [——————— 20.33
  <2 ————————- 2.48
  the question is how can I determine or discriminate granular subpopulation from its data?
  
  Thank you in advance for your passion
  Best regards
  Reply
  - Charles
    
    September 22, 2019 at 5:34 pm
    
    Hello Mohamed,
    On what basis will you define sub-populations? For discrimination analysis you already have the categories and you want to fit new data into those categories. Here, you need to first define the categories. This can be done in many ways.
    Charles
    Reply

How do I acquire these functions in EXCEL?
Is there a free add-in that I can download or is it a software package that I have to purchase?

Charles

March 13, 2019 at 8:36 am

Kenneth,
These functions are not available in Excel.
These functions are available for free from the Real Statistics Resource Pack. See
https://real-statistics.com/free-download/real-statistics-resource-pack/
Charles
Reply

I have a doubt, the equation to determine coefficient of discriminant analysis looks like a gaussian mixture model, is there any difference of them?

Charles

January 24, 2019 at 6:11 pm

There are a number of websites that address this issue. See, for example:
https://stats.stackexchange.com/questions/254963/differences-linear-discriminant-analysis-and-gaussian-mixture-model
https://www.jstor.org/stable/pdf/2346171.pdf?seq=1#page_scan_tab_contents
Charles
Reply

How do we get the values for LN that is shown in range P6:P9?

gottre

January 27, 2018 at 11:11 pm

oh, i just noticed it means the natural log, i thought it’s something else.
Reply