Basic Concepts of Multinomial Logistic Regression

Suppose there are r + 1 possible outcomes for the dependent variable, 0, 1, …, r, with r > 1. Pick one of the outcomes as the reference outcome and conduct r pairwise logistic regressions between this outcome and each of the other outcomes. For our purposes, we will assume that 0 is the reference outcome. The binary logistic regression model for the outcome h, with h ≠ 0, is defined by

image7136

Here pih is the probability that the ith sample has outcome h. Taking the exponential of both sides of the above equation yields the equivalent expression

image7137

where we define xi0 = 1 (in order to keep our notation simple). Now let

image7138and so

image7139But

image7140Hence

image7141and

image7142

Whereas the model used in the binary case with only two outcomes is based on a binomial distribution, where there are more than two outcomes, the model we use is based on the multinomial distribution. Thus, the probability that the sample data occurs as it does is given by

image7143

where the yih are the observed values while the pih are the corresponding theoretical values.

Taking the natural log of both sides and simplifying we get the following definition.

Definition 1: The log-likelihood statistic for multinomial logistic regression is defined as follows:

image7144

Observation: The multinomial counterparts to Property 1 and 2 of Finding Logistic Regression Coefficients using Newton’s Method are as follows.

Property 1: For each h > 0, let Bh = [bhj] be the (k+1) × 1 column vector of binary logistic regression coefficients of the outcome h compared to the reference outcome 0 and let B be the r(k+1) × 1 column vector consisting of the elements in B1, …, Br arranged in a column.

Also let X be the n × (k+1) design matrix (as described in Definition 3 of Least Squares for Multiple Regression). For outcomes h and l let Vhl be the n× n diagonal matrix whose main diagonal contains elements of form

image7147

and let Chl = XTVhlX. Now define the nr × nr matrices

image7148and S = C-1. Then S is the covariance matrix for B.

Property 2: The maximum of the log-likelihood statistic occurs when for all h = 1, …, r and j = 1, …, k the following r(k+1) equations hold

image7145

Observation: Let Y = [yih] be the n × r matrix of observed outcomes of the dependent variable and let P = [pih] be the n × r matrix of the model’s predicted values for the outcomes (excluding the reference variable). Let X be the n × (k+1) design matrix. Then the matrix equation

image7146

where the right side of the equation is the (k+1) × r zero matrix, is equivalent to the equations in Property 2.

Property 3: Let B, X, Y, P and S be defined as in Property 1 and 2, and let B(0) be an initial guess of B, and for each m define the following iteration

image7155

then for sufficiently large m, B(m+1) ≈ B(m), and so B(m) is a good approximation of the coefficient vector B.

Observation: Here we can take as the initial guess for B the r(k+1) × 1 zero matrix.

Observation: If we group the data as we did in Example 1 of Basic Concepts of Logistic Regression (i.e. summary data), then Property 1 takes the form

image7150

where n = the number of groups (instead of the sample size) and for each i ni = the number of observations in group i.

Property 2 also holds where Y = [yih] is the n × r column vector of summarized observed outcomes of the dependent variable, X is the corresponding n × (k+1) design matrix, P =[pih] is the n × r column vector of predicted values and Vhl is the n × n diagonal matrix whose main diagonal contains elements of form

image7151

Thus, the element in the jth row and mth column of Chl is

image7152

In this case, the expressions for L and LL become

image7153

image7154

The values of LL and R2 as well as the chi-square test for significance are calculated exactly as for binary logistic regression (see Testing the Fit of the Logistic Regression Model).

image7156

As for LL, to the above formula, we need to add the constant term

image7157

Note, however, that in calculating the different versions of R2, the constant term is not included in LL and LL0.

27 thoughts on “Basic Concepts of Multinomial Logistic Regression”

  1. Dear Charles,

    From property 1, we know that C is a nr × nr matrix with S is the inverse of C, which means S is also a nr × nr matrix. But in property 3, the iteration equation, the dimension of the transpose of X is k+1 × n which will cause a non-conformable arguments. I am wondering whether I miss or misunderdtand something.

    Thanks
    Ziv

    Reply
  2. I’m afraid this might sound stupid but wouldn’t Zi0=Pi0/Pi0=1?
    Some textbooks refer to the probability of any outcome as Zih/sum(Zit) for all h including class h0 (t also starts from 0 so there is no “1+…” in the denominator).

    Reply
      • Thanks for your reply dr. Charles, I think you meant t=0 as a lower bound for the sum of probabilities (fifth equation) and after you take the reference outcome out t starts from 1.
        Thanks again for your great effort.

        Reply
  3. Good morning Charles.
    As for property 3 being used for Newton’s Method I understand that ‘m’ is the number of iteration and P is for the predicted probabilities.
    Then what does p(m) means? for the formula used in Newton’s Method

    Reply
  4. Charles: Does this approach require grouped data? I have tried it with data such as the famous IRIS dataset and it seems to break down when most of the independent variable combinations are unique, i.e. when the N’s are 1. What happens is that the standard errors of the betas become negative. Is there an adjustment that can made for these cases? Thanks.

    Reply
  5. Dear Charles,
    I installed Microsoft office 2010, in the mean time the problem of downloading realstat 2007 easily solved.
    Thanks

    Reply
  6. When I enter the following formula in any cell? =VER() it gives me 3.0 I download realstat-2007 pac. I browse the pac and then I checked realstat-2007 from add-in list. An error message saying” Excel experienced a serious problem with the realstat-2007 add-in. Can you help me in solving this problem?

    Reply
  7. Dear Charles,
    I download RealSTATS-2003 but How can use this package to calculate Multinomial Logistic Regression?
    Thanks

    Reply
    • Sisay,
      Since Microsoft stopped supporting Excel 2003 I have not added new features to the Excel 2003 version of Real Statistics. You need to use the Excel 2007, 2010 or 2013/2016 versions of Real Statistics to get this capability.
      Charles

      Reply
  8. Dear Charles,

    What should I do if the variance-covariance matrix is a singular matrix?
    Are there any solution for this problem?

    Reply
    • Dear Eki,
      There are approaches in when the variance-covariance matrix is not invertible, but these go beyond the score of the website. You can find some of these by googling.
      Charles

      Reply
  9. Dear Charles,
    From the literature, what would you suggest as a rule to define the minimum sample size (1) for the binomial logistic regression, (2) for the multinomial logistic regression? E.g. a rule based on the number of independent variables, the observed proportions related to each possible outcome of the dependent variable. Should such a threshold be defined by considering the possible outcomes separately (e.g. the minimum observed proportion across the outcomes), or considering all rows (combinations of outcomes) of the summary table. Thanks.

    Reply
  10. Dear Charles,

    Many thanks for this very useful material. I’d like to know if, even if probably similar to the binomial case, you could add a section on the comparison of regression models. In particular, I’d be also interested to know if LL0 is supposed to remain identical from one model to the other (I think it however depends on the way the summary table is designed, due to non linearity in the LL0 formula), and if the degrees of freedom can also be simply subtracted.
    Many thanks in advance,
    Kr,

    Thomas

    Reply
    • Thomas,
      What sort of comparison are you looking for? When you use one model rather than another?
      The LL0 values won’t be identical from model to model. Generally, they will be identical only when the summary data are identical.
      Charles

      Reply
      • Dear Charles,
        Thanks for your prompt answer. I’m thinking of nested models, exactly as illustrated in the binomial case; with a chi-square test based on log likelihoods, and a substitution of LL0 by the LL1 related to the reference model. Is it valid for the multinomial case, provided we keep the summary table identical for all models? Once the final model selected, I’ll try to define a classification matrix based on RS capabilities.
        Thomas

        Reply
  11. All I want to figure out is how do get the population and sample for a multinomial logistic regress. I have four generational cohorts and five soft skill categories that I will be testing.

    Jackie

    Reply
  12. Sir

    When h j the element of v matrix is vii = (-1)*ni*Pih*Pil, but it seems in Excel Workbook you forget the term -1, why?

    Colin

    Reply

Leave a Comment