Logistic Regression

When the dependent variable is categorical it is often possible to show that the relationship between the dependent variable and the independent variables can be represented by using a logistic regression model. Using such a model, the value of the dependent variable can be predicted from the values of the independent variables.

We review here binary logistic regression models where the dependent variable only takes one of two values. In Multinomial and Ordinal Logistic Regression we look at multinomial and ordinal logistic regression models where the dependent variable can take two or more values.

We also review a model similar to logistic regression called probit regression.

Topics

References

Howell, D. C. (2010) Statistical methods for psychology (7th ed.). Wadsworth, Cengage Learning.
https://labs.la.utexas.edu/gilden/files/2016/05/Statistics-Text.pdf

Christensen, R. (2013) Logistic regression: predicting counts.
http://stat.unm.edu/~fletcher/SUPER/chap21.pdf

Wikipedia (2012) Logistic regression
https://en.wikipedia.org/wiki/Logistic_regression

Agresti, A. (2013) Categorical data analysis, 3rd Ed. Wiley.
https://mybiostats.files.wordpress.com/2015/03/3rd-ed-alan_agresti_categorical_data_analysis.pdf

164 thoughts on “Logistic Regression”

  1. Hello Charles,

    how do I know the order of the coefficients? the thing is that in the results there are no names and I am using 3 independent variables but I cant identify which one is B0, B1, B2….

    Reply
  2. Hi Charles,

    Thanks for making the logistic regression tool available in excel!

    I have two questions:]
    1) My overall model seems to be good as as in the summary table below. But I don’t how since none of the IVs have p-values < 0.05. How is this possible and how do I interpret this?
    2) Which value in the output is equivalent to the R2? In SPSS it is the Nagelkerke R2 which is shown. I can't see this in your output?

    Thank you!

    Saad

    Reply
  3. I am unable to see the “logistic regression”. What I see is “binary logistic and Probit regression” and “multinomial regression”

    Reply
  4. I have MS Office 365 ProPlus and I have downloaded the real statistics resource pack for , but I am unable to see the “logistic regression”. What I see is “binary logistic and Probit regression” and “multinomial regression”

    Reply
  5. Charles,

    I planned to use the Logistic Regression module to generate propensity scores. However, the output does not provide the probabilities of being a “1” – treatment group – for the items. Is there something I’m missing, or another step I might take? Thanks.

    Reply
  6. I am trying to run a logistic regression using this pack and every time I keep getting an error saying, “Summary table must have at least two data rows” but the data that I am entering as an input has at least two data rows. I’m confused what this alert means and I have no idea what I am doing wrong in trying to run the regression.

    (Sorry for posting this twice. I didn’t see this section before I posted in the other one, so I figured this way it could be seen in either place). Thank you very much for the help.

    Reply
    • Hello Noah,
      It looks like you are using an older release of the software. What do you see when you enter the formula =VER() in any cell?
      In the latest release, Rel 6.7, there is no such error message. If you are using Excel 2016 or later then I suggest that you install this version of the software.
      Charles

      Reply
      • Hi Charles,

        I am using 6.7 for Mac and it is still giving me the error. I am using Excel 2019. Do you have any other suggestions? Thank you for your help in trying to figure this out.

        Reply
  7. We have downloaded the tool with a need to conduct logistic regression. However, Logistic Regression is not listed as one of the tools. HELP!

    Reply
  8. Hi Charles,

    First of all, thanks for developing this add-in. I’m having problems to execute a logistic regression for a data set with aproximately 85k observations and 9 independent variables. The problem is that it doesn’t respond and finally cracks. Is there any limitation to the data set to work with this add-in?

    Thanks in advance and best regards,

    Reply
      • Hi Charles

        I am currently doing a project on competency profiling of critical roles in Petroleum refining process.I have divided entire required competency into 4 categories (Safety,Technical,Managerial & soft skills).Each category has their own set of competencies with rating assigned to it for different levels of employees (like Field engineers,shift engineers & shift superintendent ).This is done after having discussion with concerned plant managers.
        I am basically trying to develop a model which can predict the current competency level of a candidate such that we can compare it with the system requested competency which we got after consultation with Plant Managers.

        Reply
  9. Hello,

    First of all, I want to thank you for providing the add-in for free.
    Also, I have a question regarding the binomial logistic regression. Does the add-in separates the dataset on the training and test dataset or should we do it on our own?

    Thank you in advance!

    Reply
  10. Hi,
    THanks for a wonderful and simple tool. I had a query – can you advise how to use this for variables that have interaction between them?

    For instance, in my data, there is a variable ‘age’ and a variable ‘pay’ that are inter-linked with each other. There are a total of 7 other variables.

    Look forward to a reply

    Niraj

    Reply
  11. Dear Charles,

    Thanks for this excellent add in.
    I have a problem when using logistic regression when I have more than one independent variable. I followed your instructions in earlier answered questions. Version number is checked and I did make sure the last column in the Input Range contained the 0 or 1 dichotomous values of the dependent variable. The problem is that most of the output shows #VALUE!. What does it mean?

    Thank you for your help.
    Sincerely,
    Christina

    Reply
    • Christina,
      There is no guarantee that logistic regression is a good fit for your data, in which case the algorithm doesn’t converge to a solution.
      If you like, you can send me an Excel file with your data and output and I will check to see whether this is the situation.
      Charles

      Reply
      • Charles,
        I would be glad for you to have a look and check whether that is the case. I have send you my worksheet in email. Looking forwards to hearing from you.
        Christina

        Reply
    • Christina,
      For some reason, some of your data in column C is formatted as text instead of numeric. Thus, although the values are numeric, Excel is treating them as text. You need to convert the values to numeric. This can be done using the VALUE function. The good news is that once you do this, the logistic regression model converges.
      Charles

      Reply
  12. How can we do “Hierarchical bayes logistics regression” in Excel? Is there any alternatives or method by which we can perform. Please let us know.

    Reply
  13. Hi Charles,
    I downloaded the Real Statistics excel add-in but it is not showing me Logistic Regression under Reg tab.
    Can you help me in resolving this ?

    Thanks,
    Ayushi

    Reply
  14. Olá Charles,
    gostaria de parabenizar pelo excelente trabalho. Estou tentando fazer uma regressão logistica binaria onde a minha variável dependente é binaria, porém, as variáveis independentes não são binária. Os resultados que obtive são difíceis de analisar.
    Tem problema de nenhuma das variáveis independentes não serem binárias?
    Abraço

    Reply
  15. Hi Charles, good tutorial on YouTube with the pack.

    I have a data set which one column is a yes/no (1/0) dependent variable whereas the other column is an open-ended value (such as revenue or price). How do I formulate a binary logistic regression to see their relationship?

    I can send you the dataset for your assistance. Much appreciated.

    Jeremy

    Reply
  16. Hello, thank you for this useful website.
    in case of no significant correlation between two variables after a chi square and pearson
    , do i have to/need to do a binary logistic regression for the same two variables?
    I.E do binary logistic regression results matter in the absence of a significant correlation?

    Reply
      • yes, one of them is the dependent variable (binary)
        for example, the presence of acute disease (yes or no) and family income (categorical)

        Reply
        • Nameer,
          Correlation means a linear association between the two variables. Logistic regression uses a non-linear association.
          Thus, even if there is not a significant correlation, logistic regression can still be useful.
          Charles

          Reply
  17. I realise that binary logistic and probit reg seem to be combined in one. However, do confirm this when you can. Thank you.

    Reply
  18. Good Day,
    Please am about to do my post-data defence for my MSc defence and my data is both categorical and continuous sets of data.
    I wanted to use “logistic regression” but I don’t have any idea about this statistical tool and which software am I going to use to run logistic regression?
    Thanks

    Reply
      • My question is that;
        Am Using “Age,Gender,Marital Status, Education, Occupation and Income level” as my independent variable and the questions on these IVs is categorical in nature. While my DV consist of continuous questions using 5-likert scale.
        Question1. Which statistical software can run logistic regression (E-view, stats or SPSS)?
        Question 2. How many test require when running logistic regression?
        Thanks Sir

        Reply
        • Hammed,
          Q1: SPSS supports logistic regression. The others probably do as well, but I am not sure. The Real Statistics Resource Pack also supports logistic regression.
          Q2: I don’t understand this question.
          Charles

          Reply
  19. Hi Charles,

    First of all I just wanted to say thanks for the package – I’m doing a series of logistic regressions at the moment and I’ve found it incredibly effective and user friendly!

    I’ve just noticed some strange results for some of my p-values that I was hoping you could give me a little bit more information on. So instead of giving me a number e.g. 0.05, it occasionally (and just for some IVs), it contains the letter E (e.g. 2.18E-06, or 2.36E-10). What does this mean? Is there anything I can do to stop it occurring?

    Thanks Charles and please let me know if you require any more information!

    Reply
    • Pheobe,
      2.18E-06 is what is called scientific notation and it is commonly used in excel for very large or very small values. In particular, 2.18E-06 means 2.18 times 1/10^6. Thus 2.18-06 = .00000218. You can reformat such numbers in Excel by choosing the Number format and changing the number of decimals to a value at least as large as 6.
      Charles

      Reply
  20. Dear Charles, how are you?

    First, thank you for sharing all this information.

    So, I really need your help. I’m finishing my dissertation where I need to analyze the data using multiple logistic regression, for a dichotomous dependent variable (yes and no) and four independent variables, two categorical and two continuous variables. In addition, I need to include in the study the Odds Ratio for each variable.

    I believe that I am not putting the data correctly using Real Statistics, because my results don’t seem to agree with the data collected.

    For this reason, I’d like guidance on how to continue this analysis, to arrive at a reliable result.

    Thank you very much.

    Best regards,

    Rafaela

    Reply
    • Rafaela,
      If you send me an Excel file with your data with the formatting that you are using, I will try to figure out what is going wrong.
      You can find my email address at Contact us.
      Charles

      Reply
  21. Hi
    i want to measure only profitability of two banking sectors on the basis of these ratios
    1. Return on assets (ROA)
    2. Return on equity (ROE)
    3. Return on capital employed (ROCE)
    4. Dividend and capital gains

    so i have data of only six year Which model can i use help me plzz

    Reply
    • Sorry, but this seems like an economics question. I am not the right person to answer this. If you think that the relationship is linear, then you can use a linear regression or time series model.
      Charles

      Reply
          • Hi Charles & Mumytaz,

            I’d like to answer the question.
            RoA,RoCE, RoE etc. are accounting ratios used to compare entities usually in same industry and relevant sub-groups.

            @Mumytaz, I think you can follow the steps in the following manner:-
            1. Gather/calculate all ratio data
            2. Calculate regression coefficients for every ratio
            3. Decide the Null hypothesis and alternate hypothesis, in other words the threshold example yes/no, big/small, well performing/under performing(in banking terms)
            4. Use Multiple Logistic Model or for that matter more suitable would be Linear Discriminant Model (LDA)
            5. Put out a graph

            How to do it? Better use statistical s/w like R or SPSS etc.

  22. I am getting different coefficients/estimates from Excel addin and SparkMlib, even on”R” for the same data set.

    Is this an expected behavior. If yes, can you please provide an explanation.

    I have searched and found below information, it would be great if you can put some light on this.
    R’s glm is returning a maximum likelihood estimate of the model while Spark’s LogisticRegressionWithLBFGS is returning a regularized model estimate.
    Please refer to following URL –
    http://datascience.stackexchange.com/questions/5710/why-does-logistic-regression-in-spark-and-r-return-different-models-for-the-same?newreg=084d82d6809040afa2d3aacb36a9128f&newreg=aa58ae673fd24216af6b2972f34604e8

    Reply
    • Amit,
      I have tested the Real Statistics logistic regression model against some other sources and found that they match. There are many techniques available and so I cannot guarantee that the result will match with all of them.
      Charles

      Reply
  23. Hi Charles,

    I am trying to build a model to predict turnover in our organization based
    1. Tenure
    2. Age
    3. Training attended
    4. Salary range
    5. Gender

    Any suggestions as to how do i go about doing the same especially given the fact that I am trying it out in Excel.

    Also in case your e-book is out would love to have it.

    Thanks
    Shri

    Reply
  24. Mr. Zaiontz,
    Thanks for the great site!
    I ran a binary logistic of Y on each of three different numerical variables A,B,C respectively. I am having an issue of separation of variables, meaning that after certain values Ao,Bo, Co Of A,B,C respectively(different values for each, of course) responses are successes (I guess this forces the slope to diverge to minus infinity for the slope of the curve to accommodate the abrupt change of 1 to 0). Then I increased the success levels to three: high, medium and low. But now I have lack of fit issues. How does one interpret lack-of-fit issues with a Logistic Regression? I know that a lack of fit in a simple linear means that data is not linear but what does it mean for a Logistic? Does it mean the (log of) the data is not distributed like an S-curve ExpL/(1+ExpL) ?

    Reply
  25. Hi Charles

    Is there a limit to the number of rows and columns of data (apart from the Excel-specific ones) one can use to do a binary logistic regression using your pack?

    Reply
  26. Hi Charles,

    I’m trying to do logistic regression with some categorical/nominal inputs. I’m worried about multicolinearity problems from turning them into dummy variables. I was wondering if you could tell me whether I would have this problem or if it should be okay. I have a binary output (0 or 1) and my input is a set of 2 dummy variables representing 4 different scenarios: a control (0,0), a treatment a (1,0), a treatment b (0,1), or treatments a+b (1,1). I read your blog and watched the youtube video and ran the regression (I also did a chi-squared test and it seems that there is a correlation to be found), but I’m not sure whether the results are great, especially because most of my output variable observations are 0 and only 1-3% are 1.

    I’m looking at precision/recall as well but I want to know if I’m working with the model properly. Would these 2 variables give me any problems as far as you can tell? I’m just curious about if it is appropriate to use logistic regression in this way. And what about if I encoded my variables differently? Like if I had 4 dummy variables that were 0/1 for each of the 4 scenarios (meaning for every data point, exactly one of the four was set to 1)? Would that cause a multicolinearity problem? Also, is there some other concern I may be missing here?

    Thanks,

    Robert

    Reply
    • Robert,
      When most of the output is 0 and few are 1, you will certainly need a larger sample than if the data were more balanced.
      Regharding multicollinearity, I would need to see your data before I can really comment further.
      Charles

      Reply
  27. Hello Dr. Charles Zaiontz

    Dear Prof. I would like to have your comment or suggestion on my situation.
    I have collected the data, there are 300 non-injury and only 17 injury… four categorical variables are significant according to Chi-squire, then I used Multiple logistic regression for significant variables. Three of them are significant again. does it make any sense? I would like to know whether can I use Multiple logistic regression because only 17 respondent had injured from 317 of the respondents.?
    I used SPSS to analysis data.

    If I can not run it what should I have to do? There is any way to salve it.

    I appreciate all your help and support; it’s been a great encouragement to me

    Reply
    • Shalaw,
      Since I only have very limited information about the analysis that you have done, I will limit my response to the issue that only 17 of 317 respondents had an injury. I don’t see any reason a priori why you couldn’t use logistic regression. One caution is that the power of the test may suffer a lot from such an unbalanced model. E.g. if you are conducting a two sample t test with effect size .5 and alpha .05, then for two samples of size 300 and 17 the power of the test would be 52%, while if the two samples have size 158 and 159 then the power of the test would be 99%. Thus even though the total sample sizes are the same, the power of the more balanced test is much higher.
      Charles

      Reply
  28. HI Charles,
    Do you have any e-book of the above topics explained in detail?
    if yes, please share the link.
    Thanks,
    Prachi

    Reply
  29. Hi Charles,

    How do I interpret the Chi-sq and p-value in the binomial logistic regression? The same with R-sq and hosmer.

    Thanks.

    Reply
    • Hi Jonathan,
      This is explained on the appropriate logistic regression webpages on the website. Please look at these explanations. If you are still having problems, please ask me a more specific question so that I can try to help you.
      Charles

      Reply
  30. Dear Charles,
    I am using excel 2016 but I couldn’t use your tool pack. There was an error about incompatible with the version, or architecture of this application. Could you please give me the suggestion.

    Reply
    • Amonpun,
      The usual reason is that you need to make sure that Excel’s Solver is operational before you install the Real Statistics Resource Pack. This is described in the installation instructions (on the same webpage from which you downloaded the Real Statistics software).
      To see whether Solver is operational, press Alt-TI and see whether Solver appears on the list with a check mark next to it. If there is no check mark, you need to add it.
      Charles

      Reply
  31. I am trying to estimate the learning curve equation for SW developers. I have 25 developers output over their first 18 months of work. Their output does follow a Sigmoid curve.

    My goal is to use the 25 sets of data to build an estimate with confidence intervals for a new developer (ie what might be their output in month 3 or 6 etc) – if they follow past historical patterns.

    Output is normalized as “estimated delivered hours per work effort”.

    What is your recommendation for handling this data? I think simply averaging output by month for the 25 developers will mask the variability that I am trying to capture.

    Reply
  32. Hello Mr. Charles,
    Thanks for your introduction of logistic regression. I just follow your webpage by webpage and these webpages help me a lot. But I have a question when I see an example in wikipedia for logistic regression.

    https://en.wikipedia.org/wiki/Logistic_regression

    The example states,
    A group of 20 students spend between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability that the student will pass the exam? The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0).

    Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50

    Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

    I attempt to use the method from your example 1. But there is something different. When in your example 1, each category (rem) has typically more than 1 person. So majority of P(E) is less than 1 or greater than 0. But in this example from wikipedia, almost P(E) is 1 or 0, and ln(P(E)) is negtive infinite or position infinite.

    My question is how I solve this problem.

    Thanks in advance

    Steven

    Reply
    • Steven,
      You need to take the transpose of the data as the input to the Real Statistics Logistic Regression data analysis tool. If you do, you will get the same answers as those you found on Wikipedia.
      Charles

      Reply
      • i also have a problem with the example on Wikipidea. I cant get the same Intercept and slope. What am I doing wrong? Do I need to convert the response variable from binary and how.
        Thanks

        Reply
        • Yolanda,

          Is this the data for the example that you are referring to?
          Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
          Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

          If so, the response variable (Pass) is clearly binary.

          If you send me an Excel file with the analysis that you are trying to perform I will try to figure out what is going wrong.

          Charles

          Reply
  33. Hello Charles,
    I have 6 independent variables in my analysis and one dependent variable. All are binary (Both independent & dependent). What type of analysis should I use if I have to determine the equation involving all of them

    Reply
  34. Dr buenos día, que pena escribirle en español. Dr. Si usted posee una variable dependiente binaria y una independiente binaria, puedo aplicar una regresión logística. Y si posee una dependiente binaria y varias independientes binarias también puede aplicar una regresión logística?.
    Traducción:
    Dr good day, too bad to write in Spanish. Dr. If you have a dependent and an independent variable binary, can I aplay logistic regression? . And if you have a binary dependent and multiple independent binary variables can you also be applied logistic regression ?.
    Excuse me my translate please.

    Reply
    • Gerardo,
      If your dependent variable is binary and some but not all of your independent variables are binary, then you can try to apply logistic regression.
      If all your variables are binary then you should use log-linear regression and not logistic regression. The simple case reduces to the chi-square test of independence. Namely,
      if your dependent variable is binary and you also have one independent variable which is binary, then essentially you have a 2 x 2 contingency table which you can address via chi-square or related techniques.
      Charles

      Reply
  35. Dear Charles;
    Thank you so much for sharing such an excellent tool.
    It’s very useful for people all over the world.

    Could you add the tool for probit regression and tobit regression
    next time you upload the new version?

    I think these two models are widely used in the field of social science.
    Of course we can often use logit regression instead of using probit model,
    but sometimes it is not appropriate.

    Again I appreciate you very much that you made wonderful website.
    Thank you.

    Reply
    • Nozomi,
      Thanks for your comment. I already have probit and tobit on my potential enhancements list. The next release will focus on time series analysis, but I will consider adding probit and/or tobit regression to one of the following releases.
      Charles

      Reply
  36. Dr. Zaointz,
    Thanks for your very useful application!
    Would you be kind enough to point out how a user could go about reporting the testing power from a logistic multiple regression?
    As background, I observed 544 trees for whether they held cones or not as the dependent variable (i.e., 0 or 1), and for their length and age both continuous variables. Your Real Stat report showed that among the independent variables, ln length had a significant association with cone presence while ln age and interaction did not. For ln age in particular, the report shows the following: coeff b = -0.107, s.e.= 0.291, Wald stat = 0.135, p = 0.713, exp(b) = .899.
    How could I use Real Stat to find the power of the test whether the log normal age’s coefficient b differs from zero?
    Thanks and Best Regards!

    Reply
    • The theoretical background of logistic regression is provided on the Real Statistics website. If you need additional detail, please look at the Bibliography.
      Charles

      Reply
  37. Charles,

    Thanks for your informational website. I am very new to regression analysis. I am learning from your website and youtube videos. I have downloaded your excel plug-in and am working on Logistics regression. Running into some challenges that I thought you could help with…

    I have created my training data set and when I run the logistics regression… sometimes I am getting all garbage values..like below

    p-Pred Suc-Pred Fail-Pred LL % Correct HL Stat
    #VALUE! #VALUE! #VALUE! #VALUE! #VALUE! #VALUE!

    Re-running the same data set again, sometimes it is working. Can you help me with the reason why I should be getting these errors?

    Thanks,

    -Niraj

    Reply
    • Niraj,
      I can’t think of any reason why one time the procedure works and the next time you get garbage, except that perhaps one time you include the column headings option and the next time you don’t (which makes the program think there is invalid data).
      I will take a look at the Excel file you emailed me and try to figure out what is going on.
      Charles

      Reply
  38. Hi Charles,

    I am trying to use a logistic regression to forecast a percentage. The aim is to forecast the turnout at different polling locations in my country. My independent variables are a combination of numerical and categorical data (Month,Day of Week, most recent participation percentage, log of advertising money spent).

    I know it is possible to make forecasts for this data having done so in other statistical software. I’d love to use your package to do it though since most of my work is done in excel and I have had great success with some of your other tools (Thanks!).

    My problem is that when I try to run a logistic regression with participation percentage as the dependent variable, I am told I need to have either 0 or 1 as the dependent variable. Is there anyway around this so I can see the coefficients or the independent variables and make forecasts?

    Thanks for your help.

    Reply
    • Hi Michael,
      Logistic regression is used to make forecasts where there is binary outcome. It can be extended to a small number of categorical outcomes, but I have not seen it used to output percentages. You can use other regression techniques to forecast percentages, but as far as I am aware not logistic regression.
      Charles

      Reply
  39. Hi Charles,
    before asking my question I wanted to thank you for this website. It has been extremely useful for a research I am doing.
    I have found the employment of the logistic regression easy, however, I am struggling with a further extension of the model to qualitative/categorical variables. I need to consider dichotomous and polytomous explanatory variables, however, I don’t know how to code them. The real problem is with dichotomous variables because I normalise my data taking logs before regressing, this means that I will have Log(1)= 0 and Log(1)= #Value!. How can I include these variables without affecting the accuracy of the whole model?

    Reply
    • Paola,
      Presumably, you mean Log(0)= #NUM!. This is a common problem. One approach is to use a Log(x+a) transformation instead of a Log(x) transformation, choosing the constant a so that x+a is always positive.
      Charles

      Reply
  40. I very much appreciate your making Real Statistics available – and with such clarity! I am using the Logistic Regression module but am unable to obtain results if I enter more than 5 independent variables. Please let me know where I am going wrong. Many thanks.

    Reply
    • Richard,
      The problem probably has more to do with your data rather than the number of independent variables. If you send me an Excel file with your data I will try to figure out what is going wrong. See Contact Us for my email.
      Charles

      Reply
  41. Hi Charles,
    Thanks for the site – very much insightful.

    Question here specific to the log regression function. How does one whittle down the number of variables for input into the model? Is this done as part of the pre-processing or is there an input parameter in any of the menus?

    Please advise.
    David

    Reply
    • Another thing please Charles,
      Applying the model results in plenty of #VALUE in the summary page. Have checked for possible formatting issues and eliminated nulls – what is the reasons for this?
      Could I please send you the workbook?

      Reply
    • David,
      I have not provided any means for automatically whittling down the number of variables. I find that these automatic approaches can be sound mathematically, but they don’t take into account knowledge of the actual knowledge domain, especially since they usually don’t handle interactions between the variables, quadratic and higher powers of the variables, etc.
      Charles

      Reply
  42. Dear Charles,

    thanks for the useful information in this website. I am, however, having a few problems with logistic regression I am running to test the relationship between a specific type of financial report (let’s call it Type-a, where accounting information is prevailing and type-b, when non-accounting info prevails) and the type of rating it gives (good or bad). My hypothesis is that there is no significant relation bewteen type-b and bad ratings. Ratings are always scaled from 1 to 21. However, I have divided the ratings classification in two classes, so to have class A (good) class B (bad). So, I have collected a set of data where i find 50 reports in which I have recoded types of report as 1 (accounting, type-a) and 0 (nonaccounting, b) and classes as 1 (good, A) and 0 (bad, B). Even thought, from eyeballing the data, there is a very weak relation between type-b and bad ratings (only 3 times out of 50 they coincide… 0 – 0), and although the logit regressions on the binary variables gives me a coefficient of the grade equals to 0.1245, the p-value is very high (0.98944). I cannot explain why this happened, since the process of data gathering and research was very rigid. Could it be that I only ran the logit regression on a set of dummy variables (the type is the independent and the grade is the dependent)? What can be the problem?
    Thanks in advance!

    Reply
    • If you send me an Excel spreadsheet with your data, I will try to figure out whether there is a problem or help explain what is going on. You can find my email address at Contact Us.
      Charles

      Reply
  43. Charles,

    Is there a limit to the number of independent variables, I have a dataset with 45 independent variables I am trying to analyze, if this is above the limit can you suggest an alternative

    Reply
    • You should be able to run the logistic regression with 45 independent variables. With such a large number of variables, you will also need a reasonably large sample (at least 45 just to get the model to run, much more to achieve reasonable power).
      Charles

      Reply
  44. Dear Charles,

    Thanks for this excellent explanation.
    I followed your instructions and mostly it worked well. However, when I tried to test a categorical independent variable (1:using multipurpose solution as cleaner; 2: using H2O2 as cleaner; total 41 inputs) and I did make sure the last column in the Input Range contained the 0 or 1 dichotomous values of the dependent variable (microbial contamination of the contact lens system), the outcome cells revealed #VALUE!. What did it mean?
    Thank you for your help.
    Sincerely,
    Margaret

    Reply
    • Margaret,
      The usual explanation is that the logistic regression model did not converge to a solution. If you send me your worksheet I will check it out.
      Charles

      Reply
  45. Charles,

    Thank you for this excellent explanation.

    I am building a dataset with three continuous independent variables (binned into values of 1 through 5 corresponding with standard deviation ranges above and below the mean) that I am testing to a dichotomous categorical dependent variable.

    My first attempt to use your data tool gave me cells with all significant categories from p-Pred and rightward containing only a #NAME output. There is a formula there, but it isn’t picking up data. It happened with both the Solver and Newton approach. I only had 30 inputs (and should have 100, under your minimum formula) so that may be the issue. But, I wanted to make sure there wasn’t something else going on before I kept adding data.

    I assume, by the way, that the input box assumes the last column on the right of the data set is the dichotomous output and that all columns to the left of that column in the selected range are the inputs. In other words, the columns must be contiguous and arranged in this fashion.

    Thank you.

    Reply
    • Jonathan,

      Using 30 inputs should not cause this problem.

      If you have chosen the Raw Data option, then the last column in the Input Range contains the 0 or 1 dichotomous values of the dependent variable. This column needs to be included in the data range. If you have chosen the Summary Data option, however, then the last two columns are associated with the dependent variable. The first of these contains the total number of successes for the corresponding independent variables and the second of these columns contains the number of failures for the corresponding independent variables (these totals won’t necessarily be 0 or 1).

      If you have done all of this correctly, then please make sure that you are using the latest release of the Real Statistics software. You can check this by using the worksheet formula =VER(). You should get the value 12.0 or 12.1 (if you are using the Windows version of the software). I made a few changes a number of releases ago which could be the cause of the problem that you have identified.

      If none of this resolves the problem, I would be happy to look at the worksheet you are using and see if I can resolve the problem. Just email it to me.

      Charles

      Reply

Leave a Comment