Contact Us

Welcome to the Real Statistics Using Excel website. As described on the Home page, this website provides the resources (tutorials and statistics software) necessary to perform statistical analysis in the Excel environment.

In order to make this website more valuable to the user community, we welcome your comments, suggestions and feedback. We would especially like to hear about any errors in the website, examples or software. You can contact me as described below or by leaving a comment below or on any of the webpages.

Charles Zaiontz

My email: czaiontz@gmail.com, info@real-statistics.com
My profile: click here 

1,067 thoughts on “Contact Us”

  1. similarly if instead of six metals i want to find out only 1 metal in biscuits taken from nine towns of a city. then will it be appropriate to use single factor anova?

    Reply
  2. Dear Charles,

    How can I change some input value in your macro? I am trying to change value in {=DescStats(B3:B106,TRUE)} to {=DescStats(B3:B166,TRUE)}. Pressing “Enter” does not work here.

    Best regards,
    Kelly

    Reply
    • Dear Charles,

      Oh, eventually I realized that this is “Array formulas” and you had once explained the concept.

      Mant thanks and Merry Christmas,

      Best regards,
      Kelly

      Reply
      • i want to find out 6 heavy metals in bicuits taken from nine towns of lahore. from each town 6 samples of biscuits were collected. In each sample, i want to find out the concentrations of six heavy metals. should i use two way anova with replication or without replication?

        Reply
  3. Hi Charles,

    I am trying to download the software, but every time I click the link it says “Page Cannot be Found”

    Could you help me out? I am trying to compare some slopes by this afternoon!

    Thanks.

    Reply
    • Emma,
      Sorry about that. Earlier today I put out a new version of the software and apparently the wrong link was established. I believe this problem has now been corrected and you can now make the download.
      Charles

      Reply
  4. Dear Charles,

    Thank you very much for this excellent website.

    I was wondering if you might consider allowing users to subscribe to a brief email notice when there is new content. If so, I’ll be the first to sign up!

    Again, many thanks —

    Neil McQuarrie

    Reply
  5. Hi Charles,
    First of all, I would like to thank you for this excellent website. It provides plenty of useful information, especially for people who find difficulties in dealing with statistics.

    I have a question, and I hope to be able to explain it in an understandable way. I’ve performed an experiment without replicates, which consists in treating a cell culture with two different substances (control and treatment), and measure the effect on their growth at different times (1 hour, 7 hours, 12 hours and 24 hours). The effect that I have measured is the percentage of cells in a certain phase of their growth cycle. I would like to know if there is a significant difference in this percentage between control and treated cells.
    My main problem is that I don’t have any replicates. So far, I have tried with a two-factor ANOVA test without replication (considering “treatment” and “time” as the two factors), but I am not sure this is the right method for this situation. Are there any other statistical tests that I can use, considering that I don’t have replicates?

    Thanks in advance,

    Marina

    Reply
    • Marina,
      I am very pleased that you find the website helpful.
      If you want to test both the Treatment and the Time, then you should use Repeated Measures ANOVA with a fixed Treatment factor and a repeated measures Time factor. This specific situation is described on the following webpage:
      Repeated Measures ANOVA with one within subject factor and one between subjects factor
      You will need to read about Repeated Measures ANOVA in general before this webpage will make sense. See
      Repeated Measures ANOVA
      Note that if you only had the Time factor and not the Treatment factor you would use a simpler version of Repeated Measures ANOVA; this would be equivalent to two factor ANOVA without replication where one factor is Time and the other is Subjects.
      Charles

      Reply
  6. Iam getting #NUM! and # div/0! on my data when i use multiple variables for linear regression, which doesn’t happen when iam doing with solitary variables please help

    Reply
  7. Thank you Charles. Your site provides a wealth of information, especially for me, a novice and simply not all that good at mathematics.

    I have a question, but I have yet to find information about a solution. It’s related to your binomial and beta distribution tutorials. I can use the BINOMDIST to find the sample size needed when I provide the required accuracy, confidence level, and number of errors. But I’d like to go the other way and find the number of allowed errors when I know the required accuracy, required confidence level, and number of trials. All my tests are pass/fail. There are no retries. Is there a similar statistical function or formula I can use to do this? I’ve searched, but I’m not familiar enough with statistical functions to know if I’m just missing it or not.

    Thanks in advance,
    Andrew

    Reply
    • Andrew,
      Glad you are getting value from the website.
      There may be a simpler way, but you should be able to use Goal Seek or Solver to solve these sorts of problems. Numerous examples are provided on the website for using these tools for similar type problems.
      Charles

      Reply
      • Hello Charles,
        pls help me to chose design for following problem
        I want to develop mixture/formulation. A combination of 2 polymers gives me this formulation. poly A- 6% & poly B-2%.

        however, for poly A, there is supply issue,so, I am looking for 2nd vendor.

        I tried poly A from 2nd vendor. But, to get same formulation, it requires 4% only (compared to 6% of first vendor).

        Now, I am wondering if it is possible to have one formulation which will accomodate poly A of both vendors?
        we can change level of poly B to get this new robust formulation.
        pls help. Thanks in advance.

        Mudassar

        Reply
  8. Hi Charles,

    yesterday I found out your excellent RealStat. It impressed me so much: I think it may be very helpful. Great job!

    As soon as installed, it worked perfectly.

    However, the spreadsheets I saved yesterday today were no longer working. So I have had to reinstall the package and unblock the file RealStat.xlam to make it working. I also added the Addin directory among the Trusted Locations. (Windows 10 64 bit)

    Curiously, now it is doing all the calculations but, unfortunately, its very useful “help” in the function wizard doesn’t still work. So it is boring and slows me down a lot to have to look in the website for the documentation of each function parameter.

    I tried to reinstall the addin more and more times, but I failed.

    Please, can you tell me why it happens?

    Paolo

    Reply
    • Paolo,
      I am pleased that you like RealStats.
      I don’t know why you had to reinstall the software. The help wizard should work in the Excel 2013/2016 version of the software. Yiu can check which version you are using by entering the formula =VER(). I hope to make this capability available in the Excel 2010 version in the next release.
      Charles

      Reply
      • Hi,

        the formula =VER() reports “5.2 Excel 2013/2016”, Excel version is 2016. I Have had to reinstall RealStats because something was blocking it: Control+M did not longer work.

        Yesterday, for shure, the help worked. Today, after reinstalling, no more. The Macros, on the contrary, work wonderfully.

        Paolo

        Reply
  9. Hello Sir,

    We are developing a method that requires method precision as a validation parameter. Normally we sample 6 to 10 times from a material and analyze each on to generate and average, SD, and RSD. The %RSD serves as the precision. However, this time the sample is a powder and we look to determine two contaminate powders in the sample. It is difficult to obtain a homogeneous sample at the desired 100 ppm target limit making method precision determination nearly impossible. I could of course spike the 2 components on top of the powder sample for each individual prep and find out the percent recovery, mean recovery, SD and %RSD. But could I then use this data to determine the method precision? In other words for to I relate spike and recovery data back to a target amount in ppm? relative difference to the theoretical for each determination then %RSD of that? Just trying to see if I can use recovery data to determine method precision when the samples are independent.

    Reply
  10. Hi Charles,
    Thanks for such an excellent blog. Could you please indicate where to find the spreadsheet listed as Appendix 4 in your 2015 Psicologica paper (A robust function to return the cumulative density of non-central F distributions in Microsoft
    Office Excel)?
    Many thanks.

    Reply
  11. Hi, could you please help me? I’m trying to use the COVP function for obtaining variance-covariance matrix, from two columns of data. Also, I’m selecting in Excel an empty 2×2 cels range an then pressing ctrl-shift-enter. I works great!! I work on Windows. The problem is that my girlfriend has downloaded the Mac version of Realstatistics, She is doing exactly the same that I do, but in her case she obtains only errors. Do you have an idea of what is happening? Thanks

    Reply
    • Eduardo,
      Do you know which version of RealStats she is using? You can find this out by entering the formula =VER(). Also what version of Excel for the Mac is she using?
      Charles

      Reply
  12. Hi Charles:
    Thank you for your blog, Iam learning a lot.
    Right now I am evaluating a method for clinical skill learning (I’m Professor in Dentistry). The method is perfoemred by 25 students, and there are 8 different clinical procedures. And according to the performance of the student in each procedure, the student can be classified as sactisfactory (1) or unsatisfactory (0).
    My doubt is the following:
    In my case, do you thing that Kruder-Richardson is the best choice? Or may I use Fleiss Kappa interrater agreement?
    The following is part of the table, the first column is the number of studentes (1 to 25), the following 8 columns are eight clinical procedures, and 1 mens satisfactory and 0 menas unsatisfactory. Hope that you can help me.

    St DV 11 DL 21 MV 16 DV 15 DL 46 DL 41 ML 43 DV 32
    1 1 1 1 1 1 1 1 1
    2 1 1 1 1 1 1 1 1
    3 1 1 1 1 1 1 1 0
    4 1 1 1 1 1 1 1 1
    5 1 1 1 1 1 0 1 1
    …..
    …..
    …..
    25 1 1 1 1 1 1 1 1

    Thank you for your attention.
    Best regards
    Tien

    Reply
    • Tien,
      Glad to see that the Real Statistics website has been helpful to you.
      Based on additional information you provided me in an email, this doesn’t appear to be a fit for either Kruder-Richardson nor Fleiss Kappa.
      I have tried to give some ideas in my email response.
      Charles

      Reply
  13. Cntrl+M is not bringing up Real-Statistics today and I need to run some time series analysis. Was working fine yesterday. Please advise.

    Reply
  14. I see that I had not un-checked the “data labels” option when specifying the data ranges.
    My bad.
    Apologies for the disruption.

    Reply
  15. Hello-
    In Excel 2010, the Hotelling T2 output for the Hotel 2 example data does not match the results illustrated in the example.
    The output from Real Stats is:
    Hotelling T-square Test
    Two-samples (equal covariance matrices)
    T2 3.149934857
    df1 3
    df2 32
    F 0.988214857
    p-value 0.410715235

    However, the results from the example Hotel 2 are:
    Hotelling T-square Test
    Two-samples (equal covariance matrices)
    T2 4.116057236
    df1 3
    df2 34
    F 1.295795796
    p-value 0.291686665

    Please advise. Thank you.
    Sandy

    Reply
  16. Hi Charles,

    I have downloaded the software and added in in my excel app. However, it disappears after I close my excel app, and I need to do add-in again every time I open my excel. Can you suggest what is the real problems here?

    Thanks,

    Reply
  17. Hi Charles,

    I just wanted to drop a line to say thank you for your work! I used some of your code as well as learnt from your explanations there too.

    Also I read that you may want to do some videos. If I may suggest, please make them from a problem solving perspective. There are already many videos on theory explanation but not so many where the theory is extracted while tackling and solving different problems.

    Regards,
    Eduardo

    Reply
  18. Hello,
    Would you kindly suggest an analytical tool for my agric experiment.
    I have four replicates of each sample and data for plant height and number of leaves, spanning 7 weeks i.e data is recorded weekly.
    i also have data for days to flowering, shoot weight, root weight and fruit yield (all for each sample)
    the experiment is to test for the effect of two pathogens on the sample when treated with chemicals. a positive control, neutral control and negative control were also used

    Reply
    • Riife,
      It is difficult for me to give a precise answer without more detail about the experiment and more importantly about the type of hypotheses you want to test.
      It certainly sounds like some sort of repeated measures Manova type of analysis (Manova since there are multiple dependent variables. Repeated measures if you want to compare weeks).
      Charles

      Reply
  19. Dear Charles,
    Thank you very much for this wonderful site. I am new to forecasting and working on machine learning problems now.

    I’ve a use case to predict number of bookings for a given week.
    Features are – Year, WeekNumber, CountryCode, AgentCode, No.of.bookings.
    I understand that ARMA, ARIMA models looks only for one variable (here No.of bookings) and try to forecast the same. But in the given case, other attributes impact no. of bookings.

    Which are the approaches I can take to model the forecast ?

    Reply
    • You are talking about panel data, which is an extension to time series data. The Real Statistics software currently doesn’t support this type of data. Eventually it will support this type data.
      Charles

      Reply
  20. Dear Charles,

    Thank you for this very useful resource.
    Do you have an online-course of statistics from A to Z?
    If no, did you ever thought about making it?
    I guess lots of people would like to study statistics based on Excel with you.

    Greetings from Russia, Moscow.
    All the best,
    Vladimir

    Reply
  21. Hy sir i appreciate your concern and i have one question
    1. Iam conducting my thesis on Comparative study of Analgesic effectiveness of thoracic paravertebral block (TPVB ) vs. intercostal nerve block vs patient without these two for cholecystectomy under General anesthesia.
    since i have three group my statistic tool is ANOVA
    my question shall i use one way anova or two way anova ?
    thanks sir

    Reply
  22. Hi Charles,
    Great initiative and the website explains the details in a flow that it is easily understandable.
    I am currently trying to study Holt-Winters and apply it to one of my data sets,but i am facing a challenge. My data consists of data points with time stamp interval of 1 minute and complete data set is for 4 months. What should be my value of c? Any suggestions?

    Reply
  23. Hi M. Charles Zaiontz,
    Two or three words : Thanks a lof !
    Your web site and all the related stuff are just amazing. Merci pour cette générosité.

    Reply
  24. Charles,

    When I attempt to use the Shapley function with an array of data (as shown on the Shapley-Owen example page), it provides the decomposition value for the first column in the area but doesn’t present a value for other columns (as described below):

    Real Statistics Functions: The Real Statistics Resource Pack contains the following array function. Here R1 is an n × k array containing the X sample data and R2 is an n × 1 array containing the Y sample data.

    SHAPLEY(R1, R2): outputs an k × 1 column range containing the R_1^2, R_2^2, …, R_3^k values

    For Example 1, the output from the formula SHAPLEY(A4:C8,D4:D8) is shown in range G13:G15 of Figure 1.

    Any idea why the respective values for the remaining columns in the array are not showing up in the workbook (like your example)?

    Thanks,

    Reply
    • Dave,
      There are three x values in Example 1, x1, x2, x3, one for each column. The output contains three rows. Each row corresponds to one of the three columns.
      If you had four X variables and each variable had 100 rows of data, the output would contain four rows, one for each of the four variables.
      Charles

      Reply
  25. Dear Charles,

    I have concentration data for about 50 metabolites from plants under 3 invariables: temperature (5 and 15 degrees), time (1, 7 and 10 days) and light (white and far red). What would be the best statistics to analyse these data to know that which of the invariables or their interaction influence the metabolite concentration.
    Many thanks for your help in advance.
    Best regards
    Ferenc Marincs

    Reply
  26. Hello Charles, My excel crashes when i performed a kruskal Wallis test. Now Every other test i did displays a #N/A. I tried to restore it but i have the same problem. Have you a solution or must i do the tests again ?

    Reply
    • Mathieu,
      I can’t think of any reason why Excel should crash in this case, nor why you should get #N/A.
      If you send me an Excel file with your data and the Kruskal-Wallis test results (if available), then I will try to figure out what is going on. You can send the file to my email address as shown at Contact Us.
      Charles

      Reply
  27. Dear Charles,
    I’m sorry, if my question was not clear.
    I would like to ask if I can pool the data set from two different surveys, taking into account that the first survey population is 328 staff members of the hospital as (independent variable: six-sigma methodology). The second surveys population is the inpatients in the same hospital, numbering 540 patients as (dependent variable: patient satisfaction). Both surveys consist of five dimensions (define, measure, analyze, improve, control). While the survey of workers with 68 questions and the survey of patients with 36 questions. Both surveys were designed according to Likert scale.
    My question is: Can I collect the two data-files from both surveys into one dat- file and calculate the relationship (correlation) or regression between the independent variable (six-sigma methodology) and dependent variable (patient satisfaction)?

    Reply
    • Amer,
      You question has been very clear, and my answer is the same. You can certainly collect the the data from the two data sets into one file, but I don’t see how you can calculate a correlation. If each staff member was responsible for one of the patients, then you would have the one-to-one relationship necessary to calculate a correlation, but this is not your situation.
      You can compare the survey results from the two groups (e.g. by using multiple t tests or, better yet, using Hotelling’s T-square test), but this is different from calculating a correlation.
      Charles

      Reply
  28. Hi Charles,

    I have a question regarding the Holt- Winters method for forecasting using the realstats excel add-in. What do you exactly mean by “number of seasons” in the dialog box? My data is monthly (16 months), if I insert “4” as the number of seasons does that mean that it will regard every (16/4= 4) months as a season? Or does this method not work when data is monthly? Should I have my data prepared in a certain way before forecasting?

    Thank you in advance.

    Reply
  29. How to associate data from two different surveys coming from two different populations?
    I want to associate between staff perception of hospital quality with patient satisfaction about hospital services quality. Two different surveys were used for the 2 different population. The only common thing is that they were collected from the same hospitals. Other meaning: How to find the relation (correlation or regression) between the independent variable (Six Sigma methodology: from the first survey) and between the dependent variable (patient satisfaction: from the second survey)? I have two study populations: the first is the in-patient client at the Hospital and the second population is health providers at the same
    the first survey was to collect the data from workers (health providers: 328 persons) aiming to measure the level of using Six Sigma methodology at the hospitals. The survey consisted of (5) domains (DMAIC) that is (define, measure, analyze, improve, control) and total of (68) statements by Likert scale (5 degrees: strongly agree, agree, neutral, disagree, strongly disagree).
    the second survey was to collect the data from in-patients client (540 patients) aiming to measure the level of patient satisfaction at the hospitals. The survey consisted of (5) domains (DMAIC) that is (define, measure, analyze, improve, control) and total of (36) statements by Likert scale (5 degrees: strongly agree, agree, neutral, disagree, strongly disagree).
    Actually, I collected the data from the two populations, but I don’t know how to correlate between them as I mentioned above.

    Reply
    • Amer,
      Since the populations are different, I can’t think of a way to calculate a correlation.
      If they are measuring the same thing, then you can test whether the measurements differ.
      Charles

      Reply
      • Dear Charles,
        Thanks for your effort, I’ll like to ask, no way to calculate the correlation, even thought the data collected form the patients and health worker form the same hospital and during the same period of time and the basic 5 domains are similar in both surveys.

        Reply
        • Amer,
          You might be able to calculate the correlation in this case. With multiple variables, you essentially run linear regression and look at the Multiple Correlation value output from the Regression data analysis tool.
          Charles

          Reply
          • Dear Charles,
            Thanks again, I’ld like if you could please tell me the criteria which l’ll use is correct.
            Sine I have 2 populations, one is the workers (328 persons) and other is the patients (540 patients), every population is in a separated date file, so I’ll do the following:
            First: combining the two data files into one data file.
            Second: As I’ve two kinds of observations, one form worker at the hospital and the other form the in-patient from the same hospital, also every observation has the same 5 domains, and every domain has several questions (NB: The workers’ questions differ from the patients’ questions), so to overcome this difference I’ll calculate the mean for each of the 5 domains for every observation and as well as the total score.
            Third: Run the regression to find the correlation between the data (means) of workers and that’s for patients, but I don’t know which type of regression must I use because the 2 samples’ (workers and patients) are not equal in size (the workers are 328, and the patients are 540).
            So please Dr. Charles, tell me if the criteria that I mentioned above is right or not and your advice to me.
            My best regards,
            Amer

          • Amer,
            It doesn’t seem like you can use regression or correlation in this case.
            Without using any statistical terms, what real-world problem are you trying to solve?
            Charles

  30. Dear Charles,

    From where can I find the information about Granger-Causality Test, Impulse Responses, and Forecast Error Decompositions at all?

    Best regards,
    Kelly

    Reply
    • Kelly,
      These are topics that I haven’t yet covered on the Real Statistics website. I would imagine that most good books about time series analysis or econometrics cover them – e.g. Gujarati and Porter’s Basic Econometrics.
      Charles

      Reply
  31. Dear Charles,

    In ADFTEST(A3:A22,TRUE,-1), the -1 means using the Schwert estimate for the maximum number of lags. In ADFTEST(A3:A22,TRUE,0), what is the meaning of 0 here?

    Best regards,
    Kelly

    Reply
  32. Hello Sir,

    Although I’m using SPSS for my statistics, I still find your website very useful.
    Regarding “linear regression”, are the results still true if my variables are not normally distributed according to K-S and Shapiro-Wilk’s test? In other words, does linear regression require the assumption of normality even though the variables are linearly correlated?
    After running the test, all other assumptions (except for heteroscedasticity) were met including independence of observations, normal probability plot of residuals and the model was a good fit with an R-squared value of 0.7346.
    I have run “Robust Standard Errors” to overcome heteroscedasticity. Does this make sense? I mean would it correct for both normality and heteroscedasticity? If so how should I write/ report the results of a robust standard errors test?

    Thank you

    Reply
  33. Respected Sir
    I have downloaded real stat for Excel 2010. after installation it asks for password. Please help me how I may get password.

    Reply
  34. Good evening,
    at first I should write, that is great, that you are doing all these things, soft, web, discussion… You have my admiration fo this. But I am also asking for a help. I have two parametres on measured tests. One is position (-5,-4,…0,+1,+2,…,+5). For some positions I have results of test (we could say second parameter), for other position I would like to find result (presumption, estimation, premise) without test, only with knowledge of neibourghs, knowledge of others. For example -5~5.1, -4~4.2, -3~???, -2~5.5, -1~4.5, 0~4.2, +1~5.5, +2~5.5, +3~5.2, +4~3.9, +5~4.5. Is possible to compute position”-3″? Please, how to do it? Thank you very much for your help.
    Miroslav

    Reply
      • Good evening,
        thank you very much for your answer, but I did mistake, it looks. Data, which are used are measured from tests on complicated sample (values corresponding to deformations on the positions -5,-4,…0,1,…5 on the car). For some positions I have results of test, for other position I would like to find result (presumption, estimation, premise) without test, only with knowledge of neibourghs, knowledge of others. So, if it is like this, it is not Normal distribution, is it? In that case, maybe better way is just “trend” in excel. But I am not sure, if it is enough 6th degree of polynomial function for this tests. Thank you very much,
        best regards
        Miroslav

        Reply
        • Miroslav,
          If you believe that the data follows a particular 6th degree polynomial y = p(x), then simply substitute the x value x0 that you want and calculate p(x0) — i.e. the trend, as you have suggested. This is essentially how one of the simplest missing values approaches works.
          Charles

          Reply
          • Good afternoon Charles,
            I am sorry for late reply. I have to thank you very much for your answers. This is reason, why I am writing: thak you.
            Have a nice day 🙂
            best regards
            Miroslav

  35. Dear Charles,

    I am very confused with the ADF and cointegration issue.

    Am I corrected that if I have some time series variables, I have to first make sure they are stationary by applying ADF on the variables. Once they are stationary, I then can build a model using those variables as dependent and independent variables. Then, I have to perform the cointegration test on the model by applying ADF on the residuals.

    Is my understanding right?

    Many thanks and best regards,
    Kelly

    Reply
    • Kelly,
      Conducting an ADF test doesn’t make a time series stationary, it merely tests whether the time series is stationary. If the time series is not stationary, you need to do something else to create a related stationary time series — e.g. differencing. In this case you can now build a model for the stationary time series which you can use, for example, to make predictions. The good news is that form these prediction you can also make predictions about the original non-stationary time series.
      Charles

      Reply
      • Dear Charles,

        Thank you for your response. If I have 4 original variables in which 2 are stationary and 2 are nonstationary, I then need to transform them into 4 stationary variables before building a time series model from these stationary variables that are transformed from the original variables.

        Since not all of the orginal variables are nonstationary at the beginning, do I still have to perform a cointegration test?

        Best regards,
        Kelly

        Reply
        • Kelly, Perhaps. See the following from Wikipedia:

          The possible presence of cointegration must be taken into account when choosing a technique to test hypotheses concerning the relationship between two variables having unit roots (i.e. integrated of at least order one).[3] The usual procedure for testing hypotheses concerning the relationship between non-stationary variables was to run ordinary least squares (OLS) regressions on data which had been differenced. This method is biased if the non-stationary variables are cointegrated.

          For example, regressing the consumption series for any country (e.g. Fiji) against the GNP for a randomly selected dissimilar country (e.g. Afghanistan) might give a high R-squared relationship (suggesting high explanatory power on Fiji’s consumption from Afghanistan’s GNP). This is called spurious regression. To be more mathematically precise, two integrated I(1) series which are statistically independent may nonetheless show a significant correlation; this phenomenon is called spurious correlation.

          Charles

          Reply
  36. Hi Charles

    Very helpful website but I have a query which I hope you can help me with. I am investigating the effects of loneliness on maladaptive behaviours and after computing a Pearson’s correlation, a non-significant relationship was found. I wanted to analyse my data further and see if there were any gender differences or differences regarding age. Is is appropriate to just compare the Pearson’s scores I have ran for the groups even though the sample sizes are different?(I have 34 males and 12 females). Or do I need to use a Fisher’s conversion to appropriately compare them.

    Thank you in advance
    Molly

    Reply
    • Molly,
      I don’t see how you can calculate a Pearson’s correlation when the sample sizes are different. Perhaps you need to use a t test in this case. These subjects are covered on the Real Statistics website.
      Charles

      Reply
  37. Charles,

    We have been using the JENKS formulas in our supply chain to help rank items.

    We come up with a weighted ranking value for each of our SKUs. Then, we take the JENKS formula against this value to determine rank distribution. This worked perfectly last time returning 10 ranks all populated with data.

    However, this year when we ran it, sometimes we are only getting 7 to 9 ranks. When we use JENKS, we are specifying the number of ranks we are wanting returned.

    Do you have any thoughts as to why this would happen?

    Thanks for any input.

    Reply
    • Jonathan,
      If you send me an Excel file with your data and the results of the JENKS function, I will try to figure out what is happening.
      Charles

      Reply
  38. Hi Charles, great to land on your website! I have a small data set (n<25). I run multiple regression for different combinations of 4 variables and single factor. With variable A, it shows multiple R<15% and Rsquare50%, Rsquare>30%, coefficient =0.004. I’m considering to choose A & C. what do you think? how would you interpret these? thanks so much!

    Reply
  39. Hi Charles
    I’ve got some A&E data that need analyzing and would appreciate some advice. The data I have are categorical; address area, age group, trust attended, referral mode, discharge destination. I have done chi-analysis on the data but I want to do analysis to predict the discharge destination of an individual from a certain area, of a certain age group, attended a certain trust by a certain mode. A probability tree would be huge, is there anything else that can be done?

    Thanks
    Sab

    Reply
  40. Charles says:
    December 6, 2015 at 9:13 am

    It turns out that Statistics is not just “science” but also “art”.
    Charles
    _________
    Exactly on the target, Charles!
    In fact Statistics in practice lives “UII”, under insufficient information. It´s the heel, and the pleasure, of our lives,
    Art disguised as a Mathematical Branch let me say.
    Inductive instead of deductive.sts of hypothesis via necessary conditions, not at all sufficient ones.

    Reply
  41. Good evening Dr. Charles.
    I’m an undergraduate student from a university in Indonesia. If you don’t mind, I would like to ask a question about normal distribution.

    If I have data of a pre-test, an immediate post-test, and a delayed post-test, do all my tests have to be normally distributed? Or is it just my pre-test which needs to be normally distributed?

    I would be really glad if you would answer my simple question.
    Thank you in advance.
    Steve

    Reply
    • Steve,
      Assuming that you are referring to a repeated measures Anova, then you need make sure that the pre-test, post-test and delayed post-test are all normally distributed. Of course, Anova is pretty robust to violations of this assumption, especially if each group is reasonably symmetric.
      Charles

      Reply
      • Thank you very much for your reply.

        Sorry for not mentioning it before.
        I’m actually running a univariate ANCOVA test, with the pre-test as a covariate. Then, I guess I also need normal distribution for all my tests?

        If it is robust to violation of the assumption of normality, does it mean I can use the ANCOVA test even though the Kolmogorov and Shapiro tests are significant at .000? And the skewness and kurtosisness of the immediate post-test are significant?

        Thank you!

        Reply
  42. I have a SB coefficient of 0.502 and a guttman split-half coefficient of 0.498. Does it means my data has a low reliability?

    Reply
    • Cedric,
      There isn’t universal agreement about what is high and low, but most people would say that this is not sufficiently high reliability.
      Charles

      Reply
  43. Dear Sir
    I have started to reading this article. I think it may be has instead of have. I really don’t know whether it’s have or has or having.
    ”Four fields are available for testing with each field have fairly uniform characteristics”

    Sincerely
    Sujit

    Reply
    • Sujit,
      It should be “having”. Thanks for identifying this error. I have now corrected the webpage.
      I appreciate your help in improving the website.
      Charles

      Reply
  44. Hi Charles,

    Please give me your thoughts on the following.
    After studying further onto the previous matter, I came to the conclusion that a one-way ANOVA using contrasts for planned comparisons would be appropriate. That way I can identify which groups are significantly different from the control group.
    My problem is how can I identify the proper materials, since I’m interested in the materials that are Not significantly different from the original.
    The original material has a mean of 7.5 but the minimum observed is 7.2 and the maximum 7.7.
    If I use a standard alpha of 0.05 or 0.01, even with the Dunn/Sidak or Bonferroni corrections some materials with a mean of 7.6 are being considered significantly different when it is known that they shouldn’t.
    If I do a one-sample T-test on the control material, and use the p-value obtained as my alpha I get much more reliable results that correlate with the observations of materials that are comparable to the control and the ones that are not.
    With this method my alpha is extremely small (3.73446E-51), but it still identifies the materials that are actually different, and more importantly it does not make everything significantly different as was the case when using the standard p-values.
    Please give me your thoughts on if this would be an acceptable practice.
    Thanks,
    Sergio

    Reply
  45. Dear Sir,
    I love this website so much. I do not have a statistics background but I am trying to learn. I am currently doing a research that needs statistics knowledge. Could you give me some advice on my research below?

    I am trying to use multiple regression method to explore impacts on housing price in a city. The city has a lake and has a smelly river (these two waterbodies are not adjacent to each other). The research purpose is to gauge housing value increase given that the smelly river been converted to a lake by building a barrage.

    I have got housing transaction data that occurred in the area within 1000m buffer from the river and the lake, from 2014 to 2017 of around 750 records. In the database columns refer to different attributes of transactions. The attributes may include ‘property sale price’, ‘name of suburb’, ‘sale date’, ‘land area’, ‘bedroom number’, etc.

    By converting the ‘name of suburb’, I can get another two attributes: 1.the proximate distance of the property to CBD ‘Dist_ CBD’; and 2. whether or not the property is located besides the lake or the smelly river ‘lake_nolake’.

    Now in the regression model, Y is per sqm housing transaction price. Xs are,
    ‘Dist_ CBD’ (I measured by meter);
    ‘lake_nolake’ (if the property is lake-side, I code it 1, if the property is river-side, I code it 0)
    ‘sale date’ (I converted the date to number);
    ‘land area’ (I keep the original unit of sqm);
    ‘bedroom number’ (I keep the original unit of number);

    Then I ran the regression model.
    Regression Analysis

    OVERALL FIT
    Multiple R 0.434978 AIC 7900.829
    R Square 0.189206 AICc 7900.882
    Adjusted R Square 0.187063 SBC 7914.729
    Standard Error 180.5385
    Observations 760

    ANOVA Alpha 0.05
    df SS MS F p-value sig
    Regression 2 5757830 2878915 88.32612 3.33E-35 yes
    Residual 757 24673773 32594.15
    Total 759 30431603

    coeff std err t stat p-value lower upper vif
    Intercept 497.0258 11.62045 42.77164 3.7E-204 474.2137 519.838
    Dist_CBD -20.5369 1.712126 -11.995 1.8E-30 -23.898 -17.1758 1.237728
    Lake_NoLake 3.5051 31.53518 0.111149 0.911528 -58.4017 65.4119 1.237728

    Then the question would be:
    How to interpret the ‘lake-nolake’ coefficient of ‘3.5051’? Can I say the per sqm price of lake-side properties tends to be higher than that of river-side properties by $3.5051?

    Thank you sir.

    Reply
    • Bo,
      Based on the regression model that you have created, you are correct in saying that the model predicts that the per sqm price of lake-side properties is $3.5051 higher than river-side properties.
      Charles

      Reply
      • Thank you very much for your prompt reply.
        The R square of the model is only 0.189 which would mean the model can only explain small part of the situation.
        May I ask how to convince the local residence the reliability of the $3.5051?

        Reply
          • Sorry for my unclear question. I would like to show you the conflict results between multiple regression and single regression using same database and wish to seek your advice.

            In my comment on March 23, 2017, we can say that the per sqm price of lake-side properties tend to be $3.5051 higher than river-side properties.

            However, if I run a single regression using the same data (y=per sqm housing transaction price, x=lake_nolake), the regression result is:
            y=162.27x+383.5 R square= 0.0351 which is low. This single regression result may tell that the per sqm price of lake-side properties tend to be $162 lower than river-side properties.

            Thank you for your help.

  46. I believe there is an error on your page explaining how to execute Holt-Winters method in Excel. For “Example 1” you say to put “=E$22*(C9/D9)+(1-E$22)*F5 in cell F8” however I believe it should be “=E$22*(C8/D8)+(1-E$22)*F4 in cell F8”.

    Thank you for this site, the information is extremely helpful!!

    Breadcrumbs for ref: /time-series-analysis/basic-time-series-forecasting/holt-winters-method/

    Reply
    • Doug,
      Thanks for catching this error. I have now made the change on the referenced webpage.
      I really appreciate your help in making the website better.
      Charles

      Reply

Leave a Comment