Other Intraclass Correlation Models

Basic Concepts

In Intraclass Correlation we reviewed the most common form of the intraclass correlation coefficient (ICC).  We now review other approaches to ICC as described in the classic paper on the subject (Shrout and Fleiss). In that paper the following three classes are described:

Class 1: For each of the n subjects a set of k raters is chosen at random from a population of raters and each of these raters rate that subject. Note that each subject is potentially rated by different raters.

Class 2: k raters are chosen at random from a population of raters and these k raters rate all n subjects.

Class 3: Each of the n subjects are rated by the same k raters and the results address only these k raters.

The ICC values for these classes are respectively called ICC(1, 1), ICC(2, 1) and ICC(3, 1). Each of these measures the reliability of a single rater. We can also consider the reliability of the mean rating. The intraclass correlations for these are designated ICC(1, k), ICC(2, k) and ICC(3, k).

Worksheet Functions

Real Statistics Function: The Real Statistics Resource Pack contains the following array function:

ICC(R1, class, type, lab, alpha): outputs a column range consisting of the intraclass correlation coefficient ICC(class, type) of R1 where R1 is formatted as in the data range of Figure 1 of Intraclass Correlation, plus the lower and upper bound of the 1 – alpha confidence interval of ICC. If lab = TRUE then an extra column of labels is added to the output (default FALSE). class takes the values 1, 2 or 3 (default 2) and type takes the values 1 (default) or k where k = the number of raters. The default for alpha is .05.

For example, the output from the formula =ICC(B5:E12,2,1,TRUE,05) for Figure 1 of Intraclass Correlation is shown in Figure 1 below.

Intraclass correlation function output

Figure 1 – Output from ICC function

Data Analysis Tool

Real Statistics Data Analysis Tool: The Interrater Reliability data analysis tool supplied in the Real Statistics Resource Pack can also be used to calculate the ICC.

To calculate the ICC for Example 1 press Ctrl-m and choose the Interrater Reliability option from the Corr tab of the Multipage interface as shown in Figure 2 of Real Statistics Support for Cronbach’s Alpha.

If using the original interface, then select the Reliability option from the main menu and then the Interrater Reliability option from the dialog box that appears as shown in Figure 3 of Real Statistics Support for Cronbach’s Alpha.

In either case, fill in the dialog box that appears (see Figure 7 of Cohen’s Kappa) by inserting B5:E12 in the Input Range and choosing the Intraclass Correlation option. After clicking on the OK button, the output shown in Figure 2 is displayed.

ICC data analysis tool

Figure 2 – Output from ICC data analysis tool

You can change any of the values in range AG5:AG7, and the values in range AG9:AG11 will change automatically.

We next show how to calculate the various versions of ICC.

Class 1 Model

For class 1, we use the model

image7184

where μ is the population mean of the ratings for all the subjects, μ + βj is the population mean for the jth subject and εij is the residual, where we assume that the βj are normally distributed with mean 0 and that the εij are independently and normally distributed with mean 0 (and the same variance). This is a one-way ANOVA model with random effects.

As we saw in One-way ANOVA Basic Concepts

icc(1,1)-oneway-anova-model

The subjects are the groups/treatments in the ANOVA model. In this case, the intraclass correlation, called ICC(1,1), is

image7185

The unbiased estimate for var(β) is (MSB – MSW)/k and the unbiased estimate for var(ε) is MSW. A consistent (although biased) estimate for ICC is

image7186Example

For Example 1 of Intraclass Correlation, we can calculate the ICC as shown in Figure 3.

ICC(1,1) ANOVA worksheet

Figure 3 – Calculation of ICC(1, 1)

First, we use Excel’s Anova: Single Factor data analysis tool, selecting the data in Figure 1 of Intraclass Correlation and grouping the data by Rows (instead of the default Columns). Alternatively, we can first transpose the data in Figure 1 of Intraclass Correlation (so that the wines become the columns and the judges become the rows) and use the Real Statistics Single Factor Anova data analysis tool.

The value of ICC(1, 1) is shown in cell I22 of Figure 1, using the formula shown in the figure.

We can calculate the confidence interval using the following formulas:

image7187

image7188

For Example 1 of Intraclass Correlation, the 95% confidence interval of ICC(1, 1) is (.434, .927) as described in Figure 4.

ICC(1,1) confidence interval

Figure 4 – 95% confidence interval for ICC(1,1)

ICC(1, 1) measures the reliability of a single rater. We can also consider the reliability of the mean rating. The intraclass correlation in this case is designated ICC(1, k) and is calculated by the formulas

image7189

image7190

ICC(1, 4) for Example 1 of Intraclass Correlation is therefore .914 with a 95% confidence interval of (.754, .981).

Power and Sample Size

The statistical power of the ICC(1,1) test is

ICC(1,1) power

where

Here ρ0 is the ICC value when the null hypothesis is assumed to be true and ρ1 is the ICC value when an alternative hypothesis is assumed to be true, where ρ1 > ρ0.

Another Example

Example 1: Calculate the power when n = 50, k = 5, α = .05, ρ0 = .2 and ρ1 = .3.

We see that the power is 45.5% as shown in column B of Figure 1. The figure also shows the change in power when the sample size increases to 100 and the number of raters increases to 10 and 20.

Figure 5 shows the results of the analysis.

Figure 5 – Statistical power for ICC(1,1)

Real Statistics Support

Real Statistics Functions: The following function is provided in the Real Statistics Resource Pack:

ICC_POWER(ρ0ρ1, n, k, α) = power of the ICC(1,1) test when ρ0 = ICC based on the null hypothesis, ρ1 = expected value of the ICC (alternative hypothesis), n = the sample size, k = # of raters and α = the significance level (default .05).

ICC_SIZE(ρ0ρ1, k, pow, α) = minimum sample size required to achieve power of pow (default = .80) where the other arguments are as for ICC_POWER.

For Example 1, ICC_POWER(.2, .3, 50, 5, .05) = .455. The sample size required to achieve 95% power for the same values of  is given by ICC_SIZE(.2, .3, 5, .95) = 244.

Real Statistics Data Analysis Tool: The Statistical Power and Sample Size data analysis tool can also be used to calculate the power or sample size requires for the ICC(1,1). See Cronbach’s Alpha Continued for more details.

Class 2 Model

This is the model that is described in Intraclass Correlation. For Example 1 of Intraclass Correlation, we determined that ICC(2, 1) = .728 with a 95% confidence interval of (.434, .927). These are the results for a single rater. The corresponding formulas for the mean rating are as follows:

image7191

image7192

ICC(2, 4) for Example 1 of Intraclass Correlation is therefore .914 with a 95% confidence interval of (.755, .981).

Class 3 Model

The class 3 model is similar to the class 2 model, except that var(α) is not used. The intraclass correlation, called ICC(3, 1), is given by the formula

image7185

Using the terminology of Two Factor ANOVA without Replication (as for case 2), we see that (MSRow–MSE)/k is an estimate for var(β) and MSE is an estimate for var(ε). A consistent (although biased) estimate for ICC is

image7193

image7194

For Example 1 of Intraclass Correlation, we can calculate ICC(3, 1) and its 95% confidence interval as shown in Figure 6 (referring to the worksheet in Figure 2 of Intraclass Correlation).

ICC(3,1) confidence interval

Figure 6 – Calculation of ICC(3,1) and 95% confidence interval

ICC(3, 4) for Example 1 is therefore .915 with a 95% confidence interval of (.748, .981).

Observation: Class 3 is not so commonly used since by definition it doesn’t allow generalization to other raters.

Observation: ICC(3, k) = Cronbach’s alpha. For Example 1 of Intraclass Correlation, we see that =CRONALPHA(B5:E12) has value .915, just as we saw above for ICC(3, 4).

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

Shrout, P .E. and Fleiss, J. L. (1979) Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull., 86: 420-428.
https://www.aliquote.org/cours/2012_biomed/biblio/Shrout1979.pdf

Loo, T. K., Li, M. Y. (2016) A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine 15, 155–163
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4913118/

Wu, S., Crespi, C. N., Wong, W. K. (2012) Comparison of methods for estimating the intraclass correlation coefficient for binary responses in cancer prevention cluster randomized trials
https://pubmed.ncbi.nlm.nih.gov/22627076/

Li, L.., Zeng, L., Lin, Z., Cavell, M, Lui, H. (2015)  Tutorial on use of intraclass correlation coefficients for assessing intertest reliability and its application in functional near-infrared spectroscopy–based brain imaging. Journal of Biomedical Optics.
https://www.spiedigitallibrary.org/journals/journal-of-biomedical-optics/volume-20/issue-05/050801/Tutorial-on-use-of-intraclass-correlation-coefficients-for-assessing-intertest/10.1117/1.JBO.20.5.050801.full?SSO=1

95 thoughts on “Other Intraclass Correlation Models”

  1. Hi, I’ve been tasked with calculating ICCs but have some questions you might be able to help with.

    I have a subset of about 50 subjects and each subject is given a score by a “Primary” rater and a “Reliability” rater. We have 8 total raters, but 1 or 2 raters of those 8 are more often the “Primary” rater and the rest are more often the “Reliability” raters. This is because those specific 1 or 2 raters have more experience and training than the others. The reason we have a subset of 50 subjects with both Primary and Reliability scores is to run interrater-reliability and make sure all of our raters are giving similar scores for the same subjects.

    The score the raters are giving the subjects is a ratio of time the subject spent doing a specific behavior, so the score variable ranges from 0-1.

    What type of ICC Class would this be? We’re only interested in making sure that our specific 8 raters are all reliable amongst each other and that the scores they’re getting are reliable.

    Reply
    • Hi Katherine,
      You say that “We’re only interested in making sure that (a) our specific 8 raters are all reliable amongst each other and that (b) the scores they’re getting are reliable.”
      Regarding (a), I understand that when measuring reliability you treat the primary rates the same as the reliability raters. Is this what you want?
      Regarding (b), I am not sure what you are trying to measure. Is (b) really the same as (a) or are you seeking some level of consistency of the reliability raters with the “gold standard” of the primary raters?
      Charles

      Reply
  2. Hello Charles,

    I have a problem with some of my variables when conducting the ICC. Even though the output indicates a perfect ICC, there a no values for the lower and upper bound in the SPSS output. Do you know why this might be the case?

    Reply
  3. Hi! I would like to calculate inter-rater agreement on scale scores (0=N/a 1=No 2=partial 3=yes) of three rates on 14 questions assessing the quality of 80 research papers. Which is the best model to use according to your opinion and experience?
    I have tried ICC but getting some weird negative results when there is almost complete agreement in some questions.

    Reply
    • 1. Is 2=partial between No and Yes?
      2. How should I interpret N/A?
      3. Are there 3 raters? Are you trying to get a measure of agreement between these 3 raters?
      Charles

      Reply
  4. Hello Charles
    Im trying to get the ICC2,4 equation to work

    But I cant seem to get 0.914, nd my CI are 0.755 and 0.981 which are also off

    any thoughts

    thanks

    Chris

    Reply
    • Hello Chris,
      The example at the beginning of this webpage is for ICC(2,1) and not for ICC(2,4). If you are looking for ICC(2,1) and are not getting the results shown on the webpage, you can send me an Excel file with your data and results, and I will try to figure out what is going wrong. Charles

      Reply
      • Hello Charles
        ICC class 2 model

        I sorted out the formula for ICC2,4 to 0.914 using MSE, MSrow an MScol
        now its only the lower and upper that are incorrect

        This is the model that is described in Intraclass Correlation. For Example 1 of Intraclass Correlation, we determined that ICC(2, 1) = .728 with a 95% confidence interval of (.408, .950).

        These are dependent on the ICC lower / upper of the 2,1 model
        From my version of Example 1, I had the lower at 0.434 and the app 0.927
        these are different from the values above?
        Thanks again for all your help
        Im also working the Kripp Alpha I’ll have some questions for that soon

        Reply
        • Hello Chris,
          Thanks for sending me your data and calculations and sorry for not understanding what you were telling me in your comment. The values you calculated on your spreadsheet are correct. I get the same values when I use the Real Statistics ICC function. I don’t know why the values shown on the webpage are incorrect. In any case, thank you very much for finding this error. I have now corrected the webpage.
          I really appreciate your help in improving the accuracy of the Real Statistics website.
          Charles

          Reply
          • MY PLEASURE

            I hope my spreadsheet wasn’t too confusing
            I take your equations and then enlarge it to meet my needs

  5. Dear Charles,

    I’d like to apply ICC(3, 2) to my data of 100 ratings from two raters.

    1) Am I correct with the assumption that a good to excellent reliability towards an average of the raters’ scores can be concluded if the 95% confidence interval is greater than 0.75?
    I got the following scores with alpha 0,05: ICC 0.97, lower .95, upper .98. This would mean that I can form a reliable average, doesn’t it?

    2) Also, I’d have to first ensure my data fulfills the assumptions for ANOVA, don’t I?
    – Levene’s test is significant
    – Normal distribution w/ Shapiro-Wilk test not significant for rater 1, but significant for rater 2 and significant for the average of both raters
    -> What are my options here? Can I still interpret the values obtained for the ICC? If so, how and with what limitations?

    Thank you!

    Reply
    • 1. .75 as the lower limit of the confidence interval would usually be viewed as an indication of high reliability, and certainly .95.
      2. Yes, the assumptions are not met, but since the calculated ICC value is so high, it is likely that the reliability is high too. You could use another measure of interrater reliability (e.g. Krippendorff’s alpha or Gwet’s AC2). I don’t believe these measures require normality although the confidence intervals are probably based on normality. The Real Statistics website covers these tests.
      Also, the following paper may be useful.
      https://www.researchgate.net/publication/268370725_Assumptions_behind_Intercoder_Reliability_Indices
      Charles

      Reply
  6. Dear Charles,

    May I ask you a question regarding the practical difference between Class 1 and Class 3?

    I am from the language testing field. We have a group of raters nad several exam sessions during a year. For one session, each student´s performance is rated by two raters. For individual sessions, there is no overlap in the design, that means that for instance Rater 1 rates always with Rater 2, Rater 3 always with Rater 4, etc. That means two way mixed effect (Class 3). But during the year, when we calculate it for all sessions together and for all raters, the design becomes more complicated, because the pairs of raters are different, randomly assigned (?), for some performances Rater 1 and Rater 3 might form a pair, Rater 2 and Rater 7 might form a pair… That means that the Class 3 design becomes Class 1 (one way random effect), right?

    Reply
    • Martina,
      Help me understand the scenario better. For every exam, do you have exactly two raters (even though the specific raters will vary)? What sort of rating do the raters give (e.g. values from 0 to 100)? For each session is there one subject or multiple subjects that take the same exam?
      Charles

      Reply
      • Dear Charles,

        Thanks for your reply. An example:
        In the exam=session in May, 100 students participated and 6 raters were available. Rater 1 and 2 rated students 1-30, rater 3 and 4 rated students 31 – 70 and Rater 5 and 6 rated students 71 – 100. In this analysis, we analyse only the agreement on the total score – a single number. In this case, am I right when I suppose two-way mixed-effects model applies?
        We have several sessions like this, with different students and different combination of pair of raters, but during one year, the sessions are analysed separately.

        But after one year (with e.g. 8 sessions like this), we put everything together an we analyse it again. In this case, there are overlaps. Let´s suppose we have 800 students and 15 raters in total, each rater rated with several other raters in pair. For instance, rater 1 rated students 1 – 3 with rater 2, and also students 101 – 130 with rater 5, etc.
        In this analysis, it is one-way random-effects model, isn´t it?

        Reply
        • Martina,
          Thanks for the clarification, but what sort of values are used for the ratings? E.g. students 1-30 were rated by rater 1 and rater 2. What sort of ratings do they give? Numbers like 0 through 10 (like in Olympic diving competitions)?
          Charles

          Reply
          • In the analysis, we use only final score, like in the Olympic Games, so a value between 0 and e.g. 25.
            Thank you.

          • Hello Martina,
            I am not sure any of the classes apply since it doesn’t seem that the raters are chosen at random (class 1 and 2) yet the same raters don’t rate all subjects. I don’t know which approach applies the best. You might be able to use Gwet’s AC2 instead and side-step this issue. See
            Gwet’s AC2
            Perhaps someone else from the Real Statistics community can offer better insight.
            Charles

  7. Dear Charles,
    I have some questions about the ICC.
    The first one is about the conditions of their use. Because an ANOVA is realized, before using the ICC, should we first check the ANOVA hypothesis (normality of residuals…)? Or can we use them on all data ?

    I have repeated data (an observer realize the observation on day 1 and on day 2). I would like to know if the results are stable between days. Can I use ICC3 on these data ?

    I also have another type of repeated data. For a particular measure, two observers realized an observation on the morning and afternoon of day 1 and 2 on the same animals. If I want to test 1) the inter-observers reliability, can I use all the data or should I analyze them separately (moorning day 1, afternoon day 1…) 2) test-retest reliability, can I use for one observer data of morning and afternoon or should I also split the data and do the analyze in two parts ?

    I hope I’m clear about what I said.

    Thank you.

    Reply
  8. Hi Charles,

    I have a question about ICC in my study.

    There are 344 scans to analyze for one observer, do you think how many scans that need to analyze by another observer to test the ICC?

    Reply
  9. I’m trying to evaluate inter- (between 2-3 observers) and intraobserver (me doing two measurements 14 days apart) reliability between two different methods of measuring the same object. In this specific case, measuring the same time intervals in the heart using two different ultrasound techniques. In all cases, I’m comparing the average of 3 measurements for each method.

    I’ve been reading the article by Koo & Li (2016, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4913118/), and it seems the ‘Two-way mixed effects, absolute agreement, multiple raters/measurements’ ICC model/type is what best suits my case. However, I’m unsure if this is possible using the Real Statistics Data Analysis Tool, as it seems there doesn’t exist a Shrout and Fleiss (1979) version of this ICC?

    Thanks in advance.

    Reply
  10. Charles – love this site, it’s such a huge help in my work! Question for you though–is there a way to select a particular ICC form when using the excel add-in? I’m trying to look at interrater consistency (not absolute agreement) across proposal ratings of multiple raters across multiple vendors and multiple dimensions. It would be the ICC (3,k) model. I’ve been using the Corr tab and clicking Intraclass correlation. Separate row for each dimension-vendor combination, and a column for each rater. It is returning ICC type 1 every time.

    Reply
    • Hello Jen,
      You can change the ICC class on the output. For example, referring to Figure 2, you can change any of the values in range AG5:AG7, and the values in range AG9:AG11 will change automatically. In particular, you can change the value of the class (which is mislabelled as Case) from 1 to 2. I have now updated the webpage to reflect this.
      Charles

      Reply
  11. Hi Charles,
    thanks for your explanations! They are very helpful! For my research I have some questions.
    1. case: I have measured 32 Patients with a medical System for spine analysis. each of them 5 times. Now I want to determine the reproducibility of the system. Is it better to calculate the Cronbach’s alpha or the ICC for Intrarater-relibility. I think the ICC because it ist not a questionary case? So I would use ICC(3,1) with k=5 and and the 95% Confidence interval? Is there a need to calculate the Signifikanz like a p-valvue? how would I do that?

    2. case: I have measured the same patients with a second system for spine analysis. each of them again 5 times. Can I handle those two systems like two raters, and calculate ICC(3,1) with k=2 to determine Interrater-relibility (using the mean of the 5 measurement from system 1 and 2)? Or is it better to use the Bland-Altman method to compare the systems?

    hope you understand my questions, Thanks in advance!
    Julia

    Reply
    • Hi Julia,
      1. Yes, Cronbach’s alpha doesn’t appear to be the appropriate tool.
      When you say that each patient is measured 5 times, does this mean that the same person/procedure/tool is doing the measurement each time or are there 5 different raters? ICC is based on 5 different raters.
      Why are you choosing ICC(3,1) rather than ICC(2,1)?
      The Shrout and Fleiss research paper didn’t show how to calculate a p-value, but instead shows how to calculate the confidence interval (which explained on this webpage). This is more useful than the p-value in any case.
      2. Probably Bland-Altman is more appropriate. With this tool you are comparing the accuracy of a new method with an existing method.
      Charles

      Reply
      • Hello Charles,

        Thank you so much for you reply, I just saw it.
        Yes, each Patient was measured 5 times, but NOT by 5 different rates. It was each time the same tool operated by the same person!
        So I am not sure how case 2 and 3 differs when there are no different rates but repeated measurements with the same “rater”.
        So ICC(2,k) would be more appropriate because repeated measurements a more like “k raters are chosen at random from a population of raters”?

        Thanks again!
        Julia

        Reply
        • And is it then ICC(2,1) or ICC(2,k),
          ICC(2,1) is based on a single measurement, lest say there are 3 rates and each of them made one measurement
          ICC(2,k) is based on a single measurement, lest say there are 3 rates and each of them made 3 measurements and for the ICC-calcultaion there ist the mean of those 3 measurements used, is that correct?
          But since I have 5 repeated-measurements with the same rater I would choose the ICC(2,1), right?

          Reply
          • Hi Julia, Hi Charles,

            Thank you for your explanations!

            I am having almost the same issue than Julia. I have been taking measurements on patients with a medical tool. Each patient was tested 4 times with the same tool operated by the same person.

            I would like to assess the reliability of the measurements, and until now I was thinking using the ICC 2,1 or ICC 3,1.

            But I guess I should have a look on your article if it worked for Julia.

            My only issue is that I am measuring XY curves and not single measurements.
            So I am wondering what score I should use (std to the mean of the 4 tests for instance) in order to assess the reliability of my measurements ?

            Hope you will understand my question,
            thank you very much
            Antoine

          • Hi Antoine,
            Sorry for the late response.
            When you say that “Each patient was tested 4 times with the same tool operated by the same person.”, are there multiple people doing the testing (each 4 times) or is there only one person doing the measuring? Does anything change when making the 4 measurements? E.g. they are maade one hour apart.
            When you use the word reliability, what do you mean?
            Charles

  12. Hello
    Firstly, great package and finally can run ICC in Excel on multiple sets simultaneously.
    Having played around with this, I think (although very likely wrong) a slight mistake in the presentation of the underlying theory.
    ICC(2,1) and ICC(3,1) are mathematically the same. However, the difference is in how you infer from them. Secondly, the difference between a case 2 and 3 within your package is actually addressing the issue of absolute agreement (which you have labelled as case 2) vs consistency (which you have labelled as case 3).

    So, I think your ICC case 2 is actually a case 2/3 with absolute agreement and your ICC case 3 is actually a case 2/3 with consistency.

    Hope this helps.

    Stuart

    Reply
    • Stuart,
      ICC(2,1) are not mathematically the same. In fact, if you look at the example given on the webpage, the values for ICC(2,1) are different.
      The version of ICC(2,1) that I use is as described in the original paper by Shrout, P .E. and Fleis, J. L. (1979) Intraclass correlations: Uses in assessing rater reliability.
      Charles

      Reply
  13. Thank you Charles – extremely useful tool and webpages! I am still a little confused about which class of ICC to use in my study though. I have hundreds of responses to a survey with dozens of 5-point Likert questions, and want to see if different subgroups of respondents differ significantly in their responses – e.g. Europeans vs. North Americans. After reading these pages, I therefore calculated the average rating of each item for each subgroup, to essentially convert them into two ‘raters’ I could compare, and then used your tool to calculate the ICC(2,2). However, I am not sure if this is really a Class 2 situation, or if it is reasonable to consider each large subgroups as a rater (in this case, k=2). Can you please clarify?

    Reply
  14. Hi – thanks for this page, it is extremely useful!

    A couple questions:

    1. I’m trying to run a class 2 “agreement” ICC in excel, and I can’t seem to figure out the correct formula because I can’t tell which number designates which type in the excel formula.

    I am inputting: =ICC(data range, 2, __, .05)

    I’m not sure what number to put in the __ because I can’t find documentation saying whether 1 is agreement or consistency. I also get different values for higher numbers in that place (1-5 all return different values, but I have no clue what a 3,4, or 5 might be representing). I hope this makes sense, thanks for the help!

    2. What is the purpose of the alpha value in the formula? I’ve tried several different values between 0 and 1, and it doesn’t seem to affect the returned value.

    Thanks so much!

    Reply
    • Adam,
      1. The Real Statistics function takes the form ICC(R1, class, type, lab, alpha). Here class takes the values 1,2 or 3 (default 2).
      type takes the value 1 (default) or k where k = the number of raters.
      2. ICC is an array function which returns multiple values. The first of these values (i.e. the only value you see if you only use the Enter key) is not affected by the alpha value. If you treat ICC as an array function, then you will see other values, some influenced by the alpha value. See the following webpage for information about how to use an array function in Excel:
      Array Formulas and Functions
      Charles

      Reply
      • Thank you so much for the rapid response! I looked into how to use an array formula, and I’m still doing something wrong. If I highlight an area of the spreadsheet as a possible location for those multiple values you mentioned, add my ICC formula, then hit Ctrl+Shift+Enter, the range I’ve highlighted fills up with the value I already have, not a table of values (I assume these “multiple values” are things like the confidence interval etc?)

        Thanks again!

        Adam

        Reply
        • Adam,
          That is strange. Make sure that the shape of the highlighted range is correct. The range should have two columns if you set lab = True (not two rows).
          If you send me an Excel file with your data and analysis I will try to figure out why you aren’t seeing all the output information (including confidence intervals).
          Charles

          Reply
  15. Hello Charles,

    I have a problem of testing ICC values by splitting the data in groups. Two raters are measuring each individual. The sample can be divided based on severity. Say I have ICC1 for group 1 and ICC2 for group 2. I can find CI no problem, which are exact. But I need to compare the ICC1 and ICC2 to determine measurements in one group are better agreeable than those of the other group. I attempted a large sample theory but it did not work and it fails badly for small samples. I am looking at n=128 vs. n=9 in one of the comparisons.

    Is there a variance formula for ICC in a non-cluster randomized study?

    Thanks for your reply in advance!

    Reply
  16. Dr. Zaiontz:

    Thank you very much for your site. In employment testing it is common to conduct oral interviews using three raters/evaluators. With large applicant pools, it may be necessary to assemble multiple panels, and applicants are randomly assigned to only one of the panels. Theoretically, the three interviewers assigned to a panel are randomly assigned from a population of potential interviewers (I say “theoretically” because it’s not easy finding volunteers to sit on these panels!)

    It is common for personnel analysts to use Chronbach’s Alpha as a proxy for ICC Class 3 to estimate the mean reliability for a single three-rater panel. Am I correct that ICC Class 2 is a more appropriate estimate? It would seem to be nearly identical to your original example of the wine tasting (which also sounds like a lot more fun than interviews). Thanks.

    Reply
  17. Hi Charles,
    Thank you for this post.
    In Figure 5 under the Class 3 Model, the formula for the ICC is displayed in cell K29, referring to another cell H12. However I do not see where the formula (or value) in cell H12 is defined. Can you please clarify what H12 refers to?
    Thanks

    Reply
  18. Hello Charles,

    Thanks for the detail explanation. I am not sure which ICC class I should use to analyse the following:

    I have 15 MR Images from 15 subjects. I recruited 4 observers to define the organ contours e.g. Lt and Rt eyes from the images. I recorded the volume of what they had drawn and wanted to assess the inter-observer variability. I ran the repeated measure one way ANOVA and calculated the ICC using the formula on the page “Intraclass Correlation”. The ICC is extremely low while the raw data (volume measured) seems to be similar among different observers. Am I doing it the wrong way? Is it correct to consider ICC class 2? And should I consider the ICC with 4 observers, i.e. ICC(2,4) instead of ICC(2,1) which gave me a very low ICC that seems not making sense?

    Many thanks!

    Winky

    Reply
    • Winky,
      1. Which version of the ICC. ICC(1,.), ICC(2,.) or ICC(3,.) depends on what you are trying to measure, as described on the referenced webpage.
      2. Whether you use ICC(k,1) or ICC(k,4) depends on whether you want to measure the reliability of an individual (the first version) or the average for the group (the second version). Here k = 1, 2 or 3.
      3. Regarding whether you are doing the calculations correctly, if you send me an Excel file with your data and calculations, I will try to see whether the calculations are correct. You can send it to the email address on the Contact Us webpage.
      Charles
      Charles

      Reply
  19. Hi Charles,

    Thanks for the great article, it has been highly useful. I had a question regarding 95% CI interpretation in regards to ICCs.

    I executed an ICC on data from 3 judges and came up with the following results:
    ICC 0.752241391
    95% Lower CI -0.497986439
    95% Upper CI 0.946442785

    Obviously that is a fairly high ICC and degree of agreement/reliability. However, I am not sure how to interpret the 95% CIs. I have always been taught that a CI containing 0 or 1 means the test statistic is not significant. Is that the case here as well?

    I have seen previous comments eluding to this same question but could not gleam a definitive answer from them on this topic.

    Thank you for your help,

    Shelby

    Reply
    • Shelby,
      Since zero is in the 95% CI, the test statistic (namely the ICC value) is not significantly different from 0. Thus even though the calculated ICC value looks pretty high, statistically speaking the population value of the ICC may be zero.
      The fact that your 95% CI is so wide is probably due to the fact that you have very few subjects being evaluated.
      Charles

      Reply
  20. Hi Charles.
    I have been trying to figure out whether I should use an ICC or Fleiss Kappa analysis. I have multiple raters, who rated 17 groups on 8 different categories. The ratings range low to high. Every rater doesn’t rate all groups, but all groups are rated multiple times on each category. I am not sure how to set up the data or which I should use. I haven’t gotten a handle on how to use the software you’ve provided. Can you provide any additional guidance?

    Reply
    • Lyn,
      Fleiss’s Kappa is used when the ratings are categorical. ICC is used when the ratings are numerical, even ordinal data which can be viewed as numeric.
      Fleiss’s Kappa does not require that every rater rate every subject, but that all subjects get the same number of ratings. ICC does require that every rater rate every subject.
      It might be that your situation is a partial fit for ICC and a partial fit for Fleiss’s kappa. That may be a problem and so you may need to find a different measurement.
      Charles

      Reply
  21. Dear Dr. Charles,

    firts, let me congratulate again with you for your site and for the help you provide us!
    I would like to put to your attention this paper that describes an interesting case of Class 2 Inter-Rater and Intra-Rater Reliability assessment, in which there are k raters, each one rating all n subjects of the population, by performing m measurements on every subject.
    The peculiarity of this approach is that the m measurements by the i-th rater on the j-th subject are not averaged together but are considered individually, so allowing more detail in model variances estimation.
    This is the link:
    http://ptjournal.apta.org/content/ptjournal/74/8/777.full.pdf

    I’m not one of the authors so I’m not doing self-promotion, I just hope you and all the community can find it useful!

    Regards
    Piero

    Reply
  22. Hello Charles,

    I may be wrong, but should the lower CI level for the mean used in the class 2 model example (0.914) be 0.860 as opposed to 0.734?

    Thanks for your time.

    Reply
  23. First, thanks for this post on ICC calculations using Excel. In the first “Intraclass Correlation” I believe you calculated ICC(2,4) = 0.728, whereas in the “Intraclass Correlation Continued” you refer to this as ICC(2,1) = 0.728, whereas for k=1, the calculation for ICC reverts to a simpler formula (with no dependence on k) which yields a value of 0.914.

    As I understand it the ICC(2,4) applies when one is interested in the reliability of the average group score for the wine, whereas the ICC(2,1) examines the judge to judge reliability. Might it be inferred that ICC(2,1) relates to the reliability of the wine score, but that ICC(2,4) focuses on the reliability of the judges?

    Reply
    • Robert,

      In the first “Intraclass Correlation” I calculated ICC(2,1) = 0.728, although I was not so clear about the terminology since ICC(2,1) is the usual intraclass correlation.
      Also ICC(2,4) = 0.914.

      I am not sure about what you intend by “reliability of the wine score” and “reliability of the judges”. The ICC is measuring the reliability/agreement of the judges in evaluating wine scores.

      Charles

      Reply
      • Thanks for your reply. I think my confusion is because the formula for ICC(2,1) is a function of k, whereas ICC(2,k) is not a function of k! I’ve read elsewhere that the inference is that ICC(2,1) estimates the reliability of a single observer rating the wines (even though you use k=4 to obtain this estimate; ICC(2,4) is the reliability estimate using the mean of the 4 judges to assess the wine quality (but is not a function of k=4).

        Reply
        • Robert,
          That is an interesting observation. Of course, k is indirectly used in the calculation of ICC(2,1) when calculating MSRow and MSE.
          Charles

          Reply
  24. Hi Charles;
    I got the result of ICC = 0.649, CI 95% (lower bound (-2.366) and upper bound (0.964)).
    How can I report this result please?
    Many thanks.

    Reply
    • Here is the reporting for different data:

      A high degree of reliability was found between XXX measurements. The average measure ICC was .827 with a 95% confidence interval from .783 to .865 (F(162,972)= 5.775, p<.001). I believe that the lower bound for the ICC is -1/(n-1). Thus your CI lower bound of -2.366 seems quite surprising. Charles

      Reply
  25. Thank you Charles.
    Something went wrong with the message I sent and made it impossible to understand.
    Here follows my message again:
    1. What does me not feel at ease with ICC is that after having calculated the ICC value and the 95% confidence interval, I find no way of interpreting rationally the results, which is of course essential for myself and for the jury who is concearned. I read your answer to Sravanti of July 24, 2015: there is no agreement as to what is an acceptable value for ICC, although you have typically seen .7 used. So what to do if ICC<.7 ?
    2. I read about Cohen's interpretation of effect size of an experimental manipulation:
    phi square = theta square/sigma square.
    For phi square = .01 the effect is called "small". For phi square = .0625 the effect is called "medium big". For phi square = .16 the effect is called "big"
    A from phi square derived parameter is eta square = phi square/(phi square + 1) with a range [0;1]. It looks like a correlation coëfficiënt, and its estimator gives an impression of the relative size of the factors and combination of factors in Variance Analysis. So, an effect is "small" for eta square =.010, is "medium big" for eta square =.059, and"big" for eta square = .138.
    Wouldn't there be a way of transposing to ICC ? Thank you once more.
    Erik

    Reply
    • Erik,
      Sorry, but I don’t know a way of transposing these to ICC. Please note that even the effect size guidelines (small, medium, large) by Cohen and others are really rough and not appropriate for all circumstances.
      Charles

      Reply
  26. Thank you, Charles for your detailed answer.
    1. What does me not feel at ease with ICC is that after having calculated the ICC value and the 95% confidence interval I find no way of interpreting rationally the results, which is of course essential for myself and the jury who is concerned.
    I read your answer to Sravanti on July 24, 2015: there is no agreement as to what is an acceptable value for ICC, although you have typically seen .7 used. So what to do if ICC “medium big”
    Phi Square = .16 > “big”
    A from phi square derived parameter is eta square = phi square/(phi square +1) with a range [0;1]. It looks like a correlation coëfficiënt and its estimator gives an impression of the relative size of the effects of the factors and combination of factors in a Variance Analysis. So an effect is “small” for eta square = .010, “medium big” for eta square = .059, and “big” for eta square = .138
    Wouldn’t there be a way of transposing to ICC?

    Thank you once more
    Erik

    Reply
    • Erik,
      I agree with you. These measurements seem most valuable when they show a problem or when they are very high. Middle values seem less useful.
      Charles

      Reply
  27. Thank you Charles. There is no need to worry about your delayed answer. I hope you are fine and I needed anyway more time to understand…
    I struggled further through the matter and realize that my questions in the beginning were not always very adequate.
    The state of affairs for me is now as follows:
    1. After having defined the appropriate statistical model one calculates an ICC which is an estimate of the population mean value rho of a very great number of samples take in identic conditions.
    2. How to interprete this ICC value? It remains an open question to me since
    there is no agreement about.
    Wouldn’t it be logic to only consider lower limit of the confidence interval?
    3. I believe that even the confidence interval, say 95%, relative to that estimate can be questionned. Since that interval is calculated round a one time estimate of rho it can either contain rho or not. Are those chances equal to 95% if you cannot repeat an experience a great number of times?

    I begin to wonder if ICC calculations are suited for Case 2 wine tastings.

    The example I gave you comes from following book (pg 244-247):
    Wines Their sensory Evaluation by
    Maynard A. Amerine and Edward B. Roessler
    1976, 1983 by W.H. Freeman and Company

    This is the data table:
    Judge W1 W2 W3 W4 W5
    1 8 4 2 5 4
    2 6 4 5 6 5
    3 6.5 3 8.5 7 5.5
    4 3 4 5 6 7
    5 8 7 5.5 8 6.5
    6 3 3.5 7 9 8
    7 7.5 5 4.5 5.5 8.5

    Erik

    Reply
    • Erik,

      How to interpret the ICC value: Generally this statistic is used as a measurement of the agreement between raters. I still don’t completely understand the problem here.

      Confidence Interval. I agree that only the lower bound may be of interest.
      Regarding the issue of basing the confidence interval only on one sample, this is the usual situation not just for ICC but for confidence intervals of all sorts of statistics (t tests, regression, etc.). Our goal is to measure our confidence of the value of a specific population parameter based on the corresponding statistic from one sample. Since we only have one sample we can’t be certain of the value of the population parameter, but the larger the sample the narrower the confidence interval, and so the more confident we are in its value.

      Case Wine Tasting: Why do you believe that the ICC calculations are not suited to this example?

      Charles

      Reply
  28. I forgot to write in the text of my first question that I referred to example 1 in your Intraclass Coorelation chapter (Four colums for the judges and eight rows for the wines).
    Erik

    Reply
  29. Charles,
    I read the article “SF” and I wonder if what I understood is correct:
    Say Rho(2,1) is the population mean of all ICC’s between the single scores related to a given situation. When the null Hypothesis Ho says rho is equel t0 zero, this is equivalent to saying: the mean square expectation between wines = 0, the mean square expectation between judges and/or the residual mean square expectation being different from zero.
    ICC(2,1) is an estimate of rho(2,1) and when that ICC-value lies in the 95% confidence area we conclude at the 95% level of confidence that the judges have been consistent.
    Question: Consistancy of a rater does not necessarily mean reliability nor give birth to agreement between raters. What does the ICC exactly assess?

    I met an example (out of a book) that puzzles me a lot:
    Seven judges rated five wines and this is the result two-way ANOVA showed:
    – F wines < F critic .5: we accept Ho of no significant differences between the means of the wine scores.
    – F judges < F critic .5: we accept Ho of no significant differences between the means of the judges' scores. The judges are consistent in their scoring.
    – ICC(2,1) value equal to .118 (!) and confidence interval -.06 and .68. The ICC thus lies in the non significant zone. There too, can we simple conclude the judges are consistant?
    Question: Looking at the correlations of the judges' scores this is so unlikely. How to interprete what figures show exactly?

    Please give me your view on those two items?
    And again: thank you very much
    Erik

    Reply
    • Erik,

      Sorry for the delayed response.

      As the Shrout and Fleiss article says ICC measures reliability. On page 425, they make a distinction between consistency of ratings using ICC(3,1) and agreement of ratings using ICC(2,1).

      An ICC(2,1) of .118 indicates a low level of agreement between the judges. Since zero lies in the confidence interval of (-.06, .68), we need to reject the null hypothesis that there is agreement between the raters. I can’t comment on whether this is unlikely since I don’t have access to the data.

      Charles

      Reply
    • If the lower bound is -0.0, as in your example, then this likely means a very small negative number. In any case, when a statistic takes a range of value of say 0 to 1 and the lower confidence value is negative, it should be viewed as zero.
      Charles

      Reply
  30. Hi Charles!
    this is the result of a ICC test. I will appreciate it if you let me know if the Confidence interval is narrow enough to come to any conclusion.

    ICC = .87(CI 95%=.466-.997),F(2,34) = 7.7, p < .005

    best regards
    Behrouz

    Reply
    • Obviously the narrower the better, but I would say that you have a pretty good level of confidence that there is agreement among the raters. (As always, there is some risk that this is less so.)
      Charles

      Reply
  31. Hello Charles,
    My objective is to compute intraclass coefficient. I want to assess the degree of agreement between raters on the items of a new proposed tool. The tool is a 5 point Likert rated tool. There are two set of rater, Rater group 1 has psychologists (2 of them) and Rater group 2 has Educators (3 of them). It is a fully crossed model.

    My question is :
    1) How do I interpret these results? I obtained an intraclass coefficient by use of SPSS 20.For the psychologist group, I obtained the coefficient of o.54 , 95% confidence interval, the lower bound .197 and upper bound.754. These are average measures. and 0.54 for the educator group coefficient of o.54 , 95% confidence interval, the lower bound .257 and upper bound.729

    2) I am interested in the degree of agreement for each of the items of the tool. Isn’t this coefficient value an indicator of the overall scale?

    Reply
    • Hello,

      1. There isn’t agreement as to what is an acceptable value for ICC, although I have typically seen .7 used. Which such a small sample and therefore such large confidence intervals (.197, .754) and (.257, .729) it is pretty hard to derive a lot of meaning from the results, except that they seem significantly different from zero.

      2. For one item, I would simply use the variance, but again with such a small sample, this is not going to tell you very much.

      Charles

      Reply
  32. Hi! Charles,
    How we calculate ICC with one way anova for teams with different no. of people in each team. From example cited above we can use it only when there are same no. of people in each team. Kindly help.
    Thanks,
    Sapnaa

    Reply
  33. In figure 3 above, in cell K37 and K37 there is a reference to cell I28. As far as I can see this cell is empty. Can you inform me what should be the right reference?
    I asume it has to refer to F, so the reference should be or I35 or K23. Correct?

    Reply
    • Hi Carel,

      The correct formulas in cells I36, I37, I38 and I39 are:

      =I35/FINV(I31/2,I33,I34)
      =I35*FINV(I31/2,I34,I33)
      =(I36-1)/(I36+I32-1)
      =(I37-1)/(I37+I32-1)

      I will correct the references made in cells K36, K37, K38 and K39.

      Thanks for catching this error.

      Charles

      Reply

Leave a Comment