Intraclass Correlation

The intraclass correlation (ICC) assesses the reliability of ratings by comparing the variability of different ratings of the same subject to the total variation across all ratings and all subjects. The ratings are quantitative.

Topics

Examples

We illustrate the ICC technique applied to Likert scales via the following example.

Example 1: Four judges assess 8 types of wine for quality by assigning a score from 0 to 9 with the ratings given in Figure 1. Each judge tests each wine once. We would like to determine whether the wines can be judged reliably by different judges.

ICC data

Figure 1 – Data for Example 1

We can see from the data that there is a fair amount of consistency between the ratings of the different judges with a few noticeable differences.

We will assume that the four judges are taken from a random sample of judges and use Excel’s Anova: Two Factor without Replication data analysis (i.e. a repeated measure analysis). Figure 2 shows the results of this analysis.

Intraclass correlation Excel

Figure 2 – Calculation of Intraclass Correlation

Here the rows relate to the Between Subjects (Wines) and the columns relate to the Judges (who are the raters). The error term is Judge × Subjects. We have added row 29 which contains the calculation of the ICC (in cell I29) using the formula

=(J23-J25)/(J23+I24*J25+(I24+1)*(J24-J25)/(I23+1))

ICC based on Variability

We will now explain this formula. From Definition 1 in Two Factor ANOVA without Replication we have the model

image1317

The intraclass correlation is then

Intraclass correlation formula

Thus there are three types of variability:

var(β): variability due to differences in the subjects (i.e. the wines).

var(ε): variability due to differences in the evaluations of the subjects by the judges (e.g. judge B really likes wine 3, while judge C finds it to be very bad)

var(α): variability due to differences in the rating levels/scale used by the judges (e.g. judges B and C both find wine 1 to be the worst, but while judge C assigns wine 1 a Likert rating of 0, judge B gives it a bit higher rating with a 2).

We can estimate each of these as follows:

var(β) = (MSRow – MSE)/k = (26.89 – 2.28)/4 = 6.15

var(ε) = MSE = 2.28

var(α) = (MSCol – MSE)/n = (2.45 – 2.28)/8 = 0.02

where n = number of rows (i.e. subjects = wines for Example 1) and k = number of columns (i.e. raters = judges). We, therefore, obtain a consistent (although biased) estimate of the intraclass correlation by

ICC = 6.15/(6.15 + 2.28 + 0.02) = 0.728

This can also be expressed by

Intraclass correlation (ICC)

image2661

The high value of ICC shows there is a fair degree of agreement between the judges.

Real Statistics Support

Real Statistics Function: The Real Statistics Resource Pack contains the following function:

ICC(R1) = intraclass correlation coefficient of R1 where R1 is formatted as in the data range B5:E12 of Figure 1.

For Example 1, ICC(B5:E12) = .728. This function is actually an array function that provides additional capabilities, as described in Intraclass Correlation Continued.

Real Statistics Data Analysis Tool: The Interrater Reliability data analysis tool supplied in the Real Statistics Resource Pack can also be used to calculate the ICC. We show how to use this tool in Intraclass Correlation Continued.

Observation: There are a number of other measures of ICC in use. We have presented the most useful of these measures above. Click here for information about these other versions of ICC.

Confidence Interval

We now show how to calculate an approximate confidence interval for the ICC. We start by defining the following

image7176

image7177

image7179

image7180

image7181

Note that for newer versions of Excel, the FINV function should be replaced by F.INV.RT.

From these we calculate the lower and upper bounds of the confidence interval as follows:

image7182

image7183

Using these formulas we calculate the 95% confidence interval for ICC for the data in Example 1 to be (.434, .927) as shown in Figure 3.

ICC confidence interval

Figure 3 – 95% confidence interval for ICC

As stated above, for newer versions of Excel, the FINV function in cells L39 and L40 should be replaced by F.INV.RT.

Observations

The ratings by the judges indicate the difficulty or leniency of the judge. The raters can also be questions in a test. In this case, the rating corresponds to the difficulty or leniency of the question.

The measure of ICC is dependent on the homogeneity of the population of subjects being measured. For example, if the raters are measuring the level of violence in the general population, the value of var(β) may be high compared to var(α) and var(ε), thus making ICC high. If instead, the raters are measuring levels of violence in a population of inmates from maximum-security prisons, the value of var(β) may be low compared to var(α) and var(ε), thus making ICC low.

There isn’t universal agreement as to what constitutes an acceptable ICC value. This can also depend on the field in which it is used. E.g. in the medical field, an ICC less than .40 is considered to be poor, between .40 and .50 fair/moderate, between .60 and .75 good and ICC = .75 and above excellent. Others will change these boundaries to .50, .75 and .90.

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

Shrout, P .E. and Fleiss, J. L. (1979) Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull., 86: 420-428.
https://www.aliquote.org/cours/2012_biomed/biblio/Shrout1979.pdf

Loo, T. K., Li, M. Y. (2016) A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine 15, 155–163
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4913118/

Wu, S., Crespi, C. N., Wong, W. K. (2012) Comparison of methods for estimating the intraclass correlation coefficient for binary responses in cancer prevention cluster randomized trials
https://pubmed.ncbi.nlm.nih.gov/22627076/

Li, L.., Zeng, L., Lin, Z., Cavell, M, Lui, H. (2015)  Tutorial on use of intraclass correlation coefficients for assessing intertest reliability and its application in functional near-infrared spectroscopy–based brain imaging. Journal of Biomedical Optics.
https://www.spiedigitallibrary.org/journals/journal-of-biomedical-optics/volume-20/issue-05/050801/Tutorial-on-use-of-intraclass-correlation-coefficients-for-assessing-intertest/10.1117/1.JBO.20.5.050801.full?SSO=1

189 thoughts on “Intraclass Correlation”

  1. Hi Charles
    Thank you for clear review of ICC, you defined confidence interval above, if we have wide confidence width, what does it tell us about the data?

    Reply
  2. Hello Charles,
    I am assessing test-retest reliability of a self-administered questionnaire where items are rated with a Likert-type scale (1-5). But I am assessing test-retest reliability of the total score of the questionnaire, which is obtained by the sum of the items. However, my data are non-normally distributed (n=35), and thus I should avoid the ICC. Can I use Gwet’s AC2 in place of the ICC to avoid the normality assumption in this continuous variable? Could I use the central limit theorem to justify the use of the ICC? Are there other options? What would be the most appropriate approach?

    Reply
  3. Hi,
    Thank you for your useful explanations. It has helped me a lot.
    I’m trying to use ICC in order to check the reliability of ratings before aggregating them and reporting a single value. I’d be more than happy if you could help me out in the following scenario:

    A team of experts (variable size S, 2<S<10) are given S copies of the same 6-item survey with Likert-scaled questions. They are given the task to evaluate each of the S people (including themselves) by giving scores to each question in the survey for each of the S people. A score given to a question (or, statement) is an integer between 1 and 5. Each evaluation is then scored by calculating the average of each question's score. At this step, we'll be able to have a dataset like the wine and judge example, which will be an S*S matrix omitting the titles, roughly like the following:

    James John Jennifer Sara
    James 1.33 2 2.25 4
    John 3.33 3.66 4.2 2.85
    Jennifer 1 2.6 2.44 2.5
    Sara 2.33 3 3.25 5

    We'd want to check the interrater reliability, then if reliable, calculate the average of the scores given to a member of the team like the following:

    James: 2.395
    John: 3.51
    Jennifer: 2.135
    Sara: 3.395

    Then, we'd want to aggregate the score at the team level.

    I have the big picture in my mind but I'm still not sure how to do it and whether I'm thinking the right way. I'm pretty much a novice in statistics.

    Reply
    • What are you trying to accomplish? Are you trying to determine the reliability of the questionnaire? Are you trying to determine whether there is agreement among the 4 people?

      Reply
      • hi,
        thank you for your response. yes, I’m trying to accomplish the latter, then mark those records which indicate a disagreement within raters as noise.

        What do you recommend?

        Reply
        • Hi. You can use any of a number of measurements, including the ICC (as well as Gwet’s AC2 and others) to determine the agreement between the raters. I don’t know how well these measurements handle the case where raters rate themselves.
          I don’t know how to determine records that indicate disagreement. One possible approach is to remove one or more records and see how the interrater measurement changes.
          Charles

          Reply
  4. Hi,
    I am looking at 3 raters, who are rating how long it took for a patient to fall asleep. Each patient has 5 naps, and there are 13 patients (around 60 naps in total).
    So each marker will mark each patient nap (all 60) and I want to look at interrater agreement. As far as im aware I cannot use intraclass correlation coefficient as there are repeated measures from the same patient. Do you think a repeated measures anova would be better? Does the repeated measures anova sound better to use in this instance?

    Thanks

    Reply
  5. Hello!

    Beginner question:
    I would like to compare two estimations of tumor cell content in ~350 samples: one by pathology and one by bioinformatic measurements after sequencing.
    Can I use ICC for this oder should I consider another method? e.g the concordance correlation coefficient?

    Thank you very much in advance!

    Best,
    Luca

    Reply
  6. Dear Charles,

    thank you for nice tutorial about ICC.
    I have question about scenario with more then one measures per one observer at same subject. Example: 8 subjects (wine), 4 observers (judges) but each judge tests each wine three times.

    Thank you!

    Reply
  7. Hi Charles,
    I found the explanations here really useful thank you.

    With the interpretation, do you have a reference for the comment:
    “E.g. in the medical field, an ICC less than .40 is considered to be poor, between .40 and .50 fair/moderate, between .60 and .75 good and ICC = .75 and above excellent. Others will change these boundaries to .50, .75 and .90.”

    The ICC I calculate is 0.67, which seems pretty reasonable to me but I haven’t found a reference to quote.
    thanks
    Damian

    Reply
  8. Thank you for all the information! I have a question about absolute agreement and consistency, which of these is the output of the ICC function? I am analyzing the work of 66 students (scores ranging from 0-60) graded by 4 different raters (each rater graded all students) and so I chose class 2 for the ICC class and for type = 4. I am just not sure whether the ICC value is one for absolute agreement or consistency, do you mind explaining this to me or guiding me to useful resources? Thank you very much.

    Reply
  9. Apologies, it was my VBA code.

    Thank you so much for this resource!!! You are very smart to understand all these bloody stats!~

    Reply
  10. Thank you so much for taking the time to educate the internet 😉

    May I add a question, too? We want to determine interrater reliability between 2 raters.
    They each scored 20 participants, using 5 items that together form a scale (happiness), and one item (quality) which is evaluated differently.

    Is it correct that we would determine two ICCs, one using the 20 totalscores on happiness as input, and one ICC using the 20 (item) scores on quality as input?

    Are 20 observations enough?

    Reply
    • for clarification:
      one ICC using the 2x 20 totalscores on happiness as input,
      and the other using the 2×20 item scores on quality as input.

      Reply
    • Hello Hanna,
      1. You can use this approach.
      2. Whether or not 20 observations is enough depends on what you mean by “enough”. You can calculate the ICC with even fewer than 20 observations. How you interpret the ICC value does depend on the confidence interval of the ICC value, which in turn depends on the number of observations.
      Charles

      Reply
  11. Thank you for such a thorough website and for the Resource Pack! I would like to calculate the interrater reliability of my students’s ratings on a measure after watching a video to the gold standard ratings for that video. The ratings are on a scale from 0 through 4. Can I still use the Intraclass Correlation function in your Resource Pack?

    I tried it out on some data that I have, and while it appears to work on videos in which the ratings were mostly between 1 – 4, on a video in which the ratings were predominantly 0’s, with a few 1’s, I get a negative ICC.

    Alternatively, it seems to work (again) for the video that has ratings between 1 – 4, in which my student obtained a 0.93 ICC after missing 4 items, but on the video with predominantly 0’s and 1’s, she obtained an ICC of 0.65 after missing only 1 item!

    Thank you!

    Reply
  12. Hello Charles
    Thank you very much for this incredibly rich resource.
    I am analyzing data where 4 scorers, all scored the same 16 trials. I want to evaluate the homogeneity of scoring.
    In addition to calculating the ICC for all 4 scorers I also do pairwise ICC for each scorer pair (1 &2, 1&3, 1&4, 2&3…etc) and noticed that a single outlier (large difference in score between scorers) impacts the ICC score a lot (0.82 vs. 1.00 when excluded). (=ICC(data,2,1,True,0.05)
    Is this normal or am I doing something incorrectly?
    Do you recommend some other type of statistical test for this type of data?
    Thank you.

    Reply
    • It is difficult for me to say without seeing the data.
      If this one “outlier” is excluded I see that ICC = 1.0, which means there is perfect agreement. Is this so?
      Charles

      Reply
  13. Hi, thank you for the post. Why do you say that the ICC you estimated is biased? You said that it is consistent, but biased. How can we estimate an unbiased coefficient?

    Reply
    • Luis,
      I believe that I am reflecting what was written in the original research paper (although I should really check this). The ICC value is quite useful which probably means that the bias is small.
      Charles

      Reply
    • Luiz,
      I believe that this what was stated in the original research paper (although I should have checked this). In any case, the ICC estimate is quite good. I don’t know one that is less biased.
      Charles

      Reply
  14. Hello
    Is icc appropriate to assess repeatability and reproducibility between two raters each reading two measurements?

    Thanks

    Reply
  15. Bonjour Charles, je suis Axel, je mène un projet de recherche avec des collègues qui a pour but de créer un manuel de codage des publications sur les réseaux sociaux au travers d’une revues bibliographique. Après avoir fait une sellection sur 743 articles a partir d’une méthodologie de sellection ou moi et ma collègue avons collectionnés séparément des articles, voulons utiliser le ICC ceci pour confronter nos sellections et voire s’il existe une corrélation dans nos sellections séparés. nous avons 3 indicateurs de sellection S1, S2, S3 deux de rejet R1 et R2. Je voulais attribué à chaque code un chiffres afin de trouver le ICC mais je suis bloqué d’ou je viens vers vous.
    Je voulais surtout utiliser la classe 1 comme indiqué dans votre document mais je ne suis pas sure de moi.
    Classe 2 : les k évaluateurs sont choisis au hasard dans une population de évaluateurs et ces k évaluateurs évaluent tous les n sujets.
    J’ai besoin de votre aide s’il vous plaît

    Reply
    • Hello Charles, I am Axel, I am conducting a research project with colleagues which aims to create a manual for coding publications on social networks through a bibliographic review. After having made an sellection on 743 articles from an sellection methodology where I and my colleague have collected articles separately, we want to use the ICC this to compare our sellings and even if there is a correlation in our separate sellings. we have 3 sellection indicators S1, S2, S3 two of rejection R1 and R2. I wanted to assign each code a number in order to find the ICC but I am blocked from where I am coming to you.
      I mainly wanted to use class 1 as indicated in your document but I am not sure of myself.
      Class 2: the k assessors are chosen at random from a population of assessors and these k assessors assess all n subjects.
      I need your help please

      Reply
      • Hello Axel,
        Please confirm whether the following is true:
        You have 743 articles that are being assessed
        You have two raters (you and your colleague)
        You have 3 selection indicators and 2 rejection indicators. Do you choose one of S1, S2, S3 if the article is selected and one of R1, R2 if the article is rejected?
        Charles

        Reply
        • We choose just one of indicator per article, but my choosen indicator could be different of the indicator of my collegue, in this fact we want to used ICC to see if all award choose hava an corrhelation

          Reply
      • Hello Charles, Yes we are working on 743 items that we have to individually select or reject using a sellection manual. The first step is to validate this manual for sellection of articles. To validate it we want to see if there is a correlation of sellection between our two article analyzes. The analysis is done individually by me and my colleague. Not being too expert in ICC we want to use it to validate the consistency of our two article analysis. Yes there are 5 stages including 3 sellection and 2 rejection. Sellection S1, S2, S3, and rejection R1 and R2.

        Reply
        • Hello Axel,
          I am trying to determine whether for each of the 743 items you assign one rating from the five categories S1, S2, S3, R1 and R2 (and similarly for your colleague).
          Item You Colleague
          1 S1 S2
          2 R1 R1
          3 S2 S2
          4 S3 S1
          etc.
          If it is a little more complicated than this, please explain so that I can determine whether ICC is appropriate.
          Charles

          Reply
          • Yes Charles me and my collegue have different choose as
            1 S1 R1
            2 R2 S3
            3 S3 R1
            4 R2 S1 …
            But in award method of selection of article, S1 you select article by keys words of the title , S2 Select my Concept of title, R1 you rejet if the title don’t speak about award idem of research, S3 selection of abstrat concept and R2 rejet after reading abstract.

  16. I have a data set with 2 raters, each rating 6 different subjects/parameters (measurements), with each of the 2 raters rating each of the 6 parameters three separate (independent) times.
    How do I set up the Excel file for this analysis and do I use ANOVA with replication to eventually calculate the ICC?

    Reply
  17. Thanks for this! I would just like to ask, how do I calculate the intraclass correlelation coefficient if i have 6 raters and they have to rate all 102 items twice, one for its content validity and another for its structure validity? (one rater = 2 ratings). Thank you very much for your help!

    Reply
  18. Hello,

    This is very insightful. Please is ICC parametric or non parametric? How does the two way mixed ICC (where the judges are the only ones of interest) differ from Pearson correlation?

    Thank you

    Reply
  19. Hello Dr. Zaiontz,
    How do you deal with missing data in these calculations? If one of my raters did not input a value for a continuous variable, then that datapoint is simply excluded from the analyses by SPSS (obviously since there is no numeric value). However, this is clearly leading to an overestimation of my ICC value because it is simply excluding the discrepancy between coders. Is there any way to handle this with ICC calculations? I also have this for categorical variables, which I use kappa to assess reliability but I can input “missing” as a category in those cases, please correct me if I am wrong. Thank you!

    Reply
  20. Hello,
    can I use ICC to assess the agreement of 6 different immunochemistry methods on 122 patients for measurement of one qunatitative variable? The values itself are not directly comparable regarding the different antibody used, however the final outcome (normal or abnormal) should be the same.

    Thank you!

    Reply
    • Hello Vanja,
      You can use ICC treating the 6 different immunochemistry methods as the raters provided the ratings are numeric. I don’t fully understand your last sentence though.
      Charles

      Reply

Leave a Comment