Intraclass Correlation

The intraclass correlation (ICC) assesses the reliability of ratings by comparing the variability of different ratings of the same subject to the total variation across all ratings and all subjects. The ratings are quantitative. If your data is qualitative (categorical) you should not use the ICC.

Topics

Basic concepts of the ICC(2,1) model (this webpage)
Other ICC models
Using ICC for comparisons against a gold standard
Using ICC for test/retest reliability

Examples

We illustrate the ICC technique applied to Likert scales via the following example.

Example 1: Four judges assess 8 types of wine for quality by assigning a score from 0 to 9 with the ratings given in Figure 1. Each judge tests each wine once. We would like to determine whether the wines can be judged reliably by different judges.

Figure 1 – Data for Example 1

We can see from the data that there is a fair amount of consistency between the ratings of the different judges with a few noticeable differences.

We will assume that the four judges are taken from a random sample of judges and use Excel’s Anova: Two Factor without Replication data analysis (i.e. a repeated measure analysis). Figure 2 shows the results of this analysis.

Figure 2 – Calculation of Intraclass Correlation

Here the rows relate to the Between Subjects (Wines) and the columns relate to the Judges (who are the raters). The error term is Judge × Subjects. We have added row 29 which contains the calculation of the ICC (in cell I29) using the formula

=(J23-J25)/(J23+I24*J25+(I24+1)*(J24-J25)/(I23+1))

ICC based on Variability

We will now explain this formula. From Definition 1 in Two Factor ANOVA without Replication we have the model

The intraclass correlation is then

Thus there are three types of variability:

var(β): variability due to differences in the subjects (i.e. the wines).

var(ε): variability due to differences in the evaluations of the subjects by the judges (e.g. judge B really likes wine 3, while judge C finds it to be very bad)

var(α): variability due to differences in the rating levels/scale used by the judges (e.g. judges B and C both find wine 1 to be the worst, but while judge C assigns wine 1 a Likert rating of 0, judge B gives it a bit higher rating with a 2).

We can estimate each of these as follows:

var(β) = (MS_Row – MS_E)/k = (26.89 – 2.28)/4 = 6.15

var(ε) = MS_E = 2.28

var(α) = (MS_Col – MS_E)/n = (2.45 – 2.28)/8 = 0.02

where n = number of rows (i.e. subjects = wines for Example 1) and k = number of columns (i.e. raters = judges). We, therefore, obtain a consistent (although biased) estimate of the intraclass correlation by

ICC = 6.15/(6.15 + 2.28 + 0.02) = 0.728

This can also be expressed by

The high value of ICC shows there is a fair degree of agreement between the judges.

Real Statistics Support

Real Statistics Function: The Real Statistics Resource Pack contains the following function:

ICC(R1) = intraclass correlation coefficient of R1 where R1 is formatted as in the data range B5:E12 of Figure 1.

For Example 1, ICC(B5:E12) = .728. This function is actually an array function that provides additional capabilities, as described in Intraclass Correlation Continued.

Real Statistics Data Analysis Tool: The Interrater Reliability data analysis tool supplied in the Real Statistics Resource Pack can also be used to calculate the ICC. We show how to use this tool in Intraclass Correlation Continued.

Observation: There are a number of other measures of ICC in use. We have presented the most useful of these measures above. Click here for information about these other versions of ICC.

Confidence Interval

We now show how to calculate an approximate confidence interval for the ICC. We start by defining the following

Note that for newer versions of Excel, the FINV function should be replaced by F.INV.RT.

From these we calculate the lower and upper bounds of the confidence interval as follows:

Using these formulas we calculate the 95% confidence interval for ICC for the data in Example 1 to be (.434, .927) as shown in Figure 3.

Figure 3 – 95% confidence interval for ICC

As stated above, for newer versions of Excel, the FINV function in cells L39 and L40 should be replaced by F.INV.RT.

Observations

The ratings by the judges indicate the difficulty or leniency of the judge. The raters can also be questions in a test. In this case, the rating corresponds to the difficulty or leniency of the question.

The measure of ICC is dependent on the homogeneity of the population of subjects being measured. For example, if the raters are measuring the level of violence in the general population, the value of var(β) may be high compared to var(α) and var(ε), thus making ICC high. If instead, the raters are measuring levels of violence in a population of inmates from maximum-security prisons, the value of var(β) may be low compared to var(α) and var(ε), thus making ICC low.

There isn’t universal agreement as to what constitutes an acceptable ICC value. This can also depend on the field in which it is used. E.g. in the medical field, an ICC less than .40 is considered to be poor, between .40 and .50 fair/moderate, between .60 and .75 good and ICC = .75 and above excellent. Others will change these boundaries to .50, .75 and .90.

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

Shrout, P .E. and Fleiss, J. L. (1979) Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull., 86: 420-428.
https://www.aliquote.org/cours/2012_biomed/biblio/Shrout1979.pdf

Loo, T. K., Li, M. Y. (2016) A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine 15, 155–163
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4913118/

Wu, S., Crespi, C. N., Wong, W. K. (2012) Comparison of methods for estimating the intraclass correlation coefficient for binary responses in cancer prevention cluster randomized trials
https://pubmed.ncbi.nlm.nih.gov/22627076/

Li, L.., Zeng, L., Lin, Z., Cavell, M, Lui, H. (2015) Tutorial on use of intraclass correlation coefficients for assessing intertest reliability and its application in functional near-infrared spectroscopy–based brain imaging. Journal of Biomedical Optics.
https://www.spiedigitallibrary.org/journals/journal-of-biomedical-optics/volume-20/issue-05/050801/Tutorial-on-use-of-intraclass-correlation-coefficients-for-assessing-intertest/10.1117/1.JBO.20.5.050801.full?SSO=1

189 thoughts on “Intraclass Correlation”

Elle

January 24, 2023 at 9:15 am

Hi Charles
Thank you for clear review of ICC, you defined confidence interval above, if we have wide confidence width, what does it tell us about the data?
Reply
- Charles
  
  January 24, 2023 at 10:15 am
  
  Hi Elle,
  It likely means that you need a larger sample to get a more definitive picture (and a narrower confidence interval).
  Charles
  Reply
Marcos Moreno

February 12, 2022 at 1:31 pm

Hello Charles,
I am assessing test-retest reliability of a self-administered questionnaire where items are rated with a Likert-type scale (1-5). But I am assessing test-retest reliability of the total score of the questionnaire, which is obtained by the sum of the items. However, my data are non-normally distributed (n=35), and thus I should avoid the ICC. Can I use Gwet’s AC2 in place of the ICC to avoid the normality assumption in this continuous variable? Could I use the central limit theorem to justify the use of the ICC? Are there other options? What would be the most appropriate approach?
Reply
- Charles
  
  February 13, 2022 at 10:23 am
  
  Hello Marcos,
  1. It depends on what you mean by test-retest reliability. Are you performing an analysis such as that shown in Examples 2 and 3 of the following webpage?
  https://www.real-statistics.com/reliability/interrater-reliability/intraclass-correlation/icc-for-test-retest-reliability/
  2. Data for large enough samples tend to be normally distributed per the central limit theorem. But if you have tested the actual data and found them not to come from a normally distributed population, then you can’t assume normality.
  3. I don’t believe that Gwet’s AC2 requires normality, but I don’t know whether it does what you are looking for.
  Charles
  Reply
Ashkan Taravati

October 24, 2021 at 6:25 pm

Hi,
Thank you for your useful explanations. It has helped me a lot.
I’m trying to use ICC in order to check the reliability of ratings before aggregating them and reporting a single value. I’d be more than happy if you could help me out in the following scenario:

A team of experts (variable size S, 2<S<10) are given S copies of the same 6-item survey with Likert-scaled questions. They are given the task to evaluate each of the S people (including themselves) by giving scores to each question in the survey for each of the S people. A score given to a question (or, statement) is an integer between 1 and 5. Each evaluation is then scored by calculating the average of each question's score. At this step, we'll be able to have a dataset like the wine and judge example, which will be an S*S matrix omitting the titles, roughly like the following:

James John Jennifer Sara
James 1.33 2 2.25 4
John 3.33 3.66 4.2 2.85
Jennifer 1 2.6 2.44 2.5
Sara 2.33 3 3.25 5

We'd want to check the interrater reliability, then if reliable, calculate the average of the scores given to a member of the team like the following:

James: 2.395
John: 3.51
Jennifer: 2.135
Sara: 3.395

Then, we'd want to aggregate the score at the team level.

I have the big picture in my mind but I'm still not sure how to do it and whether I'm thinking the right way. I'm pretty much a novice in statistics.
Reply
- Charles
  
  October 25, 2021 at 10:22 pm
  
  What are you trying to accomplish? Are you trying to determine the reliability of the questionnaire? Are you trying to determine whether there is agreement among the 4 people?
  Reply
  - Ashkan Taravati
    
    October 26, 2021 at 9:34 pm
    
    hi,
    thank you for your response. yes, I’m trying to accomplish the latter, then mark those records which indicate a disagreement within raters as noise.
    
    What do you recommend?
    Reply
    - Charles
      
      October 28, 2021 at 4:23 pm
      
      Hi. You can use any of a number of measurements, including the ICC (as well as Gwet’s AC2 and others) to determine the agreement between the raters. I don’t know how well these measurements handle the case where raters rate themselves.
      I don’t know how to determine records that indicate disagreement. One possible approach is to remove one or more records and see how the interrater measurement changes.
      Charles
      Reply
Faizah

March 16, 2021 at 6:13 pm

Hi,
I am looking at 3 raters, who are rating how long it took for a patient to fall asleep. Each patient has 5 naps, and there are 13 patients (around 60 naps in total).
So each marker will mark each patient nap (all 60) and I want to look at interrater agreement. As far as im aware I cannot use intraclass correlation coefficient as there are repeated measures from the same patient. Do you think a repeated measures anova would be better? Does the repeated measures anova sound better to use in this instance?

Thanks
Reply
- Charles
  
  April 20, 2021 at 9:25 pm
  
  Hi Faizah,
  Sorry for the delayed response.
  The following link might be useful in addressing this problem.
  https://www.reddit.com/r/AskStatistics/comments/btyh3v/icc_for_two_raters_with_repeated_measurements/
  Charles
  Reply
Luca

February 12, 2021 at 12:15 pm

Hello!

Beginner question:
I would like to compare two estimations of tumor cell content in ~350 samples: one by pathology and one by bioinformatic measurements after sequencing.
Can I use ICC for this oder should I consider another method? e.g the concordance correlation coefficient?

Thank you very much in advance!

Best,
Luca
Reply
- Charles
  
  February 12, 2021 at 8:51 pm
  
  Hello Luca,
  What specifically do you mean when you say that you “would like to compare these two estimations of tumor cells”?
  Are you trying to show that the two methods give the same results? ICC can be used as explained at
  https://www.real-statistics.com/reliability/interrater-reliability/intraclass-correlation/icc-compare-against-gold-standard/
  Bland-Altman and Lin’s CCC might also be appropriate
  https://www.real-statistics.com/reliability/interrater-reliability/bland-altman-analysis/
  https://www.real-statistics.com/reliability/lins-concordance-correlation-coefficient/
  Depending on what you are trying to accomplish, a two-sample t-test or Mann-Whitney test may be used.
  Charles
  Reply
Dragan

January 16, 2021 at 10:23 pm

Dear Charles,

thank you for nice tutorial about ICC.
I have question about scenario with more then one measures per one observer at same subject. Example: 8 subjects (wine), 4 observers (judges) but each judge tests each wine three times.

Thank you!
Reply
- Charles
  
  January 16, 2021 at 10:42 pm
  
  Dragan,
  Why would the rating change each time the rater tastes the wine?
  Charles
  Reply
  - Dragan
    
    January 20, 2021 at 12:07 am
    
    Thank you for response but I need information how to make scenario with repeated measures.
    Reply
    - Charles
      
      January 20, 2021 at 3:58 pm
      
      Dragan,
      Perhaps the following two links will be useful:
      https://stats.stackexchange.com/questions/81454/icc-in-a-linear-mixed-model-longitudial-data-why-is-this-the-correlation-bet
      https://www.reddit.com/r/AskStatistics/comments/btyh3v/icc_for_two_raters_with_repeated_measurements/
      Here is another relevant link, although I am not sure whether it addresses your issue.
      https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6645485/
      Charles
      Reply
Damian Castanelli

January 15, 2021 at 6:37 am

Hi Charles,
I found the explanations here really useful thank you.

With the interpretation, do you have a reference for the comment:
“E.g. in the medical field, an ICC less than .40 is considered to be poor, between .40 and .50 fair/moderate, between .60 and .75 good and ICC = .75 and above excellent. Others will change these boundaries to .50, .75 and .90.”

The ICC I calculate is 0.67, which seems pretty reasonable to me but I haven’t found a reference to quote.
thanks
Damian
Reply
- Charles
  
  January 15, 2021 at 5:21 pm
  
  Damian,
  See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4913118/#:~:text=A%20flowchart%20showing%20readers%20how,than%200.90%20indicate%20excellent%20reliability.
  Charles
  Reply
Michael

December 31, 2020 at 9:15 pm

Thank you for all the information! I have a question about absolute agreement and consistency, which of these is the output of the ICC function? I am analyzing the work of 66 students (scores ranging from 0-60) graded by 4 different raters (each rater graded all students) and so I chose class 2 for the ICC class and for type = 4. I am just not sure whether the ICC value is one for absolute agreement or consistency, do you mind explaining this to me or guiding me to useful resources? Thank you very much.
Reply
- Charles
  
  January 3, 2021 at 12:02 pm
  
  Michael,
  The difference is explained at https://en.wikipedia.org/wiki/Intraclass_correlation#:~:text=Absolute%20agreement%3A%20the%20agreement%20between,random%20residual%20error%20is%20kept.
  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3670985/
  https://www.ibm.com/support/pages/how-can-icc-reliability-be-higher-absolute-agreement-consistency
  Thanks for your question. This can be confusing. I will eventually expand on this topic on the Real Statistics website.
  Charles
  Reply
Ben Heslop

December 28, 2020 at 6:44 am

Apologies, it was my VBA code.

Thank you so much for this resource!!! You are very smart to understand all these bloody stats!~
Reply
Hanna

July 21, 2020 at 5:08 pm

Thank you so much for taking the time to educate the internet 😉

May I add a question, too? We want to determine interrater reliability between 2 raters.
They each scored 20 participants, using 5 items that together form a scale (happiness), and one item (quality) which is evaluated differently.

Is it correct that we would determine two ICCs, one using the 20 totalscores on happiness as input, and one ICC using the 20 (item) scores on quality as input?

Are 20 observations enough?
Reply
- Hannah
  
  July 21, 2020 at 5:09 pm
  
  for clarification:
  one ICC using the 2x 20 totalscores on happiness as input,
  and the other using the 2×20 item scores on quality as input.
  Reply
  - Charles
    
    July 21, 2020 at 9:39 pm
    
    See my previous reply.
    Charles
    Reply
- Charles
  
  July 21, 2020 at 9:37 pm
  
  Hello Hanna,
  1. You can use this approach.
  2. Whether or not 20 observations is enough depends on what you mean by “enough”. You can calculate the ICC with even fewer than 20 observations. How you interpret the ICC value does depend on the confidence interval of the ICC value, which in turn depends on the number of observations.
  Charles
  Reply
Constance

July 15, 2020 at 8:34 am

Thank you for such a thorough website and for the Resource Pack! I would like to calculate the interrater reliability of my students’s ratings on a measure after watching a video to the gold standard ratings for that video. The ratings are on a scale from 0 through 4. Can I still use the Intraclass Correlation function in your Resource Pack?

I tried it out on some data that I have, and while it appears to work on videos in which the ratings were mostly between 1 – 4, on a video in which the ratings were predominantly 0’s, with a few 1’s, I get a negative ICC.

Alternatively, it seems to work (again) for the video that has ratings between 1 – 4, in which my student obtained a 0.93 ICC after missing 4 items, but on the video with predominantly 0’s and 1’s, she obtained an ICC of 0.65 after missing only 1 item!

Thank you!
Reply
- Charles
  
  July 19, 2020 at 11:17 am
  
  Hello Constance,
  The approach for comparison against a gold standard is described at
  https://real-statistics.com/reliability/interrater-reliability/intraclass-correlation/icc-compare-against-gold-standard/
  Charles
  Reply
Halla

June 22, 2020 at 5:52 pm

Hello Charles
Thank you very much for this incredibly rich resource.
I am analyzing data where 4 scorers, all scored the same 16 trials. I want to evaluate the homogeneity of scoring.
In addition to calculating the ICC for all 4 scorers I also do pairwise ICC for each scorer pair (1 &2, 1&3, 1&4, 2&3…etc) and noticed that a single outlier (large difference in score between scorers) impacts the ICC score a lot (0.82 vs. 1.00 when excluded). (=ICC(data,2,1,True,0.05)
Is this normal or am I doing something incorrectly?
Do you recommend some other type of statistical test for this type of data?
Thank you.
Reply
- Charles
  
  June 24, 2020 at 11:09 pm
  
  It is difficult for me to say without seeing the data.
  If this one “outlier” is excluded I see that ICC = 1.0, which means there is perfect agreement. Is this so?
  Charles
  Reply
Luiz Z

May 27, 2020 at 4:49 pm

Hi, thank you for the post. Why do you say that the ICC you estimated is biased? You said that it is consistent, but biased. How can we estimate an unbiased coefficient?
Reply
- Charles
  
  May 29, 2020 at 12:53 am
  
  Luis,
  I believe that I am reflecting what was written in the original research paper (although I should really check this). The ICC value is quite useful which probably means that the bias is small.
  Charles
  Reply
- Charles
  
  May 29, 2020 at 12:55 am
  
  Luiz,
  I believe that this what was stated in the original research paper (although I should have checked this). In any case, the ICC estimate is quite good. I don’t know one that is less biased.
  Charles
  Reply
  - Luiz Z
    
    May 29, 2020 at 10:03 pm
    
    Hi Charles, thank you for your sincere answer!
    Reply
Azita

May 12, 2020 at 12:57 pm

Hello
Is icc appropriate to assess repeatability and reproducibility between two raters each reading two measurements?

Thanks
Reply
- Charles
  
  May 20, 2020 at 10:37 pm
  
  It depends on what you mean by repeatability and reproducibility. Perhaps the following article will be helpful
  https://www.sciencedirect.com/science/article/pii/S0894731719309460
  Charles
  Reply
Axel Mbvoumi Nloh

May 10, 2020 at 2:14 am

Bonjour Charles, je suis Axel, je mène un projet de recherche avec des collègues qui a pour but de créer un manuel de codage des publications sur les réseaux sociaux au travers d’une revues bibliographique. Après avoir fait une sellection sur 743 articles a partir d’une méthodologie de sellection ou moi et ma collègue avons collectionnés séparément des articles, voulons utiliser le ICC ceci pour confronter nos sellections et voire s’il existe une corrélation dans nos sellections séparés. nous avons 3 indicateurs de sellection S1, S2, S3 deux de rejet R1 et R2. Je voulais attribué à chaque code un chiffres afin de trouver le ICC mais je suis bloqué d’ou je viens vers vous.
Je voulais surtout utiliser la classe 1 comme indiqué dans votre document mais je ne suis pas sure de moi.
Classe 2 : les k évaluateurs sont choisis au hasard dans une population de évaluateurs et ces k évaluateurs évaluent tous les n sujets.
J’ai besoin de votre aide s’il vous plaît
Reply
- Axel Mbvoumi Nloh
  
  May 10, 2020 at 2:16 am
  
  Hello Charles, I am Axel, I am conducting a research project with colleagues which aims to create a manual for coding publications on social networks through a bibliographic review. After having made an sellection on 743 articles from an sellection methodology where I and my colleague have collected articles separately, we want to use the ICC this to compare our sellings and even if there is a correlation in our separate sellings. we have 3 sellection indicators S1, S2, S3 two of rejection R1 and R2. I wanted to assign each code a number in order to find the ICC but I am blocked from where I am coming to you.
  I mainly wanted to use class 1 as indicated in your document but I am not sure of myself.
  Class 2: the k assessors are chosen at random from a population of assessors and these k assessors assess all n subjects.
  I need your help please
  Reply
  - Charles
    
    May 10, 2020 at 9:31 am
    
    Hello Axel,
    Please confirm whether the following is true:
    You have 743 articles that are being assessed
    You have two raters (you and your colleague)
    You have 3 selection indicators and 2 rejection indicators. Do you choose one of S1, S2, S3 if the article is selected and one of R1, R2 if the article is rejected?
    Charles
    Reply
    - Axel Mbvoumi Nloh
      
      May 11, 2020 at 8:42 pm
      
      We choose just one of indicator per article, but my choosen indicator could be different of the indicator of my collegue, in this fact we want to used ICC to see if all award choose hava an corrhelation
      Reply
- Charles
  
  May 10, 2020 at 9:31 am
  
  See reply in English
  Reply
  - Axel Mbvoumi Nloh
    
    May 10, 2020 at 2:03 pm
    
    Hello Charles, Yes we are working on 743 items that we have to individually select or reject using a sellection manual. The first step is to validate this manual for sellection of articles. To validate it we want to see if there is a correlation of sellection between our two article analyzes. The analysis is done individually by me and my colleague. Not being too expert in ICC we want to use it to validate the consistency of our two article analysis. Yes there are 5 stages including 3 sellection and 2 rejection. Sellection S1, S2, S3, and rejection R1 and R2.
    Reply
    - Charles
      
      May 11, 2020 at 8:57 pm
      
      Hello Axel,
      I am trying to determine whether for each of the 743 items you assign one rating from the five categories S1, S2, S3, R1 and R2 (and similarly for your colleague).
      Item You Colleague
      1 S1 S2
      2 R1 R1
      3 S2 S2
      4 S3 S1
      etc.
      If it is a little more complicated than this, please explain so that I can determine whether ICC is appropriate.
      Charles
      Reply
      - Axel Mbvoumi Nloh
        
        May 12, 2020 at 2:43 pm
        
        Yes Charles me and my collegue have different choose as
        1 S1 R1
        2 R2 S3
        3 S3 R1
        4 R2 S1 …
        But in award method of selection of article, S1 you select article by keys words of the title , S2 Select my Concept of title, R1 you rejet if the title don’t speak about award idem of research, S3 selection of abstrat concept and R2 rejet after reading abstract.
Wei

March 27, 2020 at 8:58 pm

I have a data set with 2 raters, each rating 6 different subjects/parameters (measurements), with each of the 2 raters rating each of the 6 parameters three separate (independent) times.
How do I set up the Excel file for this analysis and do I use ANOVA with replication to eventually calculate the ICC?
Reply
- Charles
  
  March 28, 2020 at 9:08 am
  
  Hi Wei,
  Real Statistics hasn’t implemented this yet, but here are some sources about this issue:
  https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0219854
  https://www.reddit.com/r/AskStatistics/comments/btyh3v/icc_for_two_raters_with_repeated_measurements/
  https://stats.stackexchange.com/questions/229443/icc-agreement-with-repeated-measures
  Charles
  Reply
  - Wei
    
    March 29, 2020 at 3:18 am
    
    Charles,
    Thank you for the quick reply and the info.
    Much appreciated.
    Reply
Marie

March 25, 2020 at 5:28 am

Thanks for this! I would just like to ask, how do I calculate the intraclass correlelation coefficient if i have 6 raters and they have to rate all 102 items twice, one for its content validity and another for its structure validity? (one rater = 2 ratings). Thank you very much for your help!
Reply
- Charles
  
  March 25, 2020 at 9:35 am
  
  Hi Marie,
  You should calculate two ICC values, one for each rating.
  Charles
  Reply
Suzan

January 13, 2020 at 10:52 am

Hello,

This is very insightful. Please is ICC parametric or non parametric? How does the two way mixed ICC (where the judges are the only ones of interest) differ from Pearson correlation?

Thank you
Reply
- Charles
  
  January 13, 2020 at 3:59 pm
  
  Hello Suzan,
  1. ICC is parametric
  2. ICC is not the same as Pearson’s correlation. See the following for details:
  https://en.wikipedia.org/wiki/Intraclass_correlation
  Charles
  Reply
Jen

December 11, 2019 at 12:58 am

Hello Dr. Zaiontz,
How do you deal with missing data in these calculations? If one of my raters did not input a value for a continuous variable, then that datapoint is simply excluded from the analyses by SPSS (obviously since there is no numeric value). However, this is clearly leading to an overestimation of my ICC value because it is simply excluding the discrepancy between coders. Is there any way to handle this with ICC calculations? I also have this for categorical variables, which I use kappa to assess reliability but I can input “missing” as a category in those cases, please correct me if I am wrong. Thank you!
Reply
- Charles
  
  December 12, 2019 at 2:18 pm
  
  Hello Jen,
  
  These sorts of issues are described in the following book
  Handbook of Inter-Rater Reliability by Gwet
  
  Note too that Gwet’s AC2 measurement can be used in place of ICC and Kappa and handles missing data. This approach is supported by Real Statistics. See
  Gwet’s AC2
  
  According to the following article, listwise deletion is a reasonable approach for Cohen’s Kappa.
  https://journals.sagepub.com/doi/abs/10.1177/0013164418823249
  
  Charles
  Reply
Vanja

November 28, 2019 at 3:28 pm

Hello,
can I use ICC to assess the agreement of 6 different immunochemistry methods on 122 patients for measurement of one qunatitative variable? The values itself are not directly comparable regarding the different antibody used, however the final outcome (normal or abnormal) should be the same.

Thank you!
Reply
- Charles
  
  November 28, 2019 at 8:29 pm
  
  Hello Vanja,
  You can use ICC treating the 6 different immunochemistry methods as the raters provided the ratings are numeric. I don’t fully understand your last sentence though.
  Charles
  Reply