Krippendorff’s Alpha Basic Concepts

Krippendorff’s Alpha is a measurement of inter-rater reliability. The advantage of this approach is that it supports categorical, ordinal, interval, and ratio type data and also handles missing data.

Rating Table

Specifically, we assume that the ratings are organized in an n × m rating table, where the rows correspond to the subjects being rated, which we will label 1, …, n, and the columns correspond to the raters, which we will label 1, …, m. The rating used by all raters for all units takes the values 1, …, q, where the rating given by rater i to subject j is denoted cij. Missing data will be coded as cij = “”.

Since Krippendorf’s alpha is based on the computation of pairwise values, we need to eliminate any rows which contain no or only one non-missing value.

Note that for categorical ratings, the values 1, …, q represent names. For ordinal ratings, they represent ranks. For interval ratings, they represent interval scale data (e.g. decimals such as degrees Celsius). Finally, for ratio ratings, they represent ratio scale data (e.g. decimals such as degrees Kelvin).

Agreement Table

In order to make things clearer, we will use the rating table on the left side of Figure 1 as an example.

Rating and Agreement Tables

Figure 1 –Rating and Agreement Tables

In this example, there are 5 raters (A through E) who rate 10 subjects, using ratings 1, 2, 3, and 4. The first thing we need to do is transform the rating table into an agreement table, as shown on the right side of Figure 2.

The agreement table is an n × q matrix whose rows are the subjects from the rating table where any row with fewer than two ratings is dropped. In the example in Figure 1, the last two rows are dropped: row 9 since it contains no ratings and row 10 since it only contains one rating. The columns in the agreement table correspond to the rating categories 1, …, q.

For example, row 2 of the rating table shows that 3 raters give a 2 rating and 1 rater gives a 3 rating. This is reflected in row 2 of the agreement table. Note that this table can be constructed by placing the formula =COUNTIF($B4:$F4,I$3) in cell I4, highlighting the range I4:L11 and pressing Ctrl-R and Ctrl-D.

Krippendorff’s Alpha

As for Cohen’s kappa, we define Krippendorff’s alpha via the formula

Krippendorff's Alpha

We now show how to calculate pa and pe, but first, we need to define the weights that will be used.

Weights

Weights whk are defined on pairs of ratings, where 0 ≤ whk ≤ 1 and wkk = 1. These weights then define a q × q matrix. For our purposes, we will assume that this matrix is symmetric; i.e. wkh = whk.

Principally, we will use weights that are based on whether the ratings use a categorical, ordinal, interval or ratio scale. For any h, k with 1 ≤ h, k ≤ q, such weight whk are defined as shown in Figure 2.

categorical 0 if k ≠ h and 1 if k = h
ordinal weight for ordinal ratings if k ≠ h and 1 if k = h
interval  weight interval ratings
ratio  weight for ratio rating

Figure 2 – Weights

Here \tilde k is the rating corresponding to k. For categorical rating categories,  \tilde k can represent any category. For ordinal data, \tilde k represents the rank of the kth rating (and so can be viewed as values 1 through q in any order). Finally, for interval and ratio rating categories, \tilde k can represent any decimal number. Note that in any case, the q rating values must all be different.

Note too, that in Figure 2, a = the smallest rating value and b = the largest rating value. For the examples shown in this section, we will simply use ratings where \tilde k = k.

Examples

Example 1: Calculate Krippendorff’s alpha for the data in Figure 1 based on categorical weights.

As described above, we need to calculate the values of pa and pe. This is done using the following formulas

We now show how to calculate each of these formulas in Excel for Example 1 (see Figure 3).

Range O15:R18 contains the weights matrix. For categorical data, per Figure 2, this is the identity matrix, where there are ones in the main diagonal and zeros everywhere else. This can be calculated in Excel by putting the formula =IF(O$14=$N15,1,0) in cell O15, highlighting the range O15:R18, and then pressing Ctrl-R and Ctrl-D. Alternatively, the Real Statistics array formula =IDENTITY() can be placed in range O15:R18.

The rik* values are calculated via the array formula =MMULT(I4:L11,O15:R18) placed in range O4:R11. With categorical weights, the rik* values are identical to the rik in range I4:L11. The ri values are the sums of the rik values for each row. These can be calculated by placing the formula =SUM(I4:L4) in cell T4, highlighting the range T4:T11 and pressing Ctrl-D.

Krippendorff's Alpha calculation

Figure 3 – Krippendorff’s Alpha

Figure 4 displays some representative formulas.

Cell Entity Formula
T4  r1 =SUM(I4:L4)
U4  p1 =SUMPRODUCT(I4:L4,O4:R4-1)/($U$15*(T4-1))
U13  n =COUNT(H4:H11)
U14  q =COUNT(I3:L3)
U15  r-bar =AVERAGE(T4:T11)
U16  ε =1/(U13*U15)
O20  π1 =SUM(I4:I11)*$U$16
U17  pa =AVERAGE(U4:U11)*(1-U16)+U16
U18  pe =SUMPRODUCT(MMULT(TRANSPOSE(O20:R20),O20:R20),O15:R18)
U19  α =(U17-U18)/(1-U18)

Figure 4 – Representative formulas

Note that the formula for pe is an array formula. Note too that the πk*  in range O22:R22 are calculated by the array formula =MMULT(O20:R20,O15:R18), as explained in Standard Error for Krippendorff’s Alpha.

Worksheet Functions

Real Statistics Function: The Real Statistics Resource Pack contains the following function:

KALPHA(R1, weights, ratings) = Krippendorff’s alpha for the n × q agreement table in range R1 based on the weights and ratings in the second and third arguments.

weights is a q × q range containing the weights or the value 0 (default) if categorical weights are used, 1 if ordinal weights are used, 2 for interval weights and 3 for ratio weights.

ratings is a q × 1 or 1 × q  range containing the rating values. If omitted then the ratings 1, …, q are used.

The Real Statistics Resource Pack also contains the following array function where R1 is a range containing a rating table:

KTRANS(R1): returns the agreement table that corresponds to the rating table in range R1.

KTRANS(R1, col): returns the agreement table that corresponds to the rating table in range R1 where column col is removed.

For Example 1, =KALPHA(I4:L11) takes the value .169204, which is the same value as that found in cell U19. The array formula =KTRANS(B4:F13) returns the array in range I4:L11. The formula =KALPHA(KTRANS(B4:F13)) also outputs the Krippendorff alpha value of .169204.

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

Gwet, K. L. (2015) On Krippendorff’s alpha coefficient
https://agreestat.com/papers/onkrippendorffalpha_rev10052015.pdf

Krippendorff, K. (2004) Reliability in content analysis: some common misconceptions and recommendations. University of Pennsylvania paper.
http://faculty.washington.edu/jwilker/559/Krippendorf.pdf

Girard, J. M. (2016) Krippendorff’s alpha coefficient, GitHub
https://github.com/jmgirard/mReliability/wiki/Krippendorff’s-alpha-coefficient

98 thoughts on “Krippendorff’s Alpha Basic Concepts”

  1. Dear Charles,
    Thank you for your great explanation!
    May I ask you about my research case? I have video data which is multi-labeled into 8 emotion classes. The data are rated by 2 raters. I want to examine the inter-rater agreement of the data annotation. Is Krippendorff’s Alpha suitable for my problem? If yes, do you know how to calculate Krippendorf’s Alpha in multilabel case? Thank you in advance. Have a good day!

    Reply
  2. Hello Dr. Charles Zaiontz.. Can you give explanation to me please about using krippendorff alpha on multi label data.
    I very appreciate for your help..

    Reply
  3. Dear Dr. Charles Zaiontz,
    thank you very much for your website, it has helped me a lot!
    I’m calculating Gwet’s AC2, but I have problems with finding the weights assigned to ordinal data. I don’t understand the formula in Figure 2 (I do for the other measurement scales), nor the formula in the Gwet’s reference.
    I tried different weighting tables, for 3 categories. When I use the GWET_AC2 command, with code 1 for ordinal data, Gwet’s AC2 is very much higher than when using weighting tables assigning values 0,5 (linear weights), 0,33, or 0,25 (quadratic weights) to adjacent categories (and 1 for the same categories, 0 for most disagreeing). Could you perhaps help me with the weights for cells (1,2), (2,3) in a 3 x 3 weight matrix?

    Reply
    • Hello Leontien,
      I can explain the formula in Figure 2 for ordinal data if you like, but from the rest of your comment it seems that your real issue is to decide what weights to use. What weights to use really depends on what sort of weights you believe are appropriate “in the real world”.
      Let’s put the statistic analysis aside for a moment. What sort of weights do you believe are appropriate and why? If these don’t produce the results that you expect, then let’s discuss this further.
      Charles

      Reply
      • Dear Charles,
        thank you for your response. I would think that one could argue for linear weights (0, 0,5 , 1) or quadratic weights (0, 0,25, 1) or something in between (0, 0,33, 1). To me it seems that 0,5 (in a 3×3 table, for one category difference) is a bit too lenient and 0,33 or 0,25 are more appropriate. (I’m looking at the interrater agreement of risk classification (low, medium, high risk for child abuse. In this case, one might also argue for different weights for e.g. low and medium vs medium and high.)
        However, I also want to compare my findings with those of others, for another risk assessment instrument with similar ordinal categories, where also Gwet’s AC2 was being calculated. I don’t know which weights they have used, I assumed that they used code 1 for ordinal data and was surprised to see such a big difference in my own data with the results using the weighting tables that I constructed. The results that I found seem reasonable.
        Krippendorf’s alpha’s are much lower , due to some categories occuring not often, in combination with disagreement about these occurences. I find it an interesting discussion whether Krippendorf’s alpha is or is not to harsh here and whether the Gwet’s AC2’s is or isn’t a better measure. I would like to report both measures, and compare both measures between the instruments.
        Thank you,
        Leontien

        Reply
  4. Hello Charles,
    Thank you for a very informative page!
    We want to calculate interrater reliability for a 7 point scale but we´re struggling with what approach we should use. The data is ordinal. We have 9 raters rating 17 different cases that each represent a level on the scale. We would like to examine interrater reliability for the whole scale as well as for the different levels. We want the results to adress “degree of agreeing/disagreeing” and not only if they rate the same or not.

    Is Krippendorff´s alpha the right approach for us?

    Reply
    • Jody,
      I am not sure what you mean by the cases “represent a level on the scale”.
      Probably Krippendorff’s Alpha or Gwet’s AC2 are the best choices, but I can’t say for sure.
      Charles

      Reply
  5. Hey Charles,

    First of all a big thanks. The tutorial is practical and seemingly a stand-alone excel based approach.
    As an add-on for you and potential readers, here is the STATA based approach: https://journals.sagepub.com/doi/pdf/10.1177/1536867X1801800408

    I gladly received the same results (with kappaetc command) but felt that the excel calculation helped me to grasp Krippendorff’s Alpha more in-depth.

    Two questions/remarks:
    1. You call it the weighted agreement table, but it seems it is just the copied agreement table which is then used for the weighted agreement calculation, right? One might consider renaming as I first thought this would be the final weighted values.
    2. In Excel: beyond categorical weights, with which excel formulas would you calculate them?

    Thanks and best,
    Niklas

    Reply
    • Hi Niklas,
      Thanks for sharing the STATA link.
      1. The entries in what I called the weighted agreement table consist of the entries in the agreement table to which the weights have been applied.
      2. The formulas are given in Figure 2. The Excel versions are fairly similar. You need to use the function COMBIN for C(x,y) and RANK or RANK.AVG for the entries with a bar over them.
      Charles

      Reply
  6. Good morning Charles,

    Thank you for putting together such a brilliant resources. I’ve only just found it and it’s been very useful to clarify some concepts for me.

    I’m currently carrying out a systematic review and I would like to look at the inter-rater concensus and the inter-rater agreements between some examiners. My team has used a risk of bias tool to assess a number of scientific studies and these have been rates as low risk of bias, moderate, serious and critical. I have given these values 1 = low to 4 = critical as these are catagorical variables with critical being the worst outcome for a paper.

    When trying to calcute the agreement should I use Krippendorf or the AC1 calculation for the inter-rater consensus? Also could I use real stats to calculate the Kohen’s kappa for the inter-rater reliability?

    Thank you for your help with this, it is much appreciated.

    Kind regards,

    Kenny

    Reply
  7. Hi there,

    I am doing content analysis on a curriculum and a language policy and the protocol allows for multiple coding. That is, more than one code can be applied to a unit of analysis. There are two coders and sometimes each coder applied different number of codes to the same unit (one may apply 3, the other 2). I am not sure whether K alpha would be suitable for this kind of data? It does not seem to work with the matrix. Do you think this type of data is more suited to a qualitative approach to agreement checking? Any thoughts appreciated

    Reply
    • Hi Shawna,
      If I understand correctly, each rater can attribute one or more rating codes to each subject. I don’t believe that Krippendorff’s Alpha can be used in this case.
      Charles

      Reply
  8. Hi Charles,

    Our research team wants to do Krippendorff Alpha analysis but we have a couple of questions for you if you can answer. Let me first described our data. Our data are CT-integrated subject area lessons. We aim to analyze it through the idea of integration so we came up with a rubric in which we have many components such as integration type, CT coherence, integration coherence, role of technology, and more. For each component, we have four categories: poor, fair, good, and excellent. We are using p symbol for poor, f symbol for fair, g symbol for good, and e symbol for excellent. After reading the method paper written by Krippendorff (2011), I thought that our data is “B. Nominal data, two observers, no missing data”. This is the second data type in that article. We thought that our data are nominal because we have four categories for each component in our rubric. We are two analyzers (raters) and we have no missing data. When the two raters agreed on the same category such as good (g), we say this is a yes because there is agreement. If we disagree, we say this is a no. However, “no missing data” part confused us because we do not analyze some components in the rubric for some lessons. I mean some components are not applicable (N/A) for some lessons such as that some lessons do not have a technology used (“role of technology” component). Could you please answer the following questions: Are our data nominal? Do we have “missing data” or “non-missing data” when we use not applicable for some components of the rubric?

    Thank you so much for your answer in advance,

    Emrah

    Reply
    • Hello Emrah,
      If you have 4 categories and there is no order to these categories, then the data is nominal.
      If neither rater rates a particular component, then for your purposes this component doesn’t exist and so you don’t have any missing data.
      Charles

      Reply
      • Thank you so much for your response, Charles. Yes, it makes sense that we do not have any missing data. However, is it OK that we could have different number of components rated for different lessons. By this I mean that while we could have 10 components for a lesson that the two raters rated, we could have 13 components for another lesson that we both rated. In this case, can Excel calculate this correctly?

        Another question I have if you don’t mind is that you said “If you have 4 categories and there is no order to these categories, then the data is nominal.” in your comment. What do you mean by order? Is having poor, fair, good, excellent considered an order?

        Emrah

        Reply
        • Hello Emrah,
          1. If you are calculating separate measurements for each lesson, then it doesn’t matter whether the number of components is the same.
          2. Yes, poor, fair, good, excellent are ordered, namely poor < fair < good < excellent. Charles

          Reply
  9. Is there a minimum recommended number of observations needed in order to test Krippendorf’s alpha?

    I’m working with a team of 3 raters who are each providing expert evaluations (ratings) of each step associated with a task. One task may contain 3 steps (so, each rater provides 3 ratings) whereas on a different task, the task may contain 7 steps (so, each rater provides 7 ratings). They are providing ratings on a 1-3 scale.

    We just ran our first session and it seemed like the more steps we have in a task, the easier it is to achieve a Kripp IRR standard of 0.667 and above.

    For example: One team evaluated a task that had 3 steps and this was their output (first scores are always the first step, second scores are always second step, etc):
    Rater 1: 1, 1, 1
    Rater 2: 2, 1, 1
    Rater 3: 1, 1, 1

    Looks pretty good, but our Kripp IRR came back at….0.

    I’m going to try your Excel formula now to see if I can replicate this, but….any insight as to why this may have happened (and implications for use)?

    Reply
    • Hi Jade,
      Krippendorf’s alpha is not a test, but a measurement, and so there is no minimum sample size. You could decide that you want the confidence interval for this measurement to be of a certain maximum size, though. In this case, you can calculate the minimum sample size required to achieve this. This approach is described for Cronbach’s alpha in Example 2 on the following webpage:
      https://www.real-statistics.com/reliability/internal-consistency-reliability/cronbachs-alpha/cronbachs-alpha-continued/
      With samples where all the ratings are almost equal, sometimes the various IRRs yield strange, counter-intuitive results. I believe that Gwet’s AC2 might yield better (i.e. more intuitive) results. You might consider trying Gwet’s AC2 to see whether things are better.
      Charles

      Reply
  10. Hello,

    Would this be appropriate for my study in which 20 raters are assessing 28 categories with 26 videos. Sometimes only 19 or 18 or 16 of my raters have assessed the video – I am trying to find a statistical test in which I can vary the value of m for each video analysis.

    Does this make sense?

    Thank you!
    Arielle

    Reply
    • Hello,
      I understand that you have 26 videos (subjects) to rate. Each is rated by up to 20 raters. Each rating is one of 28 categories.
      Yes, you can use Krippendorff’s Alpha to accomplish this. You can also use Gwet’s AC2.
      Charles

      Reply
      • Hello,

        Thank you very much for your quick reply.

        Each rating can be multiple of the 28 categories in my survey. So, a rater can select several categories to apply to one video (subject). The categories are clinical signs observed while watching the video and so raters have selected multiple clinicial signs (categories) per video.

        Would Krippendorff’s Alpha/Gwet’s AC2 still work?

        Thank you,
        Arielle

        Reply
        • Arielle,
          I don’t know how to measure agreement in this case. I know how to measure agreement when (1) there are multiple videos and each rating consists of one category or (2) there is only one video and each rating can consist of multiple categories. I don’t know how to measure agreement when there are multiple videos and each rating consists of multiple categories.
          Charles

          Reply
          • Hello,

            Thank you, that is helpful. If I may ask one further question, please – how would you measure agreement with one video and each rating consisting of multiple categories?

            Kind regards,
            Arielle

          • Hello Arielle,
            You can think of each category as a subject. The ratings therefore for each category are Yes (that rater selects that category for the one video) or No (that rater doesn’t select that category). Now you can use Krippendorff’s (or Fleiss’ or Gwet’s) treating the categories as subjects and the new rating categories as Yes or No.
            This assumes that the categories (i.e. subjects) are independent.
            Charles

  11. Hi there:) Thank you for the useful info.
    My data set consists of 10 people rating 8 questions, rated on 4 point Likert scale (1-4).
    Shall I use Krippendorff’s Alpha or fleiss kappa ?
    And may I send you my data for calculation?
    Thank you so much!

    Reply
    • Hi Sharon,
      Fleiss kappa only supports categorical data and not Likert data. THus you should use Krippendorff.
      Sorry, you will need to do the calculations yourself.
      Charles

      Reply
  12. Hello, could this be used to determine interrater reliability of participants using survey data? I am looking for a simple way to determine if people in a team shared the same opinions.

    Reply
  13. Hello Sir, good day!

    Thank you very much for writing about this test. I have a question about a scenario, where there are 2 raters. Both raters must choose 1 disease among 10 possible diseases for each patient. They will rate 25 different patients. Can I use Kripendorff’s alpha for this scenario, Sir?

    Thank you very much,

    Lunardi

    Reply
  14. Hi Charles,
    Do you maybe know if I can convert Krippendorff’s alpha to ICC (or other inter-rater statistic), when I do know the Krippendorff’s alpha value (and also sample size, number of judges) but I don’t have the raw dataset (e.g., when doing a meta-analysis).
    Thanks in advance

    Reply
  15. Hello Dr. Zaiontz,

    Thank you so much for the detailed examples and wonderful resources!

    I have a dataset where 7 raters provide scores ranging from 1-4 for 20 students. I tried following the instructions and computed both K’s alpha and Gwet’s AC2. However, there seems to be quite a disparity in both values (i.e. 0.32 for K’s alpha, and 0.69 for Gwet’s AC2). Just wondering if such a big difference in reliability scores is normal? Because the data contained no missing values, I ran it the same way for both statistics (that is, by using the rating table, agreement table, and interval-level weights).

    Reply
  16. Hello Charles,
    I am trying to prepare the coder data to determine the Krippendorff alpha inter-coder reliability, but I am running into some issues on how to set up the data. Please help.

    I have a priori codebook developed to analyze drawings. The codebook has TWO main CONSTRUCTS. Each of the CONSTRUCTS is divided into four SUB-Constructs each. Each sub-construct is coded to identify mutually exclusive instances either POSITIVE (+1) or NEGATIVE(-1) instance for each subconstruct.
    This set-up gives 16 possible codes for each drawing, but because each pair is mutually exclusive the matrix will have eight empty cells and eight filled cells with a 1 or -1.

    Before running the add on I have a question on how to prepare the coders’ results:

    Do you recommend creating a matrix 16 x number of coders with the CUMULATIVE SUM of all the drawings for each sub-construct (16) as rows and coders in the columns?

    OR should I create a matrix for each drawing with the results of the RAW code from data of either 1 and -1? So if I have 100 drawings should I create a matrix for each?

    OR what is your recommendation for going about this issue?

    Reply
  17. Very useful article. Thank you for the detailed explanations.
    I believe, though, that there is something missing in the Pa calculation, since the weights do not seem to be taken into account (this happens as well in Gwet’s AC2 calculation). If the weights are different from the identity matrix, the result is different from what I get with other tools (IrrCAC package in R, for example).

    Reply
    • Hello Fiz,
      Are you saying that (1) the weights are not taken into account (i.e you get the same result no matter what values are chosen for the weights) or (2) that the weights are taken into account but the wrong answer is generated?
      Charles

      Reply
      • Hello Charles,
        thank you for your answer. My mistake, I thought that the range O15:R18 was not being taken into account for the Pa calculation, but then I noticed that the “Weighted agreement table” was taking it into account (hence the name). I was recreating your formulas from scratch and I missed that part. Nevermind and thank you for such amount of useful and very well explained resources.

        Reply
  18. Hello Charles,
    thank you so much for all your helpful articles.
    One question that I still have, however, is how to use the interrater reliability value once it’s calculated.
    My scenario is as follows:
    10 judges rate a number of images on a scale of 1* to 3*, were 1* is irrelevant and 3* is highly relevant.
    Since I will need to continue doing calculations with that rating, I need to somehow aggregate the ratings of the judges.
    So let’s do an example:
    In an easy case, all 10 judges might say that this image is a 3*, then obviously the rating for that image should be a 3*.
    But what happens, if 8 judges say it is a 3* and 2 judges claim it is only a 1*. Based on the interrater reliability, those two judges are very far away from the majority and probably not very reliable. How can I use that knowledge to find a reliable rating for this image? Because simply taking the average will result in the score being 2.5 and I think that is not 100% accurate.
    I am actually not even sure if Kappa or any of the other methods to calculate interrater reliability are useful in this scenario. But maybe you could point me into the right direction. That would be greatly appreciated.
    Thank you very much.

    Reply
  19. Hi again,
    How could Krippendorff’s Alpha be used to calculate the reliability of domain/subtest raw scores, or is there a more appropriate method?

    For example, I have a test that includes 4 different domains/subtests, and each subtest consists of 10 Likert questions (1 through 4). Could the sums of all questions within a domain/subtest be used to calculate the reliability of the domain/subtest?

    Additionally, could the total scores of all questions be used to calculate overall reliability of assessment?

    Reply
    • Hello Matt,
      Krippendorff’s Alpha can be used to measure agreement between raters. There are a number of types of reliability. If this is not the type of reliability that you are trying to measure, then Krippendorff’s is not the right tool. For example, if you are looking to measure the internal consistency type of reliability of a questionnaire then perhaps Cronbach’s alpha is the rights tool. All of these types of reliability are described on the Real Statistics website.
      Charles

      Reply
      • Thank you.

        I guess I could ask my question another way. If there was a test of overall wine quality that included 10 questions that sum to a single score, would it be appropriate to use Krippendorff’s alpha to evaluate the inter-rater reliability of the total score? As a follow-up question, would Krippendorff’s alpha be appropriate to use for test-retest reliability of the same scorer to be able to ascertain stability of the assessment over time?

        Reply
  20. Dear Charles,

    Thank you for the information. I find it very useful.

    Does it make sense to calculate Krippendorff’s alpha for a test with one subject only? I have a project that is divided into several parts (since the options are different), one of which only has one subject. The raters are asked to choose between option A and B.

    5 of them choose A and 1 chooses B.
    alpha = -0.139

    Is this correct? How should I explain this?

    Reply
    • Hello Wei,
      If there is only one subject, then what do you mean that 5 of them chose A and 1 chose B? I thought there was only one subject.
      Since Krippendorff’s alpha is measuring agreement between the raters, there needs to be more than one rater; otherwise you can’t measure agreement.
      Charles

      Reply
      • Thanks, and sorry for not making myself clear.
        What I meant was that there were 5 raters (A,B,C,D,E) giving ratings (1,2) on one subject. Raters A, B, C and D give rating ‘1’, and E gives rating ‘2’.

        Reply
        • Hello Wei,
          Thanks for the clarification. I misunderstood your previous comment. Sorry about that.
          In any case, you need more than one subject to calculate Krippendorff’s alpha.
          Charles

          Reply
  21. Hello Dr. Zaiontz,
    Thank you for providing the software and detailed directions. I am trying to calculate Krippendorf’s alpha for an experiment I conducted, and I keep calculating negative values for the ‘p’ column. Is this normal?

    Reply
    • Hello Sarah,
      If I remember correctly, the p column contains probabilities and so shouldn’t contain any negative values.
      If you send me an Excel file with your data and calculations, I will try to figure out what is going wrong.
      Charles

      Reply
  22. Hi!
    I had my developed tool validated by 5 experts (3 nurse experts and 2 psychology majors). The tool has 39 items/statements with 4 categories (not relevant, slightly relevant, quite relevant, highly relevant). Can I use the Krippendorff alpha to test the inter-rater reliability or the Fleiss kappa?

    Thanks

    Reply
    • Hi Rica,
      Since the ratings are ordinal (i.e. ordered), you shouldn’t use Fleiss’ kappa. You can use Krippendorff’s alpha or Gwet’s AC2.
      Charles

      Reply
      • Hello Charles, thank you for the reply.
        I can understand and follow through the kalpha in excel, except about the weights in figure 3 (O15:R18)..

        Reply
          • Hello Charles,

            What if I use only 2 ratings (not relevant & relevant) instead of 4 (not relevant, slightly relevant, quite relevant, highly relevant)? Can I use Fleiss kappa?

          • This is very helpful, I am confused about the weights table. Should the above read “default weights for CATEGORICAL data”?
            I am struggling to calculate the weights in excel.
            I have ordinal data (ie categories 1a;1b;2a;2b etc up to a total of 8 categories)
            Many thanks

          • Hello Mike,
            The default weights depend on the type of data (categorical, ordinal, etc.). You can override these by using your own weights table.
            Is the order of your categories 1a < 1b < 2a < 2b < 3a < 3b < 4a < 4b ? Do you want these to be treated as a Likert scale (e.g. 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 ? Charles

  23. When using Krippendorff’s Alpha with Likert scale data, does range restriction need to be considered? If so, I am wondering how to do it?
    And thanks for the great software.

    Reply
      • I have Likert data that has a natural ceiling and floor based on the number of questions and range of the scale (e.g., 10 questions, Likert between 1 – 4, range between 10 and 40).

        Actually, I guess I can answer my own question by looking at mean, SD, skewness, and kurtosis. Unless the data is squeezed to either extremes, I think limited range would not be an issue.

        Reply
  24. I would like to ask if it’s normal to have a -.0111! I’m sure something is wrong with the way I ran the test! I have 20 questions and are nominal, each variable with 111 students. Is it because of the large number?

    Reply
    • I don’t think the result is questionable simply because you have such a large number. Actually it isn’t really such a large number.
      If you send me an Excel file with your analysis, I can look at it and give you feedback.
      Charles

      Reply
    • Hi Hong,
      Yes, you are correct. Range O22:R22 are calculated by the array formula =MMULT(O20:R20,O15:R18). I have now explained this better on the webpage. Thank you for identifying this omission.
      Charles

      Reply
  25. Dr. Zaiontz,

    Is Krippendorff’s alpha primarily used in sociological contexts as a
    qualitative content analysis tool? Can it be used in a clinical context with quantitative data?

    Thank you,
    Winford

    Reply
    • Dennis,
      I don’t know of any way to calculate a p-value. In any case, what hypothesis would you want to test? That alpha is significantly different from zero?
      Charles

      Reply
  26. Hi, could you help me to check if my formulas in cell O15 for other data types are correct?
    ordinal: =IF(O$14=$N15,1,1-(COMBIN(ABS(O$14-$N15)+1,2)/COMBIN($U$14,2)))
    interval: =1-POWER((O$14-$N15)/(MAX($O$14:$R$14)-MIN($O$14:$R$14)),2)
    ratio: =1-POWER((O$14-$N15)/(O$14+$N15),2)

    The k-alpha I’ve got are:
    ordinal: 0.4827852 (I doubt this result)
    interval: 0.542026709
    ratio: 0.521144788

    Thank you.

    Reply
  27. Hi,
    thanks for these information. However, using Excel 2016 and even if I corrected syntax in formula, Still I can’t compute Pe.The error message point to the “TRANSPOSE” command. Can you help?
    Thank you

    Reply
  28. Dear Charles,

    Thank you so much for your website, it is incredibly helpful. I am trying to calculate Krippendorff alpha on a data set where there are 8 subjects, 3 categories (0, 1, 2) and 44 raters. The part I am confused about is how to calculate the weighted table for such a data set. In the example, you use a symmetrical data set but I wonder if mine would be asymmetrical?

    I would appreciate your input on this.

    Reply

Leave a Comment