We explore various types of test reliability as suggested by the following topics.
Topics
- Internal Consistency Reliability
- Interrater Reliability
- Cohen’s Kappa
- Weighted Cohen’s Kappa
- Fleiss’ Kappa
- Krippendorff’s Alpha
- Gwet’s AC2
- Intraclass Correlation
- Kendall’s Coefficient of Concordance (W)
- Kendall’s Coefficient of Agreement u for Paired Comparisons
- Kendall’s Coefficient of Agreement u for Paired Rankings
- TC Correlation between several Judges and a Criterion
- Bland-Altman Analysis
- Lin’s Concordance Correlation Coefficient (CCC)
- Bradley-Terry Model
- Test Theory and Item Analysis
- Item Response Theory and Rasch Analysis
- Motivation for Rasch Analysis
- Basic Concepts of Rasch Analysis
- Building a Rasch Model
- Wright Map
- Guessing
- Real Statistics Support for a UCON model
- PROX model
- Building a PROX model
- Real Statistics Support for a PROX model
- Expanded view of ability and difficulty
- Polytomous Model Basic Concepts
- Building a Polytomous UCON model
- Polytomous Model Fit
- Real Statistics Support for a polytomous UCON model
References
Wikipedia (2012) Reliability (statistics)
https://en.wikipedia.org/wiki/Reliability_(statistics)
Boone, W. J. (2016)Â Rasch analysis for instrument development: why, when, and how?
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5132390/pdf/rm4.pdf
Price, P. C., Jhangiani, R., Chiang, I. (2015) Reliability and validity of measurement. Research Methods in Psychology
https://opentextbc.ca/researchmethods/chapter/reliability-and-validity-of-measurement/
Hi Charles,
Firstly, thanks for all the information and advice that you provide. Real-statistics.com is the best stats resource I have found.
I’m hoping you can answer a question that I’m struggling with. I have developed a marking rubric that has 15 questions. Three markers will use an ordinal scale (1-5) to score each question. They will do this for 20 assessment pieces (same assignment submitted by 20 different students).
My main focus is to determine the reliability of each individual question as this will help to identify those that need further refinement. Following your instruction on this site, I’m fairly confident that i can complete this procedure (using Gwet’s AC2 for example) for each of the 15 questions.
My question is – is there a reliability test that will take into account ALL the data (i.e. the scores from the three raters, for all 15 questions, on all 20 student assignments) and provide a single reliability measure for the entire rubric?
Thanks for any assistance you can provide.
Hello Patrick,
This is a rather complicated scenario.
1. First of all, we need to understand what you mean by reliability here. Are you trying to measure agreement between the 3 raters?
2. When you used Gwet’s AC2 did you calculate a measurement to compare the three raters’ assessments for question 1, another for question 2, etc.?
Charles
Hi Charles,
1. Yes, by ‘reliability’ I mean that I’m seeking to measure the level of agreement between the 3 raters.
2. I actually have not used Gwet’s AC2 (or any other method) yet. I’m trying to confirm the procedure before doing so. However, yes, I’m planning on using the method to calculate agreement between the raters on each individual question.
Thanks
Patrick
Hi Patrick,
1. I don’t know of a way to combine multiple agreement measures. We also need to define what such a measure would mean (mean of the measures, minimum of the measures, etc.). Perhaps the following article can be useful:
http://people.umass.edu/~bogartz/Interrater%20Agreement.pdf
2. You should be able to use Gwet for any specific question.
Charles
Thanks for your advice.
Hi! What test do I need to use when the type of test is dichotomous (e.g. true/false or agree/disagree)?
What hypothesis do you want to test?
Charles
Hello Charles,
I did a pre-test/post test of 50 multiple choice items to 30 respondents. What should I use to see whether the test is reliable?
Thank you for having this site. This is such a great help.
Hello Ella,
You can use Cronbach’s Alpha or any of the other internal consistency reliability tests described on the website.
Charles
how about t stat?
What would you like to know about the t statistic?
Charles
Dear Charles,
I want to thank you for the RealStats Excel add-in and this website; they have been both very educational.
I appreciate if you could please give me some advice about the following situation (I´ll try to be as clear as possible):
I have a set of 13 variables which are considered relevant for analyzing certain type of financial institutions. Each variable has a weight assigned to it and they all add up to 90. Then, we have some “experts” and they´re asked to choose from a broader set of variables, among which are the 13 variables, and assign weight to the ones they consider more relevant. Thus, I have 13 variables each one with its corresponding weight, and a different number of variables each one with its weight according to each “expert” (there are 4 “experts” each one with its own answers).
What tools could I use in order to find out how close or adjusted are the initial weights from the 13 variables, to those weights assigned by the group of “experts”?
Thank you in advance for your help.
Regards,
Diego
Hi Diego,
This is an interesting scenario, but I don’t know how to measure it. Perhaps someone else in the Real Statistics community can help you.
Charles
Hi! I hope you can help me.
I’m looking for a coefficient that will help me calculate the reliability between three observers, each of whom evaluated a student by speaking in public. A rubric of 9 areas was used, and each area with 5 categories, for example:
A. Presentation:
0. Poor
1. Minimum
2. Basic
3. Competent
4. Advanced
I have been reviewing your page, but it is not clear to me what index to use. Thanks!!!
Cristina,
I suggest that you use Gwet’s AC2 or Krippendorff’s alpha for ordered categories.
Charles
I am quite unsure how to treat my data and I was hoping that someone could enlighten me. Students from two different year levels were asked to choose from a set of 5 items which among them they preferred to use at school, they can choose all that they want. From there the counts for each item was tabulated and was ranked (arranged highest to lowest). For example the results are as follows: Grade 1 – 5,4,3,1,2 and Grade 2 – 3,4,5,2,1.
What tests do I perform to show if there is a significant difference in the preferences of the students based on the counts from the two different year levels?
Thank you so much.
Juliano,
If I understand correctly, you are trying to determine whether there is agreement between the students of year 1 and the students of year 2. None of the reliability measurements described on the website groups the raters in this way, but here is an approach that may be appropriate:
I assume that there are two raters (students for year 1 and students for year 2). Each of the two “raters” would give a score to each of five subjects (the items) based on how many students in that year preferred that item. This should work provided the number of students in each year are the same. If not then some sort of scaling would be needed. E.g. if there are 100 students in year 1 and 150 in year 2, then each of the five rating for the year 1 rater would need to be multiplied by 1.5.
I would think that now you might be able to use ICC, Krippendorff’s or Gwet’s approach to interrater reliability.
There may indeed be shortcomings to this approach, but it is the best I could think of at the moment.
Charles
I would like to statistically analyze the Following for Inter Rater Reliability, if possible. I will have 8 Raters that will have 11 Items to Rate for Quality and each of the 11 Items have 4 variables (or values): Succeeding, Progressing, Learning and N/A. What formulas should I use to analyze the data?
Kelly,
Since the rating categories are categorical you can use Fleiss’ kappa. You can also use Gwet’s AC2 or Krippendorff’s Alpha. I prefer Gwet’s.
Charles
excellent website. I wonder if you can tell what statistic I should be using. We have an instrument which measures the length of the eyeball (IOL Master). The company recommends that you take five readings in which it will calculate the mean. In viewing the individual readings there is some variability. To reduce the variability, I believe that we need to increase the number of measurements, ie 5, 10, or 20. I can easily measure the same finding on a subject(s) multiple times for each condition; or we can have a number of people taking two measurements for each condition. (test retest repeatability) Which is the best way and what statistic should we use. Thanks in advance for your reply
Jeffrey,
If I understand correctly, you are asking which is better: (a) 2k measurements by the same person or (b) 2 measurements by k different people.
I would guess (b) but I am not sure that I am right. Perhaps someone else can add their view on this.
Charles
Dear Charles,
Thank you for this wonderful website. I am a nurse living and working in Australia along with being a DHealth candidate. Statistics is something that make me feel both terrified and overwhelmed! I am currently analysing my data from a 20 question survey I conducted with our Doctors, Nurses and allied health professionals as part of my studies. I had 148 respondents out of 244 possible, so a 61% RR. I would like to ensure the tool is reliable and so have been looking into Kappa Stats to determine inter rater reliability. Is this the tool you would suggest?
I have been analysing my data using excel and just using filters for each of the responses, I can see that nurse for example have responded to questions that were Dr specific so I think some of my data will need to be cleaned up. Some of the questions were purely demographics (first 3) and the others were around communication, understanding of a programs aims and objectives, if they had received education on a particular topic at University or in the clinical setting. Thanks so much I really appreciate any and all advice you have
Nicole,
Glad you have gotten value from the website.
The most commonly used tool to check the reliability of a questionnaire is Cronbach’s alpha. See the following webpage
Cronbach’s alpha
Keep in mind that you need to handle demographic questions separately.
Charles
hello Charles, Ive downloaded the resource pack and the worksheets and I found them very educational although ive had to learn a lot,Im new to statistics . Im on my MA and doing a study about the relationship of social media to college students academic performance.What test of reliability would you suggest on the above study? Also if you have any suggestions on what possible questions i should include on my questionnaire?
Thank you very much and God bless
Hello Chav,
I am very pleased that the resource pack and example worksheets have been helpful.
Most people would use Cronbach’s alpha to check the internally consistency (reliability) of their questionnaire, although other tools are also used.
Regarding specific questions for your questionnaire, let’s see if anyone else in the community has some suggestions.
Charles
Thank you Mr. Charles Zaiontz. This website is amazing.
As an archaeologist, I have little knowledge of statistics. I am trying to test the reliability (consistency) of a method we use for categorizing lithic raw materials. But I am not sure how to approach it, or maybe I am overthinking this.
I have conducted a blind test where 9 individuals have categorized the same 144 lithic artefacts. The participants were free to label the categories themselves, so the number of groups were not fixed. A total of 20 different groups were chosen.
Do you have any ideas on how I can easily show the reliability, or degree of consistency with this? Maybe I have misunderstood the concept of reliability?
Any help is appreciated.
Alexander,
Glad you like the website.
It sounds like a fit for Fleiss’s Kappa, but I need to understand better what you mean by “The participants were free to label the categories themselves, so the number of groups were not fixed”.
Charles
Thank you for the quick reply, Charles.
What i meant was that of the total of 144 lithics, only 18 of them were FLINT. The participants were free to label the lithics as they choose. Meaning some of them labeled them correctly as FLINT, however, some participants might have labaled some of them as CHICKEN or QUARTZ. If this makes any sense?
Most examples I have seen use fixed number scores, like 1 – 5. Which is why i was not sure if fleiss’s kappa was suitable in my case. Then again, maybe percent agreement will suffice in my study.
Thanks you for your help.
Alexander,
I guess you need to decide on what you want to measure: consistency with the right answer or consistency between the raters.
Also you need to decide whether some rating are better than others (i.e. a weighting factor).
Charles
This is awsum Pack for the statistical calculation. Thanks for it.
Thanks to Mr.Charles Zaiontz for his elaborate examples that are making life easy for us to learn how to analyse our research data using Excel scientific packages.
Please keep it up.
This website is recommended especially for students and Professors in Measurement and Testing. Very helpful indeed.