Reliability

We explore various types of test reliability as suggested by the following topics.

Topics

Links

References

Wikipedia (2012) Reliability (statistics)
https://en.wikipedia.org/wiki/Reliability_(statistics)

Boone, W. J. (2016) Rasch analysis for instrument development: why, when, and how?
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5132390/pdf/rm4.pdf

Price, P. C., Jhangiani, R., Chiang, I. (2015) Reliability and validity of measurement. Research Methods in Psychology
https://opentextbc.ca/researchmethods/chapter/reliability-and-validity-of-measurement/

32 thoughts on “Reliability”

Patrick

July 6, 2022 at 11:09 am

Hi Charles,
Firstly, thanks for all the information and advice that you provide. Real-statistics.com is the best stats resource I have found.
I’m hoping you can answer a question that I’m struggling with. I have developed a marking rubric that has 15 questions. Three markers will use an ordinal scale (1-5) to score each question. They will do this for 20 assessment pieces (same assignment submitted by 20 different students).
My main focus is to determine the reliability of each individual question as this will help to identify those that need further refinement. Following your instruction on this site, I’m fairly confident that i can complete this procedure (using Gwet’s AC2 for example) for each of the 15 questions.
My question is – is there a reliability test that will take into account ALL the data (i.e. the scores from the three raters, for all 15 questions, on all 20 student assignments) and provide a single reliability measure for the entire rubric?
Thanks for any assistance you can provide.
Reply
- Charles
  
  July 8, 2022 at 7:52 am
  
  Hello Patrick,
  This is a rather complicated scenario.
  1. First of all, we need to understand what you mean by reliability here. Are you trying to measure agreement between the 3 raters?
  2. When you used Gwet’s AC2 did you calculate a measurement to compare the three raters’ assessments for question 1, another for question 2, etc.?
  Charles
  Reply
  - Patrick
    
    July 11, 2022 at 2:10 am
    
    Hi Charles,
    1. Yes, by ‘reliability’ I mean that I’m seeking to measure the level of agreement between the 3 raters.
    2. I actually have not used Gwet’s AC2 (or any other method) yet. I’m trying to confirm the procedure before doing so. However, yes, I’m planning on using the method to calculate agreement between the raters on each individual question.
    Thanks
    Patrick
    Reply
    - Charles
      
      July 12, 2022 at 5:16 pm
      
      Hi Patrick,
      1. I don’t know of a way to combine multiple agreement measures. We also need to define what such a measure would mean (mean of the measures, minimum of the measures, etc.). Perhaps the following article can be useful:
      http://people.umass.edu/~bogartz/Interrater%20Agreement.pdf
      2. You should be able to use Gwet for any specific question.
      Charles
      Reply
      - Patrick
        
        July 15, 2022 at 12:51 am
        
        Thanks for your advice.
Manther Mendellin

June 6, 2021 at 6:36 am

Hi! What test do I need to use when the type of test is dichotomous (e.g. true/false or agree/disagree)?
Reply
- Charles
  
  June 6, 2021 at 6:08 pm
  
  What hypothesis do you want to test?
  Charles
  Reply
ella

July 10, 2019 at 7:22 pm

Hello Charles,
I did a pre-test/post test of 50 multiple choice items to 30 respondents. What should I use to see whether the test is reliable?
Thank you for having this site. This is such a great help.
Reply
- Charles
  
  July 10, 2019 at 9:32 pm
  
  Hello Ella,
  You can use Cronbach’s Alpha or any of the other internal consistency reliability tests described on the website.
  Charles
  Reply
  - Prince
    
    June 8, 2020 at 3:01 pm
    
    how about t stat?
    Reply
    - Charles
      
      June 8, 2020 at 5:08 pm
      
      What would you like to know about the t statistic?
      Charles
      Reply
Diego

February 6, 2019 at 2:46 pm

Dear Charles,

I want to thank you for the RealStats Excel add-in and this website; they have been both very educational.
I appreciate if you could please give me some advice about the following situation (I´ll try to be as clear as possible):

I have a set of 13 variables which are considered relevant for analyzing certain type of financial institutions. Each variable has a weight assigned to it and they all add up to 90. Then, we have some “experts” and they´re asked to choose from a broader set of variables, among which are the 13 variables, and assign weight to the ones they consider more relevant. Thus, I have 13 variables each one with its corresponding weight, and a different number of variables each one with its weight according to each “expert” (there are 4 “experts” each one with its own answers).
What tools could I use in order to find out how close or adjusted are the initial weights from the 13 variables, to those weights assigned by the group of “experts”?

Thank you in advance for your help.

Regards,

Diego
Reply
- Charles
  
  February 7, 2019 at 11:59 am
  
  Hi Diego,
  This is an interesting scenario, but I don’t know how to measure it. Perhaps someone else in the Real Statistics community can help you.
  Charles
  Reply
Cristina Bravo

January 28, 2019 at 8:05 pm

Hi! I hope you can help me.

I’m looking for a coefficient that will help me calculate the reliability between three observers, each of whom evaluated a student by speaking in public. A rubric of 9 areas was used, and each area with 5 categories, for example:
A. Presentation:
0. Poor
1. Minimum
2. Basic
3. Competent
4. Advanced

I have been reviewing your page, but it is not clear to me what index to use. Thanks!!!
Reply
- Charles
  
  January 28, 2019 at 9:08 pm
  
  Cristina,
  I suggest that you use Gwet’s AC2 or Krippendorff’s alpha for ordered categories.
  Charles
  Reply
Juliano S. Parena Jr.

November 9, 2018 at 8:55 am

I am quite unsure how to treat my data and I was hoping that someone could enlighten me. Students from two different year levels were asked to choose from a set of 5 items which among them they preferred to use at school, they can choose all that they want. From there the counts for each item was tabulated and was ranked (arranged highest to lowest). For example the results are as follows: Grade 1 – 5,4,3,1,2 and Grade 2 – 3,4,5,2,1.
What tests do I perform to show if there is a significant difference in the preferences of the students based on the counts from the two different year levels?
Thank you so much.
Reply
- Charles
  
  November 11, 2018 at 5:55 pm
  
  Juliano,
  If I understand correctly, you are trying to determine whether there is agreement between the students of year 1 and the students of year 2. None of the reliability measurements described on the website groups the raters in this way, but here is an approach that may be appropriate:
  I assume that there are two raters (students for year 1 and students for year 2). Each of the two “raters” would give a score to each of five subjects (the items) based on how many students in that year preferred that item. This should work provided the number of students in each year are the same. If not then some sort of scaling would be needed. E.g. if there are 100 students in year 1 and 150 in year 2, then each of the five rating for the year 1 rater would need to be multiplied by 1.5.
  I would think that now you might be able to use ICC, Krippendorff’s or Gwet’s approach to interrater reliability.
  There may indeed be shortcomings to this approach, but it is the best I could think of at the moment.
  Charles
  Reply
Kelly St Clair

November 5, 2018 at 11:44 pm

I would like to statistically analyze the Following for Inter Rater Reliability, if possible. I will have 8 Raters that will have 11 Items to Rate for Quality and each of the 11 Items have 4 variables (or values): Succeeding, Progressing, Learning and N/A. What formulas should I use to analyze the data?
Reply
- Charles
  
  November 6, 2018 at 8:09 am
  
  Kelly,
  Since the rating categories are categorical you can use Fleiss’ kappa. You can also use Gwet’s AC2 or Krippendorff’s Alpha. I prefer Gwet’s.
  Charles
  Reply
Jeffrey Cooper

October 26, 2017 at 1:23 pm

excellent website. I wonder if you can tell what statistic I should be using. We have an instrument which measures the length of the eyeball (IOL Master). The company recommends that you take five readings in which it will calculate the mean. In viewing the individual readings there is some variability. To reduce the variability, I believe that we need to increase the number of measurements, ie 5, 10, or 20. I can easily measure the same finding on a subject(s) multiple times for each condition; or we can have a number of people taking two measurements for each condition. (test retest repeatability) Which is the best way and what statistic should we use. Thanks in advance for your reply
Reply
- Charles
  
  October 27, 2017 at 10:00 am
  
  Jeffrey,
  If I understand correctly, you are asking which is better: (a) 2k measurements by the same person or (b) 2 measurements by k different people.
  I would guess (b) but I am not sure that I am right. Perhaps someone else can add their view on this.
  Charles
  Reply
Nicole Draper

September 10, 2017 at 3:49 am

Dear Charles,
Thank you for this wonderful website. I am a nurse living and working in Australia along with being a DHealth candidate. Statistics is something that make me feel both terrified and overwhelmed! I am currently analysing my data from a 20 question survey I conducted with our Doctors, Nurses and allied health professionals as part of my studies. I had 148 respondents out of 244 possible, so a 61% RR. I would like to ensure the tool is reliable and so have been looking into Kappa Stats to determine inter rater reliability. Is this the tool you would suggest?
I have been analysing my data using excel and just using filters for each of the responses, I can see that nurse for example have responded to questions that were Dr specific so I think some of my data will need to be cleaned up. Some of the questions were purely demographics (first 3) and the others were around communication, understanding of a programs aims and objectives, if they had received education on a particular topic at University or in the clinical setting. Thanks so much I really appreciate any and all advice you have
Reply
- Charles
  
  September 11, 2017 at 10:00 am
  
  Nicole,
  Glad you have gotten value from the website.
  The most commonly used tool to check the reliability of a questionnaire is Cronbach’s alpha. See the following webpage
  Cronbach’s alpha
  Keep in mind that you need to handle demographic questions separately.
  Charles
  Reply
Chav

February 4, 2017 at 11:12 pm

hello Charles, Ive downloaded the resource pack and the worksheets and I found them very educational although ive had to learn a lot,Im new to statistics . Im on my MA and doing a study about the relationship of social media to college students academic performance.What test of reliability would you suggest on the above study? Also if you have any suggestions on what possible questions i should include on my questionnaire?

Thank you very much and God bless
Reply
- Charles
  
  February 5, 2017 at 7:56 am
  
  Hello Chav,
  I am very pleased that the resource pack and example worksheets have been helpful.
  Most people would use Cronbach’s alpha to check the internally consistency (reliability) of their questionnaire, although other tools are also used.
  Regarding specific questions for your questionnaire, let’s see if anyone else in the community has some suggestions.
  Charles
  Reply
Alexander Frivoll

December 14, 2016 at 11:24 am

Thank you Mr. Charles Zaiontz. This website is amazing.

As an archaeologist, I have little knowledge of statistics. I am trying to test the reliability (consistency) of a method we use for categorizing lithic raw materials. But I am not sure how to approach it, or maybe I am overthinking this.
I have conducted a blind test where 9 individuals have categorized the same 144 lithic artefacts. The participants were free to label the categories themselves, so the number of groups were not fixed. A total of 20 different groups were chosen.

Do you have any ideas on how I can easily show the reliability, or degree of consistency with this? Maybe I have misunderstood the concept of reliability?

Any help is appreciated.
Reply
- Charles
  
  December 14, 2016 at 11:16 pm
  
  Alexander,
  Glad you like the website.
  It sounds like a fit for Fleiss’s Kappa, but I need to understand better what you mean by “The participants were free to label the categories themselves, so the number of groups were not fixed”.
  Charles
  Reply
  - Alexander Frivoll
    
    December 15, 2016 at 12:00 pm
    
    Thank you for the quick reply, Charles.
    
    What i meant was that of the total of 144 lithics, only 18 of them were FLINT. The participants were free to label the lithics as they choose. Meaning some of them labeled them correctly as FLINT, however, some participants might have labaled some of them as CHICKEN or QUARTZ. If this makes any sense?
    
    Most examples I have seen use fixed number scores, like 1 – 5. Which is why i was not sure if fleiss’s kappa was suitable in my case. Then again, maybe percent agreement will suffice in my study.
    
    Thanks you for your help.
    Reply
    - Charles
      
      December 15, 2016 at 1:16 pm
      
      Alexander,
      I guess you need to decide on what you want to measure: consistency with the right answer or consistency between the raters.
      Also you need to decide whether some rating are better than others (i.e. a weighting factor).
      Charles
      Reply
Dhruv Pandya

November 11, 2015 at 3:06 am

This is awsum Pack for the statistical calculation. Thanks for it.
Reply
Kateregga Abdul Karim

August 3, 2015 at 6:24 am

Thanks to Mr.Charles Zaiontz for his elaborate examples that are making life easy for us to learn how to analyse our research data using Excel scientific packages.
Please keep it up.
Reply
dory

July 1, 2015 at 4:35 am

This website is recommended especially for students and Professors in Measurement and Testing. Very helpful indeed.
Reply

Topics

Links

References

32 thoughts on “Reliability”

Leave a Comment Cancel reply