Correlation in Relationship to t-test

Objective

On this webpage, we describe the following:

  • The relationship between the t-test and the correlation test
  • Point biserial correlation
  • The r-effect size (aka the eta or phi effect size)

Example

Example 2 of Two Sample t Test with Equal Variances tests whether memory recall is the same for the control group and the sleep-deprived group. The t-test is repeated in the middle part of Figure 1 based on the data for the two independent samples shown in ranges A4:A15 and A16:A27 of Figure 1. 

This two-sample t-test can be turned into a correlation problem by combining the two samples into one sample (random valuable x) and setting the dichotomous random variable y to 0 for elements in one sample and to 1 for elements in the other sample. It turns out that the two-sample analysis using the t-test is equivalent to the analysis of the correlation coefficient using the t-test.

Example 1: Calculate the correlation coefficient r for x and y as described above, and then test the null hypothesis H0: ρ = 0.

The correlation coefficient is calculated to be r = 0.0214 (cell L5) and the correlation test is shown on the right side of Figure 1 using the approach described in Correlation Testing using a t-Test.

Correlation and t tests

Figure 1 – Using correlation testing to solve Example 1

The p-values (cells H14 and L10) and t-statistics (cells F14 and L7) of the t-test and correlation test are exactly the same. Using either test, we conclude that the hay fever drug did not offer any significant improvement in driving results as compared to the control.

Point-biserial Correlation

Definition 1: A variable is dichotomous if it only takes two values (usually set to 0 and 1).

The point-biserial correlation coefficient is simply Pearson’s product-moment correlation coefficient where one or both of the variables are dichotomous.

Property 1: If {y1, …, yn} is a sample for the dichotomous random variable y and {x1, …, xn} is a sample for the random variable x, the point-biserial correlation coefficient between these samples is given by the formula

image203z

where m0 is the mean of the n0 data elements xi whose corresponding y value is yi = 0, m1 is the mean of the n1 data elements xi whose corresponding y value is yi = 1, and sx is the (sample) standard deviation of {x1, …, xn}.

If {x1, …, xn} and {y1, …, yn} are populations, then the point-biserial correlation coefficient is

image204z

where σx is the (population) standard deviation of {x1, …, xn}.

Example 2: Calculate the correlation coefficient for Example 1 using Property 1.

Point biserial correlation calculation

Figure 2 – Calculation of point biserial correlation coefficient

r-effect size

Property 2:
image1545

where t is the test statistic for two means hypothesis testing of variables x1 and x2 with t ~ T(df), x is a combination of x1 and x2, and y is the dichotomous variable as in Example 1.

Observation: The value for t from Example 1 is .1004. By Property 1,

r-square calculation

and so r = .0214, which agrees with the value we get using CORREL (as we can see in cell L5 in Figure 1).

Observation: The effect size for the comparison of two means (see Two Sample t Test with Equal Variances) is given by

Cohen's d population

The sample version of this measure of effect size is

Cohen's d sample

Using the formula from Property 1 of Correlation Testing via the t Test, we can convert this into an expression based on r, namely:

image1651

E.g., for the data in Example 1:

image1652

This means that the difference between the average memory recall score between the control group and the sleep-deprived group is only about 4.1% of the standard deviation. Note that this is the same effect size that was calculated in Example 2 of Two Sample t Test with Equal Variances.

Alternatively, we can use φ (phi) as a measure of effect size. Phi is nothing more than r. For this example φ = r = 0.0214. Since r2 = 0.00046, we know that only 0.046% of the variation in the memory recall scores is based on the amount of sleep.

A rough estimate of effect size is that r = .5 represents a large effect size (explains 25% of the variance), = .3 represents a medium effect size (explains 9% of the variance), and= .1 represents a small effect size (explains 1% of the variance).

77 thoughts on “Correlation in Relationship to t-test”

  1. Hi Charles
    Could you give your opinion on this?
    Let’s say I compared the distance run in 1 minute by men to distance run by women, and the results led me to reject the null-hypothesis. Would it be wrong to say that the gender correlates with the distance run? I used this phrasing in an essay, but my teacher disagrees, she says a t-test compares means and is not a correlation test.

    Reply
    • Hello Linda,
      Your teacher is correct that the usual two-sample t-test compares means, but as this webpage explains, this test is equivalent to one that tests whether a specific correlation is zero. For this latter test, it seems to me that “gender correlates with the distance run” is appropriate.
      Charles

      Reply
  2. Hi, Charles!

    The results of my paired sample t-test show that there is no significant difference between the two variables. Is it right not to proceed to the computation person-r given that the variables are not statistically significant?

    Reply
  3. Hello, Charles,
    I am conducting an exploratory study for a dissertation investigating the extent to which two different types of Master’s programs include particular leadership skills in their curricula. Inclusion is measured by a 4 – point survey of faculty members. I used a t-test and found some significant differences between program type and skill inclusion. I then used a point biserial correlation analysis to verify, confirm or perhaps show different significance. I did find some differences. Is it a sound method to use both t tests and correlations?

    Reply
    • Hello Robert,
      As the webpage shows, the t-test and correlation test are equivalent, and so you should get the same result. If you are getting different results, then you are performing different tests from those described on this webpage.
      Charles

      Reply
  4. Hello Charles,

    I am trying to determine the correlation between two data sets of 20 samples each. The data sets use two different simulation methods (bulk method and detailed method) to generate radioactivity in fuel pins (the same pins are compared in each method). The results are close (% difference ranging from 0.5 – 1.5%) however, the bulk value is always slightly lower than the detailed. I tried running a paired t-test, but I was unable to select the null hypothesis because of the way the data appears. I have been looking through the tests on this site and I am unable to find a suitable one for my case. If you could point me in the right direction I would greatly appreciate it.

    Braden

    Reply
  5. Thanks, Charles for this,

    In my research, I tested whether there is a significant difference between two dependent variables by employing a paired sample t-test. The analysis shows a strong significant difference between these two variables. From this result, I intend to enhance my attribution of these findings by employing Pearson correlation. My question is, is that right, using correlation to enhance the findings of and interpretation of t-test?

    Best

    Reply
  6. Hi Charles,
    If my hypothesis is about women being less likely to take risks than men, should i use the point biserial correlation test or independent samples t test? given that gender is a binary variable and the likelihood of taking risk is a continuous variable.
    Cheers.

    Reply
    • Elisa,
      As this webpage demonstrates, the correlation test is equivalent to the t-test. Since the point biserial correlation is equivalent to the correlation coefficient when one of the variables takes only two values (such as gender), the two tests that you have referenced are equivalent.
      Charles

      Reply
  7. Hi Charles

    I am hoping you might be able to help me please.

    I am trying to get my head round using SPSS to analyse the data I have.
    I have worked out the correlation coefficient for the set of data, but now I want to see if there is a difference in the correlation score when I separate the data by gender.
    I thought I could use the Split Data option within the SPSS V26 and then get 2 correlation scores (male & female) and see if they are different.

    I have been told that this is incorrect, please would you be able to help me understand which test I should run in order to compare the correlation (by gender)?

    I look forward to hearing from you and thank you very much for your help.

    Best wishes

    Juliet

    Reply
  8. Hi Charles, if I am studying the grades for two different semesters what type of t-test should I use, and can I proceed to correlation after that?

    Reply
  9. Hi, if my question is “the association between the studying time and a test result”, and I’m having two set of data (time and result) from a same sample group. how the t test helps me to interpret the result? Thanks!

    Reply
    • Hi Christine,
      I presume that you are referring to a paired t test with null hypothesis that the mean of the difference between the studying time and test result is zero (i.e. the more you study the better the test results and the less you study the worse the test results. This is equivalent to a test of the correlation between studying time and test result.
      Charles

      Reply
  10. Hi Charles.

    Kindly help here. My research is on the stock exchange. My null hypothesis is using a Dividend discount model does not overvalued equity stock, the alternative hypothesis is it does over overvalue stock.

    What statistical test would be best in testing the hypotheses?

    I have the stock prices for all top 100 companies for 7 years each. I have also calculated their intrinsic value using the Discounted dividend model. I have means for both stock prices and the calculated prices.

    Reply
    • Hello Santie,
      If for each of the 100 companies, you had one value of stock price based on the dividend discount model and one stock value, then you could use the paired t-test.
      If you have daily prices for 7 years, then the test to use would depend on you specify the null hypothesis.
      Charles

      Reply
  11. How to solve? Kindly please help me. Thank you

    n=35
    r=0.8838
    Perform seven steps significant test at α 0.05 level .

    Reply
  12. Hello sir,

    Could you please help me in identifying the appropriate inferential tests for the following research questions:
    1. Is there a statistically significant difference between the means of student enrollment rates in colleges before and after the Great Recession?
    2. Is there a significant difference between the proportions of the race/ethnic groups for students enrolling before and after the Great Recession?

    I assume that I should use paired sample t test for the first question (college enrollment rates before and after the crisis, it’s like a pre- and post test).
    For the second question, I am totally confused between Chi square test and Pearson correlation coefficient.

    Thank you.

    Reply
    • Hello Jeedu,
      For the first question, the answer depends on the type of data that you have. If you have the enrollment rate figures for a number of colleges before and after the Great Recession then a paired t test may be appropriate.
      The answer is the same for the second question.
      Charles

      Reply
  13. Good day sir, what kind of test should i use in my study? should it be a t-test or correlation?
    my study is about:
    Fiber Strength Evaluation of Two Plants According to their Location in in the plant.

    i have 8 parameters:
    plant height
    girth of the plangs body(base, center, top)
    number of leaves
    fiber fresh weight
    dry weight of fiber
    fiber strength from the plant’s body (base, center, top)

    Reply
    • Since you are probably comparing multiple parameters, neither test is likely to be appropriate.
      In order to answer your question, first I need to know what hypothesis or hypotheses you want to test. The term “fiber strength according to their Location in the plant” is too vague.
      Charles

      Reply
  14. Hello Good Evening Sir,

    This is the scenario of the question –

    A psychologist developed a new
    Anxiety Inventory of 25 items. The
    Cronbach alpha for the inventory was
    found to be 0.80. The inventory was
    administered to a group of 50 anxiety
    disorder patients before beginning their
    psychotherapy. The patients had a
    significantly higher mean than the
    normative group of normals. The same
    group of patients was retested after 12
    weeks of psychotherapy. The post-test
    mean was significantly lesser than the
    pre-test mean. The new Anxiety Inventory
    was correlated with an established anxiety
    inventory and the correlation turned out to
    be 0.20. The norms of the inventory were
    developed in terms of normalized T scores.

    My question is –
    What would be the median T score
    for the new Anxiety Inventory?

    Reply
  15. Hello,

    I have a question regarding the topic. I have the variable A (scale) with which I create 3 groups (using the mean and Std., Value mean+Std. the highest one, and values inbetween the middle one).

    I would like to see if the lowest group has a different score on variable B (which only is a 10 degree Likert-scale) than the highest group or not. Can I do this, or should I run a Spearman correlation analysis between the two variables? Or should I run both tests to emphasize the results?

    Thanks in advance.

    Reply
    • Sorry Gabor, but I don’t understand the scenario that you are describing. You should be able to use a t test provided the data is normally distributed. Otherwise, you could use a Mann-Whitney test.
      Charles

      Reply
  16. Good evening Sir, I have a question that I want to ask. I am looking for a relationship between one independent variable with three levels and one dependent variable, which statistical tool am I going to use for this.
    The hypothesis that I want to test is;
    There is no significant relationship between the availability of laboratory equipment and academic performance.

    The availability has three levels which are adequate, inadequate and not available.
    I would be glad to hear from you sir 🙏

    Reply
      • Ok, thank you.
        In my case sir I have a sample of 27 schools, where I want to find out the relationship between availability of laboratory equipment and academic performance. Two instruments were used for data collection, one of the instrument was a questionnaire for the level of availability, and the other was a performance test that has a total of 25marks.
        The study found out that 4 schools have adequate equipments, 21 schools have inadequate equipments and there were no equipments at all in 2 schools. I calculated the mean test score of each school, for the availability I assigned 3 to school with adequate equipment, 2 to school with inadequate equipments and 1 to school without equipments.
        My x variable was the level of availability in each school, and my y variable was the mean score of the school. Example below
        X. Y
        3. 13
        2. 16
        3. 17
        1. 10
        2. 9

        After which ppmc was used to correlate the data. Sir is my analysis correct or is there any approach I can use apart from this one. Thank you

        Reply
  17. pls help me out here.
    Construct a test of four items.Assess the reliability of the test by administering it to four persons at different point or interval.compute the coefficient of correlation if the test.help out dnt know how to go about it

    Reply
  18. Basing from the Example 2 where there were two groups (one taking the drug and the other one is not) converted to 0 and 1 in this Example 1, the groups were initially the IV for the t test statistics. But in this example, it is the y (or the DV). Will it matter if I retain the 0 and 1 as the x for correlation test, or it is required to use the dichotomous variable as the y? Thank you!

    Reply
  19. An observation to check the math:

    φ = r = 0.0214. Since r2 = 0.00046, we know that 0.46%. Should it be 0.046%?
    I may not be understanding the derivation. Thanks for such a transparent teaching website.

    Reply
    • Winston,
      Yes, the value should be 0.046%. Thanks for catching this typo. I have corrected the webpage to reflect the correct value.
      I appreciate your help in improving the accuracy of the Real Statistics website.
      Charles

      Reply
  20. Can anyone tell me the formula for a t-test for the difference between two mean using the Pearson Product Moment Correlation Coefficient instead of calculating difference scores?

    Reply
  21. Dear Dr. Zaiontz,
    I have two observer that had (independently) measured the area of cartilage of a group of patients. I want to compare these measures (that are normally distribuited) saying that the two can overlap. Can I run an unpaired t-test? or a correlation coefficient using t-test?

    Thank you for creating this website, it’s very easy to understand and also to use,
    Sincerely
    Laura

    Reply
      • What I would like to say is that there is no statistical difference between measures made by observer 1 and those made by observer 2, so that the two groups of measures are very close to each other.

        Thank you so much,
        Laura

        Reply
  22. Hi,
    Please advice how can I interpret the following:
    when matched pair t-test shows insignificant difference but Pearson’s correlation gives significant difference. Please advice how to compare this result

    Reply
    • Shanta,
      The matched paired t test is different from the two independent sample t test, but the latter is equivalent to a test of Pearson’s correlation (as explained on the referenced webpage).
      Charles

      Reply
  23. I am doing a study that looks at the relationship between father participation in school activities (0, 1) and change in their children’s test scores (spring to spring). If I use a t-test to calculate the difference of the means of the change in scores by group and I find that the difference is significant, should I also talk about the slope of the regression line or the correlation coefficient to get a sense of the nature of the relationship or R squared? How would I get that number?

    Reply
    • Imcafee,
      I am not sure what extra information you would get by doing this, but as the referenced webpage explains, you can turn a t test into a correlation by using a dummy dichotomous variable. Once you calculate the correlation coefficient in this way, R-square is just the square of the correlation coefficient.
      Charles

      Reply
  24. r(x, y) t p
    -0.78 -3.29 0.01
    -0.28 -0.86 0.41
    -1.00
    pls help me in interpreting this result…correlation of two variables.thanks

    Reply
    • The first row calculates a sample correlation coefficient of -.78 and shows that the population correlation coefficient is significantly different from zero with 99% confidence
      The second row calculates a sample correlation coefficient of -.28 and cannot reject the null hypothesis that the population correlation coefficient is zero
      The third row calculates a sample correlation coefficient of -1, which means that the two samples are 100% negatively correlated
      Charles

      Reply
  25. r (Correlation) t comp Tabled t Comparison Decision
    0.12 0.99 1.996 less than ?

    How to interpret this table? Kindly please help me. Thank you

    Reply
  26. I have a question.

    I have two variables out of which one is continuous and the other is (artificially) dichotomous with an underlying property being continuous and normally distributed.
    I want to find the correlation coefficient between these two variables. which will be better Point Biserial or Biserial coeff?

    Reply
  27. Can I ask for help? Here is the data given aside from there means.
    ΣX2 = Sum of square First Scores
    ΣY2 = Sum of square Second Scores.

    Reply
  28. Can I ask for help? Suppose I have the data for means of x and y then their summation of squared value only, how can i compute if they are significantly different at 5% level of significance. Thank you.

    Reply
    • Charisa,
      When you say that you want to “compute if they are significantly different”, are you referring to the means of x and y or something related to the correlation between x and y (in which case, the usual test is whether the correlation is significantly different from zero)?
      Charles

      Reply
  29. Thank you so much for this helpful explanation and the worksheet. However, in the downloadable worksheet, instead of ‘sig’ the cell value will be called ‘reject’, which to me sounds like the exact opposite. Am I missing something?

    Kind regards and many thanks, Christian.

    Reply
    • Christian,
      “Reject” in this context means “reject the null hypothesis,” which is equivalent to a significant result. Also, it seems that you are referring to an old version of the examples worksheet. The latest version uses “sig” instead of “reject”.
      Charles

      Reply

Leave a Comment