Data Transformations

It can sometimes be useful to transform data to overcome the violation of an assumption required for the statistical analysis we want to make. Typical transformations take a random variable and transform it into log x or 1/x or x2 or \sqrt{x} , etc. 

There is some controversy regarding the desirability of performing such transformations since often they cause more problems than they solve. Sometimes a transformation can be considered simply as another way of looking at the data. For example, sound volume is often given in decibels, which is essentially a log transformation; time to complete a task is often expressed as speed, which is essentially a reciprocal transformation; the area of a circular plot of land can be expressed as the radius, which is essentially a square root transformation.

In any case, we will see some examples in the rest of this website where transformations are desirable. See, for example, Log Transformation and Box-Cox Transformation).

An important consideration when performing transformations is that they be applied uniformly. E.g. when comparing three groups of data, it would not be appropriate to apply a log transformation to one group but not to the other two.

Also, transformations should only be used to achieve the assumptions of a test. You should avoid trying out a variety of transformations to find one that achieves a specific test result.

Reference

Howell, D. C. (2010) Statistical methods for psychology, 7th Ed. Wadsworth. Cengage Learning
https://labs.la.utexas.edu/gilden/files/2016/05/Statistics-Text.pdf

63 thoughts on “Data Transformations”

  1. Hi Charles,

    You’ve said above that transformations should not be applied in order to achieve a specific test result.

    Can I please ask whether that statement would apply to the following two scenarios :

    1) for a pearson correlation test, if the original variables are not normally distributed according to a Shapiro-Wilk test, but a reciprocal transformation of all the variables causes them to pass the SW test, is that an acceptable use of variable transformations?

    2) If the objective is to find the closest fit to a curve, or stated differently, to minimise the standard error of the regression, is that an acceptable use of variable transformations?

    Thank you,

    Gareth

    Reply
    • Hi Gareth,
      1) Yes, you can use transformations to meet the assumptions of a test.
      2) If I understand your objectives correctly, then this seems acceptable since you are not performing a test.
      Charles

      Reply
  2. Dear Charles,

    I want to run an independent student’s t-test to compare the group means of British and German participants with Jamovi. However, my assumption of normality is violated for 3 of my 5 dependent variables. Is there any way I can still compare the group means given the violation?

    Thank you,
    Elisabeth

    Reply
  3. Dear Sir,

    Kindly, I want to ask, my data is visual acuity scores before and after treatment, so I want to run paired sample t-test, but unfortunately, my data was not normally distributed, so I decide to use nonparametric test instead (Wilcoxon signed-rank test), but my data consider scale while for Wilcoxon DV need to be ordinal so what do you recommend me to do in this case.

    Thank you,
    Regards

    Reply
      • yes
        Visual acuity data its 0.0, 0.1, 0.2, 0.3, and so on
        Contrast data its value 2.00 to 1.3 and have less

        and accommodation data its value for example from -1.50 to +1.50 mostly

        I want to check the values after and before the treatment

        Thank you

        Reply
          • sorry I’ve provided this data based on my question above and it was

            “Kindly, I want to ask, my data is visual acuity scores before and after treatment, so I want to run paired sample t-test, but unfortunately, my data was not normally distributed, so I decide to use nonparametric test instead (Wilcoxon signed-rank test), but my data consider scale while for Wilcoxon DV need to be ordinal so what do you recommend me to do in this case.”

            and you asked for the type of data so I gave this “Visual acuity data its 0.0, 0.1, 0.2, 0.3, and so on
            Contrast data its value 2.00 to 1.3 and have less

            and accommodation data its value for example from -1.50 to +1.50 mostly

            I want to check the values after and before the treatment”

            hope this clarify my question.
            Thank you

          • You should be able to use Wilcoxon’s signed-ranks test for this type of data. If I understand correctly, you plan to perform two such tests, one for visual acuity and another for contrast. Is this correct?
            Charles

          • yes exactly I will do the test for every function separately.

            so I can do Wilcoxon’s signed-ranks test for my data even it is not ordinal as required for Wilcoxon’s signed-ranks test, right?

          • From your previous response, I understood that your data was ordinal, and in fact numeric. To get the “ordinal” versions of your data you need to rank the numeric values, for example by using the RANK.AVG function in Excel. Thus, .2, .3, .7, .9, .1 becomes 2, 3, 4, 5, 1.
            Charles

          • Dear Sir Charles,

            This is a great help to me, I appreciate that highly.
            will try that and see how.

            Thank you,
            Regards

  4. Hi Charles,

    I hoping you can offer some advice for solving my data problem. I am using anatomical measurements for a list of species to conduct Principal Components Analysis and Discriminants Analysis. My raw data are not normally distributed. I have tried running a Box-Cox transformation followed by a z-transformation (standardization) (z-transformation to limit the effects of size of the species on the subsequent PCA and DA visual distributions) but the data are still not normal (p values are very small despite Q-Q plots looking ‘not too bad’. I’ve tried a few other transformations prior to the z-transformation (standard log, square root, dividing values by median absolute deviation) with no luck. A mardia test for multivariate normality on the Box-Cox + z-transformed data showed a relatively high number of outliers in the dataset, as well as a number of the measurements being non-normal but both the outlier species and non-normal measurements capture important anatomical information that I would like to keep in the dataset – the reasons for the non-normal measurements make sense. Do you have any suggestions for a transformation to try so that the data meet the requirements of normality for the PCA and DA? I realise normality isn’t super important for the visualisation side of things but want to use regularised discriminants analysis to classify unknown species into known classes and from what I understand from reading, having the data meet the normality assumptions would be preferable.

    Thanks in advance for any advice
    S

    Reply
  5. Dear Charles,

    Hoping to be fine. I have done a study of the bacterial communities living on tomatoes’ fruits. This study based on the sequencing of one target gene and the results are kinds of reads, for example, bacterium A has 5 reads in Tomato A, 100 reads in Tomato B, 2000 reads in Tomato C, and so on. However, some bacteria have zero reads in some tomatoes…etc. If I use one-way ANOVA how to transform this data to be continuous? What Log?

    Many thanks in advance

    Reply
    • Awad,
      Why do you need to transform the data to be continuous? Do you mean “normally distributed”?
      If you add one to all the data values you can take the log of the transformed data.
      Charles

      Reply
  6. Hi Charles

    I am trying to create a multivariate regression model for consumer response to media inputs.

    I know that the response to certain media inputs takes the shape of an S-curve, and that the raw data must be transformed before hand to fit this curve, but I am not sure how to find the constants with which to transform the data

    Can you help?

    Kind regards
    Embeth

    Reply
  7. hi, I am doing a time series research on the effects of economic growth, population , trade and energy consumption on carbon emissions
    do i need to transform my raw data to their natural logs?
    can yo help me with the best model and test to use?

    Reply
  8. Hi Charles,
    sir!
    how can we take natural log tranformation by adding one in the base? beacuse i am using the terrorism incidents data in my study.please try to help me with details.

    Reply
    • Mohsin,
      In Excel if the value is x, then =LN(x) is the natural log of x and =LN(x+1) is the natural log transformation first adding one.
      Note this not the same as adding one to the base. For the natural log, the base is the constant e, which is calculated as EXP(1) in Excel.
      The log of x, base b is =LOG(x,b) in Excel, and so =LOG(x,EXP(1)) is the log of x base e+1.
      Charles

      Reply
  9. I have a set of data with one independent variable and five independent samples. They violate potential assumptions including similarly shaped distributions, normality, and homoscedasticity. I can’t figure out what test to use!

    Reply
  10. Hello,

    I have data that violates the assumption of a monotonic relationship for Spearman’s correlation, is it inappropriate to proceed with analysis or would I need to perform a transformation?

    Thank you.

    Reply
    • Tk,
      It probably depends on why you want to use Spearman’s correlation in the first place. What are you trying to measure? Are you trying to test some hypothesis; if so what hypothesis?
      Charles

      Reply
  11. Dear Charles,

    Is this always true:

    “If the transformed variable is normally distributed, then the original data are extracted from the Normal population”?

    Reply
  12. Hello,

    I am currently working with a data set that violates homogeneity of variance. I am trying to run a 2 x2 mixed factorial ANOVA. My between subject variable has 2 levels with very unequal n (n1 = 435; n2 = 239). I have tried taking a split sample to compare when both n = 200, however I am still violating homogeneity of variance. I think my next step would be to transform the data, however, I am not sure what method would be appropriate. Any suggestions?

    Thank you!

    Reply
    • It really depends on the details of your data. I suggest that you try the Box-Cox transformation. This subject is described on the Real Statistics website.
      Charles

      Reply
  13. Hello Sir,

    Please correct me if im wrong. I have data on percent protrombin lets say for the treated 12, 20, 28, 22, 34, 19, 27, 32 and for the untreated 34, 45, 50, 38, 41, 44, 32, 39 all values are in percentage. I transformed it using arcsin transformation and conducted a T test for independent variables. Did i did it right?

    thanks,

    Mike

    Reply
  14. Hi Sir, I am Fauzi, I am sorry if my english is bad.
    If i have percentage data and the distribution of my data from 1% – more than (>)100%. What kind of transformation that should i choose? Thank Sir

    Reply
    • Gerardo,
      Box-Cox is actually a series of transformations. One version (lambda = 0) is a log transformation. This is supported as described in the following webpage
      Power Regression
      I don’t explicitly support the other transformations (except the linear regression where lambda = 1), although I will add this in the future.
      Charles

      Reply
  15. sir, my data failed assumption of normality as well as independency, am I right by transforming the data to satisfy normality first before treating the independent assumption.

    Reply
    • Akeem,
      If your data was not selected in such a way that each data element selected is independent of the other data elements selected, then there is nothing you can do about it (except change how you create your sample). Thus, I am not sure what you mean by “treating” the independent assumption, since it seems to be independent (no pun intended) of the order in which you “treat” the two problems (normality and independence). Perhaps you mean something different by “independent”.
      Charles

      Reply
  16. Hello,

    I have a time series dataset. The,

    X (Independent variable) is time and is denoted as 1,2,3,4,5,6..1000.etc Y (Dependent variable ) is a percentage scale as 99%, 98.7%, 96%, 91% …etc. This is a continuous data set. I also have 0% which I need to take into account when performing calculations.

    I have 1000 such data points. The first 700 data points used as training set and rest 300 is used for testing.

    I tried to use simple linear regression but when predicting sometimes the prediction is more than 100%. And the case is even worse when I calculated the confidence interval and prediction interval.

    So I tried to use logistic regression as there is a boundary ( from 0% to 100%). But logistic regression can take only binary data. I am confused on how to appropriately convert my existing time series data so that I can try how logistic regression on that.

    Will be it meaning if I convert the existing data to log form and then do a linear regression over the transformed data? Also, I am not quite sure how to handle the zeros in the data set when performing a log transformation

    Reply
    • Hello,

      If you are worried about zero,then use the following transformation log(1+x).

      Regarding how to do regression when the dependent variable is a percentage, I found this suggestion on the webpage http://www.theanalysisfactor.com/proportions-as-dependent-variable-in-regression-which-type-of-model/

      [One] approach is to treat the proportion as a censored continuous variable. The censoring means that you don’t have information below 0 or above 1… If you take this approach, you would run the model as a two-limit tobit model (Long, 1997). This approach works best if there isn’t an excessive amount of censoring (values of 0 and 1).

      Reference: Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables. Sage Publishing.

      Charles

      Reply
  17. Hello, sir, Which methods of data transformation is more convinent for plant diseases survey data (incidence % and Severity %) ??

    Reply
  18. Hello! thank for your post!
    I have a question: is correct to apply a different transformation to each response variable in a manova test?

    Thanks in advance!

    Reply
    • Hello Gabriel,
      You can apply different transformations to different variables. The important thing is that you apply the same transformation to all the sample data elements for that variable. Also keep in mind that whenever you transform data the test will apply to the transformed variable/data, and you hope to make meaningful conclusions about the original variable/data.
      Charles

      Reply
  19. Hi,
    Thank you for a very useful website!

    Since you mentioned sound. I would like to do some mixed models with sound data (in decibel) as the response term. The response term should ideally be normally distributed; can I transform the sound data to be more normally distributed?

    Anne-Lise

    Reply
    • You haven’t given me enough information about the distribution of your data to give you a definitive response, but it probably relates to the fact that decibels are already a log of sound intensity. Thus it is possible that you need to use an exponential transformation, but I am only guessing here.
      Charles

      Reply
    • Vanessa,
      You perform the transform on all the data elements and then perform whatever statistical test you want to make. The results of that test will apply to the transformed data, and not necessarily the original data, but in many cases you will be able to make meaningful conclusions about the population under study as well.
      Charles

      Reply

Leave a Comment