It can sometimes be useful to transform data to overcome the violation of an assumption required for the statistical analysis we want to make. Typical transformations take a random variable and transform it into log x or 1/x or x2 or , etc.Â
There is some controversy regarding the desirability of performing such transformations since often they cause more problems than they solve. Sometimes a transformation can be considered simply as another way of looking at the data. For example, sound volume is often given in decibels, which is essentially a log transformation; time to complete a task is often expressed as speed, which is essentially a reciprocal transformation; the area of a circular plot of land can be expressed as the radius, which is essentially a square root transformation.
In any case, we will see some examples in the rest of this website where transformations are desirable. See, for example, Log Transformation and Box-Cox Transformation).
An important consideration when performing transformations is that they be applied uniformly. E.g. when comparing three groups of data, it would not be appropriate to apply a log transformation to one group but not to the other two.
Also, transformations should only be used to achieve the assumptions of a test. You should avoid trying out a variety of transformations to find one that achieves a specific test result.
Reference
Howell, D. C. (2010) Statistical methods for psychology, 7th Ed. Wadsworth. Cengage Learning
https://labs.la.utexas.edu/gilden/files/2016/05/Statistics-Text.pdf
Hi Charles,
You’ve said above that transformations should not be applied in order to achieve a specific test result.
Can I please ask whether that statement would apply to the following two scenarios :
1) for a pearson correlation test, if the original variables are not normally distributed according to a Shapiro-Wilk test, but a reciprocal transformation of all the variables causes them to pass the SW test, is that an acceptable use of variable transformations?
2) If the objective is to find the closest fit to a curve, or stated differently, to minimise the standard error of the regression, is that an acceptable use of variable transformations?
Thank you,
Gareth
Hi Gareth,
1) Yes, you can use transformations to meet the assumptions of a test.
2) If I understand your objectives correctly, then this seems acceptable since you are not performing a test.
Charles
Dear Charles,
I want to run an independent student’s t-test to compare the group means of British and German participants with Jamovi. However, my assumption of normality is violated for 3 of my 5 dependent variables. Is there any way I can still compare the group means given the violation?
Thank you,
Elisabeth
You could use a data transformation or employ a non-parametric test. The Mann-Whitney test is a likely choice.
Charles
Dear Sir,
Kindly, I want to ask, my data is visual acuity scores before and after treatment, so I want to run paired sample t-test, but unfortunately, my data was not normally distributed, so I decide to use nonparametric test instead (Wilcoxon signed-rank test), but my data consider scale while for Wilcoxon DV need to be ordinal so what do you recommend me to do in this case.
Thank you,
Regards
Can you further clarify what type of data you have? Can you give me some examples of the data pairs?
Charles
yes
Visual acuity data its 0.0, 0.1, 0.2, 0.3, and so on
Contrast data its value 2.00 to 1.3 and have less
and accommodation data its value for example from -1.50 to +1.50 mostly
I want to check the values after and before the treatment
Thank you
Sorry, but I don’t understand the data that you have provided. What is your question?
Charles
sorry I’ve provided this data based on my question above and it was
“Kindly, I want to ask, my data is visual acuity scores before and after treatment, so I want to run paired sample t-test, but unfortunately, my data was not normally distributed, so I decide to use nonparametric test instead (Wilcoxon signed-rank test), but my data consider scale while for Wilcoxon DV need to be ordinal so what do you recommend me to do in this case.”
and you asked for the type of data so I gave this “Visual acuity data its 0.0, 0.1, 0.2, 0.3, and so on
Contrast data its value 2.00 to 1.3 and have less
and accommodation data its value for example from -1.50 to +1.50 mostly
I want to check the values after and before the treatment”
hope this clarify my question.
Thank you
You should be able to use Wilcoxon’s signed-ranks test for this type of data. If I understand correctly, you plan to perform two such tests, one for visual acuity and another for contrast. Is this correct?
Charles
yes exactly I will do the test for every function separately.
so I can do Wilcoxon’s signed-ranks test for my data even it is not ordinal as required for Wilcoxon’s signed-ranks test, right?
From your previous response, I understood that your data was ordinal, and in fact numeric. To get the “ordinal” versions of your data you need to rank the numeric values, for example by using the RANK.AVG function in Excel. Thus, .2, .3, .7, .9, .1 becomes 2, 3, 4, 5, 1.
Charles
Dear Sir Charles,
This is a great help to me, I appreciate that highly.
will try that and see how.
Thank you,
Regards
sorry for the late reply, i didn’t notice the comment early
Hi Charles,
I hoping you can offer some advice for solving my data problem. I am using anatomical measurements for a list of species to conduct Principal Components Analysis and Discriminants Analysis. My raw data are not normally distributed. I have tried running a Box-Cox transformation followed by a z-transformation (standardization) (z-transformation to limit the effects of size of the species on the subsequent PCA and DA visual distributions) but the data are still not normal (p values are very small despite Q-Q plots looking ‘not too bad’. I’ve tried a few other transformations prior to the z-transformation (standard log, square root, dividing values by median absolute deviation) with no luck. A mardia test for multivariate normality on the Box-Cox + z-transformed data showed a relatively high number of outliers in the dataset, as well as a number of the measurements being non-normal but both the outlier species and non-normal measurements capture important anatomical information that I would like to keep in the dataset – the reasons for the non-normal measurements make sense. Do you have any suggestions for a transformation to try so that the data meet the requirements of normality for the PCA and DA? I realise normality isn’t super important for the visualisation side of things but want to use regularised discriminants analysis to classify unknown species into known classes and from what I understand from reading, having the data meet the normality assumptions would be preferable.
Thanks in advance for any advice
S
Hello Sarah,
It looks like you have tried all the usual approaches to transform the data to normality. I don’t know of another approach.
Charles
Dear Charles,
Hoping to be fine. I have done a study of the bacterial communities living on tomatoes’ fruits. This study based on the sequencing of one target gene and the results are kinds of reads, for example, bacterium A has 5 reads in Tomato A, 100 reads in Tomato B, 2000 reads in Tomato C, and so on. However, some bacteria have zero reads in some tomatoes…etc. If I use one-way ANOVA how to transform this data to be continuous? What Log?
Many thanks in advance
Awad,
Why do you need to transform the data to be continuous? Do you mean “normally distributed”?
If you add one to all the data values you can take the log of the transformed data.
Charles
Hi Charles
I am trying to create a multivariate regression model for consumer response to media inputs.
I know that the response to certain media inputs takes the shape of an S-curve, and that the raw data must be transformed before hand to fit this curve, but I am not sure how to find the constants with which to transform the data
Can you help?
Kind regards
Embeth
Are you looking to use logistic regression? The output takes the shape of an s curve.
Charles
hi, I am doing a time series research on the effects of economic growth, population , trade and energy consumption on carbon emissions
do i need to transform my raw data to their natural logs?
can yo help me with the best model and test to use?
Jessica,
This depends on many factors and there are no easy answers. To start with, you need to decide on what hypotheses you want to test.
Charles
Hi Charles,
sir!
how can we take natural log tranformation by adding one in the base? beacuse i am using the terrorism incidents data in my study.please try to help me with details.
Mohsin,
In Excel if the value is x, then =LN(x) is the natural log of x and =LN(x+1) is the natural log transformation first adding one.
Note this not the same as adding one to the base. For the natural log, the base is the constant e, which is calculated as EXP(1) in Excel.
The log of x, base b is =LOG(x,b) in Excel, and so =LOG(x,EXP(1)) is the log of x base e+1.
Charles
I have a set of data with one independent variable and five independent samples. They violate potential assumptions including similarly shaped distributions, normality, and homoscedasticity. I can’t figure out what test to use!
Lindsey,
What hypothesis (or hypotheses) are you trying to test?
Charles
hi
my incidence and severity data are nonnormal so which type of transformation mechanism is preferred
There is no single answer. It depends on the data. See also
https://real-statistics.com/correlation/box-cox-transformation/
Charles
Hello,
I have data that violates the assumption of a monotonic relationship for Spearman’s correlation, is it inappropriate to proceed with analysis or would I need to perform a transformation?
Thank you.
Tk,
It probably depends on why you want to use Spearman’s correlation in the first place. What are you trying to measure? Are you trying to test some hypothesis; if so what hypothesis?
Charles
Dear Charles,
Is this always true:
“If the transformed variable is normally distributed, then the original data are extracted from the Normal population”?
No, this is not true. If the original data was extracted from a normal population, you wouldn’t need to make a transformation.
Charles
Hello,
I am currently working with a data set that violates homogeneity of variance. I am trying to run a 2 x2 mixed factorial ANOVA. My between subject variable has 2 levels with very unequal n (n1 = 435; n2 = 239). I have tried taking a split sample to compare when both n = 200, however I am still violating homogeneity of variance. I think my next step would be to transform the data, however, I am not sure what method would be appropriate. Any suggestions?
Thank you!
It really depends on the details of your data. I suggest that you try the Box-Cox transformation. This subject is described on the Real Statistics website.
Charles
Hello Sir,
Please correct me if im wrong. I have data on percent protrombin lets say for the treated 12, 20, 28, 22, 34, 19, 27, 32 and for the untreated 34, 45, 50, 38, 41, 44, 32, 39 all values are in percentage. I transformed it using arcsin transformation and conducted a T test for independent variables. Did i did it right?
thanks,
Mike
Mike,
Why did you decide to use the arcsin transformation? The data is already reasonably normally distributed.
Charles
Hi Sir, I am Fauzi, I am sorry if my english is bad.
If i have percentage data and the distribution of my data from 1% – more than (>)100%. What kind of transformation that should i choose? Thank Sir
Fauzi,
Sorry, but I don’t understand your question.
Charles
hehee I’m sorry, Ok, I will do a anova test, If my data lying within the range 1% to 150%, is this data need to be tranformed?
Sorry, not the anova test, but multiple regression, and one of my independet variable have the range data like that..
Fauzi,
You haven’t provided enough information for me to know whether this data needs to be transformed.
Charles
Dr Charles, good evening. Please excume my english Can I do the box Cox transformation in Real Statistics?
Gerardo,
Box-Cox is actually a series of transformations. One version (lambda = 0) is a log transformation. This is supported as described in the following webpage
Power Regression
I don’t explicitly support the other transformations (except the linear regression where lambda = 1), although I will add this in the future.
Charles
sir, my data failed assumption of normality as well as independency, am I right by transforming the data to satisfy normality first before treating the independent assumption.
Akeem,
If your data was not selected in such a way that each data element selected is independent of the other data elements selected, then there is nothing you can do about it (except change how you create your sample). Thus, I am not sure what you mean by “treating” the independent assumption, since it seems to be independent (no pun intended) of the order in which you “treat” the two problems (normality and independence). Perhaps you mean something different by “independent”.
Charles
Hello,
I have a time series dataset. The,
X (Independent variable) is time and is denoted as 1,2,3,4,5,6..1000.etc Y (Dependent variable ) is a percentage scale as 99%, 98.7%, 96%, 91% …etc. This is a continuous data set. I also have 0% which I need to take into account when performing calculations.
I have 1000 such data points. The first 700 data points used as training set and rest 300 is used for testing.
I tried to use simple linear regression but when predicting sometimes the prediction is more than 100%. And the case is even worse when I calculated the confidence interval and prediction interval.
So I tried to use logistic regression as there is a boundary ( from 0% to 100%). But logistic regression can take only binary data. I am confused on how to appropriately convert my existing time series data so that I can try how logistic regression on that.
Will be it meaning if I convert the existing data to log form and then do a linear regression over the transformed data? Also, I am not quite sure how to handle the zeros in the data set when performing a log transformation
Hello,
If you are worried about zero,then use the following transformation log(1+x).
Regarding how to do regression when the dependent variable is a percentage, I found this suggestion on the webpage http://www.theanalysisfactor.com/proportions-as-dependent-variable-in-regression-which-type-of-model/
[One] approach is to treat the proportion as a censored continuous variable. The censoring means that you don’t have information below 0 or above 1… If you take this approach, you would run the model as a two-limit tobit model (Long, 1997). This approach works best if there isn’t an excessive amount of censoring (values of 0 and 1).
Reference: Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables. Sage Publishing.
Charles
Thanks Charles.
That helped a lot!!
Hello, sir, Which methods of data transformation is more convinent for plant diseases survey data (incidence % and Severity %) ??
Hello! thank for your post!
I have a question: is correct to apply a different transformation to each response variable in a manova test?
Thanks in advance!
Hello Gabriel,
You can apply different transformations to different variables. The important thing is that you apply the same transformation to all the sample data elements for that variable. Also keep in mind that whenever you transform data the test will apply to the transformed variable/data, and you hope to make meaningful conclusions about the original variable/data.
Charles
Hello! Is it acceptable to standardize variables that have already been (square root) transformed? Thank you!
I don’t see why not, although it really depends on what you will do with the data afterwards.
Charles
Thank you!
Hi,
Thank you for a very useful website!
Since you mentioned sound. I would like to do some mixed models with sound data (in decibel) as the response term. The response term should ideally be normally distributed; can I transform the sound data to be more normally distributed?
Anne-Lise
You haven’t given me enough information about the distribution of your data to give you a definitive response, but it probably relates to the fact that decibels are already a log of sound intensity. Thus it is possible that you need to use an exponential transformation, but I am only guessing here.
Charles
I’m still unclear on how to apply the transformation function. Can you provide the steps which do this? Thanks!
Vanessa,
You perform the transform on all the data elements and then perform whatever statistical test you want to make. The results of that test will apply to the transformed data, and not necessarily the original data, but in many cases you will be able to make meaningful conclusions about the population under study as well.
Charles
Could you please , how can I use Excel for Data Transformations ?
Mahmoud,
Essentially you apply the transformation function to all the data and then test the transformed data.
Sometimes you apply a reverse transformation as well (e.g. when using the Fisher transformation — see webpage https://real-statistics.com/correlation/one-sample-hypothesis-testing-correlation/)
Charles