Basic Concepts
The goal of the Kaplan-Meier procedure is to create an estimator of the survival function based on empirical data, taking censoring into account.
The procedure makes the assumption that censoring does not change the probability of survival (e.g. it assumes that patients won’t leave the clinical trial because they have a relapse of their cancer).
We suppose that we divide time into the following intervals t0 < t1 < ⋯ < tm where t0 is the start time and tm is the time when the study ends. In fact, the time periods that we will use are those corresponding to death or censoring events (i.e. no one dies or is censored between times tj and tj+1).
For a time tj, the risk set Rj is the set of all subjects who (or which) survive to the time just before time tj. Thus the risk set consists of all those who die at time tj or who are censored or die after time tj. We further define nj = the number of subjects at time tj and dj = the number who die at time tj. Thus, the survival time function S(tj) can be calculated iteratively as follows:
Example
Example 1: Calculate the empirical survival function based on the data for the clinical trial shown in range A3:B21 of Figure 1. Here column A shows how many years each patient stayed in the clinical trial. Patients with a 1 in column B have died during the clinical trial, while patients with a 0 in column B are censored, i.e. either the patient was still alive at the end of the trial or left the trial before it was completed.
Figure 1 – Kaplan-Meier Method
The actual calculations are shown on the right side of Figure 1. First, we note that the values in column D are the unique values shown in column A in sorted order. These values may be obtained by using Excel’s Data > Data Tools|Remove Duplicates followed by Data > Sort & Filter|Sort. Alternatively, it can be obtained by using the Real Statistics function SortUnique. We can leave out the final two values 14 and 15 since they don’t have any death values associated with them.
Figure 2 shows some key formulas from Figure 1
Cells | Entity | Formula |
E5 | d | =COUNTIFS($A$4:$A$21,D5,$B$4:$B$21,1) |
F4 | n | =COUNTIF($A$4:$A$21, “>”&D4-1) |
G5 | 1−d/n | =1-E5/F5 |
H4 | S(0) | 1 |
H5 | S(2) | =H4*G5 |
Figure 2 – Key formulas from Figure 1
Once we have entered formula E5, we can obtain all the other formulas for column E by highlighting range E5:E12 and then pressing Ctrl-D. We can obtain all the formulas for columns F, G, and H in a similar manner.
Observation: Based on the above definitions, for any t, tk ≤ t <tk+1 for k = 1, …, m-1
S(t) = 1 for t < t1. If tm is a censored time, then S(t) = 0 for t ≥ tm. Otherwise, S(t) is undefined for any t > tm.
Also note that if there are no censored data, then nj+1 = nj − dj. Thus since n1 = n, it follows that
Median Survival Time
The median survival time is the time t such that S(t) = .5. When no such t exists, we take the least t such that S(t) ≤ .5.
For Example 1, we see from Figure 1 that the median is between t = 10 and t = 11 since S(10) = .54 and S(11) = .36. Based on the definition we take 11 as the median.
Examples Workbook
Click here to download the Excel workbook with the examples described on this webpage.
References
NCSS (2015) Kaplan-Meier curves (Logrank tests)
https://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Kaplan-Meier_Curves-Logrank_Tests.pdf
Samartzis, L. (2006) Survival and censored data
https://datawisdom.ca/supervision/2005_LafterisSamartzis.pdf
Dear Charles,
In my dataset of 76 patients, 41 patients have dies till the end of follow up period. 35 patients are surviving. The median survival calculated by Kaplan Meier method would calculate the 38th (76/2) patient or 20th (41/2) patient?
Hello,
“The median survival is the smallest time at which the survival probability drops to 0.5 (50%) or below. If the survival curve does not drop to 0.5 or below then the median time cannot be computed.”
See https://www.medcalc.org/manual/kaplan-meier.php#:~:text=The%20median%20survival%20is%20the,median%20time%20cannot%20be%20computed.
Charles
Thanks a lot
Hi Charles,
Thank you for this, I am having an issue when using the formula to calculate L, it for some reason does not populate the numbers as anticipated and when using the formula: only gives me a column of 0’s, any idea as to why this might be?
Sagar
Hi Sagar,
Sorry, but I don’t know which formula you are using to calculate L.
Some of the worksheet formulas are array formulas, in which case you can’t simply press Enter (except if using Excel 365). I don’t know if this is the problem you are having, but in this case, see the following webpage https://real-statistics.com/array-formulas-functions/
Charles
Thank you for your article. I noticed that instead of 1-d/n, SPSS may be using (n-d-c)/(n-d), where c=number of censored observations at that time. Do you know when one version would be preferred over the other? Thanks.
Hi Nathan,
The formula that I use is 1 – d_j / n_j, and so the value of n_j changes (gets smaller at each subsequent time period).
The formula (n-d-c)/(n-d) is equivalent to 1 – c/(n-d). I believe that this value of c is the same as the d_j value in the formula that I am using. I am not familiar with the SPSS formula and so don’t know whether the n-d in that formula is the same as the n_j in the formula that I am using, but I wouldn’t be surprised to learn that it is.
Charles
One can get a median survival time from your notations. How does one obtain the 95% CI for the median survival time? Could you elaborate on that?
Hello Rudra,
The following webpage describes how to calculate confidence intervals at points in time:
https://www.real-statistics.com/survival-analysis/kaplan-meier-procedure/confidence-interval-for-the-survival-function/
You probably are looking for something else. The following webpages describe various methods for obtaining CIs for median survival times:
http://www.jstor.org/stable/2530286
http://www.barkerstats.com/PDFs/tas.pdf
Charles
Dear Charles,
I would like to make a time to event analysis by use of a Kaplan-Meier curve. I have a time variable (0-180 days), ill versus not ill, and a variable of presence of back pain.
Can you explain to me how i can plot this KM curve in excel in steps, with two plots over time?
thanks
Yours,
Daan
Daan,
How would you like the plot in https://real-statistics.com/survival-analysis/kaplan-meier-procedure/survival-curve/ to be modified for this purpose?
Charles
Hi Charles,
Thank you for the excel tool for calculating Kaplan Meier.
I want to calculate the curve for a cohort of 16 patients of with different follow up times and of which 2 died, 1 at 13months and one at 33 months. However, i have follow up of several patients at 40 (alive). When calculating the Kaplan Meier curve, it ends abruptly at 33 months with a vertical stroke. I would expect a final horizontal stroke up to 40months as the amount of dead patients remains stable up until then?
Instead, in the calculation in column “t” the last line is 33…
Could you help me with this?
Thanks,
Emmanuel
Emmanuel,
If you email me an Excel file with your data and results, I will try to figure out why this is happening.
Charles
Dear Mr Zaiontz,
I am a university student and I have been asked to create a Kaplan-Meier survival curves for 4 different hypothetical cohorts. But I am only given the survival data (in days at which the animal was killed). I am not given any other data. How could I calculate this information to create the curves?
Thank you for your time.
Hello George,
This is described on this webpage. Do you have any specific questions?
Charles
Hi,
Could you provide the link for exactly where you’ve covered George’s question, please?
Ben,
This webpage, namely
Kaplan-Meier Overview
Charles
Hi Charles,
I’m trying to calculate median survival time from my data of GPS tagged griffon vultures (which are checked at daily intervals), but the last S(t) value is 0.70, after slightly more than 900 days. How can I (let’s say) extrapolate/interpolate survival values as to derive median survival times?
Thanks a lot
regards
mario posillico
Hello Mario,
I don’t know how to determine the median if more than 50% of the subjects are still “alive”. Perhaps you could extrapolate/interpolate based on the existing “deaths” (perhaps by using a spline fit), but I don’t have any experience doing this.
Charles
I use the same data to make a Kaplan-Meier curve with your formula and with graphPad Prism 8 but the results are different. I don’t understand why.
Hello Giada,
If you email an Excel file with your data and results as well as the results you got from graphPad I will try to figure out why the results are different.
Charles
I think the problem is that my “t” are not “1” “2” “3” etc but “0.564” “1.234” “3.543” etc, so the formula “&D4-1” is not correct for my case!Thanks a lot for these very useful pages.
Thank you very much.
I’m having a bit of trouble- in some cases I get a negative result for 1-d/n when the number of dead is larger than the number of remaining individuals. What should I do in this case? (For example d=20 n=5 1-d/n= -3 )
Thank!!
Sorry, I think I know what the problem was.
I was supposed to write the number of remained individuals in the next row (next t), and not in the same row as the number of individuals who died at the same t.
Hi Charles
Is it possible to re produce the above table in Power BI and how, i can get all the way up to H5 S(2) =H4*G5 which i cannot do…
Hi,
I have not used Power BI and so don’t know how to do this. Perhaps some other reader knows how to do this.
Charles
Hi Charles,
I’ve managed to successfully follow your steps until column F. When I enter in the formula for F4, the result is #ERROR and formula parse error. I am working in google sheets. The exact formula I’m using is =COUNTIF($A$4:$A$157, “>”&D4-1)
My table is formatted the same as your example however I have more rows. Let me know if you have any idea why this is happening as I can’t figure it out.
Hi Cassie,
I am not able to explain why you are getting an error without seeing the spreadsheet. I also dont know whether Google’s implementation is different from the one in Excel.
Charles
I had an error issue too, but by deleting and manually re-adding the quotation marks, it was fixed. I think, if you copied and pasted the formulas from the web to excel you will possibly also need to do this due to character differences in the encoding.
Hi Time,
Thanks for sharing this issue.
Charles
I kept getting 0 when I enter =COUNTIF($A$4:$A$21, “>”&D4-1)
Emma,
I don’t know of any reason why this would happen. If you look at the Regression 2 examples workbook (which you can download for free), you will se that the value returned by this formula is 18.
If you have recreated the formula on your own spreadsheet, then if you send me an Excel file containing your spreadsheet I will try to figure out why you aren’t getting a value of 18.
Charles
This is becuase there’s a space between the “>” and the & which isn’t in the formula if you copy and paste
Hi
If out of 18 participants, 12 died then why do we have 4 people left at the end?
Regards
Aqif
Aqif,
Actually, the value of n-d in the last week in the table represents the number of people left at the end of the study. In particular, the value n = 4 in cell F12 does not represent the number of people who are left at the end, but instead it represents the number of people left at the beginning of week 12. If the rows had continued, at time t = 13, there would be n-d = 4-2 = 2 people left. Neither of these two people die during the study, but are still alive after 12 weeks.
Even if no one had died at week t = 12, the value of n would not be equal to the number of participants minus the number who died. This is clear since in previous weeks the value of n is decreased by both people who died and people who did not die.
Charles
Hi Charles
Thanks for your reply. I really appreciate that.
I am still confused about number left at the end as it goes out of sync at Year 8. Let me explain
– By year 3 one death had already happened therefore n=17
– By year 5 two deaths had already happened therefore n=16
– By year 8 five deaths had already happened therefore through previous logic n should be 13 instead of 12.
Can you please describe how you come to n=12 at year 8 instead of 13.
Regards
Aqif
Aqif,
The value is 12 because there are 4 cells with value 5 in column A (including both deaths and non-deaths), i.e. 16-4 = 12.
The value should be calculated by =COUNTIF($A$4:$A$21, “>”&D8-1) per Figure 2.
Charles
can you teach me how to draw graph for kaplan meier and log rank ? thank you
What type of graph do you want and what variables should be included in the graphs?
Charles
Hey Mr Charles. I am actually dealing with an exercise where I have 2 treatment (A andB) which are used to treat cancer. I actually can’t follow how to deal with 2 treatments since in your example you just used one. Here is my exercise in case you can help.
In a clinical trial of 48 patients 2 treatments are compared: A (20) and B(28). The following events happen:
14-A patients die from cancer after 18,25,30,34,35,36,38,39,43,60,86,130,520 and 819 weeks respectively.
1-A patient dies but not from cancer after 121 weeks.
5-A patients are still alive at last follow up after 68,77,82,324 and 546 respectively.
24-B patients die from cancer after 13,17,22,26,27,31,40,44,47,48,61,65,71,78,83,95,108,123,186,481,568,685,711 and 1105 weeks respectively.
1-B patient dies but not from cancer after 88 weeks.
3-B patients are still alive at last follow up after 37,69 and 213 weeks. I am then asked to construct the survival curves for both groups and use logrank test to test whether there is a significant (alpha=0.05) difference in survival between the two groups.
If you can please share some ideas that can help me so,ve that problem.
Thanks
Hello Lois,
On the following webpage I show how to compare two treatments
https://real-statistics.com/survival-analysis/kaplan-meier-procedure/log-rank-test/
I hope this helps.
Charles
What does the function S(t) exactly represent?
Josh,
See https://real-statistics.com/survival-analysis/survival-analysis-basic-concepts/
Charles
Hi,
i don’t understand how you calculate t(D4) and d(E5) collumns? Maybe there is some function?
Thank you
Teodor,
D4 is just the initial time value of 0.
The formula for E5 is shown in Figure 2.
Charles
I have the same problem as above. It keeps asking for a password and says that it can’t find project or library.
Carin,
See https://real-statistics.com/appendix/faqs/password-prompt/
Charles
Why F8 is 12 instead of 13? Thanks.
The formula in cell F8 is =COUNTIF($A$4:$A$21, “>”&D8-1), which yields the value 12. All the cells in column F use a similar formula (with only D8 in the formula changing).
Charles
I am thinking to apply survival analysis in the field of equipment leases. From the time an equipment is given on lease , till the point the lease expiries. Can I use survival analysis to identify early termination of leases ?. Is it required that the starting point of the lease should be same as the study start date ?. Can I consider leases originating after the study has started ?.
Hari,
Yes, it seems like you could use survival analysis in this context. The leases don’t need to start at the same time and some can start after the study start date.
Charles
Figured it out. The quotation marks in your formula are not the same ones excel uses, so it worked when I typed it out. Also, cell F8 makes sense now. Each n is subtracting previous deaths and censored data points. Thanks.
When I type in the formula for “n”, the syntax used is not accepted by Excel.
This is what I type in for it to work:
=COUNTIF($A$4:$A$21,”>”&(D4-1))
Note that:
1. There is no space between the comma and “<"
2. D4-1 is enclosed within a separate set of parentheses
Perhaps others don't have this problem, but wanted to share in case others struggled like me!
I can’t get the function in column F to work in Excel. It tells me that it “can’t find project or library”. Any idea why this is happening?
Also, shouldn’t F8=13, since 16-3=13? It says 12 instead.