More Kaplan-Meier curves in R

Steve Simon

2014-11-01

I found a larger data set and wanted to show how you could use the Kaplan Meier curves as a preliminary screen of some categorical and continuous variables in a larger and more complex data set.

I am building on a previous blog entry and this referred to an earlier webpage. Here is the R code. The comment section at the top is a cut-and-paste from the documentation of the data file.

# Some simple examples of survival analysis
# The data set is a study of primary biliary cirrhosis
# Variables:
# V1  case number
# V2  number of days between registration and the earlier of death,
#     transplantion

Read the data in from the website. The first 60 rows are documentation about the file and the actual data starts on line 61.

fn <- "http://lib.stat.cmu.edu/datasets/pbc"
pbc <- read.table(fn,header=FALSE,skip=60)
head(pbc)
  V1   V2 V3 V4    V5 V6 V7 V8 V9 V10  V11 V12  V13 V14    V15    V16 V17 V18  V19 V20
1  1  400  2  1 21464  1  1  1  1 1.0 14.5 261 2.60 156 1718.0 137.95 172 190 12.2   4
2  2 4500  0  1 20617  1  0  1  1 0.0  1.1 302 4.14  54 7394.8 113.52  88 221 10.6   3
3  3 1012  2  1 25594  0  0  0  0 0.5  1.4 176 3.48 210  516.0  96.10  55 151 12.0   4
4  4 1925  2  1 19994  1  0  1  1 0.5  1.8 244 2.54  64 6121.8  60.63  92 183 10.3   4
5  5 1504  1  2 13918  1  0  1  1 0.0  3.4 279 3.53 143  671.0 113.15  72 136 10.9   3
6  6 2503  2  2 24201  1  0  1  0 0.0  0.8 248 3.98  50  944.0  93.00  63   . 11.0   3

The data looks like it read in properly

Now that’s strange. There is an object in the survival library that has the same name (pbc). Let’s investigate.

> rm(pbc)
> head(pbc)
  id time status trt      age sex ascites hepato spiders edema bili chol albumin copper alk.phos
1  1  400      2   1 58.76523   f       1      1       1   1.0 14.5  261    2.60    156   1718.0
2  2 4500      0   1 56.44627   f       0      1       1   0.0  1.1  302    4.14     54   7394.8
3  3 1012      2   1 70.07255   m       0      0       0   0.5  1.4  176    3.48    210    516.0
4  4 1925      2   1 54.74059   f       0      1       1   0.5  1.8  244    2.54     64   6121.8
5  5 1504      1   2 38.10541   f       0      1       1   0.0  3.4  279    3.53    143    671.0
6  6 2503      2   2 66.25873   f       0      1       0   0.0  0.8  248    3.98     50    944.0
     ast trig platelet protime stage
1 137.95  172      190    12.2     4
2 113.52   88      221    10.6     3
3  96.10   55      151    12.0     4
4  60.63   92      183    10.3     4
5 113.15   72      136    10.9     3
6  93.00   63       NA    11.0     3

Wow! The data set I was importing was already part of the survival library. Let’s use their version

We have to subset the data because the first 312 observations represent a randomized trial and the remaining observations are an add-on study of baseline values for some non-randomized patients.

> tail(pbc)
     id time status trt      age sex ascites hepato spiders edema bili chol albumin copper
413 413  989      0  NA 35.00068   f      NA     NA      NA     0  0.7   NA    3.23     NA
414 414  681      2  NA 67.00068   f      NA     NA      NA     0  1.2   NA    2.96     NA
415 415 1103      0  NA 39.00068   f      NA     NA      NA     0  0.9   NA    3.83     NA
416 416 1055      0  NA 56.99932   f      NA     NA      NA     0  1.6   NA    3.42     NA
417 417  691      0  NA 58.00137   f      NA     NA      NA     0  0.8   NA    3.75     NA
418 418  976      0  NA 52.99932   f      NA     NA      NA     0  0.7   NA    3.29     NA
    alk.phos ast trig platelet protime stage
413       NA  NA   NA      312    10.8     3
414       NA  NA   NA      174    10.9     3
415       NA  NA   NA      180    11.2     4
416       NA  NA   NA      143     9.9     3
417       NA  NA   NA      269    10.4     3
418       NA  NA   NA      350    10.6     4
> pmod <- pbc[1:312,]
> tail(pmod)
     id time status trt      age sex ascites hepato spiders edema bili chol albumin copper
307 307 1149      0   2 30.57358   f       0      0       0     0  0.8  273    3.56     52
308 308 1153      0   1 61.18275   f       0      1       0     0  0.4  246    3.58     24
309 309  994      0   2 58.29979   f       0      0       0     0  0.4  260    2.75     41
310 310  939      0   1 62.33265   f       0      0       0     0  1.7  434    3.35     39
311 311  839      0   1 37.99863   f       0      0       0     0  2.0  247    3.16     69
312 312  788      0   2 33.15264   f       0      0       1     0  6.4  576    3.79    186
    alk.phos ast trig platelet protime stage
307     1282 130   59      344    10.5     2
308      797  91  113      288    10.4     2
309     1166  70   82      231    10.8     2
310     1713 171  100      234    10.2     2
311     1050 117   88      335    10.5     2
312     2115 136  149      200    10.8     2

Take a look at the two key variables needed to define the survival object. Dividing by 365.25 gives the times in years rather than in days.

> summary(pmodtime)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     41    1191    1840    2006    2697    4556 
> summary(pmodtime/365.25)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1123  3.2610  5.0360  5.4930  7.3850 12.4700 
> table(pmodstatus)

  0   1   2 
168  19 125

The values for status

First

Now let’s define some functions that will calculate and compare Kaplan-Meier curves across all the possible covariates in the model. For categorical variables

Here is the output.

v.cat <- names(pmod)[c(4,6:10,20)]
v.con <- names(pmod)[c(5,11:19)]
for (v in v.cat) {
  km.cat(v)
}
## [1] "trt"
## 
##   1   2 
## 158 154 
## [1] "There are 0 missing values."

## [1] "sex"
## 
##   m   f 
##  36 276 
## [1] "There are 0 missing values."

## [1] "ascites"
## 
##   0   1 
## 288  24 
## [1] "There are 0 missing values."

## [1] "hepato"
## 
##   0   1 
## 152 160 
## [1] "There are 0 missing values."

## [1] "spiders"
## 
##   0   1 
## 222  90 
## [1] "There are 0 missing values."

## [1] "edema"
## 
##   0 0.5   1 
## 263  29  20 
## [1] "There are 0 missing values."

## [1] "stage"
## 
##   1   2   3   4 
##  16  67 120 109 
## [1] "There are 0 missing values."

for (v in v.con) {
  km.con(v)
}
## [1] "age"
## quartiles
##    (0,42.2] (42.2,49.8] (49.8,56.7] (56.7,78.4] 
##          78          78          78          78 
## [1] "There are 0 missing values."

## [1] "bili"
## quartiles
##     (0,0.8]  (0.8,1.35] (1.35,3.42]   (3.42,28] 
##          90          66          78          78 
## [1] "There are 0 missing values."

## [1] "chol"
## quartiles
##        (0,250]      (250,310]      (310,400] (400,1.78e+03] 
##             71             71             72             70 
## [1] "There are 28 missing values."

## [1] "albumin"
## quartiles
##    (0,3.31] (3.31,3.55]  (3.55,3.8]  (3.8,4.64] 
##          81          76          81          74 
## [1] "There are 0 missing values."

## [1] "copper"
## quartiles
##  (0,41.2] (41.2,73]  (73,123] (123,588] 
##        78        80        75        77 
## [1] "There are 2 missing values."

## [1] "alk.phos"
## quartiles
##             (0,872]      (872,1.26e+03] (1.26e+03,1.98e+03] 
##                  78                  78                  78 
## (1.98e+03,1.39e+04] 
##                  78 
## [1] "There are 0 missing values."

## [1] "ast"
## quartiles
##   (0,80.6] (80.6,115]  (115,152]  (152,457] 
##         79         78         79         76 
## [1] "There are 0 missing values."

## [1] "trig"
## quartiles
##   (0,84.2] (84.2,108]  (108,151]  (151,598] 
##         71         71         70         70 
## [1] "There are 30 missing values."

## [1] "platelet"
## quartiles
##   (0,200] (200,257] (257,322] (322,563] 
##        77        77        77        77 
## [1] "There are 4 missing values."

## [1] "protime"
## quartiles
##      (0,10]   (10,10.6] (10.6,11.1] (11.1,17.1] 
##          89          85          61          77 
## [1] "There are 0 missing values."

You can find an earlier version of this page on my blog.