Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date

The reason lubridate appears to be making mistakes above is that you are calculating duration (the exact amount of time that occurs between two instants, where 1 year = 31536000s), rather than periods (the change in clock time that occurs between two instants).

To get the change in clock time (in years, months, days, etc) you need to use

as.period(interval(start = birthdate, end = givendate))

which gives the following output

 "37y 0m 1d 0H 0M 0S"   
 "37y 0m 0d 0H 0M 0S"   
 "36y 11m 30d 0H 0M 0S" 
 ...
 "46y 11m 30d 1H 0M 0S" 
 "47y 0m 0d 1H 0M 0S"   
 "47y 0m 1d 1H 0M 0S" 

To just extract years, you can use the following

as.period(interval(start = birthdate, end = givendate))$year
 [1] 37 37 36 53 53 52 50 50 49  1  1  0 46 47 47

Note sadly appears even slower than the methods above!

> mbm
Unit: microseconds
       expr       min        lq       mean    median         uq        max neval cld
 arithmetic   116.595   138.149   181.7547   184.335   196.8565   5556.306  1000  a 
  lubridate 16807.683 17406.255 20388.1410 18053.274 21378.8875 157965.935  1000   b

Ok, so I found this function in another post:

age <- function(from, to) {
    from_lt = as.POSIXlt(from)
    to_lt = as.POSIXlt(to)

    age = to_lt$year - from_lt$year

    ifelse(to_lt$mon < from_lt$mon |
               (to_lt$mon == from_lt$mon & to_lt$mday < from_lt$mday),
           age - 1, age)
}

It was posted by @Jim saying "The following function takes a vectors of Date objects and calculates the ages, correctly accounting for leap years. Seems to be a simpler solution than any of the other answers".

It is indeed simpler and it does the trick I was looking for. On average, it is actually faster than the arithmetic method (about 75% faster).

mbm <- microbenchmark(
    arithmetic = (givendate - birthdate) / 365.25,
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate, 
                        units = "years"),
    age = age(from = birthdate, to = givendate),
    times = 1000
)
mbm
autoplot(mbm)

enter image description here enter image description here

And at least in my examples it does not make any mistake (and it should not in any example; it's a pretty straightforward function using ifelses).

toy_df <- data.frame(
    birthdate = birthdate,
    givendate = givendate,
    arithmetic = as.numeric((givendate - birthdate) / 365.25),
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate,
                        units = "years"),
    age = age(from = birthdate, to = givendate)
)
toy_df[, 3:6] <- floor(toy_df[, 3:6])
toy_df

    birthdate  givendate arithmetic lubridate eeptools age
1  1978-12-30 2015-12-31         37        37       37  37
2  1978-12-31 2015-12-31         36        37       37  37
3  1979-01-01 2015-12-31         36        37       36  36
4  1962-12-30 2015-12-31         53        53       53  53
5  1962-12-31 2015-12-31         52        53       53  53
6  1963-01-01 2015-12-31         52        53       52  52
7  2000-06-16 2050-06-17         50        50       50  50
8  2000-06-17 2050-06-17         49        50       50  50
9  2000-06-18 2050-06-17         49        50       49  49
10 2007-03-18 2008-03-19          1         1        1   1
11 2007-03-19 2008-03-19          1         1        1   1
12 2007-03-20 2008-03-19          0         1        0   0
13 1968-02-29 2015-02-28         46        47       46  46
14 1968-02-29 2015-03-01         47        47       47  47
15 1968-02-29 2015-03-02         47        47       47  47

I do not consider it as a complete solution because I also wanted to have age in months and weeks, and this function is specific for years. I post it here anyway because it solves the problem for the age in years. I will not accept it because:

  1. I would wait for @Jim to post it as an answer.
  2. I will wait to see if someone else come up with a complete solution (efficient, accurate and producing age in years, months or weeks as desired).

I was going to leave this in the comments, but I think it's worthy of a separate answer. As @Molx points out, your "arithmetic" method is not as simple as it seems -- take a look at the code for -.Date, most importantly:

return(difftime(e1, e2, units = "days"))

Thus, the "arithmetic" method on objects of class Date is really a wrapper for the difftime function. What about difftime? This too has a bunch of overhead if what you're after is raw speed.

The key is that Date objects are stored as an integer number of days since/until Jan. 1, 1970 (though they're not actually stored as integer, hence the birth of the IDate class in data.table), so we can just subtract these and be done with it, but to avoid the -.Date method being called, we have to unclass our inputs:

(unclass(birthdate) - unclass(givendate)) / 365.25

As far as bang for your buck goes, this approach is another several orders of magnitude faster than even @Jim's age method.

Here's some more scaled-up test data:

set.seed(20349)
NN <- 1e6
birthdate <- as.Date(sprintf('%d-%02d-%02d',
                             sample(1901:2030, NN, TRUE),
                             sample(12, NN, TRUE),
                             sample(28, NN, TRUE)))

#average 30 years, most data between 20 and 40 years
givendate <- birthdate + as.integer(rnorm(NN, mean = 10950, sd = 1000))

(excluding eeptools because it is almost impossibly slower--a glance at the code for age_calc suggests the code goes as far as to create a sequence of dates for each pair of dates (O(n^2)-ish), not to mention a peppering of ifelses)

microbenchmark(
  arithmetic = (givendate - birthdate) / 365.25,
  lubridate = interval(start = birthdate, end = givendate) /
    duration(num = 1, units = "years"),
  age = age(from = birthdate, to = givendate),
  fastar = (unclass(givendate) - unclass(birthdate)) / 365.25,
  overlaps = get_age(birthdate, givendate),
  times = 50)
# Unit: milliseconds
#        expr        min         lq      mean     median         uq      max neval  cld
#  arithmetic  28.153465  30.384639  62.96118  31.492764  34.052991 180.9556    50  b  
#   lubridate  94.327968  97.233009 157.30420 102.751351 240.717065 265.0283    50   c 
#         age 338.347756 479.598513 483.84529 483.580981 488.090832 770.1149    50    d
#      fastar   7.740098   7.831528  11.02521   7.913146   8.090902 153.3645    50 a   
#    overlaps 316.408920 458.734073 459.58974 463.806255 470.320072 769.0929    50    d

Thus we also highlight the folly of benchmarking on small-scale data.

The big cost of @Jim's method is that as.POSIXlt is increasingly expensive as your vectors grow.

The issue of inaccuracy remains, but unless this accuracy is paramount, it seems the unclass method is unparalleled.

Tags:

R

Lubridate