Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date
The reason lubridate appears to be making mistakes above is that you are calculating duration (the exact amount of time that occurs between two instants, where 1 year = 31536000s), rather than periods (the change in clock time that occurs between two instants).
To get the change in clock time (in years, months, days, etc) you need to use
as.period(interval(start = birthdate, end = givendate))
which gives the following output
"37y 0m 1d 0H 0M 0S"
"37y 0m 0d 0H 0M 0S"
"36y 11m 30d 0H 0M 0S"
...
"46y 11m 30d 1H 0M 0S"
"47y 0m 0d 1H 0M 0S"
"47y 0m 1d 1H 0M 0S"
To just extract years, you can use the following
as.period(interval(start = birthdate, end = givendate))$year
[1] 37 37 36 53 53 52 50 50 49 1 1 0 46 47 47
Note sadly appears even slower than the methods above!
> mbm
Unit: microseconds
expr min lq mean median uq max neval cld
arithmetic 116.595 138.149 181.7547 184.335 196.8565 5556.306 1000 a
lubridate 16807.683 17406.255 20388.1410 18053.274 21378.8875 157965.935 1000 b
Ok, so I found this function in another post:
age <- function(from, to) {
from_lt = as.POSIXlt(from)
to_lt = as.POSIXlt(to)
age = to_lt$year - from_lt$year
ifelse(to_lt$mon < from_lt$mon |
(to_lt$mon == from_lt$mon & to_lt$mday < from_lt$mday),
age - 1, age)
}
It was posted by @Jim saying "The following function takes a vectors of Date objects and calculates the ages, correctly accounting for leap years. Seems to be a simpler solution than any of the other answers".
It is indeed simpler and it does the trick I was looking for. On average, it is actually faster than the arithmetic method (about 75% faster).
mbm <- microbenchmark(
arithmetic = (givendate - birthdate) / 365.25,
lubridate = interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years"),
age = age(from = birthdate, to = givendate),
times = 1000
)
mbm
autoplot(mbm)
And at least in my examples it does not make any mistake (and it should not in any example; it's a pretty straightforward function using ifelse
s).
toy_df <- data.frame(
birthdate = birthdate,
givendate = givendate,
arithmetic = as.numeric((givendate - birthdate) / 365.25),
lubridate = interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
eeptools = age_calc(dob = birthdate, enddate = givendate,
units = "years"),
age = age(from = birthdate, to = givendate)
)
toy_df[, 3:6] <- floor(toy_df[, 3:6])
toy_df
birthdate givendate arithmetic lubridate eeptools age
1 1978-12-30 2015-12-31 37 37 37 37
2 1978-12-31 2015-12-31 36 37 37 37
3 1979-01-01 2015-12-31 36 37 36 36
4 1962-12-30 2015-12-31 53 53 53 53
5 1962-12-31 2015-12-31 52 53 53 53
6 1963-01-01 2015-12-31 52 53 52 52
7 2000-06-16 2050-06-17 50 50 50 50
8 2000-06-17 2050-06-17 49 50 50 50
9 2000-06-18 2050-06-17 49 50 49 49
10 2007-03-18 2008-03-19 1 1 1 1
11 2007-03-19 2008-03-19 1 1 1 1
12 2007-03-20 2008-03-19 0 1 0 0
13 1968-02-29 2015-02-28 46 47 46 46
14 1968-02-29 2015-03-01 47 47 47 47
15 1968-02-29 2015-03-02 47 47 47 47
I do not consider it as a complete solution because I also wanted to have age in months and weeks, and this function is specific for years. I post it here anyway because it solves the problem for the age in years. I will not accept it because:
- I would wait for @Jim to post it as an answer.
- I will wait to see if someone else come up with a complete solution (efficient, accurate and producing age in years, months or weeks as desired).
I was going to leave this in the comments, but I think it's worthy of a separate answer. As @Molx points out, your "arithmetic" method is not as simple as it seems -- take a look at the code for -.Date
, most importantly:
return(difftime(e1, e2, units = "days"))
Thus, the "arithmetic" method on objects of class Date
is really a wrapper for the difftime
function. What about difftime
? This too has a bunch of overhead if what you're after is raw speed.
The key is that Date
objects are stored as an integer number of days since/until Jan. 1, 1970 (though they're not actually stored as integer
, hence the birth of the IDate
class in data.table
), so we can just subtract these and be done with it, but to avoid the -.Date
method being called, we have to unclass
our inputs:
(unclass(birthdate) - unclass(givendate)) / 365.25
As far as bang for your buck goes, this approach is another several orders of magnitude faster than even @Jim's age
method.
Here's some more scaled-up test data:
set.seed(20349)
NN <- 1e6
birthdate <- as.Date(sprintf('%d-%02d-%02d',
sample(1901:2030, NN, TRUE),
sample(12, NN, TRUE),
sample(28, NN, TRUE)))
#average 30 years, most data between 20 and 40 years
givendate <- birthdate + as.integer(rnorm(NN, mean = 10950, sd = 1000))
(excluding eeptools
because it is almost impossibly slower--a glance at the code for age_calc
suggests the code goes as far as to create a sequence of dates for each pair of dates (O(n^2)
-ish), not to mention a peppering of ifelse
s)
microbenchmark(
arithmetic = (givendate - birthdate) / 365.25,
lubridate = interval(start = birthdate, end = givendate) /
duration(num = 1, units = "years"),
age = age(from = birthdate, to = givendate),
fastar = (unclass(givendate) - unclass(birthdate)) / 365.25,
overlaps = get_age(birthdate, givendate),
times = 50)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# arithmetic 28.153465 30.384639 62.96118 31.492764 34.052991 180.9556 50 b
# lubridate 94.327968 97.233009 157.30420 102.751351 240.717065 265.0283 50 c
# age 338.347756 479.598513 483.84529 483.580981 488.090832 770.1149 50 d
# fastar 7.740098 7.831528 11.02521 7.913146 8.090902 153.3645 50 a
# overlaps 316.408920 458.734073 459.58974 463.806255 470.320072 769.0929 50 d
Thus we also highlight the folly of benchmarking on small-scale data.
The big cost of @Jim's method is that as.POSIXlt
is increasingly expensive as your vectors grow.
The issue of inaccuracy remains, but unless this accuracy is paramount, it seems the unclass
method is unparalleled.