How to calculate the elder generation’ maximum education year in one family
Here is what I tried. I thought it was necessary to create a generation variable. Seeing the sample image in your question, C
and D
are the 1st generation. A
and B
are the 2nd generation. E
and F
are the 3rd generation, and G
is the 4th generation. The first mutate()
with case_when()
created the generation variable. Then, I defined groups by family
and generation
. For each group, I identified max education duration (i.e., max_ed_duration
). Since you said that you do not need spouse's information, I dropped rows that contain mother or spouse in meanings
. Then, I defined groups again using family
. For each family, if generation
is 1, assign NA to fedu
. Otherwise, assign max_ed_duration
value from the previous generation to fedu
. Finally, I arranged the data by family
and relationship
.
library(dplyr)
mutate(mydf, generation = case_when(relationship %in% c("C", "D") ~ 1,
relationship %in% c("A", "B") ~ 2,
grepl(x = relationship, pattern = "^E|F") ~ 3,
grepl(x = relationship, pattern = "^G") ~ 4)) %>%
group_by(family, generation) %>%
mutate(max_ed_duration = max(edu)) %>%
filter(!grepl(x = meanings, pattern = "mother|spouse")) %>%
group_by(family) %>%
mutate(fedu = if_else(generation == 1,
NA_real_,
max_ed_duration[match(x = generation - 1, table = generation)])) %>%
arrange(family, relationship)
# family relationship meanings edu generation max_ed_duration fedu
# <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 A respondent 12 2 18 10
# 2 1 C A's father 10 1 10 NA
# 3 1 E1 A's first son 15 3 15 18
# 4 1 E2 A's second son 13 3 15 18
# 5 1 G11 E1's first son 3 4 3 15
# 6 1 G12 E1's second son 1 4 3 15
# 7 2 A respondent 21 2 21 16
# 8 2 C A's father 12 1 16 NA
# 9 2 E1 A's first son 18 3 18 21
#10 2 E2 A's second son 17 3 18 21
#11 2 E3 A's third son 16 3 18 21
DATA
mydf <- structure(list(family = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2), relationship = c("A", "B", "C", "D", "E1", "F1",
"G11", "G12", "E2", "A", "B", "C", "D", "E1", "F1", "E2", "E3"
), meanings = c("respondent", "respondent's spouse", "A's father",
"A's mother", "A's first son", "E1's spouse", "E1's first son",
"E1's second son", "A's second son", "respondent", "respondent's spouse",
"A's father", "A's mother", "A's first son", "E1's spouse", "A's second son",
"A's third son"), edu = c(12, 18, 10, 9, 15, 14, 3, 1, 13, 21,
16, 12, 16, 18, 15, 17, 16)), class = "data.frame", row.names = c(NA,
-17L))
Here is an option using data.table
.
Due to the nature of the dataset, quite a fair bit of data processing is required to restructure it into a parent-child relationship with its corresponding maximum number of years among parents' years of educations.
data processing:
library(data.table)
setDT(d)
#interpret the person mentioned in meanings column
d[, person := sapply(meanings, function(x) strsplit(x, "'")[[1]][1])]
d[person=="respondent", person :=
d[meanings=="respondent"][.SD, on=.(family), relationship]]
#find max edu per couple
d[, max_edu := d[grepl("spouse", meanings)][
.SD, on=.(family, person=relationship), pmax(x.edu, i.edu, na.rm=TRUE)]
]
#decipher parent and child relationship
cols <- c("parent", "child")
d[, (cols) := .(NA_character_, relationship)]
d[grepl("father|mother", meanings), (cols) := .(relationship, person)][
grepl("son", meanings), (cols) := .(person, relationship)]
#find parents' max edu
d[!is.na(parent), parent_max_edu :=
d[.SD, on=.(family, relationship=parent), x.max_edu]
]
d
#the last 3 columns are the useful ones
# family relationship meanings edu person max_edu parent child parent_max_edu
# 1: 1 A respondent 12 A 18 <NA> A NA
# 2: 1 B respondent's spouse 18 A 18 <NA> B NA
# 3: 1 C A's father 10 A 10 C A 10
# 4: 1 D A's mother 9 A 9 D A 9
# 5: 1 E1 A's first son 15 A 15 A E1 18
# 6: 1 F1 E1's spouse 14 E1 14 <NA> F1 NA
# 7: 1 G11 E1's first son 3 E1 3 E1 G11 15
# 8: 1 G12 E1's second son 1 E1 1 E1 G12 15
# 9: 1 E2 A's second son 13 A 13 A E2 18
# 10: 2 A respondent 21 A 21 <NA> A NA
# 11: 2 B respondent's spouse 16 A 16 <NA> B NA
# 12: 2 C A's father 12 A 12 C A 12
# 13: 2 D A's mother 16 A 16 D A 16
# 14: 2 E1 A's first son 18 A 18 A E1 21
# 15: 2 F1 E1's spouse 15 E1 15 <NA> F1 NA
# 16: 2 E2 A's second son 17 A 17 A E2 21
# 17: 2 E3 A's third son 16 A 16 A E3 21
finding max num of yrs of education of parents:
ans <- d[d[!grepl("spouse", meanings)], on=.(family, child=relationship),
by=.EACHI, .(family=i.family, id=i.relationship, edu=i.edu,
fedu=if (any(!is.na(parent_max_edu))) max(parent_max_edu, na.rm=TRUE) else NA_real_)][,
(1L:2L) := NULL]
setorder(ans, family, id)[]
output:
family id edu fedu
1: 1 A 12 10
2: 1 C 10 NA
3: 1 D 9 NA
4: 1 E1 15 18
5: 1 E2 13 18
6: 1 G11 3 15
7: 1 G12 1 15
8: 2 A 21 16
9: 2 C 12 NA
10: 2 D 16 NA
11: 2 E1 18 21
12: 2 E2 17 21
13: 2 E3 16 21