Label Encoder functionality in R?
Create your vector of data:
colors <- c("red", "red", "blue", "green")
Create a factor:
factors <- factor(colors)
Convert the factor to numbers:
as.numeric(factors)
Output: (note that this is in alphabetical order)
# [1] 3 3 1 2
You can also set a custom numbering system: (note that the output now follows the "rainbow color order" that I defined)
rainbow <- c("red","orange","yellow","green","blue","purple")
ordered <- factor(colors, levels = rainbow)
as.numeric(ordered)
# [1] 1 1 5 4
See ?factor
.
If I correctly understand what do you want:
# function which returns function which will encode vectors with values of 'vec'
label_encoder = function(vec){
levels = sort(unique(vec))
function(x){
match(x, levels)
}
}
colors = c("red", "red", "blue", "green")
color_encoder = label_encoder(colors) # create encoder
encoded_colors = color_encoder(colors) # encode colors
encoded_colors
new_colors = c("blue", "green", "green") # new vector
encoded_new_colors = color_encoder(new_colors)
encoded_new_colors
other_colors = c("blue", "green", "green", "yellow")
color_encoder(other_colors) # NA's are introduced
# save and restore to disk
saveRDS(color_encoder, "color_encoder.RDS")
c_encoder = readRDS("color_encoder.RDS")
c_encoder(colors) # same result
# dealing with multiple columns
# create data.frame
set.seed(123) # make result reproducible
color_dataframe = as.data.frame(
matrix(
sample(c("red", "blue", "green", "yellow"), 12, replace = TRUE),
ncol = 3)
)
color_dataframe
# encode each column
for (column in colnames(color_dataframe)){
color_dataframe[[column]] = color_encoder(color_dataframe[[column]])
}
color_dataframe
Try CatEncoders package. It replicates the Python sklearn.preprocessing
functionality.
# variable to encode values
colors = c("red", "red", "blue", "green")
lab_enc = LabelEncoder.fit(colors)
# new values are transformed to NA
values = transform(lab_enc, c('red', 'red', 'yellow'))
values
# [1] 3 3 NA
# doing the inverse: given the encoded numbers return the labels
inverse.transform(lab_enc, values)
# [1] "red" "red" NA
I would add the functionality of reporting the non-matching labels with a warning.
PS: It also has the OneHotEncoder
function.