-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Labels
Description
Dear sdcMicro maintainers
I applied globalRecode() to the numeric variable age in order to generalize it into age categories. However, after calling extractManipData() I noticed that the age variable is still numeric rather than the expected factor.
I suspect the reason lies in aux_functions.R, line 471:
## quick and dirty: ensure that keyVars are factors:
if (!is.null(k) && !ignoreKeyVars) {
for (i in seq_len(ncol(k))) {
cc <- class(origKeys[[colnames(k)[i]]])[1]
vname <- colnames(k)[i]
v_p <- o[[vname]]
if (cc != class(v_p)[1]) {
if (cc == "integer") {
o[[vname]] <- as.integer(v_p)
}
if (cc == "character") {
o[[vname]] <- as.character(v_p)
}
if (cc == "numeric") {
o[[vname]] <- as.numeric(v_p)
}
if (cc == "logical") {
o[[vname]] <- as.logical(v_p)
}
if (cc == "ordered") {
o[[vname]] <- as.ordered(v_p)
}
}
}
}Because the code section above restores the variable class to that of the original dataset, factorized numeric variables are converted back to numeric.
Note:
- The section above was last modified in extractManipData fails when only 1 keyVar is set #321
- In order to apply
globalRecode()to a numeric variable, it must first be defined as a key variable (i.e., as categorical) withincreateSdcObj(). This is not directly related to the issue above, but it may contribute to the confusion.
Here is a reproducible example:
library(sdcMicro)
# create test data ----
set.seed(123) # for reproducibility
n <- 50
# age 18–65
age <- round(runif(n, min = 18, max = 65))
# income
income <- round(rnorm(n, 60000, 20))
# occupation categories
occupation <- sample(c("student", "employed", "unemployed"),
size = n, replace = TRUE,
prob = c(0.3, 0.5, 0.2))
# gender categories
gender <- sample(c("male", "female", "nonbinary"),
size = n, replace = TRUE,
prob = c(0.45, 0.45, 0.10))
# depression score
depression_score <- rnorm(n, mean = 10, sd = 5)
# combine
df <- data.frame(
age,
income,
occupation,
gender,
depression_score
)
# Apply globalRecode() to numeric variable "age" ----
sdcCat <- createSdcObj(dat = df,
keyVars = c("gender", "occupation", "age"))
sdcCat <- globalRecode(sdcCat,
column = "age",
breaks = seq(10, 80, 10))
# extract dataset
df_recoded <- extractManipData(sdcCat)
# after calling extractManipData(), "age" remains numeric rather than a factor
is.numeric(df_recoded$age)
is.factor(df_recoded$age)
# when extracting "age" directly from the sdcObject, it's correctly represented as factor
is.factor(sdcCat@manipKeyVars$age)```