Skip to content

extractManipData() does not work correctly after applying globalRecode() #360

@MuellerRoman

Description

@MuellerRoman

Dear sdcMicro maintainers

I applied globalRecode() to the numeric variable age in order to generalize it into age categories. However, after calling extractManipData() I noticed that the age variable is still numeric rather than the expected factor.

I suspect the reason lies in aux_functions.R, line 471:

## quick and dirty: ensure that keyVars are factors:
  if (!is.null(k) && !ignoreKeyVars) {
    for (i in seq_len(ncol(k))) {
      cc <- class(origKeys[[colnames(k)[i]]])[1]
      vname <- colnames(k)[i]
      v_p <- o[[vname]]
      if (cc != class(v_p)[1]) {
        if (cc == "integer") {
          o[[vname]] <- as.integer(v_p)
        }
        if (cc == "character") {
          o[[vname]] <- as.character(v_p)
        }
        if (cc == "numeric") {
          o[[vname]] <- as.numeric(v_p)
        }
        if (cc == "logical") {
          o[[vname]] <- as.logical(v_p)
        }
        if (cc == "ordered") {
          o[[vname]] <- as.ordered(v_p)
        }
      }
    }
  }

Because the code section above restores the variable class to that of the original dataset, factorized numeric variables are converted back to numeric.

Note:

  • The section above was last modified in extractManipData fails when only 1 keyVar is set #321
  • In order to apply globalRecode() to a numeric variable, it must first be defined as a key variable (i.e., as categorical) within createSdcObj(). This is not directly related to the issue above, but it may contribute to the confusion.

Here is a reproducible example:

library(sdcMicro)

# create test data ----

set.seed(123)  # for reproducibility
n <- 50

# age 18–65
age <- round(runif(n, min = 18, max = 65))

# income
income <- round(rnorm(n, 60000, 20))

# occupation categories
occupation <- sample(c("student", "employed", "unemployed"),
                     size = n, replace = TRUE,
                     prob = c(0.3, 0.5, 0.2))

# gender categories
gender <- sample(c("male", "female", "nonbinary"),
                 size = n, replace = TRUE,
                 prob = c(0.45, 0.45, 0.10))

# depression score
depression_score <- rnorm(n, mean = 10, sd = 5)

# combine
df <- data.frame(
  age,
  income,
  occupation,
  gender,
  depression_score
)

# Apply globalRecode() to numeric variable "age" ----
sdcCat <- createSdcObj(dat = df,
                       keyVars = c("gender", "occupation", "age"))

sdcCat <- globalRecode(sdcCat,
                       column = "age",
                       breaks = seq(10, 80, 10)) 

# extract dataset
df_recoded <- extractManipData(sdcCat)

# after calling extractManipData(), "age" remains numeric rather than a factor
is.numeric(df_recoded$age)
is.factor(df_recoded$age)

# when extracting "age" directly from the sdcObject, it's correctly represented as factor
is.factor(sdcCat@manipKeyVars$age)```

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions