Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements with factor levels #457

Closed
jmbarbone opened this issue Sep 8, 2022 · 2 comments
Closed

Performance improvements with factor levels #457

jmbarbone opened this issue Sep 8, 2022 · 2 comments

Comments

@jmbarbone
Copy link

Base sub, gsub contain checks for factors which can drastically improve performance. Below is an example of some improvements that {stringr} could get from this check. I'd image that the check would be inside the functions, and not implemented as an additional function (as it is below).

# contains a check for factor & levels
sub
#> function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, 
#>     fixed = FALSE, useBytes = FALSE) 
#> {
#>     if (is.factor(x) && length(levels(x)) < length(x)) {
#>         sub(pattern, replacement, levels(x), ignore.case, perl, 
#>             fixed, useBytes)[x]
#>     }
#>     else {
#>         if (!is.character(x)) 
#>             x <- as.character(x)
#>         .Internal(sub(as.character(pattern), as.character(replacement), 
#>             x, ignore.case, perl, fixed, useBytes))
#>     }
#> }
#> <bytecode: 0x0000021504f21ee8>
#> <environment: namespace:base>

foo <- function() sample(letters[1:5], 1e4, TRUE)
x <- paste0(foo(), foo(), foo())
fx <- factor(x)

library(stringr)
str_remove_fct <- function(string, pattern) {
  str_remove(levels(string), pattern)[string]
}

res <- bench::mark(
  sub("a", "", x),
  sub("a", "", fx),
  str_remove(x, "a"),
  str_remove(fx, "a"),
  str_remove_fct(fx, "a")
)

res
#> # A tibble: 5 × 6
#>   expression                   min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 sub("a", "", x)            2.4ms   2.67ms      353.    78.2KB     0   
#> 2 sub("a", "", fx)          86.8µs    145µs     6154.    79.2KB     8.58
#> 3 str_remove(x, "a")        2.34ms    3.7ms      237.   180.5KB     0   
#> 4 str_remove(fx, "a")       2.28ms   2.69ms      297.   158.5KB     0   
#> 5 str_remove_fct(fx, "a")  171.3µs 281.45µs     3299.    79.2KB     6.35

ggplot2::autoplot(res)
#> Loading required namespace: tidyr

Created on 2022-09-08 with reprex v2.0.2

@gagolews
Copy link
Contributor

Related issue: gagolews/stringi#435

@hadley
Copy link
Member

hadley commented Oct 1, 2022

I think this is out of scope for stringr. You can use forcats::fct_relabel + stringr functions if this performance improvement is meaningful for your data.

@hadley hadley closed this as completed Oct 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants