Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory allocation error if strings too long #138

Closed
richierocks opened this issue Jan 4, 2015 · 11 comments
Closed

memory allocation error if strings too long #138

richierocks opened this issue Jan 4, 2015 · 11 comments
Assignees
Labels

Comments

@richierocks
Copy link

Here's a large text file with 10 million lines, each containing 100 characters.

x <- paste0(sample(letters, 100, replace = TRUE), collapse = "")
lines <- rep.int(x, 1e7)
writeLines(lines, "bigfile.txt")

If I try to read it (tested on a modern desktop PC with 16GB RAM under Windows 7), R crashes with the error:

lines <- stri_read_lines("bigfile.txt")
## terminate called after throwing an instance of 'std::bad_alloc'
##   what():  std::bad_alloc

For comparison, I can read the file using base-R's readLines and data.table's fread.

lines <- readLines("bigfile.txt")                          # slow but OK
lines <- data.table::fread("bigfile.txt", sep = "\n")[[1]] # faster, also OK
@gagolews
Copy link
Owner

gagolews commented Jan 4, 2015

How large is this file, in bytes?

@richierocks
Copy link
Author

Previous comment was a little short on lines; I've updated it to read 10 million (1e7 lines * 100 chars ~= 1GB). You may need to vary the file size a bit to reproduce the error, but stri_read_lines definitely fails before readLines on the (Windows) machines that I've tried.

stri_read_raw works fine on the file; the crash happens in stri_encode.

@gagolews gagolews self-assigned this Jan 5, 2015
@gagolews gagolews added the bug label Jan 5, 2015
@gagolews
Copy link
Owner

gagolews commented Jan 5, 2015

OK, thanks for the notice. That's why stri_read_lines is marked as Draft API. I will rewrite the function in pure C++ some day (now it calls too much R code). I can't promise I'll do that soon, as my main priority for stringi_0.5-1 is to add date/time parse/format features. Thanks also for pointing out data.table's fread == I'll try to make this function even faster than that :)

@richierocks
Copy link
Author

No pressure, but is this still on the TODO list?

@gagolews
Copy link
Owner

gagolews commented Oct 1, 2015

Sure, but I'm quite busy right now...

@gagolews
Copy link
Owner

Actually, it's stri_encode()...

> x <- charToRaw(stringi::stri_dup("a", 2**29))
> y <- stringi::stri_encode(x, NULL, "utf-8")
Error in stringi::stri_encode(x, NULL, "utf-8") : 
  memory allocation or access error

@gagolews
Copy link
Owner

The same with stri_rand_strings():

> x <- stringi::stri_rand_strings(1, 2**29)
Error in stringi::stri_rand_strings(1, 2^29) : 
  memory allocation or access error

@gagolews gagolews changed the title stri_read_lines fails for large datasets memory allocation error if strings too long Aug 20, 2020
@gagolews
Copy link
Owner

Fix on the way; changing buf size type to size_t:

> x <- stringi::stri_rand_strings(1, 2**31-1)
> stringi::stri_length(x)
[1] 2147483647

@gagolews gagolews reopened this Aug 20, 2020
@gagolews
Copy link
Owner

Well.., I'll give it a thought tomorrow again 🤔

x <- paste0(sample(letters, 100, replace = TRUE), collapse = "")
lines <- rep.int(x, 1e7)
writeLines(lines, "bigfile.txt")
lines <- stringi::stri_read_lines("bigfile.txt")
##Error in stri_encode(txt, encoding, "UTF-8") : 
##  Start of codes indicating failure. (U_ILLEGAL_ARGUMENT_ERROR)

@gagolews
Copy link
Owner

After fixing a few bugs, which were there, encoding conversion for in-memory data of size 672 MB works fine. A few MBs more and we will get the following error (which, at least, is now more informative):

Error in stringi::stri_encode(x, NULL, "utf-8") : 
   memory allocation or access error

I would have to rewrite the whole stri_encode to support larger buffers than this, which I don't think is of high priority (?) Of course, I'm open for discussion, but still the predicted upper limit for raw inputs will be ca. (AVAILABLE_RAM)/4 anyway -- not that much more, so I don't think it's worth the hassle.

A workaround is to open a file connection and read data in mini-batches (say, of 0.5 GBs of size).

@gagolews
Copy link
Owner

I have created related issues: #395, #396

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants