memory allocation error if strings too long #138

richierocks · 2015-01-04T06:58:33Z

Here's a large text file with 10 million lines, each containing 100 characters.

x <- paste0(sample(letters, 100, replace = TRUE), collapse = "")
lines <- rep.int(x, 1e7)
writeLines(lines, "bigfile.txt")

If I try to read it (tested on a modern desktop PC with 16GB RAM under Windows 7), R crashes with the error:

lines <- stri_read_lines("bigfile.txt")
## terminate called after throwing an instance of 'std::bad_alloc'
##   what():  std::bad_alloc

For comparison, I can read the file using base-R's readLines and data.table's fread.

lines <- readLines("bigfile.txt")                          # slow but OK
lines <- data.table::fread("bigfile.txt", sep = "\n")[[1]] # faster, also OK

The text was updated successfully, but these errors were encountered:

gagolews · 2015-01-04T09:25:05Z

How large is this file, in bytes?

richierocks · 2015-01-04T19:09:57Z

Previous comment was a little short on lines; I've updated it to read 10 million (1e7 lines * 100 chars ~= 1GB). You may need to vary the file size a bit to reproduce the error, but stri_read_lines definitely fails before readLines on the (Windows) machines that I've tried.

stri_read_raw works fine on the file; the crash happens in stri_encode.

gagolews · 2015-01-05T12:57:24Z

OK, thanks for the notice. That's why stri_read_lines is marked as Draft API. I will rewrite the function in pure C++ some day (now it calls too much R code). I can't promise I'll do that soon, as my main priority for stringi_0.5-1 is to add date/time parse/format features. Thanks also for pointing out data.table's fread == I'll try to make this function even faster than that :)

richierocks · 2015-09-20T11:46:22Z

No pressure, but is this still on the TODO list?

gagolews · 2015-10-01T09:02:49Z

Sure, but I'm quite busy right now...

gagolews · 2020-08-20T09:44:08Z

Actually, it's stri_encode()...

> x <- charToRaw(stringi::stri_dup("a", 2**29))
> y <- stringi::stri_encode(x, NULL, "utf-8")
Error in stringi::stri_encode(x, NULL, "utf-8") : 
  memory allocation or access error

gagolews · 2020-08-20T09:44:52Z

The same with stri_rand_strings():

> x <- stringi::stri_rand_strings(1, 2**29)
Error in stringi::stri_rand_strings(1, 2^29) : 
  memory allocation or access error

gagolews · 2020-08-20T11:18:39Z

Fix on the way; changing buf size type to size_t:

> x <- stringi::stri_rand_strings(1, 2**31-1)
> stringi::stri_length(x)
[1] 2147483647

gagolews · 2020-08-20T12:42:13Z

Well.., I'll give it a thought tomorrow again 🤔

x <- paste0(sample(letters, 100, replace = TRUE), collapse = "")
lines <- rep.int(x, 1e7)
writeLines(lines, "bigfile.txt")
lines <- stringi::stri_read_lines("bigfile.txt")
##Error in stri_encode(txt, encoding, "UTF-8") : 
##  Start of codes indicating failure. (U_ILLEGAL_ARGUMENT_ERROR)

gagolews · 2020-08-21T03:02:55Z

After fixing a few bugs, which were there, encoding conversion for in-memory data of size 672 MB works fine. A few MBs more and we will get the following error (which, at least, is now more informative):

Error in stringi::stri_encode(x, NULL, "utf-8") : 
   memory allocation or access error

I would have to rewrite the whole stri_encode to support larger buffers than this, which I don't think is of high priority (?) Of course, I'm open for discussion, but still the predicted upper limit for raw inputs will be ca. (AVAILABLE_RAM)/4 anyway -- not that much more, so I don't think it's worth the hassle.

A workaround is to open a file connection and read data in mini-batches (say, of 0.5 GBs of size).

gagolews · 2020-08-21T07:52:10Z

I have created related issues: #395, #396

gagolews self-assigned this Jan 5, 2015

gagolews added the bug label Jan 5, 2015

gagolews changed the title ~~stri_read_lines fails for large datasets~~ memory allocation error if strings too long Aug 20, 2020

gagolews closed this as completed in 543d5e8 Aug 20, 2020

gagolews reopened this Aug 20, 2020

gagolews closed this as completed in 61f2129 Aug 21, 2020

gagolews added a commit that referenced this issue Aug 21, 2020

mention limitations discussed in #138 in the manual

0ad479f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory allocation error if strings too long #138

memory allocation error if strings too long #138

richierocks commented Jan 4, 2015

gagolews commented Jan 4, 2015

richierocks commented Jan 4, 2015

gagolews commented Jan 5, 2015

richierocks commented Sep 20, 2015

gagolews commented Oct 1, 2015

gagolews commented Aug 20, 2020

gagolews commented Aug 20, 2020

gagolews commented Aug 20, 2020

gagolews commented Aug 20, 2020

gagolews commented Aug 21, 2020

gagolews commented Aug 21, 2020

memory allocation error if strings too long #138

memory allocation error if strings too long #138

Comments

richierocks commented Jan 4, 2015

gagolews commented Jan 4, 2015

richierocks commented Jan 4, 2015

gagolews commented Jan 5, 2015

richierocks commented Sep 20, 2015

gagolews commented Oct 1, 2015

gagolews commented Aug 20, 2020

gagolews commented Aug 20, 2020

gagolews commented Aug 20, 2020

gagolews commented Aug 20, 2020

gagolews commented Aug 21, 2020

gagolews commented Aug 21, 2020