Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel doesn't work on Windows (mclapply) #176

Open
bholtdwyer opened this issue May 26, 2023 · 3 comments
Open

Parallel doesn't work on Windows (mclapply) #176

bholtdwyer opened this issue May 26, 2023 · 3 comments
Labels
investigate Figure out what is going on

Comments

@bholtdwyer
Copy link

When I attempt to run some specifications in parallel, I get a warning that mclapply can't operate on Windows. However, in other cases the parallelization seems to work fine (or at least doesn't throw an error).

Example of a specification that doesn't work:

> example_attgt <- att_gt(yname = "my_outcome_var",
+                         tname = "yeer",
+                         idname = "group_id",
+                         gname = "min_treated_year",
+                         data = my_data,
+                         weightsname = "totsampweight",
+                         control_group = "nevertreated",
+                         allow_unbalanced_panel = TRUE,
+                         pl=TRUE,
+                         cores = 8
+ )
Error in parallel::mclapply(chunks, FUN = parallel.function, mc.cores = cores) : 
  'mc.cores' > 1 is not supported on Windows

I haven't been able to figure out why some specifications produce this error and others (such as the "real data" example in your vignette) seem to work fine. Maybe some later commenter will be able to figure out what triggers the error. But if the code calls mclapply at least in some cases, those cases won't run on Windows.

There may be a way to avoid this error by replacing calls to mclapply (which uses forking, which Windows doesn't do) with parLapply, which works on Windows but takes more setup due to the need to explicitly pass objects to the workers. Alternatively, at a cost in memory efficiency and speed, you could use the parallelsugar package to overwrite mclapply on Windows machines:

https://www.r-bloggers.com/2014/07/implementing-mclapply-on-windows-a-primer-on-embarrassingly-parallel-computation-on-multicore-systems-with-r/

https://www.r-bloggers.com/2015/10/parallelsugar-an-implementation-of-mclapply-for-windows/

https://www.r-bloggers.com/2019/06/parallel-r-socket-or-fork/

Thanks for providing this great package!!

@bcallaway11
Copy link
Owner

Thanks for the comment. Note to self: mclapply is called on line 137 of mboot.R.

We have just updated the bootstrap code, and it may get rid of this problem, but I haven't tested it.

Another note to self: we only actually run in parallel if the number of observations/clusters is large enough, so I think this is the explanation for the difference in behavior across applications.

I am going to keep this open for now.

@bcallaway11 bcallaway11 added the investigate Figure out what is going on label Aug 29, 2023
@emmiobara
Copy link

I am also facing this issue with a data set that has tens of millions of observations on a Windows machine. I am getting the same error when I try to run with multiple cores.

@grantmcdermott
Copy link
Contributor

parallel::mclapply relies on forking, which is only available on unix-based machines (i.e., Mac or Linux). For Windows, you have to use parallel sockets aka 'psocks'. (Details here.)

Here are two possible solutions:

Option 1: base tools only

The base R function for implementing a psock cluster is parallel::makeCluster. So you could use some simple architecture checking to implement the right parallel strategy. Simply change L136-L142 of boot.R to:

if(n > 2500 & pl == TRUE & cores > 1) {
  if (.Platform$OS.type == "windows") {
    cl_cores <- parallel::makeCluster(cores)
    on.exit(parallel::stopCluster(cl_cores))
    results = parallel::parLapply(cl = cl_cores, chunks, FUN = parallel.function)
  } else {
    results = parallel::mclapply(chunks, FUN = parallel.function, mc.cores = cores)
  }
 results = do.call(rbind, results)
}

(FWIW this kind of conditional strategy is super common. It's what a bunch of popular bootstrapping packages and functions use, e.g. sandwich::vovBS.default, etc.)

Option 2: Use parallel::pblapply

did already imports pbapply, which implements a similar conditional checking system to the above code chunk as part of its internal logical. So you could simply switch parallel::mclapply with pbapply::pblapply on L137 of boot.R and everything should pass through correctly. The one caveat that I'll add here is that progress bars can be surprisingly expensive, so I recommend setting some configs to govern the default behaviour for did on attach. Here's an example of how I do it with one of my own packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Figure out what is going on
Projects
None yet
Development

No branches or pull requests

4 participants