This repository has been archived by the owner on Jan 13, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 31
/
Copy path11_ParallelProcessing.Rmd
667 lines (518 loc) · 18.8 KB
/
11_ParallelProcessing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
---
title: "Introduction to Parallel Computing with R"
---
```{r, echo=FALSE, message=FALSE, results='hide', purl=FALSE}
source("knitr_header.R")
```
<div class="h_iframe">
<iframe src="11_Presentation/ParallelProcessingIntro.html"> </iframe>
</div>
[<i class="fa fa-file-code-o fa-3x" aria-hidden="true"></i> The R Script associated with this page is available here](`r output`). Download this file and open it (or copy-paste into a new script) with RStudio so you can follow along.
```{r cache=F, message=F,warning=FALSE}
library(knitr)
library(raster)
library(rasterVis)
library(dplyr)
library(ggplot2)
## New Packages
library(foreach)
library(doParallel)
library(arm)
library(fields)
library(snow)
```
If you don't have the packages above, install them in the package manager or by running `install.packages("doParallel")`.
# Simple examples
## _Sequential_ `for()` loop
```{r}
x=vector()
for(i in 1:3)
x[i]=i^2
x
```
## _Sequential_ `foreach()` loop
```{r}
x <- foreach(i=1:3) %do%
i^2
x
```
Note that `x` is a list with one element for each iterator variable (`i`). You can also specify a function to use to combine the outputs with `.combine`. Let's concatenate the results into a vector with `c`.
## _Sequential_ `foreach()` loop with `.combine`
```{r}
x <- foreach(i=1:3,.combine='c') %do%
i^2
x
```
Tells `foreach()` to first calculate each iteration, then `.combine` them with a `c(...)`
## _Sequential_ `foreach()` loop with `.combine`
```{r}
x <- foreach(i=1:3,.combine='rbind') %do%
i^2
x
```
<div class="well">
## Your turn
Write a `foreach()` loop that:
* generates 10 sets of 100 random values from a normal distribution (`rnorm()`)
* column-binds (`cbind`) them.
<button data-toggle="collapse" class="btn btn-primary btn-sm round" data-target="#demo1">Show Solution</button>
<div id="demo1" class="collapse">
```{r, purl=F}
x <- foreach(i=1:10,.combine='cbind') %do%
rnorm(100)
head(x)%>%kable()
dim(x)
```
</div>
</div>
## _Parallel_ `foreach()` loop
So far we've only used `%do%` which only uses a single processor.
Before running `foreach()` in parallel, you have to register a _parallel backend_ with one of the `do` functions such as `doParallel()`. On most multicore systems, the easiest backend is typically `doParallel()`. On linux and mac, it uses `fork` system call and on Windows machines it uses `snow` backend. The nice thing is it chooses automatically for the system.
```{r,cache=F,message=FALSE}
# register specified number of workers
registerDoParallel(3)
# or, reserve all all available cores
#registerDoParallel()
# check how many cores (workers) are registered
getDoParWorkers()
```
> _NOTE_ It may be a good idea to use n-1 cores for processing (so you can still use your computer to do other things while the analysis is running)
To run in parallel, simply change the `%do%` to `%dopar%`. Wasn't that easy?
```{r}
## run the loop
x <- foreach(i=1:3, .combine='c') %dopar%
i^2
x
```
## A slightly more complicated example
In this section we will:
1. Generate data with known parameters
2. Fit a set of regression models using subsets of the complete dataset (e.g. bootstrapping)
3. Compare processing times for sequential vs. parallel execution
For more on motivations to bootstrap, see [here](http://www.sagepub.com/sites/default/files/upm-binaries/21122_Chapter_21.pdf).
Make up some data:
```{r}
n <- 100000 # number of data points
x1 <- rnorm (n) # make up x1 covariate
b0 <- 1.8 # set intercept (beta0)
b1 <- -1.5 # set beta1
p = invlogit(b0+b1*x1)
y <- rbinom (n, 1, p) # simulate data with noise
data=cbind.data.frame(y=y,x1=x1,p=p)
```
Let's look at the data:
```{r}
kable(head(data),row.names = F,digits = 2)
```
```{r}
ggplot(data,aes(y=x1,x=as.factor(y)))+
geom_boxplot()+
coord_flip()+
geom_line(aes(x=p+1,y=x1),col="red",size=2,alpha=.5)+
xlab("Binary Response")+
ylab("Covariate")
```
### Sampling from a dataset with `sample_n()`
```{r}
size=5
sample_n(data,size,replace=T)
```
### Simple Generalized Linear Model
This is the formal definition of the model we'll use:
$y_i \sim Bernoulli(p_i)$
$logit(p_i) = \beta_0 + \beta_1 X_i$
The index $i$ identifies each grid cell (data point). $\beta_0$ - $\beta_1$ are model coefficients (intercept and slope), and $y_i$ is the simulated observation from cell $i$.
```{r}
trials = 10000
tsize = 100
ptime <- system.time({
result <- foreach(i=1:trials,
.combine = rbind.data.frame) %dopar%
{
tdata=sample_n(data,tsize,replace=TRUE)
M1=glm(y ~ x1, data=tdata, family=binomial(link="logit"))
## return parameter estimates
cbind.data.frame(trial=i,t(coefficients(M1)))
}
})
ptime
```
Look at `results` object containing slope and aspect from subsampled models. There is one row per sample (`1:trials`) with columns for the estimated intercept and slope for that sample.
```{r}
kable(head(result),digits = 2)
```
```{r}
ggplot(dplyr::select(result,everything(),Intercept=contains("Intercept")))+
geom_density(aes(x=Intercept),fill="black",alpha=.2)+
geom_vline(aes(xintercept=b0),size=2)+
geom_density(aes(x=x1),fill="red",alpha=.2)+
geom_vline(aes(xintercept=b1),col="red",size=2)+
xlim(c(-5,5))+
ylab("Parameter Value")+
xlab("Density")
```
So we were able to perform `r trials` separate model fits in `r ptime[3]` seconds. Let's see how long it would have taken in sequence.
```{r }
stime <- system.time({
result <- foreach(i=1:trials,
.combine = rbind.data.frame) %do%
{
tdata=sample_n(data,tsize,replace=TRUE)
M1=glm(y ~ x1, data=tdata,family=binomial(link="logit"))
## return parameter estimates
cbind.data.frame(trial=i,t(coefficients(M1)))
}
})
stime
```
So we were able to run `r trials` separate model fits in `r ptime[3]` seconds when using `r foreach::getDoParWorkers()` CPUs and `r stime[3]` seconds on one CPU. That's `r round(stime[3]/ptime[3],1)`X faster for this simple example.
<div class="well">
## Your turn
* Generate some random as follows:
```{r}
n <- 10000 # number of data points
x1 <- rnorm (n) # make up x1 covariate
b0 <- 25 # set intercept (beta0)
b1 <- -15 # set beta1
y <- rnorm (n, b0+b1*x1,10) # simulate data with noise
data2=cbind.data.frame(y=y,x1=x1)
```
Write a parallel `foreach()` loop that:
* selects a sample (as above) with `sample_n()`
* fits a 'normal' linear model with `lm()`
* Compiles the coefficient of determination (R^2) from each model
Hint: use `summary(M1)$r.squared` to extract the R^2 from model `M1` (see `?summary.lm` for more details).
<button data-toggle="collapse" class="btn btn-primary btn-sm round" data-target="#demo2">Show Solution</button>
<div id="demo2" class="collapse">
```{r, purl=F}
trials = 100
tsize = 100
result <- foreach(i=1:trials,.combine = rbind.data.frame) %dopar%
{
tdata=sample_n(data2,tsize,replace=TRUE)
M1=lm(y ~ x1, data=tdata)
cbind.data.frame(trial=i,
t(coefficients(M1)),
r2=summary(M1)$r.squared,
aic=AIC(M1))
}
```
Plot it:
```{r, purl=F}
ggplot(data2,aes(y=y,x=x1))+
geom_point(col="grey")+
geom_abline(data=dplyr::select(result,everything(),
Intercept=contains("Intercept")),
aes(intercept=Intercept,slope=x1),alpha=.5)
```
</div>
</div>
### Writing data to disk
For long-running processes, you may want to consider writing results to disk _as-you-go_ rather than waiting until the end in case of a problem (power failure, single job failure, etc.).
```{r}
## assign target directory
td=tempdir()
foreach(i=1:trials,
.combine = rbind.data.frame) %dopar%
{
tdata=sample_n(data,
tsize,
replace=TRUE)
M1=glm(y ~ x1,
data=tdata,
family=binomial(link="logit"))
## return parameter estimates
results=cbind.data.frame(
trial=i,
t(coefficients(M1)))
## write results to disk
file=paste0(td,"/results_",i,".csv")
write.csv(results,file=file)
return(NULL)
}
```
That will save the result of each subprocess to disk (be careful about duplicated file names!):
```{r}
list.files(td,pattern="results")%>%head()
```
### Other useful `foreach` parameters
* `.inorder` (true/false) results combined in the same order that they were submitted?
* `.errorhandling` (stop/remove/pass)
* `.packages` packages to made available to sub-processes
* `.export` variables to export to sub-processes
# Spatial example
In this section we will:
1. Generate some _spatial_ data
2. Tile the region to facilitate processing the data in parallel.
2. Perform a moving window mean for the full area
3. Compare processing times for sequential vs. parallel execution
## Generate Spatial Data
A function to generate `raster` object with spatial autocorrelation.
```{r}
simrast=function(nx=60,
ny=60,
theta=10,
seed=1234){
## create random raster with spatial structure
## Theta is scale of exponential decay
## This controls degree of autocorrelation,
## values ~1 are close to random while values ~nx/4 have high autocorrelation
r=raster(nrows=ny, ncols=nx,vals=1,xmn=-nx/2,
xmx=nx/2, ymn=-ny/2, ymx=ny/2)
names(r)="z"
# Simulate a Gaussian random field with an exponential covariance function
set.seed(seed) #set a seed so everyone's maps are the same
grid=list(x=seq(xmin(r),xmax(r)-1,
by=res(r)[1]),
y=seq(ymin(r),ymax(r)-1,res(r)[2]))
obj<-Exp.image.cov(grid=grid,
theta=theta,
setup=TRUE)
look<- sim.rf( obj)
values(r)=t(look)*10
return(r)
}
```
Generate a raster using `simrast`.
```{r}
r=simrast(nx=3000,ny=1000,theta = 100)
r
```
Plot the raster showing the grid.
```{r,fig.height=3}
gplot(r)+
geom_raster(aes(fill = value))+
scale_fill_gradient(low = 'white', high = 'blue')+
coord_equal()+ylab("Y")+xlab("X")
```
## "Tile" the region
To parallelize spatial data, you often need to _tile_ the data and process each tile separately. Here is a function that will take a bounding box, tile size and generate a tiling system. If given an `overlap` term, it will also add buffers to the tiles to reduce/eliminate edge effects, though this depends on what algorithm/model you are using.
```{r}
tilebuilder=function(raster,size=10,overlap=NULL){
## get raster extents
xmin=xmin(raster)
xmax=xmax(raster)
ymin=ymin(raster)
ymax=ymax(raster)
xmins=c(seq(xmin,xmax-size,by=size))
ymins=c(seq(ymin,ymax-size,by=size))
exts=expand.grid(xmin=xmins,ymin=ymins)
exts$ymax=exts$ymin+size
exts$xmax=exts$xmin+size
if(!is.null(overlap)){
#if overlapped tiles are requested, create new columns with buffered extents
exts$yminb=exts$ymin
exts$xminb=exts$xmin
exts$ymaxb=exts$ymax
exts$xmaxb=exts$xmax
t1=(exts$ymin-overlap)>=ymin
exts$yminb[t1]=exts$ymin[t1]-overlap
t2=exts$xmin-overlap>=xmin
exts$xminb[t2]=exts$xmin[t2]-overlap
t3=exts$ymax+overlap<=ymax
exts$ymaxb[t3]=exts$ymax[t3]+overlap
t4=exts$xmax+overlap<=xmax
exts$xmaxb[t4]=exts$xmax[t4]+overlap
}
exts$tile=1:nrow(exts)
return(exts)
}
```
Generate a tiling system for that raster. Here will use only three tiles (feel free to play with this).
```{r}
jobs=tilebuilder(r,size=1000,overlap=80)
kable(jobs,row.names = F,digits = 2)
```
Plot the raster showing the grid.
```{r,fig.height=4,fig.width=9}
ggplot(jobs)+
geom_raster(data=cbind.data.frame(
coordinates(r),fill = values(r)),
mapping = aes(x=x,y=y,fill = values(r)))+
scale_fill_gradient(low = 'white', high = 'blue')+
geom_rect(mapping=aes(xmin=xmin,xmax=xmax,
ymin=ymin,ymax=ymax),
fill="transparent",lty="dashed",col="darkgreen")+
geom_rect(aes(xmin=xminb,xmax=xmaxb,
ymin=yminb,ymax=ymaxb),
fill="transparent",col="black")+
geom_text(aes(x=(xminb+xmax)/2,y=(yminb+ymax)/2,
label=tile),size=10)+
coord_equal()+ylab("Y")+xlab("X")
```
## Run a simple spatial analysis: `focal` moving window
Use the `focal` function from the raster package to calculate a 3x3 moving window mean over the raster.
```{r, fig.height=4,fig.width=9}
stime2=system.time({
r_focal1=focal(r,w=matrix(1,101,101),mean,pad=T)
})
stime2
```
Plot it:
````{r}
gplot(r_focal1)+
geom_raster(aes(fill = value))+
scale_fill_gradient(low = 'white', high = 'blue')+
coord_equal()+ylab("Y")+xlab("X")
```
That works great (and pretty fast) for this little example, but as the data (or the size of the window) get larger, it can become prohibitive.
## Repeat the analysis, but parallelize using the tile system.
First write a function that breaks up the original raster, computes the focal mean, then puts it back together. You could also put this directly in the `foreach()` loop.
````{r}
focal_par=function(i,raster,jobs,w=matrix(1,101,101)){
## identify which row in jobs to process
t_ext=jobs[i,]
## crop original raster to (buffered) tile
r2=crop(raster,extent(t_ext$xminb,t_ext$xmaxb,
t_ext$yminb,t_ext$ymaxb))
## run moving window mean over tile
rf=focal(r2,w=w,mean,pad=T)
## crop to tile
rf2=crop(rf,extent(t_ext$xmin,t_ext$xmax,
t_ext$ymin,t_ext$ymax))
## return the object - could also write the file to disk and aggregate later outside of foreach()
return(rf2)
}
```
Run the parallelized version.
```{r}
registerDoParallel(3)
ptime2=system.time({
r_focal=foreach(i=1:nrow(jobs),.combine=merge,
.packages=c("raster")) %dopar% focal_par(i,r,jobs)
})
```
Are the outputs the same?
```{r}
identical(r_focal,r_focal1)
```
So we were able to process the data in `r ptime2[3]` seconds when using `r foreach::getDoParWorkers()` CPUs and `r stime2[3]` seconds on one CPU. That's `r round(stime2[3]/ptime2[3],1)`X faster for this simple example.
## Parallelized Raster functions
Some functions in raster package also easy to parallelize.
```{r}
ncores=2
beginCluster(ncores)
fn=function(x) x^3
system.time(fn(r))
system.time(clusterR(r, fn, verbose=T))
endCluster()
```
Does _not_ work with:
* merge
* crop
* mosaic
* (dis)aggregate
* resample
* projectRaster
* focal
* distance
* buffer
* direction
# High Performance Computers (HPC)
_aka_ *supercomputers*, for example, check out the [University at Buffalo HPC](https://www.buffalo.edu/ccr.html)

Working on a cluster can be quite different from a laptop/workstation. The most important difference is the existence of _scheduler_ that manages large numbers of individual tasks.
## SLURM and R
You typically don't run the script _interactively_, so you need to edit your script to 'behave' like a normal `#!` (linux command line) script. This is easy with [getopt](http://cran.r-project.org/web/packages/getopt/index.html) package.
```{r}
cat(paste("
library(getopt)
## get options
opta <- getopt(
matrix(c(
'date', 'd', 1, 'character'
), ncol=4, byrow=TRUE))
## extract value
date=as.Date(opta$date)
## Now your script using date as an input
print(date+1)
q(\"no\")
"
),file=paste("script.R",sep=""))
```
Then you can run this script from the command line like this:
```{r, eval=F}
Rscript script.R --date 2013-11-05
```
You will need the complete path if `Rscript` is not on your system path. For example, on OS X, it might be at `/Library/Frameworks/R.framework/Versions/3.2/Resources/Rscript`.
Or even from within R like this:
```{r, eval=F}
system("Rscript script.R --date 2013-11-05")
```
### Driving cluster from R
Possible to drive the cluster from within R via SLURM. First, define the jobs and write that file to disk:
```{r}
script="script.R"
dates=seq(as.Date("2000-01-01"),as.Date("2000-12-31"),by=60)
pjobs=data.frame(jobs=paste(script,"--date",dates))
write.table(pjobs,
file="process.txt",
row.names=F,col.names=F,quote=F)
```
This table has one row per task:
```{r,results='markup'}
pjobs
```
Now identify other parameters for SLURM.
```{r,eval=FALSE}
### Set up submission script
nodes=2
walltime=5
```
### Write the SLURM script
[More information here](https://www.buffalo.edu/ccr.html)
```{r,eval=F}
### write SLURM script to disk from R
cat(paste("#!/bin/sh
#SBATCH --partition=general-compute
#SBATCH --time=00:",walltime,":00
#SBATCH --nodes=",nodes,"
#SBATCH --ntasks-per-node=8
#SBATCH --constraint=IB
#SBATCH --mem=300
# Memory per node specification is in MB. It is optional.
# The default limit is 3000MB per core.
#SBATCH --job-name=\"date_test\"
#SBATCH --output=date_test-srun.out
#SBATCH [email protected]
#SBATCH --mail-type=ALL
##SBATCH --requeue
#Specifies that the job will be requeued after a node failure.
#The default is that the job will not be requeued.
## Load necessary modules
module load openmpi/gcc-4.8.3/1.8.4
module load R
IDIR=~
WORKLIST=$IDIR/process.txt
EXE=Rscript
LOGSTDOUT=$IDIR/log/stdout
LOGSTDERR=$IDIR/log/stderr
### use mpiexec to parallelize across lines in process.txt
mpiexec -np $CORES xargs -a $WORKLIST -p $EXE 1> $LOGSTDOUT 2> $LOGSTDERR
",sep=""),file=paste("slurm_script.txt",sep=""))
```
Now we have a list of jobs and a qsub script that points at those jobs with the necessary PBS settings.
```{r,eval=FALSE}
## run it!
system("sbatch slurm_script.txt")
## Check status with squeue
system("squeue -u adamw")
```
For more detailed information about the UB HPC, [see the CCR Userguide](https://www.buffalo.edu/ccr/support/UserGuide.html).
# Summary
> Each task should involve computationally-intensive work. If the tasks are very small, it can take _longer_ to run in parallel.
## Choose your method
1. Run from master process (e.g. `foreach`)
- easier to implement and collect results
- fragile (one failure can kill it and lose results)
- clumsy for *big* jobs
2. Run as separate R processes
- see [`getopt`](http://cran.r-project.org/web/packages/getopt/index.html) library
- safer for big jobs: each job completely independent
- update job list to re-run incomplete submissions
- compatible with slurm / cluster computing
- forces you to have a clean processing script
## Further Reading
* [CRAN Task View: High-Performance and Parallel Computing with R](http://cran.r-project.org/web/views/HighPerformanceComputing.html)
* [Simple Parallel Statistical Computing in R](www.stat.uiowa.edu/~luke/talks/uiowa03.pdf)
* [Parallel Computing with the R Language in a Supercomputing Environment](http://download.springer.com/static/pdf/832/chp%253A10.1007%252F978-3-642-13872-0_64.pdf?auth66=1415215123_43bf0cbf5ae8f5143b7ee309ff5e3556&ext=.pdf)