Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

callr not working with nested loops #27

Open
jestover opened this issue Jan 3, 2024 · 5 comments
Open

callr not working with nested loops #27

jestover opened this issue Jan 3, 2024 · 5 comments

Comments

@jestover
Copy link

jestover commented Jan 3, 2024

library(furrr)
library(callr)

inner_loop <- function(x){future_map_dbl(x, ~ .x)}
outer_loop <- function(x){future_map(x, ~ inner_loop(.x))}
x <- list(1:10, 11:20, 21:30, 31:40)

plan(sequential)
outer_loop(x)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
 [1] 11 12 13 14 15 16 17 18 19 20

[[3]]
 [1] 21 22 23 24 25 26 27 28 29 30

[[4]]
 [1] 31 32 33 34 35 36 37 38 39 40

plan(list(tweak(multisession, workers = 2), tweak(multisession, workers = 2)))
outer_loop(x)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
 [1] 11 12 13 14 15 16 17 18 19 20

[[3]]
 [1] 21 22 23 24 25 26 27 28 29 30

[[4]]
 [1] 31 32 33 34 35 36 37 38 39 40

plan(list(tweak(callr, workers = 2), tweak(callr, workers = 2)))
outer_loop(x)
Error in (function (.x, .f, ..., .progress = FALSE)  : 
  ℹ In index: 1.
Caused by error:
! object '...furrr_map_fn' not found

plan(list(tweak(callr, workers = 2), tweak(multisession, workers = 2)))
outer_loop(x)
Error in (function (.x, .f, ..., .progress = FALSE)  : 
  ℹ In index: 1.
Caused by error:
! object '...furrr_map_fn' not found

plan(callr(workers = 2))
outer_loop(x)
Error in (function (.x, .f, ..., .progress = FALSE)  : 
  ℹ In index: 1.
Caused by error in `vctrs::vec_c()`:
! Can't convert `..1` <list> to <double>.

plan(list(tweak(multisession, workers = 2), tweak(callr, workers = 2)))
outer_loop(x)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
 [1] 11 12 13 14 15 16 17 18 19 20

[[3]]
 [1] 21 22 23 24 25 26 27 28 29 30

[[4]]
 [1] 31 32 33 34 35 36 37 38 39 40

I ran this example on a 2023 MacBook Pro, but I originally discovered the issue running some code on a Linux server. The real code is a cross validation exercise on some large textual data. I kept running into a problem where the memory usage just continuously grows over time, so I switched to the callr backend to try to address the memory issue, but I keep running into other problems. This was the first that I was able to replicate on a small reproducible example. Let me know if there is any other useful information I can provide.

@HenrikBengtsson
Copy link
Collaborator

Thanks for the report. I can reproduce this. I also noticed that we get another error without nested parallelization;

library(furrr)
plan(future.callr::callr, workers = 1)

inner_loop <- function(x) { future_map_dbl(x, ~ .x) }
outer_loop <- function(x) { future_map(x, ~ inner_loop(.x)) }
x <- list(1:10, 11:20, 21:30, 31:40)
y <- outer_loop(x)

gives

Error in (function (.x, .f, ..., .progress = FALSE)  :In index: 1.
Caused by error in `vctrs::vec_c()`:
! Can't convert `..1` <list> to <double>.

It does not happen with other backends, e.g. plan(future.batchtools::batchtools_local), plan(future::cluster, workers = 1), and plan(future::multisession, workers = 2).

@HenrikBengtsson
Copy link
Collaborator

I can also reproduce it without NSE, i.e.

library(furrr)
plan(list(tweak(callr, workers = 2), tweak(callr, workers = 2)))
inner_loop <- function(x) { future_map_dbl(x, identity) }
outer_loop <- function(x) { future_map(x, inner_loop) }
x <- list(1:10, 11:20, 21:30, 31:40)
y <- outer_loop(x)

produces the same error.

What's interesting, though, is that it looks specific to furrr. For example, I cannot reproduce it with future.apply;

library(future.apply)
library(future.callr)
plan(list(tweak(callr, workers = 2), tweak(callr, workers = 2)))

inner_loop <- function(x) { future_sapply(x, FUN = identity) }
outer_loop <- function(x) { future_lapply(x, FUN = inner_loop) }
x <- list(1:10, 11:20, 21:30, 31:40)
y <- outer_loop(x)
str(y)
#> List of 4
#>  $ : int [1:10] 1 2 3 4 5 6 7 8 9 10
#>  $ : int [1:10] 11 12 13 14 15 16 17 18 19 20
#>  $ : int [1:10] 21 22 23 24 25 26 27 28 29 30
#>  $ : int [1:10] 31 32 33 34 35 36 37 38 39 40

It also works with doFuture;

library(doFuture)
library(future.callr)
plan(list(tweak(callr, workers = 2), tweak(callr, workers = 2)))

inner_loop <- function(x) { foreach(z = x, .combine = c) %dofuture% z }
outer_loop <- function(x) { foreach(z = x) %dofuture% inner_loop(z) }
x <- list(1:10, 11:20, 21:30, 31:40)
y <- outer_loop(x)
str(y)
#> List of 4
#>  $ : int [1:10] 1 2 3 4 5 6 7 8 9 10
#>  $ : int [1:10] 11 12 13 14 15 16 17 18 19 20
#>  $ : int [1:10] 21 22 23 24 25 26 27 28 29 30
#>  $ : int [1:10] 31 32 33 34 35 36 37 38 39 40

@jestover
Copy link
Author

jestover commented Jan 4, 2024

I had noticed the separate errors as well. I had also been getting hard to figure out errors on the real code that I was trying to use callr for. Some examples just in case they are helpful (not sure if they will be without the full context).

Error: CallrFuture (<none>) failed. The reason reported was ‘! callr subprocess failed: could not read result from callr’. Post-mortem diagnostic: The parallel worker (PID 38814) started at 2023-12-28T22:53:37+0000 finished with exit code 0. The total size of the 8 globals exported is 603.19 MiB. The three largest globals are ‘...furrr_dots’ (603.02 MiB of class ‘list’), ‘future_dmr’ (79.36 KiB of class ‘function’) and ‘...furrr_fn’ (55.33 KiB of class ‘function’)
Execution halted

Error in (function (.x, .f, ..., .progress = FALSE)  : ℹ In index: 1.
Caused by error in `do.call()`:
! object '...furrr_map_fn' not found
Calls: robust_mnir ... resolve.list -> signalConditionsASAP -> signalConditions
Execution halted

Error in (function (.x, .f, ..., .progress = FALSE)  : ℹ In index: 1.
Caused by error:
ℹ In index: 1.
Caused by error in `...furrr_fn()`:
! unused arguments (.y = list(c(2, 0, 8, 1, 2, 1, 2, 0, 3, 0, 3, 1, 2, 1, 1, 1, 0, 1, 1, 0, 1, 2, 6, 2, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 2, 2, 2, 0, 0, 0, 0, 1, 1, 2, 3, 12, 0, 0, 0, 0, 0, 0, 5, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 13, 0, 0, 0, 0, 1, 0, 0, 1, 2, 1, 0, 3, 1, 0, 0, 0, 2, 2, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 5, 0, 13, 0, 0, 0, 2, 0, 0, 3, 1, 0, 0, 4, 0, 2, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 5, 4, 2, 0, 0, 1, 0, 0,
0, 0, 0, 1, 1, 3, 1, 1, 0, 2, 1, 0, 1, 1, 0, 0, 3, 0, 1, 1, 1, 1, 0, 0, 0, 2, 1, 0, 2, 4, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 2, 1, 1, 5, 3, 0, 0, 2, 0, 0, 0, 0, 1, 9, 1, 11, 0, 0, 1, 2, 0, 1, 0, 17, 2, 1, 1, 0, 1, 0, 0, 10, 2, 1, 0, 0, 0, 0, 1, 1, 5, 0, 2, 1, 0, 1, 0, 0, 0, 3, 0, 0, 1, 0, 2, 1, 0, 0, 1, 0, 3, 1, 0, 1, 1, 2, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 2, 1, 6, 1, 0, 0
Calls: robust_mnir ... resolve.list -> signalConditionsASAP -> signalConditions
Execution halted

Given that this is only an issue with furrr, would you prefer me to repost the issue there?

@jestover
Copy link
Author

jestover commented Jan 5, 2024

Here is an error from a more recent attempt with plan(list(tweak(multisession, workers = 6), tweak(callr, workers = 8)))

Error in unserialize(node$con) :
  MultisessionFuture (<none>) failed to receive message results from cluster RichSOCKnode #4 (PID 255015 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 8 globals exported is 603.20 MiB. The three largest globals are ‘...furrr_dots’ (603.02 MiB of class ‘list’), ‘future_dmr’ (79.36 KiB of class ‘function’) and ‘...furrr_fn’ (55.33 KiB of class ‘function’)
Calls: robust_mnir ... resolved -> resolved.ClusterFuture -> receiveMessageFromWorker
Execution halted

@HenrikBengtsson
Copy link
Collaborator

HenrikBengtsson commented Jan 5, 2024

Thanks for more examples and details.

The error "MultisessionFuture () failed to receive message results from cluster RichSOCKnode #\4 (PID 255015 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive..." suggests that the "multisession" background process terminated abruptly.

Similarly, the error "CallrFuture () failed. The reason reported was ‘! callr subprocess failed: could not read result from callr’..." suggests that the "callr" background process is no longer responding. The post-mortem diagnostic "The parallel worker (PID 38814) started at 2023-12-28T22:53:37+0000 finished with exit code 0" confirms that it is no longer running.

That a background R process terminates prematurely, suggests something exceptional happened in that process. A simple run-time error would not cause this. Instead, it might be due to a core dump (should never happen in R), or that the process runs out of memory. If you could load the same objects and packages into an interactive R session and run the same code, it would most likely also crash. In other words, there's nothing special about background R processes, other than with parallelization we might run way more of them at the same time.

These errors, due to "crashed" workers, are independent of the other errors, including your original error on object '...furrr_map_fn' not found. The latter errors are due to something not working correctly in future, future.callr, or furrr.

Given that this is only an issue with furrr, would you prefer me to repost the issue there?

Let's keep it here until we know a bit more about why this happens.

cc/ @DavisVaughan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants