Failed to retrieve the result of MulticoreFuture (<none>) from the forked worker (on localhost; PID 62510). Post-mortem diagnostic: No process exists with this PID, i.e. the forked localhost worker is no longer alive. #498

nlarusstone · 2021-03-18T21:53:25Z

nlarusstone
Mar 18, 2021

I'm getting the following error: Failed to retrieve the result of MulticoreFuture (<none>) from the forked worker (on localhost; PID 62510). Post-mortem diagnostic: No process exists with this PID, i.e. the forked localhost worker is no longer alive.

This error occurs intermittently in the normal course of running our scripts.
It's difficult for me to provide a reprex, as the only way for me to consistently reproduce this is by running my code and sending kill -9 signals to child processes that are spawned.
Here's the code that produces that error:

plan(multicore)
foo <- df %>%
      mutate(
        fit = future_map(data, fit.fxn, formula = effect ~ bar),
      )

Is there a way for me to ensure that if a child process dies it gets restarted?

Here's my session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] logger_0.2.0     units_0.6-7      forcats_0.5.1    stringr_1.4.0
 [5] dplyr_1.0.4      purrr_0.3.4      readr_1.4.0      tidyr_1.1.2
 [9] tibble_3.0.6     tidyverse_1.3.0  magrittr_2.0.1   rlang_0.4.10
[13] stringi_1.5.3    readxl_1.3.1     furrr_0.1.0.9002 future_1.21.0
[17] broom_0.7.4      ggridges_0.5.3   ggbeeswarm_0.6.0 paletteer_1.3.0
[21] ineq_0.2-13      drc_3.0-1        caret_6.0-86     ggplot2_3.3.3
[25] lattice_0.20-41  sandwich_3.0-0   glmnet_4.1       Matrix_1.2-18
[29] MASS_7.3-53      edgeR_3.32.1     limma_3.46.0     here_1.0.1

loaded via a namespace (and not attached):
 [1] nlme_3.1-149         fs_1.5.0             lubridate_1.7.9.2
 [4] httr_1.4.2           rprojroot_2.0.2      tools_4.0.3
 [7] backports_1.2.1      utf8_1.1.4           R6_2.5.0
[10] rpart_4.1-15         vipor_0.4.5          DBI_1.1.1
[13] colorspace_2.0-0     nnet_7.3-14          withr_2.4.1
[16] tidyselect_1.1.0     curl_4.3             compiler_4.0.3
[19] cli_2.3.0            rvest_0.3.6          xml2_1.3.2
[22] scales_1.1.1         mvtnorm_1.1-1        digest_0.6.27
[25] foreign_0.8-80       rio_0.5.16           pkgconfig_2.0.3
[28] parallelly_1.23.0    plotrix_3.8-1        dbplyr_2.1.0
[31] rstudioapi_0.13      shape_1.4.5          generics_0.1.0
[34] zoo_1.8-8            jsonlite_1.7.2       gtools_3.8.2
[37] ModelMetrics_1.2.2.2 zip_2.1.1            car_3.0-10
[40] fansi_0.4.2          Rcpp_1.0.6           munsell_0.5.0
[43] abind_1.4-5          lifecycle_1.0.0      multcomp_1.4-16
[46] pROC_1.17.0.1        carData_3.0-4        plyr_1.8.6
[49] recipes_0.1.15       grid_4.0.3           parallel_4.0.3
[52] listenv_0.8.0        crayon_1.4.1         haven_2.3.1
[55] splines_4.0.3        hms_1.0.0            locfit_1.5-9.4
[58] ps_1.5.0             pillar_1.4.7         reshape2_1.4.4
[61] codetools_0.2-16     stats4_4.0.3         reprex_1.0.0
[64] glue_1.4.2           data.table_1.13.6    modelr_0.1.8
[67] vctrs_0.3.6          foreach_1.5.1        cellranger_1.1.0
[70] gtable_0.3.0         rematch2_2.1.2       assertthat_0.2.1
[73] gower_0.2.2          openxlsx_4.2.3       prodlim_2019.11.13
[76] class_7.3-17         survival_3.2-7       timeDate_3043.102
[79] iterators_1.0.13     beeswarm_0.2.3       lava_1.6.8.1
[82] globals_0.14.0       TH.data_1.0-10       ellipsis_0.3.1
[85] ipred_0.9-9

HenrikBengtsson · 2021-03-18T22:27:35Z

HenrikBengtsson
Mar 18, 2021
Maintainer

What's fit.fxn()?

Does it happen also when you use plan(multisession)? Forked processing (which "multicore" uses) is known to be unreliable in various environments and with various packages.

Is there a way for me to ensure that if a child process dies it gets restarted?

Unfortunately not; that's a frequently requested feature that's on the roadmap but it involves lots of work and several things need to be in place first, so not anything soon.

0 replies

nlarusstone · 2021-03-18T22:58:46Z

nlarusstone
Mar 18, 2021
Author

What's fit.fxn()?

Does it happen also when you use plan(multisession)? Forked processing (which "multicore" uses) is known to be unreliable in various environments and with various packages.

Is there a way for me to ensure that if a child process dies it gets restarted?

Unfortunately not; that's a frequently requested feature that's on the roadmap but it involves lots of work and several things need to be in place first, so not anything soon.

fit.fxn() is a function I wrote that calls a function from the drc package (https://cran.r-project.org/web/packages/drc/drc.pdf)

fit.fxn <- (df, formula, control) {
      res <- drm(formula,
        data = df,
        fct = LL2.4(names = c("Slope", "Lower", "Upper", "EC50")),
        control = control
      )
}

Are you suggesting I use plan(multisession) as part of my pipeline or just when trying to reproduce? I haven't been using it since it's quite a bit slower (the function is pretty quick, and there seems to be quite a bit more overhead when starting up new sessions).

But I am able to reproduce this by knocking out child processes that were started using plan(multisession):

✖ Failed to retrieve the value of MultisessionFuture (<none>) from cluster RichSOCKnode #9 (PID 36037 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive.

So I'm not convinced that dropping in plan(multisession) would solve my issues (unless they're less likely to die for some reason)

0 replies

HenrikBengtsson · 2021-03-18T23:19:24Z

HenrikBengtsson
Mar 18, 2021
Maintainer

Are you suggesting I use plan(multisession) as part of my pipeline or just when trying to reproduce?

For troubleshooting purposes. Something is causing one or more of your parallel workers to die - not just error but crash so that the R process terminates. That can happen for several reasons. Since plan(multicore) is used and forked parallel processing is known to be unstable in different cases(*), that's the first thing to rule out. If it runs well with plan(multisession), that's a really useful clue.

You can also set

options(future.globals.onReference = "error")

to (99%) rule out that you're not using any objects that cannot be used in parallelization (https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html). If you get an error from running with the above, that's another good clue, especially since forked processing sometimes misleads us to believe it works (but it's actually very unstable).

It could also be that you're running out of memory on the workers causing them to crash.

I'm almost 100% certain the underlying problem is unrelated to the future package per se. If you replace your future_map() + plan(multicore) with a parallel::mclapply(), I'd guess it'll crash too.

Also, there was a bug in R's parallel that was just fixed that possibly could also explain this problem.
See https://twitter.com/henrikbengtsson/status/1371874688037646337. To know that, it would mean to trace down the source of called code to see if anything is calling parallel::mclapply(..., mc.preschedule=FALSE). I don't think that's the problem here since you say it's only happening once in a while, but I wanted to mention it just in case.

(*) R Core & mclapply author Simon Urbanek wrote on R-devel (April 2020): “Do NOT use mcparallel() in packages except as a non-default option that user can set ... Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you [as the developer] don't know the resource available so only the user can tell you when it's safe to use.”

0 replies

nlarusstone · 2021-03-18T23:37:31Z

nlarusstone
Mar 18, 2021
Author

Hmmm this is helpful -- the difficult thing is that I can't reproduce consistently just by running my code, so it's difficult to tell if just dropping in plan(multisession) will fix everything.

Are you suggesting I use plan(multisession) as part of my pipeline or just when trying to reproduce?

For troubleshooting purposes. Something is causing one or more of your parallel workers to die - not just error but crash so that the R process terminates. That can happen for several reasons. Since plan(multicore) is used and forked parallel processing is known to be unstable in different cases(*), that's the first thing to rule out. If it runs well with plan(multisession), that's a really useful clue.

You can also set
options(future.globals.onReference = "error")
to (99%) rule out that you're not using any objects that cannot be used in parallelization (https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html). If you get an error from running with the above, that's another good clue, especially since forked processing sometimes misleads us to believe it works (but it's actually very unstable).

I set this and I do see errors -- are you suggesting that those objects are causing the error? I don't have control over most of the code, since I'm calling out to an external package, so I'm not sure if I would be able to remove all of those errors. Is the idea that multisession has no chance of sharing the objects since it doesn't spawn child objects, so it would sidestep the errors.

It could also be that you're running out of memory on the workers causing them to crash.

This is possible... I'm a bit skeptical that this is the issue since all workers should have the same workloads and I'd like to think this error would show up more consistently.

I'm almost 100% certain the underlying problem is unrelated to the future package per se. If you replace your future_map() + plan(multicore) with a parallel::mclapply(), I'd guess it'll crash too.

Also, there was a bug in R's parallel that was just fixed that possibly could also explain this problem.
See https://twitter.com/henrikbengtsson/status/1371874688037646337. To know that, it would mean to trace down the source of called code to see if anything is calling parallel::mclapply(..., mc.preschedule=FALSE). I don't think that's the problem here since you say it's only happening once in a while, but I wanted to mention it just in case.

This is super interesting! That is the exact error, though I'm reasonably sure the package I'm using isn't using parallel, so it would have to be in the parallel_map call.

(*) R Core & mclapply author Simon Urbanek wrote on R-devel (April 2020): “Do NOT use mcparallel() in packages except as a non-default option that user can set ... Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you [as the developer] don't know the resource available so only the user can tell you when it's safe to use.”

0 replies

HenrikBengtsson · 2021-03-19T00:42:06Z

HenrikBengtsson
Mar 19, 2021
Maintainer

I set this and I do see errors -- are you suggesting that those objects are causing the error?

I think that's a very strong clue. Exactlty what does the error say? The error message may provide clues on what type of object is involved or from which package it originates.

That is the exact error, ...

Note that you get that error message for any type of problem that causes your multicore worker to die - so there can still be many different for why it's dying.

... though I'm reasonably sure the package I'm using isn't using parallel, so it would have to be in the parallel_map call.

It can also be in drc::drm(), or any code that it calls, including code part of packages it relies on. This is why forked processing is so risky - you don't have full control of what's happening underneath.

0 replies

nlarusstone · 2021-03-19T16:33:09Z

nlarusstone
Mar 19, 2021
Author

Here's the error I'm getting:

✖ Detected a non-exportable reference (‘externalptr’) in one of the globals (<unknown>) used in the future expression

That's not immediately clear to me where I should be looking -- would love some advice on how I can narrow down the issue.

It can also be in drc::drm(), or any code that it calls, including code part of packages it relies on. This is why forked processing is so risky - you don't have full control of what's happening underneath.

I don't understand this -- shouldn't both multisession and multicore fork new processes?

0 replies

HenrikBengtsson · 2021-03-19T17:21:39Z

HenrikBengtsson
Mar 19, 2021
Maintainer

That's not immediately clear to me where I should be looking ...

Me neither; I was hoping it would mention a variable name as lead, e.g. "... one of the globals ('var_a') ...".

I don't understand this -- shouldn't both multisession and multicore fork new processes?

Yes, they do, with the exception that 'multisession' does not fork, it spawns a process that runs in the background. It's only 'multicore' that relies on "forking", which is a very special concept in operating systems/parallel processing. It's important to distinguish forking from all other types of parallel processing. The core problem with forking is that you cannot run everything inside a forked process. So, it can very well be that drc::drm() works great inside a forked process but then all of a sudden after an update it no longer does. If the maintainer of drc don't test their code in forked processes, they will never know there's a problem. So, take-home message is: "forked parallel processing is neat and quick when it works, but it's often a black box and if you run into errors it's really tricky to narrow down, especially since the problem may occur sporadically". I don't know enough to narrow that down myself either - it's hard. So, to troubleshoot your problem, I suggest first ruling out that there are other things going on, e.g. by testing with plan(multisession).

0 replies

HenrikBengtsson · 2021-03-19T17:25:52Z

HenrikBengtsson
Mar 19, 2021
Maintainer

Oh, BTW, and importantly, the error Detected a non-exportable reference (‘externalptr’) in one of the globals (<unknown>) can be a false positive. Some months ago, this error started to show up for unknown reasons, because we haven't had time to track it down (futureverse/furrr#168). So, in your case, I would focus trying to prove it fails also with plan(multisession).

0 replies

nlarusstone · 2021-03-19T22:08:55Z

nlarusstone
Mar 19, 2021
Author

Thanks so much for your help! I've was finally able to isolate a test case that failed consistently. The plan(multisession) seemed to fix that error, so that's a reasonable workaround.

One thing I've noticed is that plan(multisession) is much slower than plan(multicore). My hypothesis is that since the workload is relatively quick, the cost of spawning a process becomes a much bigger fraction of the runtime. Does that sound realistic to you? Also, do you have any tips to alleviate that (I'm happy to open a new issue if you think that's useful).

0 replies

HenrikBengtsson · 2021-03-19T22:16:58Z

HenrikBengtsson
Mar 19, 2021
Maintainer

I've was finally able to isolate a test case that failed consistently. The plan(multisession) seemed to fix that error, so that's a reasonable workaround.

That's great. Then to 100% rule out it's related to the future framework, you could replace your:

fit = furrr::future_map(data, fit.fxn, formula = effect ~ bar)

call with the following counter part:

fit = parallel::mclapply(data, fit.fxn, formula = effect ~ bar)

If that also crashes in your test case, then we can be pretty certain it has to do with forking.

Does that sound realistic to you?

Yes, forking is faster since it's more lightweight with less overhead and also implemented by the OS itself.

Unfortunately, there's nothing magic we can do to lower the overhead of the other parallelization backends down the same level.

BTW, if your test case is small and can be shared, please consider doing so. It might help someone else and it might even be that someone will see it and fix it.

0 replies

nlarusstone · 2021-03-19T22:28:09Z

nlarusstone
Mar 19, 2021
Author

Yeah, let me try to get a reprex over this weekend and also try the parallel version -- I'll post back here when I've done it

0 replies

tylerlittlefield · 2021-03-20T00:13:43Z

tylerlittlefield
Mar 20, 2021

I've was finally able to isolate a test case that failed consistently.

@nlarusstone I'm really interested to see what you did to replicate the problem. I have a shiny app using multiprocess which gives an error:

Warning: Error in : Detected an error (‘fatal error in wrapper code’) by the 'parallel' package while trying to retrieve the value of a MulticoreFuture (‘<none>’). This could be because the forked R process that evaluates the future was terminated before it was completed

And seems to be related to #226. I'm thinking of just setting the plan to multisession and see if the error just magically disappears.

0 replies

HenrikBengtsson · 2021-03-22T02:08:45Z

HenrikBengtsson
Mar 22, 2021
Maintainer

@nlarusstone, install:

remotes::install_github("HenrikBengtsson/future", ref = "09f9b7d")

then retry with:

options(future.globals.onReference = "error")

I'm quite certain that the error on:

Detected a non-exportable reference (‘externalptr’) in one of 
the globals (<unknown>) used in the future expression

will go away. If it does, you can rule out that external pointers are the problem. (The above version fixes a bug where the future framework would think that there are external pointers when there aren't).

0 replies

nlarusstone · 2021-03-22T23:20:20Z

nlarusstone
Mar 22, 2021
Author

I've was finally able to isolate a test case that failed consistently.

@nlarusstone I'm really interested to see what you did to replicate the problem. I have a shiny app using multiprocess which gives an error:
Warning: Error in : Detected an error (‘fatal error in wrapper code’) by the 'parallel' package while trying to retrieve the value of a MulticoreFuture (‘<none>’). This could be because the forked R process that evaluates the future was terminated before it was completed
And seems to be related to #226. I'm thinking of just setting the plan to multisession and see if the error just magically disappears.

@tyluRp unfortunately I don't have any good advice for you. We happened to get lucky where a specific set of data consistently reproduced this error. I'm working on a minimal reprex, but it's difficult to get the error to reproduce.

0 replies

nlarusstone · 2021-03-22T23:39:59Z

nlarusstone
Mar 22, 2021
Author

@nlarusstone, install:
remotes::install_github("HenrikBengta bug whersson/future", ref = "09f9b7d")
then retry with:
options(future.globals.onReference = "error")
I'm quite certain that the error on:
Detected a non-exportable reference (‘externalptr’) in one of 
the globals (<unknown>) used in the future expression
will go away. If it does, you can rule out that external pointers are the problem. (The above version fixes a bug where the future framework would think that there are external pointers when there aren't).

@HenrikBengtsson I installed that version and you're right, that error went away. Still not quite sure what the exact error was...but seems like plan(multisession) works for now.

0 replies

HenrikBengtsson · 2021-03-23T01:13:30Z

HenrikBengtsson
Mar 23, 2021
Maintainer

Awesome. So that rules out one thing for you; that previous error on external pointers (https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html) was a false alert and that was not the cause of your original problem.

0 replies

HenrikBengtsson · 2021-03-23T01:16:07Z

HenrikBengtsson
Mar 23, 2021
Maintainer

Still not quite sure what the exact error was...but seems like plan(multisession) works for now.

And if the problem comes back if you go back to plan(multicore), then you have s pretty strong case that there's some code in that package that cannot be used in forked processing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to retrieve the result of MulticoreFuture (<none>) from the forked worker (on localhost; PID 62510). Post-mortem diagnostic: No process exists with this PID, i.e. the forked localhost worker is no longer alive. #498

{{title}}

Replies: 17 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Failed to retrieve the result of MulticoreFuture (<none>) from the forked worker (on localhost; PID 62510). Post-mortem diagnostic: No process exists with this PID, i.e. the forked localhost worker is no longer alive. #498

nlarusstone Mar 18, 2021

Replies: 17 comments

HenrikBengtsson Mar 18, 2021 Maintainer

nlarusstone Mar 18, 2021 Author

HenrikBengtsson Mar 18, 2021 Maintainer

nlarusstone Mar 18, 2021 Author

HenrikBengtsson Mar 19, 2021 Maintainer

nlarusstone Mar 19, 2021 Author

HenrikBengtsson Mar 19, 2021 Maintainer

HenrikBengtsson Mar 19, 2021 Maintainer

nlarusstone Mar 19, 2021 Author

HenrikBengtsson Mar 19, 2021 Maintainer

nlarusstone Mar 19, 2021 Author

tylerlittlefield Mar 20, 2021

HenrikBengtsson Mar 22, 2021 Maintainer

nlarusstone Mar 22, 2021 Author

nlarusstone Mar 22, 2021 Author

HenrikBengtsson Mar 23, 2021 Maintainer

HenrikBengtsson Mar 23, 2021 Maintainer

nlarusstone
Mar 18, 2021

HenrikBengtsson
Mar 18, 2021
Maintainer

nlarusstone
Mar 18, 2021
Author

HenrikBengtsson
Mar 18, 2021
Maintainer

nlarusstone
Mar 18, 2021
Author

HenrikBengtsson
Mar 19, 2021
Maintainer

nlarusstone
Mar 19, 2021
Author

HenrikBengtsson
Mar 19, 2021
Maintainer

HenrikBengtsson
Mar 19, 2021
Maintainer

nlarusstone
Mar 19, 2021
Author

HenrikBengtsson
Mar 19, 2021
Maintainer

nlarusstone
Mar 19, 2021
Author

tylerlittlefield
Mar 20, 2021

HenrikBengtsson
Mar 22, 2021
Maintainer

nlarusstone
Mar 22, 2021
Author

nlarusstone
Mar 22, 2021
Author

HenrikBengtsson
Mar 23, 2021
Maintainer

HenrikBengtsson
Mar 23, 2021
Maintainer