The effective number of parallel jobs decreases during the computation when the total number of jobs is larger than number of cores #644

zhengchencai · 2022-09-13T22:21:26Z

zhengchencai
Sep 13, 2022

Hello,

I am fitting n= 1000 Stan models with 64 cores using doFuture package. Below is my code,

plan(multisession)
  fit_all <- foreach(
    ifit = fit_idx # length(fit_idx) = 1000
  ) %dorng% {
    mod$sample(
      data = data_list[[ifit]], iter_warmup = 1000, iter_sampling = 1000,
      chains = 4, parallel_chains = 4, show_messages = F, output_dir = save_csv
    )
  }

So each model fitting will use 4 cores/threads in parallel, with 128 threads I should have 32 models fitting at the same time which is true at the beginning of the computation, all 64 cores (128 threads) were indeed used. I am expecting if one fit is done (4 threads become available), another model fitting will start and all CPUs should have been used all the time until the last few models. Then the # active CPUs will decrease until finishing. However, it seems the TOTAL # of working CPUs was decreasing little by little, around the middle of the computation, there were just ~ 32 cores working, this could decrease until only 4 cores are working and the rest models will be computed one after the other which increased the total computation time a lot. In other words, the effective # of parallel jobs were decreasing little by little until 4 (parallel_chains = 4). Could you please help me to fix this problem? I guess it is because of the parallel_chains = 4 in one model fit, but don't know exactly the reason.

Thank you very much.

Answered by HenrikBengtsson

Sep 22, 2022

Hello. If I understand your description of the problem correctly, it sounds like a so-called "load balancing" issue, where you basically have parallel workers sitting idle at the end.

The default behavior for doFuture, but also for siblings future.apply and furrr, is to take all N iterations and chunk them up into W equally-sized portions, where W = number of workers. Each worker gets to process one chunk. In your case, with N=1000 and W=64, each worker processes on 15-16 Stan models. It sounds like some chunks finish much sooner than the others, so at the end there are only a few workers actually using the CPU.

There are ways to configure what chunking strategy to use. See Section 'Load …

View full answer

HenrikBengtsson · 2022-09-22T20:07:50Z

HenrikBengtsson
Sep 22, 2022
Maintainer

Hello. If I understand your description of the problem correctly, it sounds like a so-called "load balancing" issue, where you basically have parallel workers sitting idle at the end.

The default behavior for doFuture, but also for siblings future.apply and furrr, is to take all N iterations and chunk them up into W equally-sized portions, where W = number of workers. Each worker gets to process one chunk. In your case, with N=1000 and W=64, each worker processes on 15-16 Stan models. It sounds like some chunks finish much sooner than the others, so at the end there are only a few workers actually using the CPU.

There are ways to configure what chunking strategy to use. See Section 'Load balancing ("chunking")' in https://dofuture.futureverse.org/reference/doFuture.html. The default argument value is .options.future = list(scheduling = 1.0), resulting in the above 15-16 Stan models per parallel task. Try to increase that to, say, 5.0. Then there will be ~3 Stan models per parallel task and each worker will take on ~5 such tasks throughout the life span. That means, if one worker finishes quickly, it'll pick up another one, while the slow ones still run.

So, try that and see if it helps

4 replies

HenrikBengtsson Sep 22, 2022
Maintainer

On a side node, if data_list is a large object, then it's more efficient to do:

plan(multisession)
  fit_all <- foreach(
    data = data_list[fit_idx] # length(fit_idx) = 1000
  ) %dorng% {
    mod$sample(
      data = data, iter_warmup = 1000, iter_sampling = 1000,
      chains = 4, parallel_chains = 4, show_messages = FALSE, output_dir = save_csv
    )
  }

because it will send the smaller data objects to the workers, instead of the large data_list to all workers; each worker only needs a portion of the latter.

zhengchencai Sep 22, 2022
Author

Thanks a lot @HenrikBengtsson. I would never find the link between my problem and this balancing issue without your explanation and it matches the observations I have. Could you please confirm how are these chunks deployed to works? 1) each work will get randomly ~15 models, no matter the order of data_list? 2) work1 will get the first 1:15 models (data_list[1:15]) then work2 for [16:30] etc ? 3) data_list[[1]] to work1, data_list[[2]] to work2 ... then data_list[[16]] to work1. I ask this since I was assuming case 3) so I've also tried to order data_list by the estimated computation time such that all works would work round by round and each round will finish roughly at the same time. Based on your explanation, it seems case 2) is how future works. Anyway, I will give it a try and come back. Thanks again.

HenrikBengtsson Sep 23, 2022
Maintainer

Could you please confirm how are these chunks deployed to works?

each work will get randomly ~15 models, no matter the order of data_list?

work1 will get the first 1:15 models (data_list[1:15]) then work2 for [16:30] etc ?

data_list[[1]] to work1, data_list[[2]] to work2 ... then data_list[[16]] to work1.

With .options.future = list(scheduling = 5.0) you'll end up with (2).

To get (1), use .options.future = list(scheduling = structure(5.0, ordering = "random")). I just now realized that, contrary to ?future.apply::future_lapply, ?doFuture::doFuture does not document this. For now, it's hidden in ?doFuture:::makeChunks.

To get (3), use .options.future = list(scheduling = +Inf) or .options.future = list(chunk.size = 1L).

Regardless of what chunking strategy is used, the results are always returned in the same order as the input.

I ask this since I was assuming case 3) so I've also tried to order data_list by the estimated computation time such that all works would work round by round and each round will finish roughly at the same time.

I see. Hopefully the above clarifies it. You can play with:

library(doFuture)
registerDoFuture()
plan(multisession, workers = 4)

scheduling <- structure(2.0, ordering = "random")
info <- foreach(i = 1:12, .combine = rbind) %dopar% {
  Sys.sleep(1.0)
  data.frame(i = i, time = Sys.time(), pid = Sys.getpid())
}

> info
    i                time     pid
1   1 2022-09-23 12:48:00 1050513
2   2 2022-09-23 12:48:01 1050513
3   3 2022-09-23 12:48:02 1050513
4   4 2022-09-23 12:48:00 1050512
5   5 2022-09-23 12:48:01 1050512
6   6 2022-09-23 12:48:02 1050512
7   7 2022-09-23 12:48:00 1050510
8   8 2022-09-23 12:48:01 1050510
9   9 2022-09-23 12:48:02 1050510
10 10 2022-09-23 12:48:00 1050511
11 11 2022-09-23 12:48:01 1050511
12 12 2022-09-23 12:48:02 1050511

zhengchencai Sep 26, 2022
Author

Hi @HenrikBengtsson, sorry for being absent, I took some time to play with load balancing. In my case, using .options.future = list(scheduling = +Inf) allows a new iteration to start as long as there is an available worker. I also ordered the data list by descending the expected computation time such that the most time-consuming ones will start earlier. Thanks a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The effective number of parallel jobs decreases during the computation when the total number of jobs is larger than number of cores #644

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

The effective number of parallel jobs decreases during the computation when the total number of jobs is larger than number of cores #644

zhengchencai Sep 13, 2022

Replies: 1 comment · 4 replies

HenrikBengtsson Sep 22, 2022 Maintainer

HenrikBengtsson Sep 22, 2022 Maintainer

zhengchencai Sep 22, 2022 Author

HenrikBengtsson Sep 23, 2022 Maintainer

zhengchencai Sep 26, 2022 Author

zhengchencai
Sep 13, 2022

Replies: 1 comment 4 replies

HenrikBengtsson
Sep 22, 2022
Maintainer

HenrikBengtsson Sep 22, 2022
Maintainer

zhengchencai Sep 22, 2022
Author

HenrikBengtsson Sep 23, 2022
Maintainer

zhengchencai Sep 26, 2022
Author