Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

future_map not obviously faster than map in simple linear regression setting #267

Open
wbvguo opened this issue Jan 13, 2024 · 1 comment

Comments

@wbvguo
Copy link

wbvguo commented Jan 13, 2024

Dear furrr developer,

Thank you for maintaining this package, I recently tried this package and found that future_map has no obvious speed up effects compared to map function in simple linear regression setting. Here I tried a toy example and plot how the running time change along the workers

benchmark

require(dplyr); require(furrr); require(purrr); require(tidyr)

# Create some large dataset
Data <- as_tibble(mtcars)
Data <- vctrs::vec_rep(Data, 50000)
Data$ID <- vctrs::vec_rep_each(1:50000, nrow(mtcars))

NestedData <- Data %>% 
  nest(.by = ID)

map_vec = vector(mode = "numeric", length = 10)
future_map_vec = vector(mode = "numeric", length = 10)

for (i in seq(10)) {
  future::plan(multisession, workers = i)
  stamp1 = Sys.time()
  xx <- mutate(NestedData, data2 = map(data, identity))
  map_vec[i] = Sys.time() - stamp1
  
    
  stamp2 = Sys.time()
  xx <- mutate(NestedData, data2 = future_map(data, identity))
  future_map_vec[i] = Sys.time() - stamp2
}

plots

image

I also noticed although the workers is set, the htop command in command line interface did not show that number of CPUs are utilized. I am currently not clear about the details of future_map implementation, but the cpu utilization makes me wonder if the slowness is due to I/O bottleneck. If so, this might indicate there are some improvement space (for example, avoid unnecessary file creating/copying/writing)?

Given that future_map's performance is not as satisfying in this attempt, may I ask if you could share some wisdom on the application scenario of future_map where there is a significant speed up?

This might be a repeat of issue #41, #234, #252

Thanks!

session info

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] shiny_1.7.5.1   lubridate_1.9.3 forcats_1.0.0   stringr_1.5.0   readr_2.1.4     tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0 tidyr_1.3.0    
[10] purrr_1.0.2     furrr_0.3.1     future_1.33.0   dplyr_1.1.3   
@wbvguo wbvguo changed the title future_map is consistently slower than map in simple linear regression setting future_map not obviously faster than map in simple linear regression setting Jan 13, 2024
@D3SL
Copy link

D3SL commented Feb 25, 2024

One thing is you're nesting data, which is a documented limitation as mentioned in #234. Aside from that though others have noted in #260 there seems to have been some change in the past year or so that's led to a degradation of furrr's performance. I've been using it for a while with excellent results but in the past year or so I've noticed the same code (on the same test data) running noticeably slower, and at one point even found several gigs of temp files that hadn't been cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants