-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filesystem / DB contention with multiple readers #1005
Comments
Just adding some more numbers to this, loading one observation (7 wafers) on one node takes about 450s. Loading 2 observations on 2 nodes takes about the same. Loading 8 observations on 8 nodes takes about 800s. I have copied a small set of data (100 observations plus metadata) to scratch to see if it helps to run from there. |
Adding some more details to this. I copied the data in question to scratch and compared several cases using 8 nodes each running one of 8 observations and 64 nodes each running one of 64 observations.
The DB access for reading seems to not hit any filesystem contention up to 64x7 = 448 readers, regardless of whether metadata is on CFS or scratch (slightly faster on CFS). The data access seems to also scale well and is much faster on scratch. Since this total time is after a barrier, the increase in time going from 8 nodes to 64 nodes may just be due to including some longer observations. A better profiling exercise should break out the time for I/O and FLAC decompression separately, and present those as "samples per second" or similar to take account of the differing lengths of observations. I will leave this issue open until we have tested data reading at higher concurrency, but for now the solution seems to be caching data to scratch and leaving metadata on CFS. |
I just want mention some studies I did on what factors might be affecting data load speed. With "data on scratch", I don't see much performance difference when saving "one big frame" per file, vs several smaller frames. I also didn't get any speed from disabling G3SuperTimestream compression (using 4 cores for decompression). The one big frame / many frames also doesn't seem to affect load speed from cfs. I've implemented automatic copy-to-scratch-and-load support in #1037. |
This issue is just for keeping track of an investigation into observed "slowdowns" when multiple processes call
get_meta()
/get_obs()
on different wafers (i.e. different framefiles), both within a single book and from separate books. This is from within theLoadContext
operator, so each process creates a context, does the operation (either get_meta or get_obs) and then closes the context.Mostly this is just anecdotal so far. For example running a single process that loads 7 wafers in sequence from one observation takes about 60 seconds per wafer (perlmutter compute node, reading data from CFS) to call get_meta + get_obs. Running with 8 processes, each reading 7 wafers in sequence from different observations, seems to take considerably longer.
A more systematic test is needed. The changes in #845 should also be tested to see if they help.
The text was updated successfully, but these errors were encountered: