Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filesystem / DB contention with multiple readers #1005

Open
tskisner opened this issue Oct 17, 2024 · 3 comments
Open

Filesystem / DB contention with multiple readers #1005

tskisner opened this issue Oct 17, 2024 · 3 comments
Assignees

Comments

@tskisner
Copy link
Member

This issue is just for keeping track of an investigation into observed "slowdowns" when multiple processes call get_meta() / get_obs() on different wafers (i.e. different framefiles), both within a single book and from separate books. This is from within the LoadContext operator, so each process creates a context, does the operation (either get_meta or get_obs) and then closes the context.

Mostly this is just anecdotal so far. For example running a single process that loads 7 wafers in sequence from one observation takes about 60 seconds per wafer (perlmutter compute node, reading data from CFS) to call get_meta + get_obs. Running with 8 processes, each reading 7 wafers in sequence from different observations, seems to take considerably longer.

A more systematic test is needed. The changes in #845 should also be tested to see if they help.

@tskisner tskisner self-assigned this Oct 17, 2024
@tskisner
Copy link
Member Author

Just adding some more numbers to this, loading one observation (7 wafers) on one node takes about 450s. Loading 2 observations on 2 nodes takes about the same. Loading 8 observations on 8 nodes takes about 800s. I have copied a small set of data (100 observations plus metadata) to scratch to see if it helps to run from there.

@tskisner
Copy link
Member Author

Adding some more details to this. I copied the data in question to scratch and compared several cases using 8 nodes each running one of 8 observations and 64 nodes each running one of 64 observations.

8 observations of 7 wafers on 8 nodes
=================================================

One process / one thread reading each wafer
-------------------------------------------------

Data on CFS, metadata on CFS:  418s
	get_meta = 3-6s
	get_obs = 40-60s

Data on CFS, metadata on scratch:  414s
	get_meta = 3-6s
	get_obs = 40-60s

Data on scratch, metadata on scratch: 95s
	get_meta ~= 1-10s
	get_obs = 8-10s

Data on scratch, metadata on scratch, "better sqlite (PR #845)": 95s
        (no change, most of the benefits in this branch 
         would only be seen with multiple writers)
	get_meta ~= 1-10s
	get_obs = 8-10s

One process / four threads reading each wafer
-------------------------------------------------

Data on scratch, metadata on scratch: 75s
	get_meta ~= 1-10s
	get_obs = 4-6s

64 observations of 7 wafers on 64 nodes
=================================================

One process / one thread reading each wafer.
-------------------------------------------------

Data on scratch, metadata on scratch: 155s
	get_meta ~= 1-10s
	get_obs = 6-20s

One process / four threads reading each wafer.
-------------------------------------------------

Data on scratch, metadata on scratch: 126s
	get_meta ~= 1-10s
	get_obs = 3-10s

The DB access for reading seems to not hit any filesystem contention up to 64x7 = 448 readers, regardless of whether metadata is on CFS or scratch (slightly faster on CFS).

The data access seems to also scale well and is much faster on scratch. Since this total time is after a barrier, the increase in time going from 8 nodes to 64 nodes may just be due to including some longer observations. A better profiling exercise should break out the time for I/O and FLAC decompression separately, and present those as "samples per second" or similar to take account of the differing lengths of observations.

I will leave this issue open until we have tested data reading at higher concurrency, but for now the solution seems to be caching data to scratch and leaving metadata on CFS.

@mhasself
Copy link
Member

I just want mention some studies I did on what factors might be affecting data load speed.

With "data on scratch", I don't see much performance difference when saving "one big frame" per file, vs several smaller frames. I also didn't get any speed from disabling G3SuperTimestream compression (using 4 cores for decompression).

The one big frame / many frames also doesn't seem to affect load speed from cfs.

I've implemented automatic copy-to-scratch-and-load support in #1037.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants