Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of blockdir prefix directories #180

Open
sourcefrog opened this issue Aug 13, 2022 · 1 comment
Open

Performance of blockdir prefix directories #180

sourcefrog opened this issue Aug 13, 2022 · 1 comment
Labels
topic:performance type:format-change issues requiring an archive format change

Comments

@sourcefrog
Copy link
Owner

sourcefrog commented Aug 13, 2022

One more thought from #177, cc @road2react and @WolverinDEV:

Conserve's current format puts blocks into subdirectories with a 3-hex-digit name, from the first 12 bytes of the hash. So there are up to 1<<12 or 4096 of them. This introduces a blocking mkdir ahead of writing each block file.

The point of this is to reduce the size of any single directory, although that is probably less of a concern on most local filesystems than in years past. It may actually help with rclone/Box, if the client regularly reads whole directories. It may still be a good idea for VFAT USB drives.

It's probably a loss on scalable local filesystems? In particular walking the list of blocks needs to read up to 4096 directories.

There are several options, and in order of priority:

  1. Remember which subdirectories are known to exist (because we already wrote or saw a block in them) and then there's no need to create them.
  2. In addition, at the start of a backup, read the block directory to see which prefixes are present and remember them. This has the added benefit of quickly answering whether a given hash can possibly be present.
  3. Make it tunable so that we can at least experiment with different settings, where 0 means no subdirectories. (It should be stored in some archive metadata. It may not be worth allowing this to be changed once the archive exists.)

I mention the first two first because they are direct efficiency wins that don't require a format change or guessing what's likely to be optimal in any situation, or making the user guess.

@sourcefrog sourcefrog added type:format-change issues requiring an archive format change topic:performance labels Aug 13, 2022
@WolverinDEV
Copy link
Contributor

WolverinDEV commented Aug 14, 2022

#179 seems pretty interesting but I'm not having the time to join the conversation.
But after #173 I wanted to focus on performance (I'm into bug hunting) and encryption.
I'll probably respond under the week (I'm working weekends).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:performance type:format-change issues requiring an archive format change
Projects
None yet
Development

No branches or pull requests

2 participants