Stored data de-duplication #330

cayolblake · 2021-04-20T14:02:30Z

cayolblake
Apr 20, 2021

Hello,

I wonder if data de-duplication is on your roadmap.

Basically, if an attempt to store the same file multiple times would first check if the file chunks already exists somewhere (by using some sort of fast hashing algorithm) and if it previously existed, it would just point to the already existing file without consuming additional space and effort.

xiaogaozi · 2021-04-21T03:00:30Z

xiaogaozi
Apr 21, 2021
Collaborator

It's an interesting idea. We will have a discussion internally. If you have enough time, contributions are very welcome. 😀

0 replies

davies · 2021-04-22T02:53:53Z

davies
Apr 22, 2021
Maintainer

De-duplication is hot topic in storage, and useful for some cases, may not justify for the high cost as a general feature, so we don't put that in the roadmap.

JuiceFS does provide some features to dedup the data at application level, for example, hardlink and faster CopyFileRange (without coping the data).

4 replies

xyb May 25, 2023

I'm researching how to implement out-of-band deduplication using copy_file_range like btrfs and duperemove. I want to understand the level of support offered by JuiceFS. Does it only work with block boundaries and if so, what is the block size? Alternatively, does JuiceFS support flexible block size similar to the rolling hash generated?

davies May 25, 2023
Maintainer

There is no limit on CopyFileRange, it work with any length from anywhere.

Ideally, you should take aware of chunk boundary (64MiB), which means two files share the same 64MiB. Otherwise, A file may refer to partial of a slice in another file. When there are too many slices (> 5), it may trigger compaction to duplicate the data.

xyb May 25, 2023

@davies Thank you for the information. I believe it would be valuable to include this in the JuiceFS documentation.

davies May 26, 2023
Maintainer

We have update this part recently: https://juicefs.com/docs/zh/community/architecture#how-juicefs-store-files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stored data de-duplication #330

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Stored data de-duplication #330

cayolblake Apr 20, 2021

Replies: 2 comments · 4 replies

xiaogaozi Apr 21, 2021 Collaborator

davies Apr 22, 2021 Maintainer

xyb May 25, 2023

davies May 25, 2023 Maintainer

xyb May 25, 2023

davies May 26, 2023 Maintainer

cayolblake
Apr 20, 2021

Replies: 2 comments 4 replies

xiaogaozi
Apr 21, 2021
Collaborator

davies
Apr 22, 2021
Maintainer

davies May 25, 2023
Maintainer

davies May 26, 2023
Maintainer