-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: GC deleting files but not their references in metadata #10046
Comments
IMHO, GC works as designed. https://projectnessie.org/nessie-0-100-3/gc/ The docs may not be crystal clear, but still they say Metadata is not modified to preserve Nessie's commit history integrity (commits are immutable). If the Spark job needs access to a range of historical Iceberg snapshots, please set cutoff policies so that the required data is considered "live" by Nessie GC. |
If your Spark job must process all snapshots in the Iceberg metadata file, perhaps you may want to add another job that removes some (old) snapshots before running GC. |
Thank you for the response. Just to clarify, I am running a Since my marking policy is configured with a cutoff of at least 48 hours, and GC always leaves at least 1 snapshot needed for latest commit intact, I believe this is a data loss happening at some stage. I have collected data in this catalog for over a quarter and experienced no issues, however immediately starting the day when GC was launched last week I started having errors on tables with the highest throughput. I am running GC once a day in production recommended setup, and the number of files missing in each metadata seems to correlate with the number of total files in the table. Specific numbers:
I have only two workflows that interact with the tables, so I am quite confident that the files were not deleted by accident manually:
We have three more smaller tables (>100x times smaller in total storage size) and we have no issues with them. Please advise if some of my understanding is wrong about how GC works, or if I may have missed something. |
@justas200 : It is pretty hard to say whether it's a bug or not based on the provided data, unfortunately. Are you able to reproduce the problem in a small fresh test env? As to the original problem, are you able to identify the full URI of the missing file that Spark wants to access? Is it referenced from "live" snapshots? |
I will try reproducing it in a small test env later today or early next week. |
@justas200 Hi! Do you have any news here? The issue is very bad at first glance |
could it be that while writing files that have not yet been committed, nessie gc triggered and marked them as orphaned? |
I suspect there may be a timing/concurrency problem in how Nessie’s GC handles data files. Consider this scenario: This is just a hypothesis, but it highlights a potential race condition: files are labeled orphaned before they are fully committed. To address this, it seems that Nessie GC should not only look at the commit time for pruning older snapshots, but also consider the creation or last-modified time of the “orphaned” files. Otherwise, there’s a risk of deleting files that were recently created and are still in use. Or am I missing something? |
I noticed there is a --max-file-modification= flag for the expire operation in Nessie GC. Could this option resolve the concurrency issue described above, where freshly written (but not yet committed) files end up marked as orphaned? |
Yes, This value applies to storage objects' "creation" timestamps as reported by the storage backend (e.g. S3). |
@dimas-b So can it be the root cause of topic starter issue? If @justas200 using default
Can timezone in env of nessie gc can be different of timezone nessie gc recieving from s3 objects? |
Timezones (or DST) should not matter. Timestamps are manipulates as points on the universal time scale. |
What happened
Our team is running Iceberg on S3 with Nessie on Kubernetes.
Recently we turned on GC and noticed that we cannot query the full range history data on some of the files using Spark. We get an error that a file cannot be found in S3:
How to reproduce it
These are the configurations for the GC:
Nessie server type (docker/uber-jar/built from source) and version
docker
Client type (Ex: UI/Spark/pynessie ...) and version
Spark
Additional information
No response
The text was updated successfully, but these errors were encountered: