[Bug]: GC deleting files but not their references in metadata #10046

justas200 · 2024-12-05T15:01:44Z

What happened

Our team is running Iceberg on S3 with Nessie on Kubernetes.
Recently we turned on GC and noticed that we cannot query the full range history data on some of the files using Spark. We get an error that a file cannot be found in S3:

pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 1621 in stage 11252.0 failed 4 times, most recent failure: Lost task 1621.3 in stage 11252.0 (TID 3116305) (10.32.186.235 executor 4282): org.apache.iceberg.exceptions.NotFoundException: File does not exist: <file_on_s3.parquet>

How to reproduce it

These are the configurations for the GC:

# Marking step

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nessie-gc-mark-step
  namespace: {{ .Release.Namespace }}
spec:
  schedule: "0 0 * * *" # Runs daily at midnight GMT
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: nessie-gc
              image: ghcr.io/projectnessie/nessie-gc:latest
              args:
                - mark
                - --default-cutoff=PT48H
                - --uri
                - http://nessie:19120/api/v2
                - --jdbc
                - --jdbc-url
                - "$(JDBC_URL)"
                - --jdbc-user
                - "$(JDBC_USER)"
                - --jdbc-password
                - "$(JDBC_PASSWORD)"
                - --write-live-set-id-to
                - /nessie/live_set_id
              envFrom:
                - secretRef:
                    name: {{ .Values.nessie.jdbc.secret.name }}-gc
              volumeMounts:
                - name: nessie-gc-data-volume
                  mountPath: /nessie
          restartPolicy: Never
          volumes:
            - name: nessie-gc-data-volume
              persistentVolumeClaim:
                claimName: nessie-gc-pvc

---

# Sweep step

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nessie-gc-sweep-step
  namespace: {{ .Release.Namespace }}
spec:
  schedule: "30 0 * * *" # Runs daily at 30 past midnight GMT
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: nessie-sa
          containers:
            - name: nessie-gc
              image: ghcr.io/projectnessie/nessie-gc:latest
              args:
                - sweep
                - --defer-deletes
                - --jdbc
                - --jdbc-url
                - "$(JDBC_URL)"
                - --jdbc-user
                - "$(JDBC_USER)"
                - --jdbc-password
                - "$(JDBC_PASSWORD)"
                - --read-live-set-id-from
                - /nessie/live_set_id
              envFrom:
                - secretRef:
                    name: "{{ .Values.nessie.jdbc.secret.name }}-gc"
              volumeMounts:
                - name: nessie-gc-data-volume
                  mountPath: /nessie
          restartPolicy: Never
          volumes:
            - name: nessie-gc-data-volume
              persistentVolumeClaim:
                claimName: nessie-gc-pvc

---

### Delete step

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nessie-gc-delete-step
  namespace: {{ .Release.Namespace }}
spec:
  schedule: "0 3 * * *" # Runs daily at 3 AM GMT
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: nessie-sa
          containers:
            - name: nessie-gc
              image: ghcr.io/projectnessie/nessie-gc:latest
              args:
                - deferred-deletes
                - --jdbc
                - --jdbc-url
                - "$(JDBC_URL)"
                - --jdbc-user
                - "$(JDBC_USER)"
                - --jdbc-password
                - "$(JDBC_PASSWORD)"
                - --read-live-set-id-from
                - /nessie/live_set_id
              envFrom:
                - secretRef:
                    name: "{{ .Values.nessie.jdbc.secret.name }}-gc"
              volumeMounts:
                - name: nessie-gc-data-volume
                  mountPath: /nessie
          restartPolicy: Never
          volumes:
            - name: nessie-gc-data-volume
              persistentVolumeClaim:
                claimName: nessie-gc-pvc

---

# DB initialization
apiVersion: batch/v1
kind: Job
metadata:
  name: nessie-gc-db-init-step
  namespace: {{ .Release.Namespace }}
spec:
  template:
    spec:
      containers:
        - name: nessie-gc
          image: ghcr.io/projectnessie/nessie-gc:latest
          args:
            - create-sql-schema
            - --jdbc
            - --jdbc-url
            - "$(JDBC_URL)"
            - --jdbc-user
            - "$(JDBC_USER)"
            - --jdbc-password
            - "$(JDBC_PASSWORD)"
            - --jdbc-schema
            - "CREATE_IF_NOT_EXISTS"
          envFrom:
            - secretRef:
                name: "{{ .Values.nessie.jdbc.secret.name }}-gc"
      restartPolicy: Never

Nessie server type (docker/uber-jar/built from source) and version

docker

Client type (Ex: UI/Spark/pynessie ...) and version

Spark

Additional information

No response

The text was updated successfully, but these errors were encountered:

dimas-b · 2024-12-05T15:15:32Z

IMHO, GC works as designed. https://projectnessie.org/nessie-0-100-3/gc/

The docs may not be crystal clear, but still they say Nessie GC is a tool to clean up orphaned files in a Nessie repository. The key here is identifying the orphan files. The GC CLI delegates this task to the user by providing "cutoff" arguments / policies. Whatever happens to be beyond the cutoff line is assumed to be not needed by the user, therefore GC deletes those files from storage.

Metadata is not modified to preserve Nessie's commit history integrity (commits are immutable).

If the Spark job needs access to a range of historical Iceberg snapshots, please set cutoff policies so that the required data is considered "live" by Nessie GC.

dimas-b · 2024-12-05T15:40:24Z

If your Spark job must process all snapshots in the Iceberg metadata file, perhaps you may want to add another job that removes some (old) snapshots before running GC.

justas200 · 2024-12-05T17:40:41Z

Thank you for the response. Just to clarify, I am running a SELECT * FROM <table> statement, which, I believe, defaults to the latest commit/snapshot on the main branch.

Since my marking policy is configured with a cutoff of at least 48 hours, and GC always leaves at least 1 snapshot needed for latest commit intact, I believe this is a data loss happening at some stage.

I have collected data in this catalog for over a quarter and experienced no issues, however immediately starting the day when GC was launched last week I started having errors on tables with the highest throughput.

I am running GC once a day in production recommended setup, and the number of files missing in each metadata seems to correlate with the number of total files in the table.

Specific numbers:

Table1 missing 8/2544
Table2 missing 160/719285
Table3 missing 70/694985

I have only two workflows that interact with the tables, so I am quite confident that the files were not deleted by accident manually:

Kafka Connector in append-only mode (runs Java Iceberg library to append data)
Spark Connect to query data

We have three more smaller tables (>100x times smaller in total storage size) and we have no issues with them.

Please advise if some of my understanding is wrong about how GC works, or if I may have missed something.

dimas-b · 2024-12-05T18:25:11Z

@justas200 : It is pretty hard to say whether it's a bug or not based on the provided data, unfortunately.

Are you able to reproduce the problem in a small fresh test env?

As to the original problem, are you able to identify the full URI of the missing file that Spark wants to access? Is it referenced from "live" snapshots?

justas200 · 2024-12-06T07:07:30Z

I will try reproducing it in a small test env later today or early next week.
I am able to identify the full URI of the missing file - it is referenced from live (current) snapshot.

Albel-Analyst · 2024-12-25T19:53:54Z

@justas200 Hi! Do you have any news here? The issue is very bad at first glance

Albel-Analyst · 2024-12-25T19:57:22Z

could it be that while writing files that have not yet been committed, nessie gc triggered and marked them as orphaned?

Albel-Analyst · 2024-12-26T06:35:18Z

@dimas-b

I suspect there may be a timing/concurrency problem in how Nessie’s GC handles data files. Consider this scenario:
1. A job is in the process of writing data files.
2. At the same time, Nessie GC runs and marks these (still-being-written) files as orphaned.
3. The job completes writing the data files and successfully commits them, so from Iceberg’s perspective, everything looks correct.
4. However, Nessie’s database has already flagged these files as orphaned. When the deletion phase runs, it could remove files that are actually valid and just committed.

This is just a hypothesis, but it highlights a potential race condition: files are labeled orphaned before they are fully committed. To address this, it seems that Nessie GC should not only look at the commit time for pruning older snapshots, but also consider the creation or last-modified time of the “orphaned” files. Otherwise, there’s a risk of deleting files that were recently created and are still in use. Or am I missing something?

Albel-Analyst · 2024-12-26T06:37:17Z

I noticed there is a --max-file-modification= flag for the expire operation in Nessie GC.

Could this option resolve the concurrency issue described above, where freshly written (but not yet committed) files end up marked as orphaned?

dimas-b · 2024-12-27T16:22:44Z

Yes, --max-file-modification is meant to avoid situations when new files are created while GC is running.

This value applies to storage objects' "creation" timestamps as reported by the storage backend (e.g. S3).

Albel-Analyst · 2024-12-28T10:36:01Z

@dimas-b So can it be the root cause of topic starter issue? If @justas200 using default

Defaults to the created timestamp of the live-content-set.

Can timezone in env of nessie gc can be different of timezone nessie gc recieving from s3 objects?

dimas-b · 2024-12-30T17:47:19Z

Timezones (or DST) should not matter. Timestamps are manipulates as points on the universal time scale.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: GC deleting files but not their references in metadata #10046

[Bug]: GC deleting files but not their references in metadata #10046

justas200 commented Dec 5, 2024 •

edited

Loading

dimas-b commented Dec 5, 2024

dimas-b commented Dec 5, 2024

justas200 commented Dec 5, 2024 •

edited

Loading

dimas-b commented Dec 5, 2024 •

edited

Loading

justas200 commented Dec 6, 2024

Albel-Analyst commented Dec 25, 2024

Albel-Analyst commented Dec 25, 2024

Albel-Analyst commented Dec 26, 2024

Albel-Analyst commented Dec 26, 2024

dimas-b commented Dec 27, 2024

Albel-Analyst commented Dec 28, 2024 •

edited

Loading

dimas-b commented Dec 30, 2024

[Bug]: GC deleting files but not their references in metadata #10046

[Bug]: GC deleting files but not their references in metadata #10046

Comments

justas200 commented Dec 5, 2024 • edited Loading

What happened

How to reproduce it

Nessie server type (docker/uber-jar/built from source) and version

Client type (Ex: UI/Spark/pynessie ...) and version

Additional information

dimas-b commented Dec 5, 2024

dimas-b commented Dec 5, 2024

justas200 commented Dec 5, 2024 • edited Loading

dimas-b commented Dec 5, 2024 • edited Loading

justas200 commented Dec 6, 2024

Albel-Analyst commented Dec 25, 2024

Albel-Analyst commented Dec 25, 2024

Albel-Analyst commented Dec 26, 2024

Albel-Analyst commented Dec 26, 2024

dimas-b commented Dec 27, 2024

Albel-Analyst commented Dec 28, 2024 • edited Loading

dimas-b commented Dec 30, 2024

justas200 commented Dec 5, 2024 •

edited

Loading

justas200 commented Dec 5, 2024 •

edited

Loading

dimas-b commented Dec 5, 2024 •

edited

Loading

Albel-Analyst commented Dec 28, 2024 •

edited

Loading