Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: GC deleting files but not their references in metadata #10046

Open
justas200 opened this issue Dec 5, 2024 · 12 comments
Open

[Bug]: GC deleting files but not their references in metadata #10046

justas200 opened this issue Dec 5, 2024 · 12 comments

Comments

@justas200
Copy link

justas200 commented Dec 5, 2024

What happened

Our team is running Iceberg on S3 with Nessie on Kubernetes.
Recently we turned on GC and noticed that we cannot query the full range history data on some of the files using Spark. We get an error that a file cannot be found in S3:

pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 1621 in stage 11252.0 failed 4 times, most recent failure: Lost task 1621.3 in stage 11252.0 (TID 3116305) (10.32.186.235 executor 4282): org.apache.iceberg.exceptions.NotFoundException: File does not exist: <file_on_s3.parquet>

How to reproduce it

These are the configurations for the GC:

# Marking step

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nessie-gc-mark-step
  namespace: {{ .Release.Namespace }}
spec:
  schedule: "0 0 * * *" # Runs daily at midnight GMT
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: nessie-gc
              image: ghcr.io/projectnessie/nessie-gc:latest
              args:
                - mark
                - --default-cutoff=PT48H
                - --uri
                - http://nessie:19120/api/v2
                - --jdbc
                - --jdbc-url
                - "$(JDBC_URL)"
                - --jdbc-user
                - "$(JDBC_USER)"
                - --jdbc-password
                - "$(JDBC_PASSWORD)"
                - --write-live-set-id-to
                - /nessie/live_set_id
              envFrom:
                - secretRef:
                    name: {{ .Values.nessie.jdbc.secret.name }}-gc
              volumeMounts:
                - name: nessie-gc-data-volume
                  mountPath: /nessie
          restartPolicy: Never
          volumes:
            - name: nessie-gc-data-volume
              persistentVolumeClaim:
                claimName: nessie-gc-pvc

---

# Sweep step

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nessie-gc-sweep-step
  namespace: {{ .Release.Namespace }}
spec:
  schedule: "30 0 * * *" # Runs daily at 30 past midnight GMT
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: nessie-sa
          containers:
            - name: nessie-gc
              image: ghcr.io/projectnessie/nessie-gc:latest
              args:
                - sweep
                - --defer-deletes
                - --jdbc
                - --jdbc-url
                - "$(JDBC_URL)"
                - --jdbc-user
                - "$(JDBC_USER)"
                - --jdbc-password
                - "$(JDBC_PASSWORD)"
                - --read-live-set-id-from
                - /nessie/live_set_id
              envFrom:
                - secretRef:
                    name: "{{ .Values.nessie.jdbc.secret.name }}-gc"
              volumeMounts:
                - name: nessie-gc-data-volume
                  mountPath: /nessie
          restartPolicy: Never
          volumes:
            - name: nessie-gc-data-volume
              persistentVolumeClaim:
                claimName: nessie-gc-pvc

---

### Delete step

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nessie-gc-delete-step
  namespace: {{ .Release.Namespace }}
spec:
  schedule: "0 3 * * *" # Runs daily at 3 AM GMT
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: nessie-sa
          containers:
            - name: nessie-gc
              image: ghcr.io/projectnessie/nessie-gc:latest
              args:
                - deferred-deletes
                - --jdbc
                - --jdbc-url
                - "$(JDBC_URL)"
                - --jdbc-user
                - "$(JDBC_USER)"
                - --jdbc-password
                - "$(JDBC_PASSWORD)"
                - --read-live-set-id-from
                - /nessie/live_set_id
              envFrom:
                - secretRef:
                    name: "{{ .Values.nessie.jdbc.secret.name }}-gc"
              volumeMounts:
                - name: nessie-gc-data-volume
                  mountPath: /nessie
          restartPolicy: Never
          volumes:
            - name: nessie-gc-data-volume
              persistentVolumeClaim:
                claimName: nessie-gc-pvc

---

# DB initialization
apiVersion: batch/v1
kind: Job
metadata:
  name: nessie-gc-db-init-step
  namespace: {{ .Release.Namespace }}
spec:
  template:
    spec:
      containers:
        - name: nessie-gc
          image: ghcr.io/projectnessie/nessie-gc:latest
          args:
            - create-sql-schema
            - --jdbc
            - --jdbc-url
            - "$(JDBC_URL)"
            - --jdbc-user
            - "$(JDBC_USER)"
            - --jdbc-password
            - "$(JDBC_PASSWORD)"
            - --jdbc-schema
            - "CREATE_IF_NOT_EXISTS"
          envFrom:
            - secretRef:
                name: "{{ .Values.nessie.jdbc.secret.name }}-gc"
      restartPolicy: Never

Nessie server type (docker/uber-jar/built from source) and version

docker

Client type (Ex: UI/Spark/pynessie ...) and version

Spark

Additional information

No response

@dimas-b
Copy link
Member

dimas-b commented Dec 5, 2024

IMHO, GC works as designed. https://projectnessie.org/nessie-0-100-3/gc/

The docs may not be crystal clear, but still they say Nessie GC is a tool to clean up orphaned files in a Nessie repository. The key here is identifying the orphan files. The GC CLI delegates this task to the user by providing "cutoff" arguments / policies. Whatever happens to be beyond the cutoff line is assumed to be not needed by the user, therefore GC deletes those files from storage.

Metadata is not modified to preserve Nessie's commit history integrity (commits are immutable).

If the Spark job needs access to a range of historical Iceberg snapshots, please set cutoff policies so that the required data is considered "live" by Nessie GC.

@dimas-b
Copy link
Member

dimas-b commented Dec 5, 2024

If your Spark job must process all snapshots in the Iceberg metadata file, perhaps you may want to add another job that removes some (old) snapshots before running GC.

@justas200
Copy link
Author

justas200 commented Dec 5, 2024

Thank you for the response. Just to clarify, I am running a SELECT * FROM <table> statement, which, I believe, defaults to the latest commit/snapshot on the main branch.

Since my marking policy is configured with a cutoff of at least 48 hours, and GC always leaves at least 1 snapshot needed for latest commit intact, I believe this is a data loss happening at some stage.

I have collected data in this catalog for over a quarter and experienced no issues, however immediately starting the day when GC was launched last week I started having errors on tables with the highest throughput.

I am running GC once a day in production recommended setup, and the number of files missing in each metadata seems to correlate with the number of total files in the table.

Specific numbers:

  • Table1 missing 8/2544
  • Table2 missing 160/719285
  • Table3 missing 70/694985

I have only two workflows that interact with the tables, so I am quite confident that the files were not deleted by accident manually:

  • Kafka Connector in append-only mode (runs Java Iceberg library to append data)
  • Spark Connect to query data

We have three more smaller tables (>100x times smaller in total storage size) and we have no issues with them.

Please advise if some of my understanding is wrong about how GC works, or if I may have missed something.

@dimas-b
Copy link
Member

dimas-b commented Dec 5, 2024

@justas200 : It is pretty hard to say whether it's a bug or not based on the provided data, unfortunately.

Are you able to reproduce the problem in a small fresh test env?

As to the original problem, are you able to identify the full URI of the missing file that Spark wants to access? Is it referenced from "live" snapshots?

@justas200
Copy link
Author

I will try reproducing it in a small test env later today or early next week.
I am able to identify the full URI of the missing file - it is referenced from live (current) snapshot.

@Albel-Analyst
Copy link

@justas200 Hi! Do you have any news here? The issue is very bad at first glance

@Albel-Analyst
Copy link

could it be that while writing files that have not yet been committed, nessie gc triggered and marked them as orphaned?

@Albel-Analyst
Copy link

@dimas-b

I suspect there may be a timing/concurrency problem in how Nessie’s GC handles data files. Consider this scenario:
1. A job is in the process of writing data files.
2. At the same time, Nessie GC runs and marks these (still-being-written) files as orphaned.
3. The job completes writing the data files and successfully commits them, so from Iceberg’s perspective, everything looks correct.
4. However, Nessie’s database has already flagged these files as orphaned. When the deletion phase runs, it could remove files that are actually valid and just committed.

This is just a hypothesis, but it highlights a potential race condition: files are labeled orphaned before they are fully committed. To address this, it seems that Nessie GC should not only look at the commit time for pruning older snapshots, but also consider the creation or last-modified time of the “orphaned” files. Otherwise, there’s a risk of deleting files that were recently created and are still in use. Or am I missing something?

@Albel-Analyst
Copy link

I noticed there is a --max-file-modification= flag for the expire operation in Nessie GC.

Could this option resolve the concurrency issue described above, where freshly written (but not yet committed) files end up marked as orphaned?

@dimas-b
Copy link
Member

dimas-b commented Dec 27, 2024

Yes, --max-file-modification is meant to avoid situations when new files are created while GC is running.

This value applies to storage objects' "creation" timestamps as reported by the storage backend (e.g. S3).

@Albel-Analyst
Copy link

Albel-Analyst commented Dec 28, 2024

@dimas-b So can it be the root cause of topic starter issue? If @justas200 using default

Defaults to the created timestamp of the live-content-set.

Can timezone in env of nessie gc can be different of timezone nessie gc recieving from s3 objects?

@dimas-b
Copy link
Member

dimas-b commented Dec 30, 2024

Timezones (or DST) should not matter. Timestamps are manipulates as points on the universal time scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants