Data preprocessing is non deterministic due to python's builtin hash function #11

xinyangz · 2024-12-12T22:47:35Z

First of all, thank you for the great paper and package!

The issue

I've been using it to run evaluations on public models, and have found slight variations in model performance on ICL tasks (haven't test all the other tasks yet) across runs.

The cause

Upon examining the code, I've found the data loading is not deterministic. The root cause is the use of python's builtin hash function. For example: https://github.com/princeton-nlp/HELMET/blob/main/data.py#L450-L452

In contrary to common impression, Python's hash function is not deterministic across runs. Please see this community blog post: https://chenna.me/blog/2023/12/25/python-hash-is-not-deterministic/

Proposed changes

Switch to hashlib for all hashing operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data preprocessing is non deterministic due to python's builtin hash function #11

Data preprocessing is non deterministic due to python's builtin hash function #11

xinyangz commented Dec 12, 2024 •

edited

Loading

howard-yen commented Dec 15, 2024

Data preprocessing is non deterministic due to python's builtin hash function #11

Data preprocessing is non deterministic due to python's builtin hash function #11

Comments

xinyangz commented Dec 12, 2024 • edited Loading

The issue

The cause

Proposed changes

Related

howard-yen commented Dec 15, 2024

xinyangz commented Dec 12, 2024 •

edited

Loading