Non-reproducible XGBoost ranking #497

pltb · 2024-07-05T14:45:01Z

We've noticed some unexpected xgboost model behavior when comparing the scoring in the ES+LTR runtime to what the original inference (python + pandas + xgboost) gave us.

The likely cause for the discrepancies was this: since every feature that the model gets is a function score, the model will receive float32-typed features in the ES LTR runtime (which means they would have up to about 7 decimal digits), but our model's splitting thresholds are numbers with 9 or more decimal digits.

We thought that it is important to mention this explicitly in the plugin's documentation, because it's not obvious, and may lead to unpredictable results.

For more context please see the following:

The text was updated successfully, but these errors were encountered:

renekrie · 2024-07-05T21:29:37Z

On which ES / Java version did you observe this behaviour? Just to make sure it's not < Java 17 and you are not observing the behaviour that is described here in the context of strictfp (which isn't used in Lucene/ES/LTR and might cause floating point differences between JDK instances): https://en.wikipedia.org/wiki/Strictfp

pltb · 2024-07-08T10:07:02Z

Hi @renekrie, thanks for your reply =)
We're on ES 8.8.2 and jvm 20.0.1

pltb · 2024-07-12T14:08:14Z

I just noticed that the XGBoost example in the plugin's docs uses the get_dump method, however the XGBoost docs state that it's primarily supposed to be used for visualization or interpretation (and it won't be even able to load it back from this dump):

I have a hunch that this get_dump method may be dumping the numbers as higher-precision decimals instead of float32s... (which, considering ltr plugin's trimming to float32 during parsing, may produce an essentially different model when loaded by the ltr plugin).

pltb · 2024-07-12T14:18:12Z

I'm wondering if it's possible for the LTR plugin to load the model from the output of this xgboost method: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.Booster.save_model

pltb · 2024-07-12T15:18:29Z

Okay, I've confirmed my theory =) Here's the complete summary of the issue that we observe right now:

The model training code (as well as the plugin's docs) uses the get_dump xgboost method to save the trained model, however the XGBoost docs state that it's primarily supposed to be used for visualization or interpretation (and it won't be even able to load it back from this dump). The docs say that for dumping models for later inference, save_model should be used instead.

I've checked the way the model's split thresholds (the split_condition(s) field) are represented in the resulting files in both cases, here's an example:

save_model produces 7.82353E-2 (the actual threshold from the model)
get_dump produces 0.0782352963 (the "human-readable" number with extra digits)

As you can see, the number from the get_dump method has more significant decimal digits than the one from save_model – and it's also true for the rest of the model. Of course this kind of difference may lead to an essentially different model being loaded by the LTR plugin (even despite the fact that it parses the numbers as 32-bit floats), and this may definitely be the reason why we were seeing different scores. Those extra digits at the end simply shouldn't be there, the original model doesn't have them (which is logical because XGBoost operates with float32 ).

If we want the scores to be reproducible, we should ensure that the model that we load in ES shouldn't contain the extra digits that we see in the get_dump's result.

The problem now is that the output of the "proper" save_model model saving method can't be read by the LTR plugin.

pltb · 2024-07-23T18:17:55Z

Hey @renekrie, I've made a draft of an alternative parser that would accept the output of save_model().

#500

I hope I will be able to put more time into this soon.

pltb · 2024-07-25T16:26:25Z

We ran tests with our own model and the implementation I've posted above, apparently it fixes the score mismatches that we've been having

cc @wrigleyDan

renekrie · 2024-07-25T17:04:22Z

Thank you for your work and insights @pltb.

To be honest, I'm not aware of the differences in format between get_dump vs save_dump apart from the float-related issue that you described. Could the code in #500 also read a model that was produced using get_dump? - If that was the case, all we'd need should be a minimal patch to the existing parser which would make it easy to have people use either get_dump or save_dump as they like (or as they are used to)

pltb · 2024-07-30T11:44:46Z

Thank you for your response @renekrie!

Could the code in #500 also read a model that was produced using get_dump? - If that was the case, all we'd need should be a minimal patch to the existing parser which would make it easy to have people use either get_dump or save_dump as they like (or as they are used to)

Unfortunately it cannot, because the structure of JSON is entirely different (it's flat, you can see the json schema here).

I wish we were able to somehow reuse/adapt the original parser, but unfortunately it seems like introducing a different parser (and exposing it via separate content type in the API) is the easier way to address the issue

pltb · 2024-07-30T18:52:57Z

@renekrie I've cleaned the code up a bit, now it should be more clear why the parser can't be simply reused.
I think the result I've arrived at would benefit from more tests and someone else with more expertise than me looking at it, but it already looks more or less fine to me.

renekrie · 2024-07-30T19:07:31Z

Thank you @pltb ! I will have a look and then we'll need to figure out how we deal with it. I think he save_model should be the default and we'd need a plan how to introduce that as a breaking change

pltb · 2024-07-30T19:22:20Z

@renekrie thank you!

pltb · 2024-08-01T11:57:46Z

then we'll need to figure out how we deal with it. I think he save_model should be the default and we'd need a plan how to introduce that as a breaking change

@renekrie let me know if you'll need my help with that 👍

pltb changed the title ~~Put a warning in the documentation about features always being 32-bit floats~~ Put a warning in the documentation about features always being rounded to 32-bit float Jul 5, 2024

pltb changed the title ~~Put a warning in the documentation about features always being rounded to 32-bit float~~ Non-reproducible XGBoost ranking Jul 12, 2024

pltb mentioned this issue Jul 23, 2024

Fix irreproducible inference due to the imprecise floating point values in xgboost's get_dump #500

Merged

3 tasks

wrigleyDan closed this as completed in 4e39eb8 Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-reproducible XGBoost ranking #497

Non-reproducible XGBoost ranking #497

pltb commented Jul 5, 2024 •

edited

Loading

renekrie commented Jul 5, 2024

pltb commented Jul 8, 2024

pltb commented Jul 12, 2024 •

edited

Loading

pltb commented Jul 12, 2024

pltb commented Jul 12, 2024 •

edited

Loading

pltb commented Jul 23, 2024

pltb commented Jul 25, 2024

renekrie commented Jul 25, 2024

pltb commented Jul 30, 2024

pltb commented Jul 30, 2024

renekrie commented Jul 30, 2024

pltb commented Jul 30, 2024

pltb commented Aug 1, 2024

Non-reproducible XGBoost ranking #497

Non-reproducible XGBoost ranking #497

Comments

pltb commented Jul 5, 2024 • edited Loading

renekrie commented Jul 5, 2024

pltb commented Jul 8, 2024

pltb commented Jul 12, 2024 • edited Loading

pltb commented Jul 12, 2024

pltb commented Jul 12, 2024 • edited Loading

pltb commented Jul 23, 2024

pltb commented Jul 25, 2024

renekrie commented Jul 25, 2024

pltb commented Jul 30, 2024

pltb commented Jul 30, 2024

renekrie commented Jul 30, 2024

pltb commented Jul 30, 2024

pltb commented Aug 1, 2024

pltb commented Jul 5, 2024 •

edited

Loading

pltb commented Jul 12, 2024 •

edited

Loading

pltb commented Jul 12, 2024 •

edited

Loading