-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Realtime test sometime failing? #2515
Comments
See also this run - python 3.13 + postgres, |
It looks like the test It's likely to do with the sql caching mechanism. I've looked through the stack a bit to try to figure out what code might be non-deterministic. My current hypothesis is based around the fact that we use
So far this is just a theory. I think this scenario should be possible, but I haven't managed locally to generate an If this is the issue, then it should be relatively easy to fix with a slight tweak to the caching strategy (and maybe we could do this anyway). But it would be nice to get to the bottom of this beforehand, as this kind of failure could mask more subtle bugs. If for instance the only difference between the two settings objects were in the comparison levels (or even just the parameter values!), then the 'incorrect' SQL would still execute. If we were only dealing with the |
Note this line from the python docs on
which suggests that this mechanism is at least plausible. How likely it is I don't really have a sense of. |
Okay I believe that this is in fact what is occurring. I have run tests repeatedly on my fork in CI to get a failure, with some additional logging in place. This run has a failure, and you can see in the logs that a settings
For reference here is an equivalent portion of the logs for a non-failing test run - in this case the object gets a distinct
|
Thanks for looking into this, and excellent detective work! I just checked back on this one because I just got the same failure here The reason I used id(settings) was simply that it seems like a low-overhead way of uniquely identifying a settings object. Obviously it didn't work as anticipated but any other mechanism should do. It was just to avoid caching errors in the case where the user is making predictions from two different models - leading to the potential for one model to use the cached sql from the other |
Update on this / to document my thinking:
I've run it with some hooks to confirm that in cases where this issue would have arisen this fixes things. I could merge that in now, and it would solve the intermittent test failures. But as it's not a blocking issue currently, I'm going to explore another option based on hashing the object (snapshot)s. I realise it's probably a fairly niche bug, but I think it would be useful to try and get this right, as there is potential for quite a subtle (and thus hard-to-notice) error here. |
Interesting - the weakref idea is definitely an improvement, and feels like the correct/principled solution for the object case, it's a shame about the I agree that it's worth solving this 'properly' Could we simply set our own uuid
But, I appreciate it's a bit hacky! The loading directly from a path scenario is interesting because it's potentially a common production use case. The simplicity of 'if settings is a string, use it as a key' seems potentially nice - easy to understand and high performance. (apologies for the mistake in the first place - i had assumed that because the id is a big number, the chance of collision was negligible, but I can see now why that's not the case) Or maybe it's just:
That doesn't feel too bad... |
I had thought about using a The other, separate issue, I wonder about with loading from a path is the fact that the file might change between calls. So you get the SQL, then another process updates the model, but you still get the cached SQL cause the path is the same. Not sure for the moment how likely this kind of scenario is though. I don't think there's really loads we can do about that - we could read the file in and use a hash of its contents, but then we have a disk read every time, which probably isn't ideal if we are aiming for high performance. I suppose we could query the metadata for last modified time (which is maybe cached?) if we wanted, but don't know off the top of my head if that is still too expensive. Or possibly leave it, and just emit a warning. At any rate I won't do anything here, as I think it is really a separate (albeit related) concern.
My current feeling is that this is probably the best way, and we can get something merged. I will just keep this in the back of my head though - will follow up with some additional testing to try and cover some potential edge-cases |
After this falling by the wayside I am keen to kick this over the line, as it keeps on coming up and is pretty annoying. The complexity is the I'm tempted to go down the Another option (which I don't love) is just turning off the caching mechanism when user supply things as a dict (along with a warning). This is a very specific use-case, and we are generally encouraging the use of Or maybe we do the opposite and always retrieve from the cache when we are dealing with dicts. That way users can still use dicts at high performance, with the caveat that if they are dealing with multi-settings they might get the wrong SQL. That feels like less of an impact to workaround, but potentially more damaging for users that ignore or do not read the warning (although who would ever ignore a warning like that!). I'm writing this out partly to help me remember what the complications are if stuff comes up and I end up taking a moment to get back to this. |
Thanks - yeah would be good to get this over the line. I agree we want to avoid:
Overall I think I'm inclined towards simpler but less complete solutions that are bug free, than something more complex that works in every circumstance. So I think I'd err on the side of:
I think what matters is that it's possible to get this performance boost - so that people embedding Splink into realtime systems have at least one option that gives them great performance. Such people should be 'deep' enough into Splink they've read any warnings we emit. I think it's much less important that it always works for a typical user just playing about with Splink in say a Jupyter notebook. |
Just for future reference on this: it turns out I had already done the work to allow dicts to be fully converted to native python types, so the solution uses this + json dump as a key. |
The test
tests/test_realtime.py::test_realtime_cache_different_settings[duckdb]
failed in python 3.9 on this run. Re-running the job allowed the test to pass, which is odd as I can't see where any indeterminacy could occur.Error was: Binder Error: Values list "l" does not have a column named "tf_first_name"
.Full test output
Test summary output
Environment info (3.9.20)
The text was updated successfully, but these errors were encountered: