-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP refactor func globals to make external delegation possible #430
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #430 +/- ##
==========================================
+ Coverage 91.99% 92.17% +0.18%
==========================================
Files 4 4
Lines 687 703 +16
Branches 141 140 -1
==========================================
+ Hits 632 648 +16
Misses 34 34
Partials 21 21
Continue to review full report at Codecov.
|
@staticmethod | ||
def persistent_id(obj): | ||
if ( | ||
isinstance(obj, _FuncMetadataGlobals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_FuncMetadataGlobals
and _FuncCodeGlobals
are still private API so the subclassing pickler would still be impacted by changes in those constructs. However it feels maybe less brittle than overriding the save_function
or _dynamic_function_reduce
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is worse than my solution:
- it is less intuitive as it forces you to use private API
- it also brings unnecessary wrapping of all dicts, so you will have to maintain these wrappers forever
My solution does not bring any more APIs as it follows official python docs. It also does not complicate serialization for people who don't use persistent_id
for func.__globals__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't like your solution because it's calling persistent_id
with the global dict at a point where it's not supposed to be called: the full func globals dict is never pickled when persistent_id
is not overridden, only the derived base_globals
and f_globals
are. Therefore your solution is a breach/abuse of the persistent_id
protocol as defined in the pickle module documentation.
If we don't like the wrapping, we could instead inject a specific key inside the dict with the id of the full func.__globals__
. But it would make it hard to write code to access it and it would make it impossible to know which function is being pickled (but maybe users don't need that info).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the full func globals dict is never pickled when persistent_id is not overridden, only the derived base_globals and f_globals are.
Are you talking about cloudpickle
or pickle
? Because pickle
pickles full func globals dict and we want to be able to tell cloudpickle
that we want the same behaviour here.
Therefore your solution is a breach/abuse of the persistent_id protocol as defined in the pickle module documentation.
I don't see how it is. When you replace cloudpickle.CloudPickler
with pickle.Pickler
in my example you receive the desired behavior:
import io
import pickle
__globals__ = {"a": "foo"}
class Pickler(pickle.Pickler):
@staticmethod
def persistent_id(obj):
if id(obj) == id(__globals__):
return "__globals__"
class Unpickler(pickle.Unpickler):
@staticmethod
def persistent_load(pid):
return {"__globals__": __globals__}[pid]
exec("""
def get():
return a
""", __globals__)
get = __globals__['get']
file = io.BytesIO()
Pickler(file).dump(get)
dumped = file.getvalue()
if b'foo' in dumped:
raise Exception(repr(dumped))
get = Unpickler(io.BytesIO(dumped)).load()
if id(__globals__) != id(get.__globals__):
raise Exception(f"id(__globals__) != id(get.__globals__)")
print(f"get() == {get()}")
__globals__['a'] = 'bar'
print(f"get() == {get()}")
But it would make it hard to write code to access it and it would make it impossible to know which function is being pickled (but maybe users don't need that info).
That's right, users don't need that info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you replace cloudpickle.CloudPickler with pickle.Pickler in my example you receive the desired behavior
Your example here works without defining persistent_id
and persistent_load
, so I am not sure it is relevant to why your changes are "abusing" the persistent_id
protocol.
In [9]: import io
...: import pickle
...:
...: __globals__ = {"a": "foo"}
...:
...:
...: class Pickler(pickle.Pickler):
...: pass
...:
...: class Unpickler(pickle.Unpickler):
...: pass
...:
...: exec("""
...: def get():
...: return a
...: """, __globals__)
...: get = __globals__['get']
...: file = io.BytesIO()
...: Pickler(file).dump(get)
...: dumped = file.getvalue()
...: if b'foo' in dumped:
...: raise Exception(repr(dumped))
...: get = Unpickler(io.BytesIO(dumped)).load()
...: if id(__globals__) != id(get.__globals__):
...: raise Exception(f"id(__globals__) != id(get.__globals__)")
...: print(f"get() == {get()}")
...: __globals__['a'] = 'bar'
...: print(f"get() == {get()}")
I'll try to explain one more time why your PR does not use persistent_id
as it is intended to be used. According to the pickle
documentation:
To pickle objects that have an external persistent ID, the pickler must have a custom persistent_id() method that takes an object as an argument and returns either None or the persistent ID for that object. When None is returned, the pickler simply pickles the object as normal. When a persistent ID string is returned, the pickler will pickle that object, along with a marker so that the unpickler will recognize it as a persistent ID.
In other words, persistent_id(obj)
only has an impact on which obj
is pickled. In your PR, persistent_id(func.__globals__)
has an impact on the content of two different objects that are NOT func.__globals__
: f_globals
and base_globals
. Such a case is never advertised by the pickle
docs, and thus is an abuse of the persistent_id
protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because pickle pickles full func globals dict and we want to be able to tell cloudpickle that we want the same behaviour here
"pickles full globals dict" in which setting?
import pickle
global_var = 1
def f():
return global_var
pickle.dumps(f)
does not pickle any global variable, or any dictionary containing global variables. It pickles f
by reference.
@@ -49,6 +49,7 @@ | |||
from cloudpickle.cloudpickle import _make_empty_cell, cell_set | |||
from cloudpickle.cloudpickle import _extract_class_dict, _whichmodule | |||
from cloudpickle.cloudpickle import _lookup_module_and_qualname | |||
from cloudpickle.cloudpickle import _FuncMetadataGlobals, _FuncCodeGlobals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make those directly importable from the top level cloudpickle
package even if they are private API?
@ogrisel feel free to ping me when this is ready for review. |
I am not sure if we want merge this PR. If it does not solve anybodies problem, then it's not worth the added complexity. Maybe @faucct is best addressed by just overriding |
I don't like this solution. I will check if overriding |
Also, what about putting .globals substituion behind a flag? |
Prototype to tackle #411 without calling
self.persistent_id
on a object that would not otherwise be part of the pickle stream.