-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(jit): (un)-serialization of jit script modules #1240
Conversation
This is useful for sending jitted modules between R processes. Resolves Issue mlverse#1236
@dfalbel I once again need the 'lantern' tag. Can I maybe get the permission to do this myself? |
@dfalbel I think the bug is: library(torch)
x = torch_tensor(1)
n1 = nn_linear(1, 1)
n2 = nn_linear(1, 1)
set.seed(1)
jit_trace(n1, x)
#> An `nn_module` containing 2 parameters.
#>
#> ── Parameters ──────────────────────────────────────────────────────────────────
#> • weight: Float [1:1, 1:1]
#> • bias: Float [1:1]
set.seed(1)
jit_trace(n1, x)
#> Error in cpp_jit_script_module_new(.compilation_unit, make_script_module_name(mod)): class '__torch__.nn_linear_ydgabwknrsauujvnjgioueiy' already defined.
#> Exception raised from register_type at /Users/runner/work/libtorch-mac-m1/libtorch-mac-m1/pytorch/torch/csrc/jit/api/compilation_unit.h:180 (most recent call first):
#> frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 52 (0x103dc811c in libc10.dylib)
#> frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 140 (0x103dc4d6c in libc10.dylib)
#> frame #2: torch::jit::CompilationUnit::register_type(std::__1::shared_ptr<c10::NamedType>) + 264 (0x116a4f3a4 in libtorch_cpu.dylib)
#> frame #3: torch::jit::create_module_object(c10::QualifiedName, std::__1::shared_ptr<torch::jit::CompilationUnit>, bool) + 1640 (0x116a4139c in libtorch_cpu.dylib)
#> frame #4: torch::jit::Module::Module(c10::QualifiedName, std::__1::shared_ptr<torch::jit::CompilationUnit>, bool) + 144 (0x116a41b88 in libtorch_cpu.dylib)
#> frame #5: _lantern_ScriptModule_new + 156 (0x1070db9bc in liblantern.dylib)
#> frame #6: lantern_ScriptModule_new + 48 (0x1051760b8 in torchpkg.so)
#> frame #7: cpp_jit_script_module_new(XPtrTorchCompilationUnit, XPtrTorchstring) + 68 (0x105176070 in torchpkg.so)
#> frame #8: _torch_cpp_jit_script_module_new + 240 (0x104f38ea8 in torchpkg.so)
#> frame #9: R_doDotCall + 268 (0x1015c048c in libR.dylib)
#> frame #10: bcEval_loop + 128060 (0x10161ccfc in libR.dylib)
#> frame #11: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #12: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #13: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #14: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #15: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #16: do_docall + 644 (0x10158df04 in libR.dylib)
#> frame #17: bcEval_loop + 40164 (0x1016075a4 in libR.dylib)
#> frame #18: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #19: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #20: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #21: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #22: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #23: do_eval + 1352 (0x1015f6ac8 in libR.dylib)
#> frame #24: bcEval_loop + 40164 (0x1016075a4 in libR.dylib)
#> frame #25: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #26: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #27: forcePromise + 232 (0x1015f0128 in libR.dylib)
#> frame #28: Rf_eval + 660 (0x1015ef654 in libR.dylib)
#> frame #29: do_withVisible + 64 (0x1015f6e00 in libR.dylib)
#> frame #30: do_internal + 400 (0x10165fbd0 in libR.dylib)
#> frame #31: bcEval_loop + 40724 (0x1016077d4 in libR.dylib)
#> frame #32: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #33: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #34: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #35: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #36: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #37: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #38: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #39: bcEval_loop + 37296 (0x101606a70 in libR.dylib)
#> frame #40: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #41: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #42: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #43: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #44: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #45: do_begin + 396 (0x1015f4a4c in libR.dylib)
#> frame #46: Rf_eval + 1012 (0x1015ef7b4 in libR.dylib)
#> frame #47: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #48: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #49: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #50: do_docall + 644 (0x10158df04 in libR.dylib)
#> frame #51: bcEval_loop + 40164 (0x1016075a4 in libR.dylib)
#> frame #52: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #53: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #54: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #55: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #56: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #57: do_docall + 644 (0x10158df04 in libR.dylib)
#> frame #58: bcEval_loop + 40164 (0x1016075a4 in libR.dylib)
#> frame #59: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #60: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #61: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #62: applyClosure_core + 164 (0x1015f12a4 in libR.dylib) Created on 2025-01-20 with reprex v2.1.1 We have already discussed this once, we should probably create a different compilation unit every time we trace-jit. I don't think this bug was introduced by this PR, however. Not sure why it only pops up now. |
I think a quick fix that we can add to this PR is to make the module name more like a timestamp instead of sampling like we do here: Lines 213 to 215 in ba479c7
|
@dfalbel The only failing test now is not related to this PR it seems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This is useful for sending jitted modules between R processes.
Resolves Issue #1236