Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(jit): (un)-serialization of jit script modules #1240

Merged
merged 8 commits into from
Jan 21, 2025

Conversation

sebffischer
Copy link
Collaborator

This is useful for sending jitted modules between R processes.

Resolves Issue #1236

This is useful for sending jitted modules between R processes.

Resolves Issue mlverse#1236
@sebffischer
Copy link
Collaborator Author

@dfalbel I once again need the 'lantern' tag. Can I maybe get the permission to do this myself?

@sebffischer sebffischer changed the title feat(jit): (un)-serialization of jit scropt modules feat(jit): (un)-serialization of jit script modules Jan 16, 2025
@dfalbel dfalbel added the lantern Use this label if your PR affects lantern so it's built in the CI label Jan 16, 2025
@sebffischer
Copy link
Collaborator Author

@dfalbel I think the bug is:

library(torch)
x = torch_tensor(1)
n1 = nn_linear(1, 1)
n2 = nn_linear(1, 1)

set.seed(1)
jit_trace(n1, x)
#> An `nn_module` containing 2 parameters.
#> 
#> ── Parameters ──────────────────────────────────────────────────────────────────
#> • weight: Float [1:1, 1:1]
#> • bias: Float [1:1]
set.seed(1)
jit_trace(n1, x)
#> Error in cpp_jit_script_module_new(.compilation_unit, make_script_module_name(mod)): class '__torch__.nn_linear_ydgabwknrsauujvnjgioueiy' already defined.
#> Exception raised from register_type at /Users/runner/work/libtorch-mac-m1/libtorch-mac-m1/pytorch/torch/csrc/jit/api/compilation_unit.h:180 (most recent call first):
#> frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 52 (0x103dc811c in libc10.dylib)
#> frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 140 (0x103dc4d6c in libc10.dylib)
#> frame #2: torch::jit::CompilationUnit::register_type(std::__1::shared_ptr<c10::NamedType>) + 264 (0x116a4f3a4 in libtorch_cpu.dylib)
#> frame #3: torch::jit::create_module_object(c10::QualifiedName, std::__1::shared_ptr<torch::jit::CompilationUnit>, bool) + 1640 (0x116a4139c in libtorch_cpu.dylib)
#> frame #4: torch::jit::Module::Module(c10::QualifiedName, std::__1::shared_ptr<torch::jit::CompilationUnit>, bool) + 144 (0x116a41b88 in libtorch_cpu.dylib)
#> frame #5: _lantern_ScriptModule_new + 156 (0x1070db9bc in liblantern.dylib)
#> frame #6: lantern_ScriptModule_new + 48 (0x1051760b8 in torchpkg.so)
#> frame #7: cpp_jit_script_module_new(XPtrTorchCompilationUnit, XPtrTorchstring) + 68 (0x105176070 in torchpkg.so)
#> frame #8: _torch_cpp_jit_script_module_new + 240 (0x104f38ea8 in torchpkg.so)
#> frame #9: R_doDotCall + 268 (0x1015c048c in libR.dylib)
#> frame #10: bcEval_loop + 128060 (0x10161ccfc in libR.dylib)
#> frame #11: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #12: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #13: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #14: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #15: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #16: do_docall + 644 (0x10158df04 in libR.dylib)
#> frame #17: bcEval_loop + 40164 (0x1016075a4 in libR.dylib)
#> frame #18: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #19: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #20: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #21: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #22: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #23: do_eval + 1352 (0x1015f6ac8 in libR.dylib)
#> frame #24: bcEval_loop + 40164 (0x1016075a4 in libR.dylib)
#> frame #25: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #26: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #27: forcePromise + 232 (0x1015f0128 in libR.dylib)
#> frame #28: Rf_eval + 660 (0x1015ef654 in libR.dylib)
#> frame #29: do_withVisible + 64 (0x1015f6e00 in libR.dylib)
#> frame #30: do_internal + 400 (0x10165fbd0 in libR.dylib)
#> frame #31: bcEval_loop + 40724 (0x1016077d4 in libR.dylib)
#> frame #32: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #33: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #34: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #35: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #36: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #37: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #38: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #39: bcEval_loop + 37296 (0x101606a70 in libR.dylib)
#> frame #40: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #41: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #42: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #43: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #44: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #45: do_begin + 396 (0x1015f4a4c in libR.dylib)
#> frame #46: Rf_eval + 1012 (0x1015ef7b4 in libR.dylib)
#> frame #47: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #48: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #49: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #50: do_docall + 644 (0x10158df04 in libR.dylib)
#> frame #51: bcEval_loop + 40164 (0x1016075a4 in libR.dylib)
#> frame #52: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #53: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #54: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #55: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)
#> frame #56: Rf_eval + 1224 (0x1015ef888 in libR.dylib)
#> frame #57: do_docall + 644 (0x10158df04 in libR.dylib)
#> frame #58: bcEval_loop + 40164 (0x1016075a4 in libR.dylib)
#> frame #59: bcEval + 684 (0x1015efeec in libR.dylib)
#> frame #60: Rf_eval + 556 (0x1015ef5ec in libR.dylib)
#> frame #61: R_execClosure + 812 (0x1015f21ac in libR.dylib)
#> frame #62: applyClosure_core + 164 (0x1015f12a4 in libR.dylib)

Created on 2025-01-20 with reprex v2.1.1

We have already discussed this once, we should probably create a different compilation unit every time we trace-jit. I don't think this bug was introduced by this PR, however. Not sure why it only pops up now.

@dfalbel
Copy link
Member

dfalbel commented Jan 21, 2025

I think a quick fix that we can add to this PR is to make the module name more like a timestamp instead of sampling like we do here:

torch/R/trace.R

Lines 213 to 215 in ba479c7

make_script_module_name <- function(x) {
paste0(class(x)[1], "_", paste(sample(letters, 24, replace = TRUE), collapse = ""))
}

@sebffischer
Copy link
Collaborator Author

@dfalbel Yes, I just had the same idea: #1247

@sebffischer
Copy link
Collaborator Author

@dfalbel The only failing test now is not related to this PR it seems.

Copy link
Member

@dfalbel dfalbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dfalbel dfalbel merged commit ecdf13b into mlverse:main Jan 21, 2025
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lantern Use this label if your PR affects lantern so it's built in the CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants