Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to determine pod name #77

Open
omus opened this issue May 19, 2021 · 2 comments
Open

Unable to determine pod name #77

omus opened this issue May 19, 2021 · 2 comments

Comments

@omus
Copy link
Member

omus commented May 19, 2021

@kolia reported this issue with [email protected]:

julia> addprocs(K8sClusterManager(n_workers; pending_timeout=180, memory="1Gi"))
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-z4jjs is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-fvt5b is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-jt2dv is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-gwtsp is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-pv5dw is up
[ Info: driver-2021-05-18--20-31-35-wgssh-worker-dlzxx is up
ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:322 [inlined]
 [2] addprocs_locked(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:497
 [3] addprocs_locked
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:451 [inlined]
 [4] addprocs(manager::K8sClusterManager; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:444
 [5] addprocs(manager::K8sClusterManager)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:438
 [6] top-level scope
   @ REPL[16]:1
    nested task error: TaskFailedException
        nested task error: Unable to determine the pod name from: ""
        Stacktrace:
         [1] error(s::String)
           @ Base ./error.jl:33
         [2] create_pod(manifest::DataStructures.DefaultOrderedDict{String, Any, typeof(K8sClusterManagers.rdict)})
           @ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/vRyNt/src/pod.jl:68
         [3] macro expansion
           @ ~/.julia/packages/K8sClusterManagers/vRyNt/src/native_driver.jl:107 [inlined]
         [4] (::K8sClusterManagers.var"#29#31"{K8sClusterManager, Vector{WorkerConfig}, Condition})()
           @ K8sClusterManagers ./task.jl:411
    ...and 25 more exceptions.
    Stacktrace:
     [1] sync_end(c::Channel{Any})
       @ Base ./task.jl:369
     [2] macro expansion
       @ ./task.jl:388 [inlined]
     [3] launch(manager::K8sClusterManager, params::Dict{Symbol, Any}, launched::Vector{WorkerConfig}, c::Condition)
       @ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/vRyNt/src/native_driver.jl:105
     [4] (::Distributed.var"#39#42"{K8sClusterManager, Condition, Vector{WorkerConfig}, Dict{Symbol, Any}})()
       @ Distributed ./task.jl:411
@omus
Copy link
Member Author

omus commented May 19, 2021

The nested task error: Unable to determine the pod name from: "" is from create_pod and shows that the external command call resulted in no stdout (the empty string reported) and no stderr (a different exception would have been raised) from the process. I'll note we're using ignorestatus so possibly the return code here could be useful. One theory I have is that since the launch call happens inside of a task maybe it's possible that output could be missed if Julia was busy with another task.

Additionally, there are another 25 error messages we're not seeing which could be useful for determining the root cause.

@ericphanson
Copy link
Member

ericphanson commented Mar 4, 2022

I just ran into this too; I asked for 6 workers, and it seemed to happen on the 6th (since I got 5 "worker is up" log messages before it failed; no other log messages though). Partial stacktrace:

TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:334 [inlined]
 [2] addprocs_locked(manager::K8sClusterManager; kwargs::Base.Pairs{Symbol, String, Tuple{Symbol}, NamedTuple{(:exeflags,), Tuple{String}}})
   @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:504
 [3] addprocs(manager::K8sClusterManager; kwargs::Base.Pairs{Symbol, String, Tuple{Symbol}, NamedTuple{(:exeflags,), Tuple{String}}})
   @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:447
[truncated]
    nested task error: TaskFailedException
    
        nested task error: Unable to determine the pod name from: ""
        Stacktrace:
         [1] error(s::String)
           @ Base ./error.jl:33
         [2] create_pod(manifest::DataStructures.DefaultOrderedDict{String, Any, typeof(K8sClusterManagers.rdict)})
           @ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/PIZ9P/src/pod.jl:66
         [3] macro expansion
           @ ~/.julia/packages/K8sClusterManagers/PIZ9P/src/native_driver.jl:103 [inlined]
         [4] (::K8sClusterManagers.var"#17#18"{K8sClusterManager, Vector{WorkerConfig}, Condition})()
           @ K8sClusterManagers ./task.jl:423
    Stacktrace:
     [1] sync_end(c::Channel{Any})
       @ Base ./task.jl:381
     [2] macro expansion
       @ ./task.jl:400 [inlined]
     [3] launch(manager::K8sClusterManager, params::Dict{Symbol, Any}, launched::Vector{WorkerConfig}, c::Condition)
       @ K8sClusterManagers ~/.julia/packages/K8sClusterManagers/PIZ9P/src/native_driver.jl:101
     [4] (::Distributed.var"#39#42"{K8sClusterManager, Condition, Vector{WorkerConfig}, Dict{Symbol, Any}})()
       @ Distributed ./task.jl:423

On K8sClusterManagers v0.1.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants