cannot run multiple containers in parallel in the same pod #47

antoinetran · 2025-01-24T14:25:10Z

Example of one pod with one initContainer, one container main, one container wait.

The job.sh generated is:

#!/bin/bash
#SBATCH --job-name=b7c6ba14-a332-4e63-87a7-5be69be7b968
#SBATCH --output=/home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/job.out
#SBATCH --mem=384
#SBATCH --cpus-per-task=3


singularity exec --no-eval --containall --nv   --env-file /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/init_envfile.properties  --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/emptyDirs/argo-staging:/argo/staging:rw   --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/emptyDirs/var-run-argo:/var/run/argo:rw   --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/projectedVolumeMaps/kube-api-access-kpzmd/token:/var/run/secrets/kubernetes.io/serviceaccount/token:ro  --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/projectedVolumeMaps/kube-api-access-kpzmd/ca.crt:/var/run/secrets/kubernetes.io/serviceaccount/ca.crt:ro  --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/projectedVolumeMaps/kube-api-access-kpzmd/namespace:/var/run/secrets/kubernetes.io/serviceaccount/namespace:ro  docker://quay.io/argoproj/argoexec:latest argoexec init --loglevel info --log-format text --gloglevel 0 &> /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/init.out; echo $? > /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/init.status
singularity exec --no-eval --containall --nv   --env-file /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/wait_envfile.properties  --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/emptyDirs/argo-staging:/mainctrfs/argo/staging:rw   --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/emptyDirs/tmp-dir-argo:/tmp:rw   --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/emptyDirs/var-run-argo:/var/run/argo:rw   --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/projectedVolumeMaps/kube-api-access-kpzmd/namespace:/var/run/secrets/kubernetes.io/serviceaccount/namespace:ro  --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/projectedVolumeMaps/kube-api-access-kpzmd/token:/var/run/secrets/kubernetes.io/serviceaccount/token:ro  --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/projectedVolumeMaps/kube-api-access-kpzmd/ca.crt:/var/run/secrets/kubernetes.io/serviceaccount/ca.crt:ro  docker://quay.io/argoproj/argoexec:latest argoexec wait --loglevel info --log-format text --gloglevel 0 &> /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/wait.out; echo $? > /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/wait.status; sleep 30 &
singularity exec --no-eval --containall --nv   --env-file /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/main_envfile.properties  --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/emptyDirs/argo-staging:/argo/staging:rw   --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/emptyDirs/var-run-argo:/var/run/argo:rw   --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/projectedVolumeMaps/kube-api-access-kpzmd/ca.crt:/var/run/secrets/kubernetes.io/serviceaccount/ca.crt:ro  --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/projectedVolumeMaps/kube-api-access-kpzmd/namespace:/var/run/secrets/kubernetes.io/serviceaccount/namespace:ro  --bind /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/projectedVolumeMaps/kube-api-access-kpzmd/token:/var/run/secrets/kubernetes.io/serviceaccount/token:ro  docker://python:3.8.18-bullseye /var/run/argo/argoexec emissary --loglevel info --log-format text --gloglevel 0 -- bash /argo/staging/script &> /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/main.out; echo $? > /home/username/.interlink/argo-workflows-b7c6ba14-a332-4e63-87a7-5be69be7b968/main.status; sleep 30 &

This script is equivalent in terms of structure to

#!/bin/sh

set -x

echo step init
sleep 2
echo step wait
sleep infinity ; sleep 5 &
echo step main
sleep 10 ; sleep 4 &
echo end

The result is we wait for step init to finish, and then, the wait container is run but the main container is not run in parallel.
The expected result is for init to finish, and then to run wait and main in parallel.

My proposition of fix is:

# init containers
=> we do any init containers sequentially, and this is already working.
# Loop of all containers
singularity ... & ; pid="$!" && pids="${pids} ${pid}" && echo "$(date -Is --utc) Ran container wait pid ${pid}..." 
singularity ... & ; pid="$!" && pids="${pids} ${pid}" && echo "$(date -Is --utc) Ran container main pid ${pid}..."

for pid in pids ; do
  echo "$($(date -Is --utc) Waiting for pid ${pid}..."
  wait "${pid}"
done

# For some reason, the status files does not have the time for being written in some HPC, because slurm kills the job too soon. Thus the sleep.
sleep 30

The text was updated successfully, but these errors were encountered:

antoinetran · 2025-01-24T18:23:56Z

After reworking it, this is a proper fix. I also replaced bash with sh, everything here is POSIX shell and is portable across HPCs. To be tested...

#/bin/sh

runInitCtn() {
  ctn="$1"
  shift
  printf "%s\n" "$(date -Is --utc) Running ${ctn}..."
  time "$@"
  exitCode="$?"
  printf "%s\n" "${exitCode}" > .../${ctn}.status
  if test "${exitCode}" != 0 ; then
    printf "%s\n" "$(date -Is --utc) InitContainer ${ctn} failed with status ${exitCode}" >&2
    # InitContainers are fail-fast.
    exit "${exitCode}"
  fi
}

runCtn() {
  ctn="$1"
  shift
  time "$@" &
  pid="$!"
  printf "%s\n" "$(date -Is --utc) Running in background ${ctn} pid ${pid}..."
  pidCtns="${pidCtns} ${pid}:${ctn}"
}

waitCtns() {
  # POSIX shell substring test below. Also, container name follows DNS pattern (hyphen alphanumeric, so no ":" inside)
  # pidCtn=12345:container-name-rfc-dns
  # ${pidCtn%:*} => 12345
  # ${pidCtn#*:} => container-name-rfc-dns
  for pidCtn in ${pidCtns} ; do
    pid="${pidCtn%:*}"
    ctn="${pidCtn#*:}"
    printf "%s\n" "$(date -Is --utc) Waiting for container ${ctn} pid ${pid}..."
    wait "${pid}"
    exitCode="$?"
    printf "%s\n" "${exitCode}" > .../${ctn}.status
    printf "%s\n" "$(date -Is --utc) Container ${ctn} pid ${pid} ended with status ${exitCode}."
    test "${highestExitCode}" -lt ${exitCode}" && highestExitCode="${exitCode}"
  done
}

highestExitCode=0

# init containers
runInitCtn init singularity ... 

runCtn wait singularity ... 
runCtn main singularity ... 

waitCtns

printf "%s\n" "$(date -Is --utc) End of script, highest exit code ${highestExitCode}, sleeping 30s in case of..."
# For some reason, the status files does not have the time for being written in some HPC, because slurm kills the job too soon. sleep 30

exit "${highestExitCode}"

pod with multiple containers are run in parallel

antoinetran added a commit to antoinetran/interlink-slurm-plugin that referenced this issue Jan 29, 2025

Fix interTwin-eu#47 -

c96b591

pod with multiple containers are run in parallel

antoinetran mentioned this issue Jan 29, 2025

Fix issue45 - support for serviceAccount in slurm jobs #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot run multiple containers in parallel in the same pod #47

cannot run multiple containers in parallel in the same pod #47

antoinetran commented Jan 24, 2025

antoinetran commented Jan 24, 2025

cannot run multiple containers in parallel in the same pod #47

cannot run multiple containers in parallel in the same pod #47

Comments

antoinetran commented Jan 24, 2025

antoinetran commented Jan 24, 2025