Guide to GPUDrive setup on NYU HPC

🧱 Installation (first time only)

Step 1: Clone the repository

Clone the gpudrive repository into your /home/$USER directory (info on HPC directories and data management).

git clone --recursive https://github.com/Emerge-Lab/gpudrive.git

Step 2: Navigate to repository

Move into the cloned repository folder:

cd gpudrive

Step 3: Set up overlay image

Create a directory for overlay files in the scratch directory:

mkdir -p /scratch/$USER/images/gpudrive
cd /scratch/$USER/images/gpudrive

Copy and decompress the overlay image:

cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz .
gunzip overlay-50G-10M.ext3.gz

This may take a couple of minutes.

1. Verify the decompressed overlay image exists:

ls /scratch/$USER/images/gpudrive

What if I want to use a different overlay image?

To explore all available overlay images:

ls -l /scratch/work/public/overlay-fs-ext3/

Step 4: Request a GPU

srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account=<ASK> --pty /bin/bash

Ouput:

>>> srun: job XXXXXXX queued and waiting for resources
>>> srun: job XXXXXXX has been allocated resources

You will see something like:

[08:33:52 Wed Dec 25 2024] [email protected] ~/gpudrive

Ask Eugene for your account code if you don't have one yet.

Step 5: Launch Singularity container

Navigate back to main repo:

cd /home/$USER/gpudrive

Run the following to start the container with GPU support and the overlay image:

singularity exec --nv --overlay /scratch/$USER/images/gpudrive/overlay-50G-10M.ext3:rw \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bash

You should see:

Singularity>

Details on Sinularity and overlay images on NYU HPC here.

Step 6: Set up Python environment

Inside the Singularity container, create a virtual environment:
One-off step: Create conda environment with Python 3.11

conda env create -f environment.yml

Why use conda? Currently, conda is the only way to use a Python version > 3.8.6 on the NYU HPC without Docker.

Activate conda environment

conda activate gpudrive

Now you should see:

(/scratch/username/.conda/gpudrive) Singularity>

Step 7: Set up GPUDrive

We use the manual install option to set up GPUDrive, see the README for details.

poetry install

If successful, you'll see

[100%] Linking CXX executable my_tests
[100%] Built target my_tests

Step 8: Verify installation

Launch Python:

python3

Then run:

import gpudrive

If there are no errors, the installation was successful!

Using `wandb`, `pufferlib` and running experiments

Set up Weights and Biases

Create wandb account and authorize
Set trusted certificates

export SSL_CERT_FILE=$(python -m certifi)
export REQUESTS_CA_BUNDLE=$(python -m certifi)

Set up Pufferlib

Install PufferLib with SSL Certificate Fixes

Update Certifi Package
Ensure the certifi package (provides root certificates) is up-to-date:

pip install --upgrade certifi

Why?
Keeps SSL certificates current to avoid issues with secure connections.

Set Trusted Certificates Manually (if needed)
Explicitly set the certificate bundle:

export SSL_CERT_FILE=$(python -m certifi)
export REQUESTS_CA_BUNDLE=$(python -m certifi)

Install pufferlib

pip install git+https://github.com/PufferAI/PufferLib.git@gpudrive

Run Self-Play PPO

Use the --help command to see the CLI configurable arguments:

Singularity> python baselines/ippo/ippo_pufferlib.py --help
                                                                                                                                                               
 Usage: ippo_pufferlib.py [OPTIONS] [CONFIG_PATH]                                                                                                              
                                                                                                                                                               
 Run PPO training with the given configuration.                                                                                                                
                                                                                                                                                               
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│   config_path      [CONFIG_PATH]  The path to the default configuration file [default: baselines/ippo/config/ippo_ff_puffer.yaml]                           │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --collision-weight              FLOAT    The weight for collision penalty [default: None]                                                                   │
│ --off-road-weight               FLOAT    The weight for off-road penalty [default: None]                                                                    │
│ --goal-achieved-weight          FLOAT    The weight for goal-achieved reward [default: None]                                                                │
│ --dist-to-goal-threshold        FLOAT    The distance threshold for goal-achieved [default: None]                                                           │
│ --sampling-seed                 INTEGER  The seed for sampling scenes [default: None]                                                                       │
│ --obs-radius                    FLOAT    The radius for the observation [default: None]                                                                     │
│ --learning-rate                 FLOAT    The learning rate for training [default: None]                                                                     │
│ --resample-scenes               INTEGER  Whether to resample scenes during training; 0 or 1 [default: None]                                                 │
│ --resample-interval             INTEGER  The interval for resampling scenes [default: None]                                                                 │
│ --install-completion                     Install completion for the current shell.                                                                          │
│ --show-completion                        Show completion for the current shell, to copy it or customize the installation.                                   │
│ --help                                   Show this message and exit.                                                                                        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

⚡️ Usage | Interactive node

What is an interactive job and when should I use it?

In short, use interactive nodes for code development and testing.

Steps:

Request an interactive compute node, e.g:

srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account=<account_number> --pty /bin/bash

Replace <account_number> with your project number.

Navigate to repository:

cd /home/$USER/gpudrive

Launch the Singularity image:

singularity exec --nv --overlay /scratch/$USER/images/gpudrive/overlay-50G-10M.ext3:ro \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bash

Activate the virtual environment:

conda activate gpudrive

Run experiments!

To run the Pufferlib PPO implementation, install Puffer first (not in requirements.yaml)

pip install git+https://github.com/PufferAI/PufferLib.git@gpudrive

python baselines/ippo/ippo_pufferlib.py

🚀 Usage | `sbatch`

What is sbatch and when should I use it?

In short, use sbatch for large runs, such as hyperparameter sweeps.

Steps:

[Optional] Define run configurations and hyperparameters to sweep over in generate_sbatch.py. Running it stores an sbatch script.

python examples/experiments/scripts/generate_sbatch.py

Submit sbatch jobs using

sbatch <your_sbatch_script>.sh

Through the use of job arrays, all the specified runs are launched at once.

Help

Do you encounter issues with one of the steps outlined above? Please reach out in the Emerge lab #code-help channel!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guide to GPUDrive setup on NYU HPC

🧱 Installation (first time only)

Step 1: Clone the repository

Step 2: Navigate to repository

Step 3: Set up overlay image

Step 4: Request a GPU

Step 5: Launch Singularity container

Step 6: Set up Python environment

Step 7: Set up GPUDrive

Step 8: Verify installation

Using `wandb`, `pufferlib` and running experiments

⚡️ Usage | Interactive node

🚀 Usage | `sbatch`

Help

Clone this wiki locally

Guide to GPUDrive setup on NYU HPC

🧱 Installation (first time only)

Step 1: Clone the repository

Step 2: Navigate to repository

Step 3: Set up overlay image

Step 4: Request a GPU

Step 5: Launch Singularity container

Step 6: Set up Python environment

Step 7: Set up GPUDrive

Step 8: Verify installation

Using wandb, pufferlib and running experiments

⚡️ Usage | Interactive node

🚀 Usage | sbatch

Help

Clone this wiki locally

Using `wandb`, `pufferlib` and running experiments

🚀 Usage | `sbatch`