Skip to content

Guide to GPUDrive setup on NYU HPC

Daphne Cornelisse edited this page Dec 28, 2024 · 56 revisions

๐Ÿงฑ Installation (first time only)

Step 1: Clone the repository

Clone the gpudrive repository into your /home/$USER directory (info on HPC directories and data management).

git clone --recursive https://github.com/Emerge-Lab/gpudrive.git

Step 2: Navigate to repository

Move into the cloned repository folder:

cd gpudrive

Step 3: Set up overlay image

  • Create a directory for overlay files in the scratch directory:
mkdir -p /scratch/$USER/images/gpudrive
cd /scratch/$USER/images/gpudrive
  • Copy and decompress the overlay image:
cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz .
gunzip overlay-50G-10M.ext3.gz

This may take a couple of minutes.

    1. Verify the decompressed overlay image exists:
ls /scratch/$USER/images/gpudrive
What if I want to use a different overlay image?

To explore all available overlay images:

ls -l /scratch/work/public/overlay-fs-ext3/

Step 4: Request a GPU

srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account=<ASK> --pty /bin/bash

Ouput:

>>> srun: job XXXXXXX queued and waiting for resources
>>> srun: job XXXXXXX has been allocated resources

You will see something like:

[08:33:52 Wed Dec 25 2024] [email protected] ~/gpudrive

Ask Eugene for your account code if you don't have one yet.

Step 5: Launch Singularity container

Navigate back to main repo:

cd /home/$USER/gpudrive

Run the following to start the container with GPU support and the overlay image:

singularity exec --nv --overlay /scratch/$USER/images/gpudrive/overlay-50G-10M.ext3:rw \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bash

You should see:

Singularity> 

Details on Sinularity and overlay images on NYU HPC here.


Step 6: Set up Python environment

  • Inside the Singularity container, create a virtual environment:

  • One-off step: Create conda environment with Python 3.11

conda env create -f environment.yml

Why use conda? Currently, conda is the only way to use a Python version > 3.8.6 on the NYU HPC without Docker.

  • Activate conda environment
conda activate gpudrive

Now you should see:

(/scratch/username/.conda/gpudrive) Singularity>

Step 7: Set up GPUDrive

We use the manual install option to set up GPUDrive, see the README for details.

poetry install

If successful, you'll see

[100%] Linking CXX executable my_tests
[100%] Built target my_tests

Step 8: Verify installation

Launch Python:

python3

Then run:

import gpudrive

If there are no errors, the installation was successful!

Using wandb, pufferlib and running experiments

Set up Weights and Biases
  1. Create wandb account and authorize

  2. Set trusted certificates

export SSL_CERT_FILE=$(python -m certifi)
export REQUESTS_CA_BUNDLE=$(python -m certifi)
Set up Pufferlib

Install PufferLib with SSL Certificate Fixes

  1. Update Certifi Package
    Ensure the certifi package (provides root certificates) is up-to-date:
pip install --upgrade certifi

Why?
Keeps SSL certificates current to avoid issues with secure connections.

  1. Set Trusted Certificates Manually (if needed)
    Explicitly set the certificate bundle:
export SSL_CERT_FILE=$(python -m certifi)
export REQUESTS_CA_BUNDLE=$(python -m certifi)
  1. Install pufferlib
pip install git+https://github.com/PufferAI/PufferLib.git@gpudrive
Run Self-Play PPO

Use the --help command to see the CLI configurable arguments:

Singularity> python baselines/ippo/ippo_pufferlib.py --help
                                                                                                                                                               
 Usage: ippo_pufferlib.py [OPTIONS] [CONFIG_PATH]                                                                                                              
                                                                                                                                                               
 Run PPO training with the given configuration.                                                                                                                
                                                                                                                                                               
โ•ญโ”€ Arguments โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚   config_path      [CONFIG_PATH]  The path to the default configuration file [default: baselines/ippo/config/ippo_ff_puffer.yaml]                           โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --collision-weight              FLOAT    The weight for collision penalty [default: None]                                                                   โ”‚
โ”‚ --off-road-weight               FLOAT    The weight for off-road penalty [default: None]                                                                    โ”‚
โ”‚ --goal-achieved-weight          FLOAT    The weight for goal-achieved reward [default: None]                                                                โ”‚
โ”‚ --dist-to-goal-threshold        FLOAT    The distance threshold for goal-achieved [default: None]                                                           โ”‚
โ”‚ --sampling-seed                 INTEGER  The seed for sampling scenes [default: None]                                                                       โ”‚
โ”‚ --obs-radius                    FLOAT    The radius for the observation [default: None]                                                                     โ”‚
โ”‚ --learning-rate                 FLOAT    The learning rate for training [default: None]                                                                     โ”‚
โ”‚ --resample-scenes               INTEGER  Whether to resample scenes during training; 0 or 1 [default: None]                                                 โ”‚
โ”‚ --resample-interval             INTEGER  The interval for resampling scenes [default: None]                                                                 โ”‚
โ”‚ --install-completion                     Install completion for the current shell.                                                                          โ”‚
โ”‚ --show-completion                        Show completion for the current shell, to copy it or customize the installation.                                   โ”‚
โ”‚ --help                                   Show this message and exit.                                                                                        โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

โšก๏ธ Usage | Interactive node


What is an interactive job and when should I use it?


In short, use interactive nodes for code development and testing.

Steps:

  1. Request an interactive compute node, e.g:
srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account=<account_number> --pty /bin/bash

Replace <account_number> with your project number.

  1. Navigate to repository:
cd /home/$USER/gpudrive
  1. Launch the Singularity image:
singularity exec --nv --overlay /scratch/$USER/images/gpudrive/overlay-50G-10M.ext3:ro \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bash
  1. Activate the virtual environment:
conda activate gpudrive 
  1. Run experiments!

To run the Pufferlib PPO implementation, install Puffer first (not in requirements.yaml)

pip install git+https://github.com/PufferAI/PufferLib.git@gpudrive
python baselines/ippo/ippo_pufferlib.py

๐Ÿš€ Usage | sbatch


What is sbatch and when should I use it?


In short, use sbatch for large runs, such as hyperparameter sweeps.

Steps:

  1. [Optional] Define run configurations and hyperparameters to sweep over in generate_sbatch.py. Running it stores an sbatch script.
python examples/experiments/scripts/generate_sbatch.py
  1. Submit sbatch jobs using
sbatch <your_sbatch_script>.sh

Through the use of job arrays, all the specified runs are launched at once.

Help


Do you encounter issues with one of the steps outlined above? Please reach out in the Emerge lab #code-help channel!