-
Notifications
You must be signed in to change notification settings - Fork 20
Guide to GPUDrive setup on NYU HPC
Clone the gpudrive
repository into your /home/$USER
directory (info on HPC directories and data management).
git clone --recursive https://github.com/Emerge-Lab/gpudrive.git
Move into the cloned repository folder:
cd gpudrive
- Create a directory for overlay files in the
scratch
directory:
mkdir -p /scratch/$USER/images/gpudrive
cd /scratch/$USER/images/gpudrive
- Copy and decompress the overlay image:
cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz .
gunzip overlay-50G-10M.ext3.gz
This may take a couple of minutes.
-
- Verify the decompressed overlay image exists:
ls /scratch/$USER/images/gpudrive
What if I want to use a different overlay image?
To explore all available overlay images:
ls -l /scratch/work/public/overlay-fs-ext3/
srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account=<ASK> --pty /bin/bash
Ouput:
>>> srun: job XXXXXXX queued and waiting for resources
>>> srun: job XXXXXXX has been allocated resources
You will see something like:
[08:33:52 Wed Dec 25 2024] [email protected] ~/gpudrive
Ask Eugene for your account code if you don't have one yet.
Navigate back to main repo:
cd /home/$USER/gpudrive
Run the following to start the container with GPU support and the overlay image:
singularity exec --nv --overlay /scratch/$USER/images/gpudrive/overlay-50G-10M.ext3:rw \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bash
You should see:
Singularity>
Details on Sinularity and overlay images on NYU HPC here.
-
Inside the Singularity container, create a virtual environment:
-
One-off step: Create conda environment with Python 3.11
conda env create -f environment.yml
Why use conda? Currently, conda is the only way to use a Python version > 3.8.6 on the NYU HPC without Docker.
- Activate conda environment
conda activate gpudrive
Now you should see:
(/scratch/username/.conda/gpudrive) Singularity>
We use the manual install option to set up GPUDrive, see the README for details.
poetry install
If successful, you'll see
[100%] Linking CXX executable my_tests
[100%] Built target my_tests
Launch Python:
python3
Then run:
import gpudrive
If there are no errors, the installation was successful!
Set up Weights and Biases
-
Set trusted certificates
export SSL_CERT_FILE=$(python -m certifi)
export REQUESTS_CA_BUNDLE=$(python -m certifi)
Set up Pufferlib
Install PufferLib with SSL Certificate Fixes
-
Update Certifi Package
Ensure thecertifi
package (provides root certificates) is up-to-date:
pip install --upgrade certifi
Why?
Keeps SSL certificates current to avoid issues with secure connections.
-
Set Trusted Certificates Manually (if needed)
Explicitly set the certificate bundle:
export SSL_CERT_FILE=$(python -m certifi)
export REQUESTS_CA_BUNDLE=$(python -m certifi)
- Install pufferlib
pip install git+https://github.com/PufferAI/PufferLib.git@gpudrive
Run Self-Play PPO
Use the --help
command to see the CLI configurable arguments:
Singularity> python baselines/ippo/ippo_pufferlib.py --help
Usage: ippo_pufferlib.py [OPTIONS] [CONFIG_PATH]
Run PPO training with the given configuration.
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ config_path [CONFIG_PATH] The path to the default configuration file [default: baselines/ippo/config/ippo_ff_puffer.yaml] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --collision-weight FLOAT The weight for collision penalty [default: None] โ
โ --off-road-weight FLOAT The weight for off-road penalty [default: None] โ
โ --goal-achieved-weight FLOAT The weight for goal-achieved reward [default: None] โ
โ --dist-to-goal-threshold FLOAT The distance threshold for goal-achieved [default: None] โ
โ --sampling-seed INTEGER The seed for sampling scenes [default: None] โ
โ --obs-radius FLOAT The radius for the observation [default: None] โ
โ --learning-rate FLOAT The learning rate for training [default: None] โ
โ --resample-scenes INTEGER Whether to resample scenes during training; 0 or 1 [default: None] โ
โ --resample-interval INTEGER The interval for resampling scenes [default: None] โ
โ --install-completion Install completion for the current shell. โ
โ --show-completion Show completion for the current shell, to copy it or customize the installation. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
In short, use interactive nodes for code development and testing.
Steps:
- Request an interactive compute node, e.g:
srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account=<account_number> --pty /bin/bash
Replace <account_number>
with your project number.
- Navigate to repository:
cd /home/$USER/gpudrive
- Launch the Singularity image:
singularity exec --nv --overlay /scratch/$USER/images/gpudrive/overlay-50G-10M.ext3:ro \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bash
- Activate the virtual environment:
conda activate gpudrive
- Run experiments!
To run the Pufferlib PPO implementation, install Puffer first (not in requirements.yaml)
pip install git+https://github.com/PufferAI/PufferLib.git@gpudrive
python baselines/ippo/ippo_pufferlib.py
In short, use sbatch
for large runs, such as hyperparameter sweeps.
Steps:
- [Optional] Define run configurations and hyperparameters to sweep over in
generate_sbatch.py
. Running it stores an sbatch script.
python examples/experiments/scripts/generate_sbatch.py
- Submit sbatch jobs using
sbatch <your_sbatch_script>.sh
Through the use of job arrays, all the specified runs are launched at once.
Do you encounter issues with one of the steps outlined above? Please reach out in the Emerge lab
#code-help
channel!