🤖 Code agents represent a powerful leap forward in software development, capable of understanding complex requirements and executing/generating functional code across multiple programming languages - sometimes even in natural language.
In this work, we propose RedCode, a high-quality, large-scale (over 4,000 test cases) dataset that features diverse languages and formats (Python, Bash, natural language), providing real interaction with systems and fine-grained evaluation of both code execution and generation, aiming to rigorously and comprehensively evaluate the safety of code agents.
RedCode consists of RedCode-Exec and RedCode-Gen.
- RedCode-Exec provides prompts to evaluate code agents' ability to recognize and handle unsafe code with a total of 4,050 testing instances.
- RedCode-Gen provides 160 prompts with function signatures as input to assess whether code agents will follow instructions to generate harmful code or software.
For the safety leaderboard and more visualized results, please consider visiting our RedCode webpage.
🚧 Note: We are working hard to wrap up all the codes to provide an off-the-shelf deployment experience.
To stay updated, consider starring⭐️ and watching😎 this repository. Your support means a lot to us!
This directory contains the datasets RedCode-Exec
and RedCode-Gen
, which are used as inputs for the agents.
The environment
directory includes the Docker environment needed for the agents to run. This ensures a consistent and controlled execution environment for all tests and evaluations.
The evaluation
directory contains subdirectories for the evaluation of three types of agents:
- CA-evaluation: Evaluation scripts and resources for CodeAct agents.
- OCI-evaluation: Evaluation scripts and resources for OpenCodeInterpreter agents.
- RA-evaluation: Evaluation scripts and resources for ReAct agents.
Additionally, evaluation.py
that serve as evaluation scripts for each risky scenario.
The result
directory stores the results of the evaluations.
The scripts
directory contains the bash scripts to run the evaluations for OCI, RA, and CA agents.
Follow these steps to set up the project locally.
Clone this GitHub repo:
git clone https://github.com/AI-secure/RedCode.git
The environment.yml
file lists all dependencies required for the project. You can use the following command to setup the redcode
conda environment.
conda env create -f environment.yml
conda activate redcode
./scripts/OCI_eval.sh
./scripts/RA_eval.sh
./scripts/CA_eval.sh
Currently, the scripts are run separately. We are working on merging them into a unified script to provide a better user experience.
If you find our work helpful, please consider citing it as follows:
@article{guo2024redcode,
title={RedCode: Risky Code Execution and Generation Benchmark for Code Agents},
author={Guo, Chengquan and Liu, Xun and Xie, Chulin and Zhou, Andy and Zeng, Yi and Lin, Zinan and Song, Dawn and Li, Bo},
booktitle={Thirty-Eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024}
}
Please reach out to us if you have any suggestions or need any help in reproducing the results. You can submit an issue or pull request, or send an email to either [email protected], [email protected] or [email protected]. Thanks for your attention!