feat: add benchmark tool #95

gmickel · 2024-08-15T23:05:03Z

This PR introduces a comprehensive benchmark tool for CodeWhisper, designed to evaluate its performance on Exercism Python exercises. The benchmark tool provides detailed insights into CodeWhisper's capabilities across various configurations.

Key Features:

Docker-based sandboxed execution environment
Support for multiple LLM providers (Anthropic, OpenAI, Groq, DeepSeek)
Concurrent execution of tests with configurable worker count
Detailed Markdown reports with timestamps for each benchmark run
Flexible test selection (all tests or a specified number)
Support for different CodeWhisper modes (plan/no-plan, diff/whole file editing)
Adds benchmark results for our top 3 models, further demonstrating CodeWhisper's great one-shot code modification ability.

New CodeWhisper Flags:

--skip-files: Added to allow CodeWhisper to run without any user input. If this flag is true then the file selection prompt can be skipped by adding files via the --filter flag.

Benchmark Tool Options:

--model: Specify the AI model to use
--workers: Set the number of concurrent workers
--tests: Choose the number of tests to run (default: all)
--no-plan: Disable the planning mode
--diff / --no-diff: Override the default diff/whole file edit mode

The benchmark tool generates comprehensive reports including:

Total time and cost
Pass rate for exercises
Detailed per-exercise metrics (time, cost, mode used, test results)
Failed test cases and any errors encountered

This tool will greatly assist in evaluating CodeWhisper's performance across different configurations and identifying areas for improvement.

To use the benchmark tool:

Build the Docker image: ./benchmark/docker_build.sh
Set the appropriate API key as an environment variable
Run the benchmark: ./benchmark/run_benchmark.sh [options]

Reports are saved in the benchmark/reports/ directory with timestamped filenames.

# [1.16.0](v1.15.0...v1.16.0) (2024-08-16) ### Features * add benchmark tool ([#95](#95)) ([a82619e](a82619e))

github-actions · 2024-08-16T08:58:15Z

🎉 This issue has been resolved in version 1.16.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

gmickel added 4 commits August 14, 2024 02:08

wip

d102f58

chore: make files executable

bc0e4d5

feat: finalize benchmark implementation

1ff366a

add debug mode

507bb01

gmickel self-assigned this Aug 15, 2024

gmickel added the enhancement New feature or request label Aug 15, 2024

gmickel added 11 commits August 16, 2024 01:17

docs: update documentation

5ccbc4c

fix benchmarking, add results

6f0be7f

add reports for transparency

3a7e7a0

docs: update readme

4e7b80b

docs: stress how good our results are

f8902dd

docs: stress how good our results are

89e7681

docs: add deepseek-coder benchmark

e937cd0

docs: update README

c657f89

docs: fix link

fa6a758

remove wrong report files

22442a9

rerun benchmarks after fix and update results

cc5c791

gmickel merged commit a82619e into main Aug 16, 2024
6 checks passed

gmickel deleted the feature/benchmark branch August 16, 2024 08:57

github-actions bot pushed a commit that referenced this pull request Aug 16, 2024

chore(release): 1.16.0 [skip ci]

d1379d6

# [1.16.0](v1.15.0...v1.16.0) (2024-08-16) ### Features * add benchmark tool ([#95](#95)) ([a82619e](a82619e))

github-actions bot added the released label Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add benchmark tool #95

feat: add benchmark tool #95

gmickel commented Aug 15, 2024 •

edited

Loading

github-actions bot commented Aug 16, 2024

feat: add benchmark tool #95

feat: add benchmark tool #95

Conversation

gmickel commented Aug 15, 2024 • edited Loading

github-actions bot commented Aug 16, 2024

gmickel commented Aug 15, 2024 •

edited

Loading