Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add benchmark tool #95

Merged
merged 15 commits into from
Aug 16, 2024
Merged

feat: add benchmark tool #95

merged 15 commits into from
Aug 16, 2024

Conversation

gmickel
Copy link
Owner

@gmickel gmickel commented Aug 15, 2024

This PR introduces a comprehensive benchmark tool for CodeWhisper, designed to evaluate its performance on Exercism Python exercises. The benchmark tool provides detailed insights into CodeWhisper's capabilities across various configurations.

Key Features:

  1. Docker-based sandboxed execution environment
  2. Support for multiple LLM providers (Anthropic, OpenAI, Groq, DeepSeek)
  3. Concurrent execution of tests with configurable worker count
  4. Detailed Markdown reports with timestamps for each benchmark run
  5. Flexible test selection (all tests or a specified number)
  6. Support for different CodeWhisper modes (plan/no-plan, diff/whole file editing)
  7. Adds benchmark results for our top 3 models, further demonstrating CodeWhisper's great one-shot code modification ability.

New CodeWhisper Flags:

  • --skip-files: Added to allow CodeWhisper to run without any user input. If this flag is true then the file selection prompt can be skipped by adding files via the --filter flag.

Benchmark Tool Options:

  • --model: Specify the AI model to use
  • --workers: Set the number of concurrent workers
  • --tests: Choose the number of tests to run (default: all)
  • --no-plan: Disable the planning mode
  • --diff / --no-diff: Override the default diff/whole file edit mode

The benchmark tool generates comprehensive reports including:

  • Total time and cost
  • Pass rate for exercises
  • Detailed per-exercise metrics (time, cost, mode used, test results)
  • Failed test cases and any errors encountered

This tool will greatly assist in evaluating CodeWhisper's performance across different configurations and identifying areas for improvement.

To use the benchmark tool:

  1. Build the Docker image: ./benchmark/docker_build.sh
  2. Set the appropriate API key as an environment variable
  3. Run the benchmark: ./benchmark/run_benchmark.sh [options]

Reports are saved in the benchmark/reports/ directory with timestamped filenames.

@gmickel gmickel self-assigned this Aug 15, 2024
@gmickel gmickel added the enhancement New feature or request label Aug 15, 2024
@gmickel gmickel merged commit a82619e into main Aug 16, 2024
6 checks passed
@gmickel gmickel deleted the feature/benchmark branch August 16, 2024 08:57
github-actions bot pushed a commit that referenced this pull request Aug 16, 2024
# [1.16.0](v1.15.0...v1.16.0) (2024-08-16)

### Features

* add benchmark tool ([#95](#95)) ([a82619e](a82619e))
Copy link

🎉 This issue has been resolved in version 1.16.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant