feat: add benchmark tool (#95)

* wip * chore: make files executable * feat: finalize benchmark implementation * add debug mode * docs: update documentation * fix benchmarking, add results * add reports for transparency * docs: update readme * docs: stress how good our results are * docs: stress how good our results are * docs: add deepseek-coder benchmark * docs: update README * docs: fix link * remove wrong report files * rerun benchmarks after fix and update results
gmickel · Aug 16, 2024 · a82619e · a82619e
1 parent c042d51
commit a82619e
Show file tree

Hide file tree

Showing 23 changed files with 4,834 additions and 26 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,169 @@
+# Test-related files
+tests/fixtures/**/.gitignore
+tests/**/*.log
+
+# Temporary files
+*.tmp
+*.temp
+
+### Node ###
+# Logs
+logs
+*.log
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+lerna-debug.log*
+.pnpm-debug.log*
+
+
+# Diagnostic reports (https://nodejs.org/api/report.html)
+report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json
+
+# Runtime data
+pids
+*.pid
+*.seed
+*.pid.lock
+
+# Directory for instrumented libs generated by jscoverage/JSCover
+lib-cov
+
+# Coverage directory used by tools like istanbul
+coverage
+*.lcov
+
+# nyc test coverage
+.nyc_output
+
+# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
+.grunt
+
+# Bower dependency directory (https://bower.io/)
+bower_components
+
+# node-waf configuration
+.lock-wscript
+
+# Compiled binary addons (https://nodejs.org/api/addons.html)
+build/Release
+
+# Dependency directories
+node_modules/
+jspm_packages/
+
+# Snowpack dependency directory (https://snowpack.dev/)
+web_modules/
+
+# TypeScript cache
+*.tsbuildinfo
+
+# Optional npm cache directory
+.npm
+
+# Optional eslint cache
+.eslintcache
+
+# Optional stylelint cache
+.stylelintcache
+
+# Microbundle cache
+.rpt2_cache/
+.rts2_cache_cjs/
+.rts2_cache_es/
+.rts2_cache_umd/
+
+# Optional REPL history
+.node_repl_history
+
+# Output of 'npm pack'
+*.tgz
+
+# Yarn Integrity file
+.yarn-integrity
+
+# dotenv environment variable files
+.env
+.env.development.local
+.env.test.local
+.env.production.local
+.env.local
+
+# parcel-bundler cache (https://parceljs.org/)
+.cache
+.parcel-cache
+
+# Next.js build output
+.next
+out
+
+# Nuxt.js build / generate output
+.nuxt
+dist
+
+# Gatsby files
+.cache/
+# Comment in the public line in if your project uses Gatsby and not Next.js
+# https://nextjs.org/blog/next-9-1#public-directory-support
+# public
+
+# vuepress build output
+.vuepress/dist
+
+# vuepress v2.x temp and cache directory
+.temp
+
+# Docusaurus cache and generated files
+.docusaurus
+
+# Serverless directories
+.serverless/
+
+# FuseBox cache
+.fusebox/
+
+# DynamoDB Local files
+.dynamodb/
+
+# TernJS port file
+.tern-port
+
+# Stores VSCode versions used for testing VSCode extensions
+.vscode-test
+
+# yarn v2
+.yarn/cache
+.yarn/unplugged
+.yarn/build-state.yml
+.yarn/install-state.gz
+.pnp.*
+
+### Node Patch ###
+# Serverless Webpack directories
+.webpack/
+
+# Optional stylelint cache
+
+# SvelteKit build / generate output
+.svelte-kit
+
+# OS generated files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+
+# Vim configurations
+.vim
+
+todos.md
+codewhisper.md
+testing
+ElPlan.md
+ElPlanFilter.md
+codewhisper-task-output.json
+demotask.md
+.codewhisper-task-cache.json
diff --git a/.gitignore b/.gitignore
@@ -167,3 +167,7 @@ ElPlanFilter.md
 codewhisper-task-output.json
 demotask.md
 .codewhisper-task-cache.json
+
+# benchmark reports
+benchmark/reports/
+!benchmark/reports/*_reference.md
diff --git a/README.md b/README.md
@@ -19,6 +19,7 @@ AI-Powered End-to-End Task Implementation & blazingly fast Codebase-to-LLM Conte
 [Templates](#-templates) •
 [Configuration](#-configuration) •
 [API](#-api) •
+[Benchmarking](#benchmarking) •
 [Contributing](#-contributing) •
 [Roadmap](#-roadmap) •
 [FAQ](#-faq)
@@ -27,7 +28,7 @@ AI-Powered End-to-End Task Implementation & blazingly fast Codebase-to-LLM Conte
 
 CodeWhisper is a powerful tool that bridges the gap between your codebase and Large Language Models (LLMs). It serves two primary functions:
 
-1. **AI-Powered End-to-End Task Implementation**: Tackle complex, codebase-spanning tasks with ease. CodeWhisper doesn't just suggest snippets; it plans, generates, and applies comprehensive code changes across your entire project, from backend logic to frontend integration.
+1. **AI-Powered End-to-End Task Implementation**: Tackle complex, codebase-spanning tasks with ease. CodeWhisper doesn't just suggest snippets; it plans, generates, and applies comprehensive code changes across your entire project, from backend logic to frontend integration. CodeWhisper's generations are SOTA and outperform other AI-code generation tools in benchmarks. See [Benchmarking](#benchmarking) for more details.
 
 2. **Precision-Guided Context Curation for LLMs**: Harness the power of human insight to feed AI exactly what it needs. Quickly transform carefully selected parts of your codebase into rich, relevant context for LLMs, ensuring more accurate and project-aligned results.
 
@@ -111,26 +112,27 @@ While CodeWhisper excels at performing individual coding tasks and even large fe
 
 ## ✨ Key Features
 
-| Feature                                         | Description                                                       |
-| ----------------------------------------------- | ----------------------------------------------------------------- |
-| 🧠 AI-powered task planning and code generation | Leverage AI to plan and implement complex coding tasks            |
-| 🔄 Full git integration                         | Version control of AI-generated changes                           |
-| 🔄 Diff-based code modifications                | Handle larger edits within output token limits                    |
-| 🌍 Support for various LLM providers            | Compatible with Anthropic, OpenAI, Ollama and Groq                |
-| 🔐 Support for local models                     | Use local models via Ollama                                       |
-| 🚀 Blazingly fast code processing               | Concurrent workers for improved performance                       |
-| 🎯 Customizable file filtering and exclusion    | Fine-tune which files to include in the context                   |
-| 📊 Intelligent caching                          | Improved performance through smart caching                        |
-| 🔧 Extensible template system                   | Interactive variable prompts for flexible output                  |
-| 🖊️ Custom variables in templates                | Support for single-line and multi-line custom variables           |
-| 💾 Value caching                                | Quick template reuse with cached values                           |
-| 🖥️ CLI and programmatic API                     | Use CodeWhisper in scripts or as a library                        |
-| 🔒 Respect for .gitignore                       | Option to use custom include and exclude globs                    |
-| 🌈 Full language support                        | Compatible with all text-based file types                         |
-| 🤖 Interactive mode                             | Granular file selection and template customization                |
-| ⚡ Optimized for large repositories             | Efficient processing of extensive codebases                       |
-| 📝 Detailed logging                             | Log AI prompts, responses, and parsing results                    |
-| 🔗 GitHub integration                           | Fetch and work with issues (see [Configuration](#-configuration)) |
+| Feature                                         | Description                                                                                                                                                                                  |
+| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| 🧠 AI-powered task planning and code generation | Leverage AI to plan and implement complex coding tasks                                                                                                                                       |
+| 🚀 SOTA generations                             | CodeWhisper's generations are SOTA and outperform other AI-code generation tools in benchmarks, even though it uses one-shot generation. See [Benchmarking](#benchmarking) for more details. |
+| 🔄 Full git integration                         | Version control of AI-generated changes                                                                                                                                                      |
+| 🔄 Diff-based code modifications                | Handle larger edits within output token limits                                                                                                                                               |
+| 🌍 Support for various LLM providers            | Compatible with Anthropic, OpenAI, Ollama and Groq                                                                                                                                           |
+| 🔐 Support for local models                     | Use local models via Ollama                                                                                                                                                                  |
+| 🚀 Blazingly fast code processing               | Concurrent workers for improved performance                                                                                                                                                  |
+| 🎯 Customizable file filtering and exclusion    | Fine-tune which files to include in the context                                                                                                                                              |
+| 📊 Intelligent caching                          | Improved performance through smart caching                                                                                                                                                   |
+| 🔧 Extensible template system                   | Interactive variable prompts for flexible output                                                                                                                                             |
+| 🖊️ Custom variables in templates                | Support for single-line and multi-line custom variables                                                                                                                                      |
+| 💾 Value caching                                | Quick template reuse with cached values                                                                                                                                                      |
+| 🖥️ CLI and programmatic API                     | Use CodeWhisper in scripts or as a library                                                                                                                                                   |
+| 🔒 Respect for .gitignore                       | Option to use custom include and exclude globs                                                                                                                                               |
+| 🌈 Full language support                        | Compatible with all text-based file types                                                                                                                                                    |
+| 🤖 Interactive mode                             | Granular file selection and template customization                                                                                                                                           |
+| ⚡ Optimized for large repositories             | Efficient processing of extensive codebases                                                                                                                                                  |
+| 📝 Detailed logging                             | Log AI prompts, responses, and parsing results                                                                                                                                               |
+| 🔗 GitHub integration                           | Fetch and work with issues (see [Configuration](#-configuration))                                                                                                                            |
 
 ## 📺 Video
 
@@ -220,6 +222,8 @@ This section is still under development. We are actively testing and evaluating
 
 \* Whole-file edit mode is generally more precise but may lead to issues with maximum output token length, potentially limiting the ability to process larger files or multiple files simultaneously. It can also result in incomplete outputs for very large files, with the model resorting to placeholders like "// other functions here" instead of providing full implementations.
 
+For more details, see the [Benchmarking](#benchmarking) section.
+
 #### Experimental Support
 
 - **Groq as a provider**
@@ -386,6 +390,63 @@ For more detailed instructions on using the GitHub integration and other CodeWhi
 
 CodeWhisper can be used programmatically in your Node.js projects. For detailed API documentation and examples, please refer to [USAGE.md](USAGE.md).
 
+## Benchmarking
+
+CodeWhisper includes a benchmarking tool to evaluate its performance on Exercism Python exercises. This tool allows you to assess the capabilities of different AI models and configurations.
+
+### Key Features
+
+- Docker-based execution for consistent environments
+- Concurrent worker support for faster benchmarking
+- Detailed Markdown reports with performance metrics
+- Options to customize test runs (number of tests, planning mode, diff mode)
+
+### Usage
+
+1. Build the Docker image:
+
+   ```
+   ./benchmark/docker_build.sh
+   ```
+
+2. Set up the appropriate API key as an environment variable.
+
+3. Run the benchmark:
+   ```
+   ./benchmark/run_benchmark.sh --model <model_name> --workers <num_workers> --tests <num_tests> [options]
+   ```
+
+### Output
+
+The benchmark generates a detailed Markdown report including:
+
+- Summary statistics (total time, cost, pass percentage)
+- Per-exercise results (time, cost, mode, model, tests passed)
+
+Reports are saved in `benchmark/reports/` with timestamped filenames.
+
+### Results
+
+CodeWhisper's performance has been evaluated across different models using the Exercism Python exercises. Below is a summary of the benchmark results:
+
+| Model                      | Tests Passed | Time (s) | Cost ($) | Command                                                                        |
+| -------------------------- | ------------ | -------- | -------- | ------------------------------------------------------------------------------ |
+| claude-3-5-sonnet-20240620 | 80.26%       | 1619.49  | 3.4000   | `./benchmark/run_benchmark.sh --workers 5 --no-plan`                           |
+| gpt-4o-2024-08-06          | 81.51%       | 986.68   | 1.6800   | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model gpt-4o-2024-08-06` |
+| deepseek-coder             | 76.98%       | 5850.58  | 0.0000\* | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model deepseek-coder`    |
+
+\*The cost calculation was not working properly for this benchmark run.
+
+> **Note:** All benchmarks are one-shot only, unlike other benchmarks which use multiple generations that depend on the results of the test run.
+
+The full reports used to generate these results are available in the `benchmark/reports/` directory.
+
+These results provide insights into the efficiency and accuracy of different models when used with CodeWhisper. The "Tests Passed" percentage indicates the proportion of Exercism tests successfully completed, while the time and cost metrics offer a view of the resource requirements for each model.
+
+As we continue to run benchmarks with various models and configurations, this table will be updated to provide a comprehensive comparison, helping users make informed decisions about which model might best suit their needs.
+
+For full details on running benchmarks, interpreting results, and available options, please refer to the [Benchmark README](./benchmark/README.md).
+
 ## 🤝 Contributing
 
 We welcome contributions to CodeWhisper! Please read our [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.

diff --git a/USAGE.md b/USAGE.md
@@ -61,6 +61,7 @@ codewhisper task [options]
 | `-g, --gitignore <path>`              | Path to .gitignore file (default: .gitignore)                                                                                                                                                                                                                                                                                                                                 |
 | `-f, --filter <patterns...>`          | File patterns to include (use glob patterns, e.g., "src/\*_/_.js")                                                                                                                                                                                                                                                                                                            |
 | `-e, --exclude <patterns...>`         | File patterns to exclude (use glob patterns, e.g., "\*_/_.test.js")                                                                                                                                                                                                                                                                                                           |
+| `--skip-files`                        | Skip the file selection step and use the files provided by the --filter and --exclude options                                                                                                                                                                                                                                                                                 |
 | `-s, --suppress-comments`             | Strip comments from the code                                                                                                                                                                                                                                                                                                                                                  |
 | `-l, --line-numbers`                  | Add line numbers to code blocks                                                                                                                                                                                                                                                                                                                                               |
 | `-cw, --context-window <number>`      | Specify the context window for the AI model. Only applicable for Ollama models.                                                                                                                                                                                                                                                                                               |

diff --git a/benchmark/Dockerfile b/benchmark/Dockerfile
@@ -0,0 +1,48 @@
+FROM node:20
+
+# Enable corepack for pnpm support
+RUN corepack enable
+
+# Install Python, pip, and build essentials
+RUN apt-get update && apt-get install -y \
+    python3 \
+    python3-pip \
+    build-essential \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+
+# Set up pnpm to use a specific store directory in the container
+RUN echo "store-dir=/root/.pnpm-store" > /root/.npmrc
+
+# Set up working directory for the main project
+WORKDIR /app
+
+# Copy the entire CodeWhisper project
+COPY .. .
+
+# Install dependencies for the main project
+RUN pnpm install
+
+# Set NODE_ENV to development for the build process
+ENV NODE_ENV=development
+
+# Build the main project
+RUN pnpm run build
+
+# Change to the benchmark directory
+WORKDIR /app/benchmark
+
+# Install dependencies for the benchmark
+RUN pnpm install
+
+# Build the benchmark
+RUN pnpm run build
+
+# Set environment variables back to production
+ENV NODE_ENV=production
+
+# Set PATH to include CodeWhisper's dist directory
+ENV PATH="/app/dist/cli:${PATH}"
+
+# Run benchmark
+CMD ["node", "--unhandled-rejections=strict", "/app/benchmark/dist/benchmark.js"]