Skip to content

Commit

Permalink
feat: add benchmark tool (#95)
Browse files Browse the repository at this point in the history
* wip

* chore: make files executable

* feat: finalize benchmark implementation

* add debug mode

* docs: update documentation

* fix benchmarking, add results

* add reports for transparency

* docs: update readme

* docs: stress how good our results are

* docs: stress how good our results are

* docs: add deepseek-coder benchmark

* docs: update README

* docs: fix link

* remove wrong report files

* rerun benchmarks after fix and update results
  • Loading branch information
gmickel authored Aug 16, 2024
1 parent c042d51 commit a82619e
Show file tree
Hide file tree
Showing 23 changed files with 4,834 additions and 26 deletions.
169 changes: 169 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Test-related files
tests/fixtures/**/.gitignore
tests/**/*.log

# Temporary files
*.tmp
*.temp

### Node ###
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
lerna-debug.log*
.pnpm-debug.log*


# Diagnostic reports (https://nodejs.org/api/report.html)
report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json

# Runtime data
pids
*.pid
*.seed
*.pid.lock

# Directory for instrumented libs generated by jscoverage/JSCover
lib-cov

# Coverage directory used by tools like istanbul
coverage
*.lcov

# nyc test coverage
.nyc_output

# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
.grunt

# Bower dependency directory (https://bower.io/)
bower_components

# node-waf configuration
.lock-wscript

# Compiled binary addons (https://nodejs.org/api/addons.html)
build/Release

# Dependency directories
node_modules/
jspm_packages/

# Snowpack dependency directory (https://snowpack.dev/)
web_modules/

# TypeScript cache
*.tsbuildinfo

# Optional npm cache directory
.npm

# Optional eslint cache
.eslintcache

# Optional stylelint cache
.stylelintcache

# Microbundle cache
.rpt2_cache/
.rts2_cache_cjs/
.rts2_cache_es/
.rts2_cache_umd/

# Optional REPL history
.node_repl_history

# Output of 'npm pack'
*.tgz

# Yarn Integrity file
.yarn-integrity

# dotenv environment variable files
.env
.env.development.local
.env.test.local
.env.production.local
.env.local

# parcel-bundler cache (https://parceljs.org/)
.cache
.parcel-cache

# Next.js build output
.next
out

# Nuxt.js build / generate output
.nuxt
dist

# Gatsby files
.cache/
# Comment in the public line in if your project uses Gatsby and not Next.js
# https://nextjs.org/blog/next-9-1#public-directory-support
# public

# vuepress build output
.vuepress/dist

# vuepress v2.x temp and cache directory
.temp

# Docusaurus cache and generated files
.docusaurus

# Serverless directories
.serverless/

# FuseBox cache
.fusebox/

# DynamoDB Local files
.dynamodb/

# TernJS port file
.tern-port

# Stores VSCode versions used for testing VSCode extensions
.vscode-test

# yarn v2
.yarn/cache
.yarn/unplugged
.yarn/build-state.yml
.yarn/install-state.gz
.pnp.*

### Node Patch ###
# Serverless Webpack directories
.webpack/

# Optional stylelint cache

# SvelteKit build / generate output
.svelte-kit

# OS generated files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

# Vim configurations
.vim

todos.md
codewhisper.md
testing
ElPlan.md
ElPlanFilter.md
codewhisper-task-output.json
demotask.md
.codewhisper-task-cache.json
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -167,3 +167,7 @@ ElPlanFilter.md
codewhisper-task-output.json
demotask.md
.codewhisper-task-cache.json

# benchmark reports
benchmark/reports/
!benchmark/reports/*_reference.md
103 changes: 82 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ AI-Powered End-to-End Task Implementation & blazingly fast Codebase-to-LLM Conte
[Templates](#-templates)
[Configuration](#-configuration)
[API](#-api)
[Benchmarking](#benchmarking)
[Contributing](#-contributing)
[Roadmap](#-roadmap)
[FAQ](#-faq)
Expand All @@ -27,7 +28,7 @@ AI-Powered End-to-End Task Implementation & blazingly fast Codebase-to-LLM Conte

CodeWhisper is a powerful tool that bridges the gap between your codebase and Large Language Models (LLMs). It serves two primary functions:

1. **AI-Powered End-to-End Task Implementation**: Tackle complex, codebase-spanning tasks with ease. CodeWhisper doesn't just suggest snippets; it plans, generates, and applies comprehensive code changes across your entire project, from backend logic to frontend integration.
1. **AI-Powered End-to-End Task Implementation**: Tackle complex, codebase-spanning tasks with ease. CodeWhisper doesn't just suggest snippets; it plans, generates, and applies comprehensive code changes across your entire project, from backend logic to frontend integration. CodeWhisper's generations are SOTA and outperform other AI-code generation tools in benchmarks. See [Benchmarking](#benchmarking) for more details.

2. **Precision-Guided Context Curation for LLMs**: Harness the power of human insight to feed AI exactly what it needs. Quickly transform carefully selected parts of your codebase into rich, relevant context for LLMs, ensuring more accurate and project-aligned results.

Expand Down Expand Up @@ -111,26 +112,27 @@ While CodeWhisper excels at performing individual coding tasks and even large fe

## ✨ Key Features

| Feature | Description |
| ----------------------------------------------- | ----------------------------------------------------------------- |
| 🧠 AI-powered task planning and code generation | Leverage AI to plan and implement complex coding tasks |
| 🔄 Full git integration | Version control of AI-generated changes |
| 🔄 Diff-based code modifications | Handle larger edits within output token limits |
| 🌍 Support for various LLM providers | Compatible with Anthropic, OpenAI, Ollama and Groq |
| 🔐 Support for local models | Use local models via Ollama |
| 🚀 Blazingly fast code processing | Concurrent workers for improved performance |
| 🎯 Customizable file filtering and exclusion | Fine-tune which files to include in the context |
| 📊 Intelligent caching | Improved performance through smart caching |
| 🔧 Extensible template system | Interactive variable prompts for flexible output |
| 🖊️ Custom variables in templates | Support for single-line and multi-line custom variables |
| 💾 Value caching | Quick template reuse with cached values |
| 🖥️ CLI and programmatic API | Use CodeWhisper in scripts or as a library |
| 🔒 Respect for .gitignore | Option to use custom include and exclude globs |
| 🌈 Full language support | Compatible with all text-based file types |
| 🤖 Interactive mode | Granular file selection and template customization |
| ⚡ Optimized for large repositories | Efficient processing of extensive codebases |
| 📝 Detailed logging | Log AI prompts, responses, and parsing results |
| 🔗 GitHub integration | Fetch and work with issues (see [Configuration](#-configuration)) |
| Feature | Description |
| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 🧠 AI-powered task planning and code generation | Leverage AI to plan and implement complex coding tasks |
| 🚀 SOTA generations | CodeWhisper's generations are SOTA and outperform other AI-code generation tools in benchmarks, even though it uses one-shot generation. See [Benchmarking](#benchmarking) for more details. |
| 🔄 Full git integration | Version control of AI-generated changes |
| 🔄 Diff-based code modifications | Handle larger edits within output token limits |
| 🌍 Support for various LLM providers | Compatible with Anthropic, OpenAI, Ollama and Groq |
| 🔐 Support for local models | Use local models via Ollama |
| 🚀 Blazingly fast code processing | Concurrent workers for improved performance |
| 🎯 Customizable file filtering and exclusion | Fine-tune which files to include in the context |
| 📊 Intelligent caching | Improved performance through smart caching |
| 🔧 Extensible template system | Interactive variable prompts for flexible output |
| 🖊️ Custom variables in templates | Support for single-line and multi-line custom variables |
| 💾 Value caching | Quick template reuse with cached values |
| 🖥️ CLI and programmatic API | Use CodeWhisper in scripts or as a library |
| 🔒 Respect for .gitignore | Option to use custom include and exclude globs |
| 🌈 Full language support | Compatible with all text-based file types |
| 🤖 Interactive mode | Granular file selection and template customization |
| ⚡ Optimized for large repositories | Efficient processing of extensive codebases |
| 📝 Detailed logging | Log AI prompts, responses, and parsing results |
| 🔗 GitHub integration | Fetch and work with issues (see [Configuration](#-configuration)) |

## 📺 Video

Expand Down Expand Up @@ -220,6 +222,8 @@ This section is still under development. We are actively testing and evaluating

\* Whole-file edit mode is generally more precise but may lead to issues with maximum output token length, potentially limiting the ability to process larger files or multiple files simultaneously. It can also result in incomplete outputs for very large files, with the model resorting to placeholders like "// other functions here" instead of providing full implementations.

For more details, see the [Benchmarking](#benchmarking) section.

#### Experimental Support

- **Groq as a provider**
Expand Down Expand Up @@ -386,6 +390,63 @@ For more detailed instructions on using the GitHub integration and other CodeWhi

CodeWhisper can be used programmatically in your Node.js projects. For detailed API documentation and examples, please refer to [USAGE.md](USAGE.md).

## Benchmarking

CodeWhisper includes a benchmarking tool to evaluate its performance on Exercism Python exercises. This tool allows you to assess the capabilities of different AI models and configurations.

### Key Features

- Docker-based execution for consistent environments
- Concurrent worker support for faster benchmarking
- Detailed Markdown reports with performance metrics
- Options to customize test runs (number of tests, planning mode, diff mode)

### Usage

1. Build the Docker image:

```
./benchmark/docker_build.sh
```

2. Set up the appropriate API key as an environment variable.

3. Run the benchmark:
```
./benchmark/run_benchmark.sh --model <model_name> --workers <num_workers> --tests <num_tests> [options]
```

### Output

The benchmark generates a detailed Markdown report including:

- Summary statistics (total time, cost, pass percentage)
- Per-exercise results (time, cost, mode, model, tests passed)

Reports are saved in `benchmark/reports/` with timestamped filenames.

### Results

CodeWhisper's performance has been evaluated across different models using the Exercism Python exercises. Below is a summary of the benchmark results:

| Model | Tests Passed | Time (s) | Cost ($) | Command |
| -------------------------- | ------------ | -------- | -------- | ------------------------------------------------------------------------------ |
| claude-3-5-sonnet-20240620 | 80.26% | 1619.49 | 3.4000 | `./benchmark/run_benchmark.sh --workers 5 --no-plan` |
| gpt-4o-2024-08-06 | 81.51% | 986.68 | 1.6800 | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model gpt-4o-2024-08-06` |
| deepseek-coder | 76.98% | 5850.58 | 0.0000\* | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model deepseek-coder` |

\*The cost calculation was not working properly for this benchmark run.

> **Note:** All benchmarks are one-shot only, unlike other benchmarks which use multiple generations that depend on the results of the test run.
The full reports used to generate these results are available in the `benchmark/reports/` directory.

These results provide insights into the efficiency and accuracy of different models when used with CodeWhisper. The "Tests Passed" percentage indicates the proportion of Exercism tests successfully completed, while the time and cost metrics offer a view of the resource requirements for each model.

As we continue to run benchmarks with various models and configurations, this table will be updated to provide a comprehensive comparison, helping users make informed decisions about which model might best suit their needs.

For full details on running benchmarks, interpreting results, and available options, please refer to the [Benchmark README](./benchmark/README.md).

## 🤝 Contributing

We welcome contributions to CodeWhisper! Please read our [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.
Expand Down
1 change: 1 addition & 0 deletions USAGE.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ codewhisper task [options]
| `-g, --gitignore <path>` | Path to .gitignore file (default: .gitignore) |
| `-f, --filter <patterns...>` | File patterns to include (use glob patterns, e.g., "src/\*_/_.js") |
| `-e, --exclude <patterns...>` | File patterns to exclude (use glob patterns, e.g., "\*_/_.test.js") |
| `--skip-files` | Skip the file selection step and use the files provided by the --filter and --exclude options |
| `-s, --suppress-comments` | Strip comments from the code |
| `-l, --line-numbers` | Add line numbers to code blocks |
| `-cw, --context-window <number>` | Specify the context window for the AI model. Only applicable for Ollama models. |
Expand Down
48 changes: 48 additions & 0 deletions benchmark/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
FROM node:20

# Enable corepack for pnpm support
RUN corepack enable

# Install Python, pip, and build essentials
RUN apt-get update && apt-get install -y \
python3 \
python3-pip \
build-essential \
git \
&& rm -rf /var/lib/apt/lists/*

# Set up pnpm to use a specific store directory in the container
RUN echo "store-dir=/root/.pnpm-store" > /root/.npmrc

# Set up working directory for the main project
WORKDIR /app

# Copy the entire CodeWhisper project
COPY .. .

# Install dependencies for the main project
RUN pnpm install

# Set NODE_ENV to development for the build process
ENV NODE_ENV=development

# Build the main project
RUN pnpm run build

# Change to the benchmark directory
WORKDIR /app/benchmark

# Install dependencies for the benchmark
RUN pnpm install

# Build the benchmark
RUN pnpm run build

# Set environment variables back to production
ENV NODE_ENV=production

# Set PATH to include CodeWhisper's dist directory
ENV PATH="/app/dist/cli:${PATH}"

# Run benchmark
CMD ["node", "--unhandled-rejections=strict", "/app/benchmark/dist/benchmark.js"]
Loading

0 comments on commit a82619e

Please sign in to comment.