Clean up evaluate Khoj helpfulness post based on feedback by Khoj

khoj-ai · Nov 23, 2024 · 667409a · 667409a
1 parent 85d593a
commit 667409a
Showing 1 changed file with 27 additions and 26 deletions.
diff --git a/src/content/posts/evaluate-khoj-quality.md b/src/content/posts/evaluate-khoj-quality.md
@@ -4,7 +4,7 @@ author: debanjum
 description: "A deep dive into how we implemented an automated evaluation harness and Khoj's excellent performance on modern factuality and reasoning benchmarks."
 heroImage: /eval-khoj-quality.webp
 pubDate: 2024-11-22
-keywords: ["agent eval", "automated llm benchmark"]
+keywords: ["agent eval", "automated llm benchmark", "research mode"]
 ---
 
 Khoj is an open, personal AI that can gather information from your documents and the web to generate accurate answers, paint images, visualize data, and create documents for you.
@@ -26,28 +26,27 @@ Additionally, as agent capabilities increase, we need more widespread testing to
 We selected two primary benchmarks for evaluation:
 
 1. **Google's [FRAMES](https://huggingface.co/datasets/google/frames-benchmark)**: This is the primary evaluation benchmark we tested against. It tests:
-   - Multi-hop reasoning: Requires retrieval from multiple sources and reasoning over them.
-   - Temporal reasoning: Requires reasoning about time.
-   - Tabular reasoning: Requires reasoning on data in tabels.
+  - Multi-hop reasoning: Requires retrieval from multiple sources and reasoning over them.
+  - Temporal reasoning: Requires reasoning about time.
+  - Tabular reasoning: Requires reasoning on data in tables.
 
-   These align well with our retrieval and reasoning goals for Khoj. The benchmark was released in September 2024. It is public, reasonably challenging dataset for modern agents.[^2].
+  These align well with our retrieval and reasoning goals for Khoj. The benchmark was released in September 2024 by Google. It is a public, reasonably challenging dataset for modern AI agents[^2].
 
-2. **OpenAI's [SimpleQA](https://openai.com/index/introducing-simpleqa/)**: This is a newer, secondary evaluation benchmark we incorporated:
-   - It evaluates the ability of large language models to give correct and truthful answers.
-   - It was created as a challenging Q&A benchmark for modern LLMs. Top models like o1-preview and the latest claude 3.5 sonnet only get ~40% answers correct.
+2. **OpenAI's [SimpleQA](https://openai.com/index/introducing-simpleqa/)**: This is a recently released evaluation benchmark.
+  - It evaluates the ability of large language models to give correct and truthful answers.
+  - It was created as a challenging Q&A benchmark for modern LLMs. Top models like o1-preview and the latest claude 3.5 sonnet only get ~40% answers correct.
 
-    These aligns well with our helpfulness goals for Khoj. The benchmark was released in October 2024. It is open-source and challenging for current state-of-the-art LLMs.
+  These match our helpfulness goals for Khoj. This benchmark was released a few weeks ago by OpenAI. It is open-source and challenging for current state-of-the-art LLMs.
 
-#### Implementation
-##### Modes
-Khoj can be interacted with in few different modes. The 3 main ones from the lens of the evaluations are:
+#### Evaluated Modes
+Khoj can be interacted with in a few different modes. The 3 main ones from the lens of the evaluations are:
 - **General**: This is like a closed book exam. No retrieval is allowed. The agent can't access external information, only the LLMs existing *general* knowledge.
 - **Default**: This is like an open book exam. Single shot retrieval is allowed. The agent can search for information online, run calculations in a [code sandbox](/posts/ai-with-code-execution).
-- **Research**: This is like a take home exam. Iterative retrival is permitted. The agent can do deeper research for a bit longer with the same web search and code tools.
+- **Research**: This is like a take home exam. Iterative retrieval is permitted. The agent can do deeper research for a bit longer with the same web search and code tools.
 
-You can trigger any of the 3 modes in Khoj using a slash command like `/research`. Default mode doesn't require slash command. Research mode was released at the start of November and is still in beta.
+You can chat with Khoj in any of the 3 modes using a slash command like `/research`. Default mode doesn't require slash command. Research mode was released at the start of November and is still in beta.
 
-##### Evaluation Script
+#### Evaluation Harness
 
 We developed an evaluation script to quiz Khoj on different benchmarks[^6]. It allows you to:
 - Configure sample size, randomization, target benchmark.
@@ -61,17 +60,19 @@ The eval is automatically run on every release using a Github [workflow](https:/
 4. Grades the responses using gemini-1.5-pro-002 as the LLM judge.
 5. Publishes the scores and a downloadable report for verification.
 
-Using an automated evaluation workflow provides transparency at multiple levels. It creates an audit trail to inspect the setup, reasoning traces and detailed results of Khoj's performance across time and code changes. You can see the raw logs from an eval workflow run [here](https://github.com/khoj-ai/khoj/actions/runs/11963916969/job/33355284137#step:8:38398).
+Using a public evaluation workflow provides transparency at multiple levels. It creates an audit trail to inspect the setup, reasoning traces and detailed results of Khoj's performance across time and code changes. You can see the raw logs from a recent eval workflow run [here](https://github.com/khoj-ai/khoj/actions/runs/11963916969/job/33355284137#step:8:38398).
 
 ### Results
 
-These runs evaluate Khoj with gemini-1.5-flash-002[^5] on a 200-question random subset of the target benchmark. This results in error margins of ~6% at reasonable costs ($5 across the 3 modes and 2 benchmarks).
+These runs evaluate Khoj with gemini-1.5-flash-002 on a 200-question random subset of the target benchmark[^5]. This results in error margins of ~6% at reasonable costs ($5 across the 3 modes and 2 benchmarks).
 
 | Benchmark | General | Default | Research | Baseline |
 |-----------|------|---------|---------|----------|
 | [FRAMES](https://huggingface.co/datasets/google/frames-benchmark) | [27.14](https://github.com/khoj-ai/khoj/actions/runs/11941817410/attempts/1#summary-33287504889) | [42.00](https://github.com/khoj-ai/khoj/actions/runs/11944716303/attempts/1#summary-33296136909) | [63.5](https://github.com/khoj-ai/khoj/actions/runs/11945673147/attempts/1#summary-33298733849) | 26.3% (flash-1.5-001) |
 | [SimpleQA](https://openai.com/index/introducing-simpleqa/) | [10.00](https://github.com/khoj-ai/khoj/actions/runs/11963066702/attempts/1#summary-33352767460) | [84.00](https://github.com/khoj-ai/khoj/actions/runs/11963354200/attempts/1#summary-33353634493) | [86.00](https://github.com/khoj-ai/khoj/actions/runs/11963916969/attempts/1#summary-33355284137) | 43.5% (o1 preview) |
 
+The graphs below visualize the improvements across the 3 modes on the evaluated benchmarks:
+
 ![](/khoj-on-frames.webp)
 
 ![](/khoj-on-simpleqa.webp)
@@ -85,7 +86,7 @@ Khoj upgrades small hosted LLMs into AI agents that perform at or beyond the cap
 - Default to Research mode: **50% additional** improvement from 42% to 63.5%.
 - Khoj more than doubled **(230%) the accuracy of gemini-1.5-flash** from 27% to 63.5%. This seems close to saturating the models reasoning capabilities on this benchmark.
 - Khoj research mode upgrades gemini-1.5-flash (63.5%) to achieve gemini-1.5-pro performance (66%) with the multi-step retrieval from the FRAMES paper.
-- For reference when shown all relevant documents gemini-1.5-flash achieves a 66.5% score.
+- For reference when shown all relevant documents gemini-1.5-flash achieves a 66.5% score. This is the ceiling of the model's reasoning capabilities given perfect retrieval.
 
 #### Improvements on the SimpleQA Benchmark
 - General to Default mode: **840%** improvement from 10% to 84%.
@@ -96,9 +97,9 @@ Khoj upgrades small hosted LLMs into AI agents that perform at or beyond the cap
 
 #### Impact of Code Interpreter Tool
 Khoj can [run code](/posts/ai-with-code-execution). This ability results in notable accuracy improvements:
-- Default mode accuracy **without** code tool: 35.68%
-- Default mode accuracy **with** code tool: 42.00%
-- Net improvement: ~**20%**
+- Default mode accuracy **without** code tool: 35.68%.
+- Default mode accuracy **with** code tool: 42.00%.
+- Net relative improvement: ~**18%** from 35.68% to 42.00%.
 
 ### Future Work
 - Add ability to efficiently test retrieval across internal and external knowledge. Our current eval only measures retrieval from the internet, not from your documents.
@@ -117,10 +118,10 @@ Through systematic testing and continuous monitoring, we can now quantitatively
 You can fork our [eval script](https://github.com/khoj-ai/khoj/blob/master/tests/evals/eval.py) & [workflow](https://github.com/khoj-ai/khoj/blob/master/.github/workflows/run_evals.yml) to adapt it as an automated eval harness for your agents.
 
 #### Footnotes
-[^1]: This uncertainty motivated us to build this automated evaluation harness to track and catch any drops in quality across time.
-[^2]: While effective, FRAMES has limitations in its range (25% to 66%), which somewhat constrains testing of agent retrieval and reasoning capabilities.
-[^3]: The high performance on SimpleQA suggests either remarkable effectiveness of Khoj's approach or potential brittleness in the benchmark itself. Or maybe our eval script is broken?
-[^4]: We expect dangerous behaviors and capabilities to appear in AI agents before LLMs. Building systems to detect and isolate these at the AI agent layer is important. But a topic for a separate post.
+[^1]: The uncertainty motivated us to build this automated evaluation harness to track and catch any drops in quality across time.
+[^2]: While FRAMES has good quality questions, it has a somewhat limited score range (26% to 66% for flash) for testing the models retrieval capabilities as an agent.
+[^3]: The performance jump on SimpleQA suggests either remarkable effectiveness of Khoj's approach or potential brittleness in the benchmark itself. Or maybe our eval is broken?
+[^4]: We expect dangerous behaviors and capabilities to appear in AI agents before LLMs. Building systems to detect and isolate these at the AI agent layer is important. That topic needs a separate post. You can read [this](https://www.lesswrong.com/posts/ZoFxTqWRBkyanonyb/current-safety-training-techniques-do-not-fully-transfer-to#Discussion) post until then.
 [^5]: Similar quality improvements were seen in our internal evaluations of 4o-mini on a 100 random question subset of both the benchmarks.
-[^6]: We verified that gemini-1.5-flash in general mode on our eval get the same baseline score as gemini-1.5-flash on the FRAMES [paper](https://arxiv.org/abs/2409.12941) of 26.3%.
+[^6]: We verified that gemini-1.5-flash-002 in general mode on our eval get similar baseline score as gemini-1.5-flash-001 on the FRAMES [paper](https://arxiv.org/abs/2409.12941) of 26.3%.
 [^7]: For the scope of this post, we define helpfulness as the ability to accurately and truthfully answer a query.