Jialong Wu, Wenbiao Yin, Jiang Yong, Zhenglin Wang, Zekun Xi, Runnan Fang
Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang
👏 Welcome to try web traversal via our Modelscope online demo or 🤗 Huggingface online demo!
Repo for WebWalker: Benchmarking LLMs in Web Traversal
-
🌏 The Online Demo is available at ModelScope and HuggingFace now!
-
🤗 The WebWalkerQA dataset is available at HuggingFace Datasets!
-
🤗 The WebWalkerQA Leaderborad is available at HuggingFace Space!
- We construct a challenging benchmark, WebWalkerQA, which is composed of 680 queries from four real-world scenarios across over 1373 webpages.
- To tackle the challenge of web-navigation tasks requiring long context, we propose WebWalker, which utilizes a multi-agent framework for effective memory management.
- Extensive experiments show that the WebWalkerQA is challenging, and for information-seeking tasks, vertical exploration within the page proves to be beneficial.
The json item of WebWalkerQA dataset is organized in the following format:
{
"Question": "When is the paper submission deadline for the ACL 2025 Industry Track, and what is the venue address for the conference?",
"Answer": "The paper submission deadline for the ACL 2025 Industry Track is March 21, 2025. The conference will be held in Brune-Kreisky-Platz 1.",
"Root_Url": "https://2025.aclweb.org/",
"Info": {
"Hop": "multi-source",
"Domain": "Conference",
"Language": "English",
"Difficulty_Level": "Medium",
"Source_Website": [
"https://2025.aclweb.org/calls/industry_track/",
"https://2025.aclweb.org/venue/"
],
"Golden_Path": ["root->call>student_research_workshop", "root->venue"]
}
}
🤗 The WebWalkerQA Leaderboard is is available at HuggingFace!
You can load the dataset via the following code:
from datasets import load_dataset
ds = load_dataset("callanwu/WebWalkerQA", split="main")
Additionally, we possess a collection of approximately 14k silver QA pairs, which, although not yet carefully human-verified.
You can load the silver dataset by changing the split to silver
.
The performance on Web Agents are shown below:
🤗 The WebWalkerQA Leaderboard is is available at HuggingFace!
🚩 Welcome to submit your method to the leaderboard!
conda create -n webwalker python=3.10
git clone https://github.com/alibaba-nlp/WebWalker.git
cd WebWalker
pip install -e .
# Install requirements
pip install -r requirements.txt
# Run post-installation setup
crawl4ai-setup
# Verify your installation
crawl4ai-doctor
🔑 Before running, please export the OPENAI API key or Dashscope API key as an environment variable:
export OPEN_AI_API_KEY=YOUR_API_KEY
export OPEN_AI_API_BASE_URL=YOUR_API_BASE_URL
or
export DASHSCOPE_API_KEY=YOUR_API_KEY
You can use other supported API keys with Qwen-Agent. For more details, please refer to the Qwen-Agent. To configure the API key, modify the code in lines 44-53 of
src/app.py
.
Then, run the app.py
file with Streamlit:
cd src
streamlit run app.py
cd src
python rag_system.py --api_name [API_NAME] --output_file [OUTPUT_PATH]
The details of environment setup can be found in the README.md in the src
folder.
The evaluation script for accuracy of the output answers using GPT-4 can be used as follows:
cd src
python evaluate.py --input_path [INPUT_PATH]--output_path [OUTPUT_PATH]
- This work is implemented by ReACT, Qwen-Agents, LangChain. Sincere thanks for their efforts.
- We sincerely thank the contributors and maintainers of ai4crawl for their open-source tool❤️, which helped us get web pages in a Markdown-like format.
- The repo is contributed by Jialong Wu, if you have any questions, please feel free to contact via [email protected] or [email protected] or create an issue.
If this work is helpful, please kindly cite as:
@misc{wu2025webwalker,
title={WebWalker: Benchmarking LLMs in Web Traversal},
author={Jialong Wu and Wenbiao Yin and Yong Jiang and Zhenglin Wang and Zekun Xi and Runnan Fang and Deyu Zhou and Pengjun Xie and Fei Huang},
year={2025},
eprint={2501.07572},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.07572},
}