WebWalker: Benchmarking LLMs in Web Traversal

Jialong Wu, Wenbiao Yin, Jiang Yong, Zhenglin Wang, Zekun Xi, Runnan Fang

Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang

Tongyi Lab , Alibaba Group

👏 Welcome to try web traversal via our Modelscope online demo or 🤗 Huggingface online demo!

[🤖Project] [📄Paper] [🚩Citation]

Repo for WebWalker: Benchmarking LLMs in Web Traversal

📖 Quick Start

🌏 The Online Demo is available at ModelScope and HuggingFace now！
🤗 The WebWalkerQA dataset is available at HuggingFace Datasets!
🤗 The WebWalkerQA Leaderborad is available at HuggingFace Space!

📌 Introduction

We construct a challenging benchmark, WebWalkerQA, which is composed of 680 queries from four real-world scenarios across over 1373 webpages.
To tackle the challenge of web-navigation tasks requiring long context, we propose WebWalker, which utilizes a multi-agent framework for effective memory management.
Extensive experiments show that the WebWalkerQA is challenging, and for information-seeking tasks, vertical exploration within the page proves to be beneficial.

📚 WebWalkerQA Dataset

The json item of WebWalkerQA dataset is organized in the following format:

{
  "Question": "When is the paper submission deadline for the ACL 2025 Industry Track, and what is the venue address for the conference?",
  "Answer": "The paper submission deadline for the ACL 2025 Industry Track is March 21, 2025. The conference will be held in Brune-Kreisky-Platz 1.",
  "Root_Url": "https://2025.aclweb.org/",
  "Info": {
    "Hop": "multi-source",
    "Domain": "Conference",
    "Language": "English",
    "Difficulty_Level": "Medium",
    "Source_Website": [
      "https://2025.aclweb.org/calls/industry_track/",
      "https://2025.aclweb.org/venue/"
    ],
    "Golden_Path": ["root->call>student_research_workshop", "root->venue"]
  }
}

🤗 The WebWalkerQA Leaderboard is is available at HuggingFace!

You can load the dataset via the following code:

from datasets import load_dataset
ds = load_dataset("callanwu/WebWalkerQA", split="main")

Additionally, we possess a collection of approximately 14k silver QA pairs, which, although not yet carefully human-verified. You can load the silver dataset by changing the split to silver.

💡 Perfomance

📊 Result on Web Agents

The performance on Web Agents are shown below:

📊 Result on RAG-Systems

🤗 The WebWalkerQA Leaderboard is is available at HuggingFace!

🚩 Welcome to submit your method to the leaderboard!

🛠 Dependencies

conda create -n webwalker python=3.10
git clone https://github.com/alibaba-nlp/WebWalker.git
cd WebWalker
pip install -e .
# Install requirements
pip install -r requirements.txt
# Run post-installation setup
crawl4ai-setup
# Verify your installation
crawl4ai-doctor

💻 Running WebWalker Demo Locally

🔑 Before running, please export the OPENAI API key or Dashscope API key as an environment variable:

export OPEN_AI_API_KEY=YOUR_API_KEY
export OPEN_AI_API_BASE_URL=YOUR_API_BASE_URL

or

export DASHSCOPE_API_KEY=YOUR_API_KEY

You can use other supported API keys with Qwen-Agent. For more details, please refer to the Qwen-Agent. To configure the API key, modify the code in lines 44-53 of src/app.py.

Then, run the app.py file with Streamlit:

cd src
streamlit run app.py

Runing RAG-System on WebWalkerQA

cd src
python rag_system.py --api_name [API_NAME] --output_file [OUTPUT_PATH]

The details of environment setup can be found in the README.md in the src folder.

🔍 Evaluation

The evaluation script for accuracy of the output answers using GPT-4 can be used as follows:

cd src
python evaluate.py --input_path [INPUT_PATH]--output_path [OUTPUT_PATH]

🌻Acknowledgement

This work is implemented by ReACT, Qwen-Agents, LangChain. Sincere thanks for their efforts.
We sincerely thank the contributors and maintainers of ai4crawl for their open-source tool❤️, which helped us get web pages in a Markdown-like format.
The repo is contributed by Jialong Wu, if you have any questions, please feel free to contact via [email protected] or [email protected] or create an issue.

🚩Citation

If this work is helpful, please kindly cite as:

@misc{wu2025webwalker,
      title={WebWalker: Benchmarking LLMs in Web Traversal},
      author={Jialong Wu and Wenbiao Yin and Yong Jiang and Zhenglin Wang and Zekun Xi and Runnan Fang and Deyu Zhou and Pengjun Xie and Fei Huang},
      year={2025},
      eprint={2501.07572},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.07572},
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
src		src
.gitigore		.gitigore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebWalker: Benchmarking LLMs in Web Traversal

📖 Quick Start

📌 Introduction

📚 WebWalkerQA Dataset

💡 Perfomance

📊 Result on Web Agents

📊 Result on RAG-Systems

🛠 Dependencies

💻 Running WebWalker Demo Locally

Runing RAG-System on WebWalkerQA

🔍 Evaluation

🌻Acknowledgement

🚩Citation

Star History

About

Releases

Packages

Contributors 3

Languages

Alibaba-NLP/WebWalker

Folders and files

Latest commit

History

Repository files navigation

WebWalker: Benchmarking LLMs in Web Traversal

📖 Quick Start

📌 Introduction

📚 WebWalkerQA Dataset

💡 Perfomance

📊 Result on Web Agents

📊 Result on RAG-Systems

🛠 Dependencies

💻 Running WebWalker Demo Locally

Runing RAG-System on WebWalkerQA

🔍 Evaluation

🌻Acknowledgement

🚩Citation

Star History

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages