Skip to content

Latest commit

 

History

History
86 lines (59 loc) · 1.83 KB

README.md

File metadata and controls

86 lines (59 loc) · 1.83 KB

NeuScraper

Source code for our ACL'24 paper : Cleaner Pretraining Corpus Curation with Neural Web Scraping

If you find this work useful, please cite our paper and give us a shining star.

Quick Start

1️⃣ Download checkpoint for NeuScraper

git lfs install
git clone https://huggingface.co/OpenMatch/neuscraper-v1-clueweb

2️⃣ Clone from git

git clone https://github.com/MiraclePlus/NeuScraper
cd NeuScraper

3️⃣ Environment

Install the torch first :

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

Install other packages :

pip install -r requirements.txt

4️⃣ Install As Package

Install the neu_scraper package :

pip install -e .

You also can install from whl :

python setup.py bdist_wheel
pip install dist/neu_scraper-0.1-py3-none-any.whl

5️⃣ Use it like

from neu_scraper import predict
import requests

url = 'https://blog.christianperone.com/2023/06/appreciating-llms-data-pipelines/'
model_path = '../neuscraper-v1-clueweb/training_state_checkpoint.tar'

response = requests.get(url)
html = response.content.decode('utf-8')

result = predict(html, url, model_path)
print(result)

Citation

@inproceedings{xu2024cleaner,
  title={Cleaner Pretraining Corpus Curation with Neural Web Scraping},
  author={Xu, Zhipeng and Liu, Zhenghao and Yan, Yukun and Liu, Zhiyuan and Xiong, Chenyan and Yu, Ge},
  booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics},
  year={2024}
}

Contact Us

If you have questions, suggestions, and bug reports, please send a email to us, we will try our best to help you.