Tool that crawls list of websites given as a csv and finds emails against them and returns another csv similar to the one provided but with an extra column called 'email'.
- You will need Python2.7 and above for smooth experience.
- Clone the directory on your machine
git clone https://github.com/1MochaChan1/website-email-crawler.git
. - Create an environment to be safe
python -m venv env
. - Install the requirements using the requirement folder
python -m pip install -r requirements.txt
.
Important
Make sure to have a column named website
in your src.csv file which would contain all the websites that we need to crawl
- Ideally you need to go to the root folder after cloning
cd website-email-crawler
and now you can use the csv commands for crawling websites for emails.--src {path_to_result_file}
- the flag specifies the input file from which you will take the data. By default it is set to src.csv.--res {path_to_result_file}
- ((optional)) the flag specifies which file you want to store the output in, if the file doesn't exists it will create on with the same headers as the source (src) file. By default it is set to res.csv.
- Finally you can call run the
email_spider.py
file using the following command:python email_spider.py --src 'path/to/src.csv'
- Alternatively you can navigate to the folder root folder
cd email_scraper\spiders
- You can either run the file:
email_spider.py
from an IDE.
Enter the path of the file in which the results needs to be stored.
Enter the url of the website you want to crawl for emails.
Enter the path of the csv file containing the column "website" to clean it.
Enter the list of columns that you want keep from a csv file and the rest of them will be dropped in the final file.
Use this flag if you want to crawl websites inside a csv file.
Enter the path of the file containing the scraped emails in single cells.
This will prepare the document for verification by creating rows for each email found for a specific domain.
Creates a **for_verif** file.
If your verified list doesn't return the extra data you added to it.
This flag helps to map your verified emails to the data in the **for_verif** file.
Tip
Also the ideal way of doing this would be, cleaning the sheet using:
--cleanup
--keep_colunmns
--crawl_csv
--make_csv_for_verif
--map_verified_emails
if used MillionVerifier
- Still not able to parse header and footer of websites. Will try BS4 for that probably idk.