A python twitter crawler.
- Support on crawling timeline for a target user [in a certain date range].
Developing
Support on crawling threads.Warning
Using TwitterAPI to get user information with uid/screen_name is much faster and safer than web-scraping method.
bs4
Beautifulsouplxml
Html parser for beautifulsoup (has special installation method on Amazon EC2)tqdm
Progress bar in terminalrequestsplus
Self-modified requests package with max retries and sleeping time between requestspushbullet.py
(optional) Notifier when crawling is finished.
input = [<screen_name_1>, <screen_name_2>]
access_token = <pushbullet_token>
output_fp = <filepath you want to save>
with GotchaTwitter('timeline', input, output_fp) as gt:
gt = gt.set_output(save_mode='w', has_header=True) \
.set_notifier('pushbullet', access_token=access_token)
gt.crawl()
- Register a PushBullet account and create an access token in your account setting.
- Download and install Pushbullet app on your device (iOS tested).
sudo yum install libxml2-devel libxslt-devel python-devel gcc
sudo pip install --upgrade setuptools
sudo /usr/local/bin/easy_install lxml