This is a repository that build simple spider with python 2.7, containing a powerpoint slides and an ipython notebook.
A series of idea: request and response, header, exception handling,javascript rendering, regular expression, BeautifulSoup4 modules, which make up the basic framework of a crawler would be introduced in the ipython notebook
A basic knowledge in Python is needed and highly recommended to understand the content of this workshop. If you want to review on our previous workshop -- "Python Introduction", please go to "https://goo.gl/KxUzRf"
Requirements: Several modules would be needed to run this script. a. re b. selenium c. BeautifulSoup4 d. Jupyter notebook e. phantomjs f. requests
For mac/linux user to download those module, make sure you have brew installed "https://brew.sh/", If you have brew already installed, run the following two lines of command
- pip install re requests selenium BeautifulSoup4 Jupyter
- brew install phantomjs
(Optional) To maintain a clean environment, consider using a virtual environment. For more information please refer to "https://virtualenv.pypa.io/en/stable/"
For windows user, please use python installed with Anaconda and run the following commands in the Anaconda Prompt:
conda install -c conda-forge re
conda install -c conda-forge requests
conda install -c conda-forge selenium
conda install -c conda-forge beautifulsoup4
conda install -c conda-forge phantomjs