Hi all and welcome to the climate misinformation data science repo!
Feel free to create your own branches and start playing around with the data that is stored in the labelled_data directory.
You will find text preprocessing and embedding pipeline in the text_preprocessing directory.
In the models directory you will find the implementation of several models and their performance evaluation.
Notebooks directory contains any additional EDA.
-
Set up virtual environment and add this environment to ipykernel.
-
The virtual environment allows you to install and use specific packages for this project without interfering with other projects. You will need to enter the virtual environment each time before running code.
python3 -m venv ~/venvs/cm-venv source ~/venvs/cm-venv/bin/activate pip install -r requirements.txt python -m ipykernel install --name=cm-venv
- Enter virtual environment.
source ~/venvs/cm-venv/bin/activate
- You will see your Terminal prompt begin with '(cm-venv)'.
- Open jupyter notebook.
jupyter notebook
- Select 'Kernel' > 'Change kernel' > cm-venv
Alternatively, can build a Docker image and run the code inside a container...
```
docker build -t cd-ds .
docker run --rm -it -p 8887:8887 -v "`pwd`":/data cd-ds
```
Then follow the link with 127.0.0.1 to open Jupyter
So far we have a fairly simple model which classifies articles into one of three categories...
- 0 - Climate denying
- 1 - Climate related (not climate denying)
- 2 - Not climate related
We have experimented with a few classification algorithms (e.g. support vector machines, random forests, adaptive boosting) and feature representations (tf-idf, normalised bag-of-words, word2vec).
Arguably the best results have been for a Random Forest with TF-IDF