Initial commit

erik1066 · Aug 19, 2020 · b8ddbe9 · b8ddbe9
1 parent ba22932
commit b8ddbe9
Show file tree

Hide file tree

Showing 37 changed files with 2,160 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,8 @@
+# Project-specific things to ignore
+src/temp/
+src/output/
+*.exe
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]

diff --git a/README.md b/README.md
@@ -0,0 +1,126 @@
+# COVID-19 County-level Web Scraper Project
+
+I created this project because we seem absent a way to transparently determine county-level counts of confirmed cases, deaths, and hospitalizations. Publicly-available data from the U.S. state health departments is used as input.
+
+## States/territories supported as of 8/18/2020
+
+- [x] Alabama
+- [x] Alaska
+- [ ] Arizona
+- [x] Arkansas
+- [x] California
+- [x] Colorado
+- [x] Connecticut
+- [x] Delaware
+- [x] Florida
+- [x] Georgia
+- [ ] Hawaii
+- [x] Idaho
+- [x] Illinois
+- [x] Indiana
+- [ ] Iowa
+- [ ] Kansas
+- [ ] Kentucky
+- [x] Louisiana
+- [ ] Maine
+- [ ] Maryland
+- [ ] Massachusetts
+- [x] Michigan
+- [x] Minnesota
+- [x] Mississippi
+- [x] Missouri
+- [x] Montana
+- [x] Nebraska
+- [x] Nevada
+- [ ] New Hampshire
+- [ ] New Jersey
+- [x] New Mexico
+- [x] New York City
+- [ ] New York (excluding NYC)
+- [x] North Carolina
+- [ ] North Dakota
+- [x] Ohio
+- [x] Oklahoma
+- [ ] Oregon
+- [ ] Pennsylvania
+- [ ] Rhode Island
+- [x] South Carolina
+- [ ] South Dakota
+- [x] Tennessee
+- [x] Texas
+- [ ] Utah
+- [x] Vermont
+- [x] Virginia
+- [ ] Washington
+- [ ] West Virginia
+- [x] Wisconsin
+- [ ] Wyoming
+- [ ] American Samoa
+- [ ] District of Columbia
+- [ ] Guam
+- [ ] Northern Mariana Islands
+- [ ] U.S. Virgin Islands
+- [ ] Puerto Rico
+- [ ] Palau
+- [ ] Federated States of Micronesia
+- [ ] Republic of Marshall Islands 
+- [ ] Navajo Nation
+
+## Breakages
+
+In the roughly 16 hours of development time that it took me to write and test these algorithms, three feeds from U.S. state health departments changed slightly. Even these slight changes caused those states to not generate output. Rework of their respective scraping algorithms was required.
+
+It is likely that continuous development work will be required to keep the scraper project up-to-date for use in daily reporting.
+
+## Missing data
+
+Some states will never be represented in this project because county-level data is either not published by those states or it is too difficult to obtain with even advanced web scraping techniques.
+
+## Running the code yourself
+
+Install Python 3 and then use `pip` to install the following packages:
+
+```bash
+pip install openpyxl
+pip install bs4
+pip install selenium
+```
+
+Some states' data is only accessible by using web browser automation. As such, you will need to install a web driver for the scraping operation before you can run the Python code. You first need to install the new Microsoft Edge browser for Windows 10: https://www.microsoft.com/en-us/edge. Note that Edge may already be installed.
+
+Once installed, you will then need to find the version number of Edge. You can do this by opening Edge and clicking the ellipsis button at the top right of the screen. Select **Help and Feedback** > **About Microsoft Edge**. Note the version number in the **About** page that appears.
+
+Next, modify the Edge webdriver URL found in the `installEdgeDriver` function of `main.py`. You'll want to modify this URL to match the version you just saw in the Edge **About** page. Visit https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/ to find a valid URL that matches your version of Edge. Copy and paste the URL from that page into the Python code. Generally, as long as the major version number is the same between the **About** page and what's listed on the [Microsoft webdriver website](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/), it'll probably work.
+
+> Edge is updated every few weeks, so changing the Python URL to match your Edge version is likely going to be required on a periodic basis.
+
+Finally, navigate to the `src` folder and run `main.py`:
+
+```bash
+cd src
+python main.py
+```
+Output should start to generate after a few seconds. Web browser windows will appear on occasion; please do not close the browser windows that appear or the scaping operation will fail.
+
+Once the operation completes, please open the `src/output` folder to view a timestamped CSV file representing all county-level data for all states that were included in the scraping operation.
+
+> On Ubuntu or other Linux-based OS distributions, you may need to use the `pip3` command instead of `pip` and `python3` instead of `python`.
+
+> Because this scraping project relies on web drivers to deal with JavaScript-intense pages for a small subset of states, you will need to be running Windows and MS Edge to obtain a full CSV output. A long-term TODO is to use headless Firefox or Chromium so this will run on *nix-based distributions or on Windows Subsystem for Linux (WSL).
+
+## Excluding states from the scraping operation
+
+You can exclude states from the scraper by commenting them out in `main.py`. Any state scraper not included in the `scrapers` array will not be run.
+
+## License
+The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.
+
+This source code in this repository is free: you can redistribute it and/or modify it under
+the terms of the Apache Software License version 2, or (at your option) any later version.
+
+This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the Apache Software License for more details.
+
+You should have received a copy of the Apache Software License along with this program. If not, see https://www.apache.org/licenses/LICENSE-2.0.html
+
+The source code forked from other open source projects will inherit its license.
diff --git a/src/ak_scraper.py b/src/ak_scraper.py
@@ -0,0 +1,55 @@
+import requests, json, io, datetime
+import county_report, state_report
+
+STATE_ABBR = 'AK'
+STATE = 'Alaska'
+
+def scraper():
+    # make an HTTP web request to get the AK Json
+    response = requests.get('https://services1.arcgis.com/WzFsmainVTuD5KML/arcgis/rest/services/Geographic_Distribution_of_Confirmed_Cases/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json')
+
+    if response.status_code == requests.codes.ok:
+        # Success - print to the console that the HTTP request succeeeded
+        print(' ', STATE_ABBR, ': Downloaded succeeded')
+
+        jsonPayload = json.loads(response.text)
+        features = jsonPayload['features']
+
+        counties = []
+
+        for feature in features:
+            attribute = feature['attributes']
+
+            county_name = attribute['Borough_Census_Area']
+            confirmed = int(attribute['All_Cases'])
+            hospitalizations = int(attribute['Hospitalizations'])
+            deaths = int(attribute['Deaths'])
+
+            county = findCounty(county_name, counties)
+
+            if county == None:
+                county = county_report.CountyReport(STATE, county_name, (int)(confirmed), (int)(deaths), -1, -1, datetime.datetime.now())
+                counties.append(county)
+            else:
+                county.confirmed += confirmed
+                county.hospitalizations += hospitalizations
+                county.deaths += deaths
+
+        # print the number of counties we processed
+        print(' ', STATE_ABBR, ':', len(counties), ' counties processed OK')
+
+        # build the state-level report object that will include all of the counties
+        stateReport = state_report.StateReport(STATE, STATE_ABBR, counties, datetime.datetime.now())
+
+        # return the state-level report
+        return stateReport
+
+
+    else:
+        # Fail
+        print(' ', STATE_ABBR, ': ERROR : Web download failed - HTTP status code ', response.status_code)
+
+def findCounty(county_name, counties):
+    for county in counties:
+        if county.county == county_name:
+            return county
diff --git a/src/al_scraper.py b/src/al_scraper.py
@@ -0,0 +1,76 @@
+import requests, io, datetime, pathlib, sys, time, os, openpyxl
+import county_report, state_report
+from selenium import webdriver
+from selenium.webdriver.common.keys import Keys
+from selenium.webdriver.common.action_chains import ActionChains
+from selenium.webdriver.support.ui import WebDriverWait
+from selenium.webdriver.support import expected_conditions as EC
+from selenium.webdriver.common.by import By
+
+STATE_ABBR = 'AL'
+STATE = 'Alabama'
+
+URL = 'https://dph1.adph.state.al.us/covid-19/'
+
+FILE_NAME = 'COVID-19 in Alabama.xlsx'
+
+def scraper():
+    counties = []
+
+    # You will need a WebDriver for Edge. See https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
+
+    try:
+
+        browser = webdriver.Edge("msedgedriver.exe")
+        browser.get(URL)
+
+        file_path = pathlib.Path.home().joinpath('Downloads', FILE_NAME)
+
+        if os.path.isfile(file_path):
+            print("  FAILED on ", STATE, " : Please delete ", file_path, " and start the process over. This file must not exist prior to running the scrape operation.")
+
+        download_link = WebDriverWait(browser, 30).until(EC.presence_of_element_located((By.XPATH, '/html/body/div[2]/div[2]/div/div[1]/div/div[1]/a[2]')))
+        download_link.click()
+
+        time.sleep(4)
+
+        wb = openpyxl.load_workbook(filename=file_path)
+
+        sheet = wb.worksheets[0]
+
+        counties = []
+
+        max_rows = sheet.max_row
+
+        for i in range(2, max_rows):
+            rowCount = str(i)
+            #     print(rowCount)
+            county_name = sheet['A' + rowCount].value
+
+
+            if county_name == None or len(county_name) == 0:
+                continue
+
+            confirmed = sheet['B' + rowCount].value
+            deaths = sheet['D' + rowCount].value
+
+            county = county_report.CountyReport(STATE, county_name, (int)(confirmed), (int)(deaths), -1, -1, datetime.datetime.now())
+            counties.append(county) # append the countyReport to our list of counties
+
+        wb.close()
+
+    except:
+        print("Unexpected error:", sys.exc_info()[0])
+
+    browser.quit()
+
+    os.remove(file_path)
+
+    # print the number of counties we processed
+    print(' ', STATE_ABBR, ':', len(counties), ' counties processed OK')
+
+    # build the state-level report object that will include all of the counties
+    stateReport = state_report.StateReport(STATE, STATE_ABBR, counties, datetime.datetime.now())
+
+    # return the state-level report
+    return stateReport        
diff --git a/src/ar_scraper.py b/src/ar_scraper.py
@@ -0,0 +1,44 @@
+import requests, bs4, datetime
+import county_report, state_report
+
+STATE_ABBR = 'AR'
+STATE = 'Arkansas'
+
+def scraper():
+    # make an HTTP web request to get the AR data
+    response = requests.get('https://www.healthy.arkansas.gov/programs-services/topics/covid-19-county-data')
+
+    if response.status_code == requests.codes.ok:
+        # Success - print to the console that the HTTP request succeeeded
+        print(' ', STATE_ABBR, ': Downloaded succeeded')
+
+        table = bs4.BeautifulSoup(response.text, features="html.parser").select('table tr')
+
+        counties = []
+
+        for i in range (1, 75):
+            row = table[i].find_all('td')
+            county_name = row[0].find('p').getText()
+            confirmed = int(row[1].find('p').getText())
+            deaths = int(row[3].find('p').getText())
+
+            county = county_report.CountyReport(STATE, county_name, confirmed, deaths, -1, -1, datetime.datetime.now())
+            counties.append(county)
+
+        # print the number of counties we processed
+        print(' ', STATE_ABBR, ':', len(counties), ' counties processed OK')
+
+        # build the state-level report object that will include all of the counties
+        stateReport = state_report.StateReport(STATE, STATE_ABBR, counties, datetime.datetime.now())
+
+        # return the state-level report
+        return stateReport
+
+    else:
+        # Fail
+        print(' ', STATE_ABBR, ': ERROR : Web download failed - HTTP status code ', response.status_code)
+
+def findCounty(county_name, counties):
+    for county in counties:
+        if county.county == county_name:
+            return county
diff --git a/src/ca_scraper.py b/src/ca_scraper.py
@@ -0,0 +1,70 @@
+import requests, io, csv, datetime
+import county_report, state_report
+
+STATE_ABBR = 'CA'
+STATE = 'California'
+
+def scraper():
+    # make an HTTP web request to get the CA CSV file
+    response = requests.get('https://data.ca.gov/dataset/590188d5-8545-4c93-a9a0-e230f0db7290/resource/926fd08f-cc91-4828-af38-bd45de97f8c3/download/statewide_cases.csv')
+
+    if response.status_code == requests.codes.ok:
+        # Success - print to the console that the HTTP request succeeeded
+        print(' ', STATE_ABBR, ': Downloaded succeeded')
+
+        csvData = response.text
+
+        # read the in-memory string using the 'csv' module so we can iterate over each row
+        csvReader = csv.reader(csvData.splitlines(), delimiter=',', quotechar='"')
+
+        # create a list that will contain our county data
+        counties = []
+
+        # iterate over every row in the CSV
+        for row in csvReader:
+            # skip the header row
+            if row[0] == 'county':
+                continue    
+
+            county_name = row[0]
+            confirmedStr = row[1]
+            confirmed = 0
+            if '.' in confirmedStr:
+                confirmed = int(float(confirmedStr))
+            elif len(confirmedStr) > 0:
+                confirmed = int(confirmedStr)
+
+            deathsStr = row[2]
+            deaths = 0
+            if '.' in deathsStr:
+                deaths = int(float(deathsStr))
+            elif len(deathsStr) > 0:
+                deaths = int(deathsStr)
+
+            county = findCounty(county_name, counties)
+
+            if county == None:
+                county = county_report.CountyReport(STATE, county_name, confirmed, deaths, -1, -1, datetime.datetime.now())
+                counties.append(county) # append the countyReport to our list of counties
+            else:
+                county.confirmed = confirmed
+                county.deaths = deaths
+
+        # print the number of counties we processed
+        print(' ', STATE_ABBR, ':', len(counties), ' counties processed OK')
+
+        # build the state-level report object that will include all of the counties
+        stateReport = state_report.StateReport(STATE, STATE_ABBR, counties, datetime.datetime.now())
+
+        # return the state-level report
+        return stateReport
+
+    else:
+        # Fail
+        print(' ', STATE_ABBR, ': ERROR : Download failed - HTTP status code ', response.status_code)
+
+
+def findCounty(county_name, counties):
+    for county in counties:
+        if county.county == county_name:
+            return county