This page describes how to walk through the full GaNCH data workflow, so that you can replicate the process for your own region. Links from the GaNCH project are included to provide context, but of course your region may have unique local needs.
- Download and install free open source software (and one propritary software) to manipulate data and upload it to Wikidata. See the Resources page for more.
- Git - Version Control software, used to protect your work against accidental deletion.
- Visual Studio Code - Text and code editor with a built-in terminal. Free and easy to learn interface.
- Edit CSV extension - Edit CSV files in spreadsheet format within VS Code. We don't use MS Excel since it reformats content (dates, etc.).
- markdownlint extension - Helps with formatting markdown files. We're writing all our documentation in markdown because it provides rich text formatting.
- python extension - Python for VS Code.
- OpenRefine - Data wrangling software which is also used to upload to Wikidata.
- Microsoft Excel - You'll use Excel (or LibreOffice Calc) to perform data formatting and cleanup with formulas to prepare CSVs. Be careful NOT to use Excel for most editing, as it tends to reformat dates, phone numbers, etc.
- Reach out to partner organizations and individuals in your region to identify data sources.
- Data sources can be found on the web, or may be emailed to you directly from a partner organization. Screen the data for Personally Identifying Information (PII) such as personal email addresses. If the PII is already available publically on the web (i.e. [email protected] published on the organization's website), we include it. However, if the PII can be replaced with non-peronal information (i.e. [email protected]), we use the non-personal information.
- Create a data dictionary that defines what fields you're going to use, how those fields match up to the Wikidata schema, and whether they're going to be required, recommended, or optional. This will help you format your data correctly for Wikidata, prioritize work, and explain to other folks what you're doing.
- Reformat the data from each source into its own spreadsheet matching a CSV template.
- HTML lists from websites (like the GAMAG dataset) are reformatted into CSV by hand using VS Code's multi-cursor capabilities for batch editing.
- Straightforward website tables (like the GPLSV dataset) are harvested using the HTML Table to CSV/Excel Converter
- Complex tables (like the GHRAC dataset) are harvested using simple python scripts which dump data into pipe-delimited text files. These are then imported into Excel, reformatted to match the template, and exported as CSV.
- Make sure to include fields for the Reference URL (REF URL) and Retrieval Date (RET DAT) for the data, to provide references in Wikidata.
- If you're working with several datasets and you know that some organizations will be duplicated across datasets, create an index that will help you de-dupe organizations as you add on new datasets. This will prevent you from wasting time and energy by performing the Update & Source step multiple times for the same organization.
- Generate the new datatset and back it up using Git (see below) so you have a snapshot of the whole dataset.
- Search in Wikidata by organization name to see if there is already a record in Wikidata that you have created/edited. You may have to try a few name variations to be sure that the organization hasn't already been created/edited. This of course works best if you're doing one new dataset at a time.
- Delete duplicated record rows in the new dataset -- that way you're only updating and sourcing data for each organization record once.
- If you accidentally deleted an organization in your new dataset that wasn't duplicate, use Git to view the deleted record and copy-and-paste it back into your dataset to update and source.
- As you work, use Git to save your work as you go. You can get fancy with Git, but for our work we mostly use "git pull", "git add .", "git commit -m "what changed"" and "git push".
- At the end of this step, you will have several datasets formatted to match your CSV template.
- Starting with web-based research, verify that the data is correct.
- Look at each orgnization provided on the partner's list, and try to find that organization on the web.
- Make sure that the data that was provided by the partner matches what is on the web, since datasets can become out-of-date relatively quickly.
- If you find a more up-to-date fact, update the CSV spreadsheet and record the source of the updated information in the REF URL field, and the date you made the update in the RET DAT field.
- Whenever possible, use the Internet Archive Wayback Machine's Save Page Now tool to provide a REF URL that also records the date that the fact was true.
- If the partner's list is correct, you can cite that list in the REF URL and RET DAT fields.
- If the information you find is ambiguous or confusing, perform email and phone research to reach out to the organization for clarification.
- Cite the email or phone conversation in the "Source Notes" field of the CSV spreadsheet.
- You can then cite your CSV as the REF URL source (yes it's recursive, but this way you can record that the information was updated via phone call or email like a MARC 670a Source citation). Once it's uploaded to Wikidata, your CSV can serve as the source/reference for the corrected information.
Below is an example of 1) the phone number field, 2) the phone number REF URL field, and 3) the phone number RET DAT fields. Underlined in red are several corrected phone numbers with their associated REF URLs and RET DATs for the locations and dates of the corrections.
-
Generate county
- Using a free online tool like Geocod.io or MapLarge's Geocoder (free version limited to batches of 100), generate county information for each organization's address.
- These tools are not exact, so you'll check to make sure that the county information is correct during the Quality Control step below.
-
Look up coordinate locations in Google Maps
- Search by organization name or address to locate the physical location of the organization.
- Note that Wikidata requires you to ingest coordinate location using decimal format (i.e. "34.435818, -84.702066"), but displays coordinate location using DMS format (i.e. "34° 26′ 8.95″ N, 84° 42′ 7.44″ W").
- In Google Maps, right-click on the organization's physical location and left-click on "What's here?"
- Left-click on the coordinates.
- Select and copy the decimal format coordinates in the left panel. This is what you'll use for the coordinate location.
- Click the Back button in your browser to return to the named location on Google Maps. Then select and copy the URL for that named location up through the coodinates and zoom level (ends in a z). That's your REF URL for the coordinates, since it includes the named organization.
- Before begnning the reconciling process for your dataset, see if your region is well-described geographically in Wikidata.
- To save time during reconciliation, make sure that administrative regions (counties, boroughs, parishes, territories, districts, census areas, consolidated city-counties, etc.), municipalities (cities and towns with a local government), and unincorporated communities (small towns without local governments) are already in Wikidata and have helpful descriptions (so you can tell your region's "Springfield" or "Franklin County" from all the others in the world).
- Using OpenRefine, reconcile your spreadsheet, build your schema, check for issues, and preview the results.
- On the rows tab, reconcile those fields that can be reconciled against Wikidata (organization label, instance of, city, state, county, country, parent organization, and subsidiaries). Unique fields (phone number, email address, official website, etc.) don't have to be reconciled.
- On the Schema tab, build out the schema to map your fields to the matching properties in Wikidata. Use your REF URL and RET DAT fields to create field-specific references.
- On the Issues tab, check for typos, conflicts, or other problems. Resolve what you can.
- On the Preview tab, do a final check to make sure that the sample records are displaying the way you want them to. Go back and fix any problems you notice.
- NOTE: OpenRefine is awesome, but there are some challenges to be aware of:
- When reconciling or building the schema, OpenRefine will time out if the Wikidata query service server is running slowly. If it's lagging too much, do some other kind of work and come back later when the server isn't so busy.
- OpenRefine doesn't give you a report after uploading to Wikidata, so if records were skipped you won't know unless you specifically look for them. You can catch these by doing a post-ingest check, or during the Quality Control step.
- You can't create a new item in OpenRefine and create relationships for that item (e.g. Parent Organization and Subsidiaries) at the same time. Create the new items first (removing the Parent Org and Subsidiary fields from the schema), then do a seperate upload of the relationship fields after the items already exist. Since property creation doesn't happen simultaneously, OpenRefine may choke on inverse-dependent relationships.
- Updates are uploaded to Wikidata.
- A team member who didn't work on the original dataset reviews the records in Wikidata for any errors.
- Each location coordinate should be checked closely, since they are prone to being incorrect.
- On the organization record in Wikidata, right click on the coordinate location and open in a new tab.
- This will take you to the coordinate location on the GeoHack website. Click on Google Maps link at the top of the screen to open the coordinate location in Google Maps. Zoom in to see if the coordinate location pin is located on top of the correct location. If the location isn't labeled, you may have to use Street View to confirm that it's the correct location. If the location is correct, move on to the next record.
- If the coordinate location is incorrect, search to find the correct location (which may take some sleuthing).
- As the reviewer checks (and potentially corrects) each record in Wikidata, they mark that they've checked it in the dataset in the QC column by adding the date reviewed and their initials (i.e. 2019-12-03 CL)
- Since quality control is done in Wikidata, your dataset will no longer be the "source of truth" after ingest. The datasets exist for the purpose of ingesting data and references into Wikidata and performing quality control. Wikidata then becomes the source of truth, not your datasets.