Skip to content

Latest commit

 

History

History
245 lines (162 loc) · 9.7 KB

DATA_BACKGROUND.adoc

File metadata and controls

245 lines (162 loc) · 9.7 KB

Data Background

The data for this repository consists of bill data scraped using the unitedstates/congress open source scraper, enriched with a number of other public data sources. Where necessary, we have built scrapers for this project that download, process, store and update the data. The default location for the bill data is in a congress/data directory in the root of this repository.

The data is updated through a Django/Celery process described in UPDATES_CELERY.

Bulk downloads: bill metadata

For maintainers of this project, we have created an s3 bucket, s3://flatgov which contains the scraped and processed metadata and bill xml for Congresses 110-116, and the beginning of Congress 117.

The metadata can be downloaded in bulk from a publicly-available storage, maintained by ProPublicaTM here: + https://www.propublica.org/datastore/dataset/congressional-data-bulk-legislation-bills

Bulk historical metadata is available for one-year ranges. Data for the current Congress is updated twice daily. The metadata is sufficient to test many of the functions in this library. Bill text comparison requires bulk download or scraping of the text.

A script to download and organize these files is provided in scripts/bulk_downloads.sh.

To start, it is sufficient to download a couple of recent congresses (e.g. 117, 116). Unzip, and copy the deepest directory into a single hierarchy so that you have:

FlatGovDir
   |
   -congress
       |
       -data
          |
          -117
          -116

Processing Metadata: Background

For each bill a metadata file is created in congress/data/relatedbills with the following form, combining data from the original data.json with additional information from other bills:

116hr1ih.json

{
  titles: [...],
  titles_whole_bill: [...],
  cosponsors: ['name1', 'name2'...],
  related_bills: [],
  related: [
    {116s1356: {
      titles: [],
      sponsor: {},
      cosponsors: [],
      identified_by: "CRS"
    }
    },

  ]

}

Where 'titles' includes all titles and 'full_titles' includes those where "is_for_portion": false (see below).

Cosponsors

This information is available for each bill in the data.json file. Two key fields in sponsors are name and bioguide_id

Bill Titles

This information is available for each bill (and version) in the data.json file. For example, in /congress/data/116/bills/hr/hr3/data.json. After collecting titles for each bill, a reverse index can be created, with the title as key and an array of billnumbers as value. This will identify the bills across congresses that share identical titles.

The title information in data.json is of the form:

"titles": [
    {
      "as": "introduced",
      "is_for_portion": false,
      "title": "INVEST in America Act",
      "type": "short"
    },
    {
      "as": "introduced",
      "is_for_portion": false,
      "title": "INVEST in America Act",
      "type": "short"
    },
    {
      "as": "introduced",
      "is_for_portion": false,
      "title": "Investing in a New Vision for the Environment and Surface Transportation in America Act",
      "type": "short"
    } ...
]

Similarity measures: simple

  • Simple similarity measures are obtained using the relatedBills.py file which expands upon the functionality generated by the files: billdata.py and process_bill_meta.py.

  • relatedBills.py uses the getRelatedBills() as a higher order function containing several functions that obtain simple similarity measures.

A few 'simple' measures can be taken of similarity. Bills which share:

  • Identical titles

  • Very similar titles (e.g. all but the year)

  • Identical sponsor lists

  • Significant overlap in sponsors

This can be represented in a summary JSON of the form: relatedBills.json

  116s130: {
    same_titles: ['116hr201', ...]
  }
]

OR

116s130: [
  { billCongressTypeNumber: '116hr201'
    cosponsors: [bioguide_id1, bioguide_id2],
    titles: ['Shared Title 1', 'Shared Title 2', etc.]
    similar_title: ['Similar (nonidentical) Title 1', 'Similar (nonidentical) Title 2', etc.]
  }...
  ],

]

(Same)Titles

It does this by creating a billnumber index with the bill metadata, and any similarity measures will subsequently be attributed to its corresponding number in the index. For example, after the index is created,a “getSameTitles” function is run, which loops through the index and creates a list of titles for that billNumber. A bill number with more than one title would then indicate that the bill has more than one version of itself. Identical titles would indicate identical bills, with different bill numbers.

Cosponsors

Legislator information is downloaded from YAML files maintained in the unitedstates/congress-legislators repository: 'https://raw.githubusercontent.com/unitedstates/congress-legislators/master/legislators-current.yaml'

This data is downloaded and updated in the database in flatgov/common/cosponsor.py. The Committee data is also updated and associations between Committees and their members are created.

Similar Title

(to )do

Similarity calculation

For any bill (e.g. 116hr100ih), we want to find related bills for previous congresses. Related bills are listed for the same congress in Congress.gov, e.g. "hr2"}&r=1&s=3. There are many ways of calculating similarity.

For purposes of efficiency and performance, we have developed a similarity measure that built on a search engine model. In particular, we build an index of document headers and sections in Elasticsearch. We then calculate the similarity between any input text and sections in the index using the ES/Lucene 'more like this' metric. We combine the section to section similarity scores to yield an overall bill similarity measure. For more details, see https://github.com/aih/FlatGov/blob/master/server_py/flatgov/elasticsearch/README.adoc

Statement of Administration Policy

The metadata for Statement of Administration Policy section has been scraped and stored in json files. The pdfs are stored in the media directory.

Load Policy Data for Obama Administration to Database

  • activate the virtualenv and go to (flatgov) ~/…​/FlatGov/server_py$

$ cd ~/.../FlatGov/server_py
$ source .venv/bin/activate
  • Go to (flatgov) ~/…​/FlatGov/server_py/flatgov$

$ cd flatgov
  • Apply all migrations

./manage.py makemigrations
./manage.py migrate
  • Load Statement of Administration Policy data

./manage.py loaddata dumped_statements.json

Statements of Administration Policy for Trump Administration

  1. The pdf files are stored in media directory. We load data to dumped_statements.json file with management command.

  2. Run the following command to load data in database.

./manage.py load_statements

Statements of Administration Policy data for Biden Administration

To load Biden statements data with management command, run the following command:

./manage.py biden_statements

CRS Reports

CRS reports are scraped from the everycrsreport website. For each report, an attempt is made to associate it with bills by a combination of: the report title, the report metadata from everycrsreport, and the report html from the site. For example, if the report title includes H. R. 200, that bill is associated with the report in our database.

The scraper, and its instructions, are described in CRS_REPORTS.

There is a many-to-many association between reports and bills: more than one bill may be associated with a report, and a report may mention more than one bill.

To download the CRS report data in csv format, go to /crs/csv-report/

CRS Report in CSV
Figure 1: CRS Report CSV table
CRS Report in the Database (Django Admin)
Figure 2: CRS Report in Django Admin
Note
in most cases, it is not explicit in the report data which Congress a bill refers to (e.g. H. R. 200). We have made an initial association of the bill with the Congress on the date of the report publication. This leads to many-- possibly a majority — of the bills being mis-associated. There are many sources of this error: the date we are using may include a much later 'update' to a report; a report may refer to historical bills; in particular, bills in the early part of the year (January or February) may refer to the previous year’s bills. We attempt to handle this by adding bills from January/February to both congresses. In addition, many reports refer to Public Laws, and we do not make an attempt to associate the P.L. with a bill (this would be more accurate, since the P.L. number includes the Congress).

Committee Documents

TODO: add detail about sources of scraping, setting up and running a Celery task.

The scraper, and its instructions, are described in Scraping: Relevant Committee Documents.