The data for this repository consists of bill data scraped using the unitedstates/congress open source scraper, enriched with a number of other public data sources. Where necessary, we have built scrapers for this project that download, process, store and update the data. The default location for the bill data is in a congress/data
directory in the root of this repository.
The data is updated through a Django/Celery process described in UPDATES_CELERY.
For maintainers of this project, we have created an s3
bucket, s3://flatgov
which contains the scraped and processed metadata and bill xml for Congresses 110-116, and the beginning of Congress 117.
The metadata can be downloaded in bulk from a publicly-available storage, maintained by ProPublicaTM here: + https://www.propublica.org/datastore/dataset/congressional-data-bulk-legislation-bills
For example:
https://s3.amazonaws.com/pp-projects-static/congress/bills/116.zip?_ga=2.93052057.587919213.1601076458-407540952.1601076458
https://s3.amazonaws.com/pp-projects-static/congress/bills/115.zip?_ga=2.159617977.587919213.1601076458-407540952.1601076458
…
https://s3.amazonaws.com/pp-projects-static/congress/bills/110.zip?_ga=2.159617977.587919213.1601076458-407540952.1601076458
Bulk historical metadata is available for one-year ranges. Data for the current Congress is updated twice daily. The metadata is sufficient to test many of the functions in this library. Bill text comparison requires bulk download or scraping of the text.
A script to download and organize these files is provided in scripts/bulk_downloads.sh
.
To start, it is sufficient to download a couple of recent congresses (e.g. 117, 116
). Unzip, and copy the deepest directory into a single hierarchy so that you have:
FlatGovDir
|
-congress
|
-data
|
-117
-116
For each bill a metadata file is created in congress/data/relatedbills
with the following form, combining data from the original data.json with additional information from other bills:
116hr1ih.json
{
titles: [...],
titles_whole_bill: [...],
cosponsors: ['name1', 'name2'...],
related_bills: [],
related: [
{116s1356: {
titles: [],
sponsor: {},
cosponsors: [],
identified_by: "CRS"
}
},
]
}
Where 'titles' includes all titles and 'full_titles' includes those where "is_for_portion": false
(see below).
This information is available for each bill in the data.json
file. Two key fields in sponsors
are name
and bioguide_id
This information is available for each bill (and version) in the data.json
file. For example, in /congress/data/116/bills/hr/hr3/data.json
. After collecting titles for each bill, a reverse index can be created, with the title as key and an array of billnumbers as value. This will identify the bills across congresses that share identical titles.
The title information in data.json
is of the form:
"titles": [
{
"as": "introduced",
"is_for_portion": false,
"title": "INVEST in America Act",
"type": "short"
},
{
"as": "introduced",
"is_for_portion": false,
"title": "INVEST in America Act",
"type": "short"
},
{
"as": "introduced",
"is_for_portion": false,
"title": "Investing in a New Vision for the Environment and Surface Transportation in America Act",
"type": "short"
} ...
]
-
Simple similarity measures are obtained using the relatedBills.py file which expands upon the functionality generated by the files: billdata.py and process_bill_meta.py.
-
relatedBills.py uses the getRelatedBills() as a higher order function containing several functions that obtain simple similarity measures.
A few 'simple' measures can be taken of similarity. Bills which share:
-
Identical titles
-
Very similar titles (e.g. all but the year)
-
Identical sponsor lists
-
Significant overlap in sponsors
This can be represented in a summary JSON of the form:
relatedBills.json
116s130: {
same_titles: ['116hr201', ...]
}
]
OR
116s130: [
{ billCongressTypeNumber: '116hr201'
cosponsors: [bioguide_id1, bioguide_id2],
titles: ['Shared Title 1', 'Shared Title 2', etc.]
similar_title: ['Similar (nonidentical) Title 1', 'Similar (nonidentical) Title 2', etc.]
}...
],
]
It does this by creating a billnumber index with the bill metadata, and any similarity measures will subsequently be attributed to its corresponding number in the index. For example, after the index is created,a “getSameTitles” function is run, which loops through the index and creates a list of titles for that billNumber. A bill number with more than one title would then indicate that the bill has more than one version of itself. Identical titles would indicate identical bills, with different bill numbers.
Legislator information is downloaded from YAML files maintained in the unitedstates/congress-legislators
repository:
'https://raw.githubusercontent.com/unitedstates/congress-legislators/master/legislators-current.yaml'
This data is downloaded and updated in the database in flatgov/common/cosponsor.py
. The Committee data is also updated and associations between Committees and their members are created.
For any bill (e.g. 116hr100ih), we want to find related bills for previous congresses. Related bills are listed for the same congress in Congress.gov, e.g. "hr2"}&r=1&s=3. There are many ways of calculating similarity.
For purposes of efficiency and performance, we have developed a similarity measure that built on a search engine model. In particular, we build an index of document headers and sections in Elasticsearch. We then calculate the similarity between any input text and sections in the index using the ES/Lucene 'more like this' metric. We combine the section to section similarity scores to yield an overall bill similarity measure. For more details, see https://github.com/aih/FlatGov/blob/master/server_py/flatgov/elasticsearch/README.adoc
Note
|
a comparable bill text similarity engine is here https://github.com/govtrack/govtrack.us-web/blob/master/analysis/text_incorporation.py |
The metadata for Statement of Administration Policy section has been scraped and stored in json files. The pdfs are stored in the media directory.
-
activate the virtualenv and go to
(flatgov) ~/…/FlatGov/server_py$
$ cd ~/.../FlatGov/server_py
$ source .venv/bin/activate
-
Go to
(flatgov) ~/…/FlatGov/server_py/flatgov$
$ cd flatgov
-
Apply all migrations
./manage.py makemigrations
./manage.py migrate
-
Load Statement of Administration Policy data
./manage.py loaddata dumped_statements.json
-
The pdf files are stored in media directory. We load data to dumped_statements.json file with management command.
-
Run the following command to load data in database.
./manage.py load_statements
To load Biden statements data with management command, run the following command:
./manage.py biden_statements
CRS reports are scraped from the everycrsreport
website. For each report, an attempt is made to associate it with bills by a combination of: the report title, the report metadata from everycrsreport
, and the report html from the site. For example, if the report title includes H. R. 200
, that bill is associated with the report in our database.
The scraper, and its instructions, are described in CRS_REPORTS.
There is a many-to-many association between reports and bills: more than one bill may be associated with a report, and a report may mention more than one bill.
To download the CRS report data in csv format, go to /crs/csv-report/
Note
|
in most cases, it is not explicit in the report data which Congress a bill refers to (e.g. H. R. 200 ). We have made an initial association of the bill with the Congress on the date of the report publication. This leads to many-- possibly a majority — of the bills being mis-associated. There are many sources of this error: the date we are using may include a much later 'update' to a report; a report may refer to historical bills; in particular, bills in the early part of the year (January or February) may refer to the previous year’s bills. We attempt to handle this by adding bills from January/February to both congresses. In addition, many reports refer to Public Laws, and we do not make an attempt to associate the P.L. with a bill (this would be more accurate, since the P.L. number includes the Congress).
|
TODO: add detail about sources of scraping, setting up and running a Celery task.
The scraper, and its instructions, are described in Scraping: Relevant Committee Documents.