Contributing guidelines

Reporting bugs
Development
- New features
- Bug fixes
Getting Started
Data Model

Reporting bugs

First check to see if the bug has not already been reported.
Create a bug report issue.

Development

New features

Right now all new feature development is handled internally.

Bug fixes

Most bug fixes are handled internally, but we will accept pull requests for bug fixes if you first:

Create an issue describing the bug. see Reporting bugs
Get approval from DDG staff before working on it. Since most bug fixes and feature development are handled internally, we want to make sure that your work doesn't conflict with any current projects.

Getting Started

Pre-Requisites

PostgreSQL database
PhantomJS 2.1.1
Perl
compare
pkill
Should run on many varieties of Linux/*BSD

Setup

Install required Perl modules via cpanfile:

cpanm --installdeps .

Connect to PostgreSQL with psql and create the tables needed by the crawler:

\i sql/full_urls.sql
\i sql/https_crawl.sql
\i sql/mixed_assets.sql
etc.

Create a copy of the crawler configuration file:

cp config.yml.example config.yml

Edit the settings as necessary for your system.

If you have a source of URLs you would like to be crawled for a host they can be added to the full_urls table:

insert into full_urls (host, url) values ('duckduckgo.com', 'https://duckduckgo.com/?q=privacy'), ...

The crawler will attempt to get URLs from the home page even if none are available in this table.

Running the crawler

Add hosts to be crawled to the https_queue table:

insert into https_queue (domain) values ('duckduckgo.com');

The crawler can be run as follows:

perl -Mlib=/path/to/smarter-encryption https_crawl.pl -c /path/to/config.yml

Checking the results

The individual HTTP and HTTPs comparisons for each URL crawled are stored in https_crawl:

select * from https_crawl where domain = 'duckduckgo.com' order by id desc limit 10;

The maximum URLs for the crawl session, i.e. limit, is determined by URLS_PER_SITE.

Aggregate session data for each host is stored in https_crawl_aggregate:

select * from https_crawl_aggregate where domain = 'duckduckgo.com';

There is also an associated view - https_upgrade_metrics - that calculates some additional metrics:

select * from https_upgrade_metrics where domain = 'duckduckgo.com';

Additional information from the crawl can be found in:

sss_cert_info
mixed_assets
https_response_headers

Hosts can be selected based on various combinations of criteria directly from the above tables or by using the upgradeable_domains function.

Data Model

full_urls

Complete URLs for hosts that will be used in addition to those the crawler extracts from the home page.

Column	Description	Type	Key
host	hostname	text	unique
url	Complete URL with scheme	text	unique
updated	When added to table	timestamp with time zone

https_queue

Domains to be crawled in rank order. Multiple crawlers can access this concurrently.

Column	Description	Type	Key
rank	Processing order	integer	primary
domain	Domain to be crawled	character varying(500)
processing_host	Hostname of server processing domain	character varying(50)
worker_pid	Process ID of crawler handling domain	integer
reserved	When domain was selected for processing	timestamp with time zone
started	When processing of domain started	timestamp with time zone
finished	When processing of domain completed	timestamp with time zone

https_crawl

Log table of HTTP and HTTPs comparisons made by the crawler.

Column	Description	Type	Key
id	Comparison ID	bigint	unique
domain	Domain evaluated	text
http_request_uri	Resulting URI of HTTP request	text
http_response	HTTP status code for HTTP request	integer
http_requests	Total requests made, including child subrequests, for HTTP request	integer
http_size	Size of HTTP response (bytes)	integer
https_request_uri	Resulting URI of HTTPs request	text
https_response	HTTP status code for HTTPs request	integer
https_requests	Total requests made, including child subrequests, for HTTPs request	integer
https_size	Size of HTTPs response (bytes)	integer
timestamp	When inserted	timestamp with time zone
screenshot_diff	Percentage difference between HTTP and HTTPs screenshots after page load	real
autoupgrade	Whether HTTP request was redirected to HTTPs	boolean
mixed	Whether HTTPs request had HTTP child requests	boolean

mixed_assets

HTTP child requests made for HTTPs.

Column	Description	Type	Key
https_crawl_id	https_crawl.id, only associated with https_* columns	bigint	unique/foreign
asset	URI of HTTP subrequest made during HTTPs request	text	unique

https_response_headers

The response headers for HTTPs requests.

Column	Description	Type	Key
https_crawl_id	https_crawl.id, only associated with https_* columns	bigint	unique/foreign
response_headers	key/value of all HTTPs response headers	jsonb

ssl_cert_info

SSL certificate information for domains crawled.

Column	Description	Type	Key
domain	Domain evaluated	text	primary
issuer	Issuer of SSL certificate	text
notbefore	Valid from timestamp	timestamp with time zone
notafter	Valid to timestamp	timestamp with time zone
host_valid	Whether the domain is covered by the SSL certificate	boolean
err	Connection err	text
updated	When last updated	timestamp with time zone

https_crawl_aggregate

Aggregate of https_crawl that creates latest crawl sessions based on domain. Can also include domains that were redirected to and not directly crawled.

Column	Description	Type	Key
domain	Domain evaluated	text	primary
https	Comparisons where only HTTPs was supported	integer
http_and_https	Comparisons where HTTP and HTTPs were supported	integer
http	Comparisons where only HTTP was supported	integer
https_errs	Number of non-2xx HTTPs responses	integer
unknown	Comparisons where neither HTTP nor HTTPs responses were valid or the status codes differed	integer
autoupgrade	Comparisons where HTTP was redirected to HTTPs	integer
mixed_requests	HTTPs request that made HTTP calls	integer
max_screenshot_diff	Maximum percentage difference between HTTP and HTTPs screenshots	real
redirects	Number of HTTPs requests redirected to different host	integer
requests	Number of comparison requests actually made during the crawl session	integer
session_request_limit	The number of comparisons wanted for the session	integer
is_redirect	Whether the domain was actually crawled or is a redirect from another host in the table that was crawled	boolean
max_https_crawl_id	https_crawl.id of last comparison made during crawl session	bigint
redirect_hosts	key/value pairs of hosts and the number of redirects to it	jsonb

https_upgrade_metrics

View of https_crawl_aggregate that calculates crawl session percentages for easier selection based on cutoffs.

Column	Description	Type
domain	Domain evaluated	text
unknown_pct	Percentage of unknown	real
combined_pct	Percentage that supported HTTPs	real
https_err_rate	Percentage unknown	real
max_screenshot_diff	https_crawl_aggregate.max_screenshot_diff	real
mixed_ok	Whether HTTPs requests contained mixed content requests	boolean
autoupgrade_pct	Percentage of autoupgrade	real

domain_exceptions

For manually excluding domains that may otherwise pass specific upgrade criteria given to upgradeable_domains.

Column	Description	Type	Key
domain	Domain to exclude	text	primary
comment	Reason for exclusion	text
updated	When added	timestamp with time zone

upgradeable_domains

Function to select domains based on a variety of criteria.

Parameter	Description	Type	Source
autoupgrade_min	Minimum autoupgrade percentage	real	https_upgrade_metrics
combined_min	Minimum percentage of HTTPs responses	real	https_upgrade_metrics
screenshot_diff_max	Maximum observed screenshot diff allowed	real	https_upgrade_metrics
mixed_ok	Whether to allow domains that had mixed content	boolean	https_upgrade_metrics
max_err_rate	Maximum https_err_rate	real	https_upgrade_metrics
unknown_max	Maximum unknown comparisons	real	https_upgrade_metrics
ssl_cert_buffer	SSL certificate must be valid until this timestamp	timestamp with time zone	ssl_cert_info
exclude_issuers	Array of SSL cert issuers to exclude	text array	ssl_cert_info

In addtion to the above parameters, the function enforces several other conditions:

Domain must not be in domain_exceptions
From values in ssl_cert_info:
1. No err
2. The domain, or host, must be valid for the certificate.
3. Valid from/to and the issuer must not be null

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONTRIBUTING.md

CONTRIBUTING.md

Contributing guidelines

Reporting bugs

Development

New features

Bug fixes

Getting Started

Pre-Requisites

Setup

Running the crawler

Checking the results

Data Model

full_urls

https_queue

https_crawl

mixed_assets

https_response_headers

ssl_cert_info

https_crawl_aggregate

https_upgrade_metrics

domain_exceptions

upgradeable_domains

Files

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing guidelines

Reporting bugs

Development

New features

Bug fixes

Getting Started

Pre-Requisites

Setup

Running the crawler

Checking the results

Data Model

full_urls

https_queue

https_crawl

mixed_assets

https_response_headers

ssl_cert_info

https_crawl_aggregate

https_upgrade_metrics

domain_exceptions

upgradeable_domains