Skip to content

huanchen-stack/post-link-rot-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

post-link-rot-analysis

Results are too large to upload to github.

  1. parse_FSM.py: read and parse ext dump; output is the first revision in 2019/2014 as well as revision meta (optional: article_move_analysis.py)

  1. extract_links.py: read the first revisions and extract external links using mwparserfromhell

[logic-change] 3. analysis_augment.py: analysis the extracted external links, group them into [live-only, augmented, archive-only]

  1. merge_live_links_by_host.py: prep probe for live links, group links by hostnames so that politeness is not performance bottleneck
  2. probe_scheduler.py: randomize probe sequence to improve politeness
  3. probe_live.py: GET requests to live links (for previous bug, probe_live_patch.py was used once)
  4. group_broken_links.py: group broken links by DIR/article name; this is to help batched analysis

probe_live_patch.py probe_live_filter_broken.jq


  1. last_revision_analysis.py: efficiently extract all eventually-augmented/removed links (by only looking at grouped broken links and the last revision)
  2. iter_edit_history_FSM.py: re-iterate through all edit histories, find date of 1st [augmentation, removal]
  3. removal_reason_filters.py:
  4. probe_archive_conf.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages