You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
JSON is now the default export option; the --json flag is deprecated
Added numerous extra flags
Improved logging
Bug fixes
Code refactor
Full Changelog
Added
User Interface
Analytical tools
Word frequencies generator.
Wordcloud generator.
Source code
CLI
Flags
-e - Display additional example usage.
--check - Runs a quick check for PRAW credentials and displays the rate limit table after validation.
--rules - Include the Subreddit's rules in the scrape data (for JSON only). This data is included in the subreddit_rules field.
-f - Word frequencies generator.
-wc - Wordcloud generator.
--nosave - Only display the wordcloud; do not save to file.
Added metavar for args help message.
Added additional verbose feedback if invalid arguments are given.
Log decorators
Added new decorator to log individual argument errors.
Added new decorator to log when no Reddit objects are left to scrape after failing validation check.
Added new decorator to log when an invalid file is passed into the analytical tools.
Added new decorator to log when the scrapes directory is missing, which would cause the new make_analytics_directory() method in DirInit.py to fail.
This decorator is also defined in the same file to avoid a circular import error.
ASCII art
Added new art for the word frequencies and wordcloud generators.
Added new error art displayed when a problem arises while exporting data.
Added new error art displayed when Reddit object validation is completed and there are no objects left to scrape.
Added new error art displayed when an invalid file is passed into the analytical tools.
README
Added new Contact section and moved contact badges into it.
Apparently it was not obvious enough in previous versions since users did not send emails to the address specifically created for URS-related inquiries.
Added new sections for the analytical tools.
Updated demo GIFs
Moved all GIFs to a separate branch to avoid unnecessary clones.
Hosting static images on Imgur.
Tests
Added additional tests for analytical tools.
Changed
User interface
JSON is now the default export option. --csv flag is required to export to CSV instead.
Improved JSON structure.
PRAW scraping export structure:
Scrape details are now included at the top of each exported file in the scrape_details field.
Subreddit scrapes - Includes subreddit, category, n_results_or_keywords, and time_filter.
Redditor scrapes - Includes redditor and n_results.
Submission comments scrapes - Includes submission_title, n_results, and submission_url.
Scrape data is now stored in the data field.
Subreddit scrapes - data is a list containing submission objects.
Redditor scrapes - data is an object containing additional nested dictionaries:
information - a dictionary denoting Redditor metadata,
interactions - a dictionary denoting Redditor interactions (submissions and/or comments). Each interaction follows the Subreddit scrapes structure.
Submission comments scrapes - data is an list containing additional nested dictionaries.
Raw comments contains dictionaries of comment_id: SUBMISSION_METADATA.
Structured comments follows the structure seen in raw comments, but includes an extra replies field in the submission metadata, holding a list of additional nested dictionaries of comment_id: SUBMISSION_METADATA. This pattern repeats down to third level replies.
Word frequencies export structure:
The original scrape data filepath is included in the raw_file field.
data is a dictionary containing word: frequency.
Log:
scrapes.log is now named urs.log.
Validation of Reddit objects is now included - invalid Reddit objects will be logged as a warning.
Rate limit information is now included in the log.
Source code
Moved PRAW scrapers into its own package.
Subreddit scraper's "edited" field is now either a boolean (if the post was not edited) or a string (if it was).
Previous iterations did not distinguish the different types and would solely return a string.
Scrape settings for the basic Subreddit scraper is now cleaned within Basic.py, further streamlining conditionals in Subreddit.py and Export.py.
Returning final scrape settings dictionary from all scrapers after execution for logging purposes, further streamlining the LogPRAWScraper class in Logger.py.
Passing the submission URL instead of the exception into the not_found list for submission comments scraping.
This is a part of a bug fix that is listed in the Fixed section.
ASCII art:
Modified the args error art to display specific feedback when invalid arguments are passed.
Upgraded from relative to absolute imports.
Replaced old header comments with docstring comment block.
Upgraded method comments to Numpy/Scipy docstring format.
README
Moved Releases section into its own document.
Deleted all media from master branch.
Tests
Updated absolute imports to match new directory structure.
Updated a few tests to match new changes made in the source code.
Community documents
Updated PULL_REQUEST_TEMPLATE:
Updated section for listing changes that have been made to match new Releases syntax.
Wrapped New Dependencies in a code block.
Updated STYLE_GUIDE:
Created new rules for method comments.
Added Releases:
Moved Releases section from main README to a separate document.
Fixed
Source code
PRAW scraper settings
Bug: Invalid Reddit objects (Subreddits, Redditors, or submissions) and their respective scrape settings would be added to the scrape settings dictionary even after failing validation.
Behavior: URS would try to scrape invalid Reddit objects, then throw an error mid-scrape because it is unable to pull data via PRAW.
Fix: Returning the invalid objects list from each scraper into GetPRAWScrapeSettings.get_settings() to circumvent this issue.
Basic Subreddit scraper
Bug: The time filter all would be applied to categories that do not support time filter use, resulting in errors while scraping.
Behavior: URS would throw an error when trying to export the file, resulting in a failed run.
Fix: Added a conditional to check if the category allows for a time filter, and applies either the all time filter or None accordingly.
Deprecated
User interface
Removed the --json flag since it is now the default export option.