Releases: JosephLai241/URS
Releases · JosephLai241/URS
URS v3.4.0
Summary
This release contains code cleanup by upgrading the project structure to a Poetry
project, and refactoring a compute-heavy bottleneck within this program in Rust, drastically improving performance.
Changelog
Added
taisun
- A Python module written in Rust that contains the depth-first search algorithm and associated data structures for structured comments scraping. This library will eventually contain additional code that handles compute-heavy tasks.- GitHub Actions Workflows
rust.yml
- Format and lint Rust code.python.yml
- Format and test Python code.manual.yml
- Build and deploy themdBook
manual to GitHub Pages.
- A new user guide/manual built from
mdBook
. - Added type hints to all
urs/
code.
Changed
- Dates used in this program have been updated to use the ISO8601 timestamp format (
YYYY-MM-DD HH:MM:SS
). - Docstrings have been updated from NumPy to reStructuredText format.
- Simplifying the
STYLE_GUIDE.md
- The style is dictated byBlack
andisort
for Python code, andrustfmt
for Rust. - Simplifying the
README.md
- Most information previously listed there has been moved to the user guide/manual.
- Formatted every single Python file with
Black
andisort
. - Upgraded/recorded new demo GIFs.
Deprecated
N/A
URS v3.3.2
Summary
This release fixes an open issue.
PRAW v7.3.0 changed the Redditor
object's subreddit
attribute. This change breaks the Redditor scraper. It would be nice if all the tools worked as advertised.
Full Changelog
Added
- Source code
- In
Redditor.py
:- Added a new method
GetInteractions._get_user_subreddit()
- extractssubreddit
data from theUserSubreddit
object into a dictionary.
- Added a new method
- In
- Tests
- In
test_Redditor.py
:- Added
TestGetUserSubredditMethod().test_get_user_subreddit()
to test the new method.
- Added
- In
Changed
- Source code
- In
Redditor.py
:GetInteractions._get_user_info()
calls the newGetInteractions._get_user_subreddit()
method to set the Redditor'ssubreddit
data within the main Redditor information dictionary.
- In
Version.py
:- Incremented version number.
- In
README
- Incremented PRAW badge version number.
URS v3.3.1
Summary
- Introduced a new utility,
-t
, which will display a visual tree of the current day's scrape directory by default. Optionally, include a different date to display that day's scrape directory. - Move CI providers from Travis-CI to GitHub Actions.
- Travis-CI is no longer free - there is now a free build cap.
- Minor code refactoring and issue resolution.
Full Changelog
Added
- User interface
- Added a new utility:
-t
/--tree
- display the directory structure of the current date directory. Or optionally include a date to display that day's scrape directory.
- Added a new utility:
- Source code
- Added a new file
Utilities.py
to theurs/utils
module.- Added a class
DateTree
which contains methods to find and build a visual tree for the target date's directory.- Added logging when this utility is run.
- Added a class
- Added an additional Halo to the wordcloud generator.
- Added a new file
README
- Added new "Utilities" section.
- This section describes how to use the
-t
/--tree
and--check
utility flags.
- This section describes how to use the
- Added new "Sponsors" section.
- Added new "Utilities" section.
- Tests
- Added
test_Utilities.py
under thetest_utils
module.
- Added
Changed
- Source code
- Refactored the following methods within the
analytics
module:GetPath.get_scrape_type()
GetPath.name_file()
FinalizeWordcloud().save_wordcloud()
- Implemented
pathlib
'sPath()
method to get the path.
- Implemented
- Upgraded all string formatting from old-school Python formatting (using the
%
operator) to the superiorf-string
. - Updated GitHub Actions workflow
pytest.yml
.- This workflow was previously disabled. The workflow has been upgraded to test URS on all platforms (
ubuntu-latest
,macOS-latest
, andwindows-latest
) and to send test coverage to Codecov after testing completes onubuntu-latest
.
- This workflow was previously disabled. The workflow has been upgraded to test URS on all platforms (
- Refactored the following methods within the
README
- Changed the Travis-CI badge to a GitHub Actions badge.
- Updated badge link to route to the workflows page within the repository.
- Changed the Travis-CI badge to a GitHub Actions badge.
- Tests
- Upgraded all string formatting from old-school Python formatting (using the
%
operator) to the superiorf-string
in the following modules:test_utils/test_Export.py
test_praw_scrapers/test_live_scrapers/test_Livestream.py
- Refactored two tests within
test_Export.py
:TestExportWriteCSVAndWriteJSON().test_write_csv()
TestExportExportMethod().test_export_write_csv()
- Upgraded all string formatting from old-school Python formatting (using the
- Community documents
- Updated
PULL_REQUEST_TEMPLATE.md
.- Removed Travis-CI configuration block.
- Updated
Deprecated
- Source code
- Removed
.travis.yml
- URS no longer uses Travis-CI as its CI provider.
- Removed
URS v3.3.0
Summary
- Introduced livestreaming tools:
- Livestream comments or submissions submitted within Subreddits.
- Livestream comments or submissions submitted by a Redditor.
Full Changelog
Added
- User interface
- Added livestream scraper flags:
-lr
- livestream a Subreddit-lu
- livestream a Redditor- Added livestream scrape control flags to limit stream exclusively to submissions (default is streaming comments):
--stream-submissions
- Added a flag
-v
/--version
to display the version number.
- Added livestream scraper flags:
- Source code
- Added a new sub-module
live_scrapers
withinpraw_scrapers
for livestream functionality:Livestream.py
utils/DisplayStream.py
utils/StreamGenerator.py
- Added a new file
Version.py
to single-source the package version. - Added a
gallery_data
andmedia_metadata
check inComments.py
, which includes the above fields if the submission contains a gallery.
- Added a new sub-module
README
- Added a new "Installation" section with updated installation procedures.
- Added a new section "Livestreaming Subreddits and Redditors" with sub-sections containing details for each flag.
- Updated the Table of Contents accordingly.
- Tests
- Added additional unit tests for the
live_scrapers
module. These tests are located intests/test_praw_scrapers/test_live_scrapers
:tests/test_praw_scrapers/test_live_scrapers/test_Livestream.py
tests/test_praw_scrapers/test_live_scrapers/test_utils/test_DisplayStream.py
tests/test_praw_scrapers/test_live_scrapers/test_utils/test_StreamGenerator.py
- Added additional unit tests for the
- Repository documents
- Added a Table of Contents for
The Forest.md
- Added a Table of Contents for
Changed
- User interface
- Updated the usage menu to clarify which tools may use which optional flags.
- Source code
- Reindexed the
praw_scrapers
module:- Moved the following files into the new
static_scrapers
sub-module:Basic.py
Comments.py
Redditor.py
Subreddit.py
- Updated absolute imports throughout the source code.
- Moved the following files into the new
- Moved
confirm_options()
, previously located inSubreddit.py
toGlobal.py
. - Moved
PrepRedditor.prep_redditor()
algorithm to its own class methodPrepMutts.prep_mutts()
.- Added additional error handling to the algorithm to fix the
KeyError
exception mentioned in the Issue Fix or Enhancement Request section.
- Added additional error handling to the algorithm to fix the
- Removed Colorama's
init()
method from many modules - it only needs to be called once and is now located inUrs.py
. - Updated
requirements.txt
.
- Reindexed the
README
- The "Exporting" section is now one large section and is now located on top of the "URS Overview" section.
- Tests
- Updated absolute imports for existing PRAW scrapers.
- Removed a few tests for
DirInit.py
since themake_directory()
andmake_type_directory()
methods have been deprecated.
Deprecated
- Source code
- Removed many methods defined in the
InitializeDirectory
class inDirInit.py
:LogMissingDir.log()
create()
make_directory()
make_type_directory()
make_analytics_directory()
- Replaced these methods with a more versatile
create_dirs()
method.
- Replaced these methods with a more versatile
- Removed many methods defined in the
URS v3.2.1
Release date: March 28, 2021
Summary
- Structured comments export has been upgraded to include comments of all levels.
- Structured comments are now the default export format. Exporting to raw format requires including the
--raw
flag.
- Structured comments are now the default export format. Exporting to raw format requires including the
- Tons of metadata has been added to all scrapers. See the Full Changelog section for a full list of attributes that have been added.
Credentials.py
has been deprecated in favor of.env
to avoid hard-coding API credentials.- Added more terminal eye candy - Halo has been implemented to spice up the output.
Full Changelog
Added
- User interface
- Added Halo to spice up the output while maintaining minimalism.
- Source code
- Created a comment
Forest
and accompanyingCommentNode
.- The
Forest
contains methods for insertingCommentNode
s, including a depth-first search algorithm to do so.
- The
Subreddit.py
has been refactored and submission metadata has been added to scrape files:"author"
"created_utc"
"distinguished"
"edited"
"id"
"is_original_content"
"is_self"
"link_flair_text"
"locked"
"name"
"num_comments"
"nsfw"
"permalink"
"score"
"selftext"
"spoiler"
"stickied"
"title"
"upvote_ratio"
"url"
Comments.py
has been refactored and submission comments now include the following metadata:"author"
"body"
"body_html"
"created_utc"
"distinguished"
"edited"
"id"
"is_submitter"
"link_id"
"parent_id"
"score"
"stickied"
- Major refactor for
Redditor.py
on top of adding additional metadata.- Additional Redditor information has been added to scrape files:
"has_verified_email"
"icon_img"
"subreddit"
"trophies"
- Additional Redditor comment, submission, and multireddit metadata has been added to scrape files:
subreddit
objects are nested withincomment
andsubmission
objects and contain the following metadata:"can_assign_link_flair"
"can_assign_user_flair"
"created_utc"
"description"
"description_html"
"display_name"
"id"
"name"
"nsfw"
"public_description"
"spoilers_enabled"
"subscribers"
"user_is_banned"
"user_is_moderator"
"user_is_subscriber"
comment
objects will contain the following metadata:"type"
"body"
"body_html"
"created_utc"
"distinguished"
"edited"
"id"
"is_submitter"
"link_id"
"parent_id"
"score"
"stickied"
"submission"
- contains additional metadata"subreddit_id"
submission
objects will contain the following metadata:"type"
"author"
"created_utc"
"distinguished"
"edited"
"id"
"is_original_content"
"is_self"
"link_flair_text"
"locked"
"name"
"num_comments"
"nsfw"
"permalink"
"score"
"selftext"
"spoiler"
"stickied"
"subreddit"
- contains additional metadata"title"
"upvote_ratio"
"url"
multireddit
objects will contain the following metadata:"can_edit"
"copied_from"
"created_utc"
"description_html"
"description_md"
"display_name"
"name"
"nsfw"
"subreddits"
"visibility"
interactions
are now sorted in alphabetical order.
- Additional Redditor information has been added to scrape files:
- CLI
- Flags
--raw
- Export comments in raw format instead (structure format is the default)
- Flags
- Created a new
.env
file to store API credentials.
- Created a comment
README
- Added new bullet point for The Forest Markdown file.
- Tests
- Added a new test for the
Status
class inGlobal.py
.
- Added a new test for the
- Repository documents
- Added "The Forest".
- This Markdown file is just a place where I describe how I implemented the
Forest
.
- This Markdown file is just a place where I describe how I implemented the
- Added "The Forest".
Changed
- User interface
- Submission comments scraping parameters have changed due to the improvements made in this pull request.
- Structured comments is now the default format.
- Users will have to include the new
--raw
flag to export to raw format.
- Users will have to include the new
- Both structured and raw formats can now scrape all comments from a submission.
- Structured comments is now the default format.
- Submission comments scraping parameters have changed due to the improvements made in this pull request.
- Source code
- The submission comments JSON file's structure has been modified to fit the new
submission_metadata
dictionary."data"
is now a dictionary that contains the submission metadata dictionary and scraped comments list. Comments are now stored in the"comments"
field within"data"
. - Exporting Redditor or submission comments to CSV is now forbidden.
- URS will ignore the
--csv
flag if it is present while trying to use either scraper.
- URS will ignore the
- The
created_utc
field for each Subreddit rule is now converted to readable time. requirements.txt
has been updated.- As of v1.20.0,
numpy
has dropped support for Python 3.6, which means Python 3.7+ is required for URS..travis.yml
has been modified to exclude Python 3.6. Added Python 3.9 to test configuration.- Note: Older versions of Python can still be used by downgrading to numpy<=1.19.5.
- As of v1.20.0,
- Reddit object validation block has been refactored.
- A new reusable module has been defined at the bottom of
Validation.py
.
- A new reusable module has been defined at the bottom of
Urs.py
no longer pulls API credentials fromCredentials.py
as it is now deprecated.- Credentials are now read from the
.env
file.
- Credentials are now read from the
- Minor refactoring within
Validation.py
to ensure an extra Halo line is not rendered on failed credential validation.
- The submission comments JSON file's structure has been modified to fit the new
README
- Updated the Comments section to reflect new changes to comments scraper UI.
- Repository documents
- Updated
How to Get PRAW Credentials.md
to reflect new changes.
- Updated
- Tests
- Updated CLI usage and examples tests.
- Updated
c_fname()
test because submission comments scrapes now follow a different naming convention.
Deprecated
- User interface
- Specifying
0
comments does not only export all comments to raw format anymore. Defaults to structured format.
- Specifying
- Source code
- Deprecated many global variables defined in
Global.py
:eo
options
s_t
analytical_tools
Credentials.py
has been replaced with the.env
file.- The
LogError.log_login
decorator has been deprecated due to the refactor withinValidation.py
.
- Deprecated many global variables defined in
URS v3.2.0
Release date: February 25, 2021
Summary
- Added analytical tools
- Word frequencies generator
- Wordcloud generator
- Significantly improved JSON structure
- JSON is now the default export option; the
--json
flag is deprecated - Added numerous extra flags
- Improved logging
- Bug fixes
- Code refactor
Full Changelog
Added
- User Interface
- Analytical tools
- Word frequencies generator.
- Wordcloud generator.
- Analytical tools
- Source code
- CLI
- Flags
-e
- Display additional example usage.--check
- Runs a quick check for PRAW credentials and displays the rate limit table after validation.--rules
- Include the Subreddit's rules in the scrape data (for JSON only). This data is included in thesubreddit_rules
field.-f
- Word frequencies generator.-wc
- Wordcloud generator.--nosave
- Only display the wordcloud; do not save to file.
- Added metavar for args help message.
- Added additional verbose feedback if invalid arguments are given.
- Flags
- Log decorators
- Added new decorator to log individual argument errors.
- Added new decorator to log when no Reddit objects are left to scrape after failing validation check.
- Added new decorator to log when an invalid file is passed into the analytical tools.
- Added new decorator to log when the
scrapes
directory is missing, which would cause the newmake_analytics_directory()
method inDirInit.py
to fail.- This decorator is also defined in the same file to avoid a circular import error.
- ASCII art
- Added new art for the word frequencies and wordcloud generators.
- Added new error art displayed when a problem arises while exporting data.
- Added new error art displayed when Reddit object validation is completed and there are no objects left to scrape.
- Added new error art displayed when an invalid file is passed into the analytical tools.
- CLI
README
- Added new Contact section and moved contact badges into it.
- Apparently it was not obvious enough in previous versions since users did not send emails to the address specifically created for URS-related inquiries.
- Added new sections for the analytical tools.
- Updated demo GIFs
- Moved all GIFs to a separate branch to avoid unnecessary clones.
- Hosting static images on Imgur.
- Added new Contact section and moved contact badges into it.
- Tests
- Added additional tests for analytical tools.
Changed
- User interface
- JSON is now the default export option.
--csv
flag is required to export to CSV instead. - Improved JSON structure.
- PRAW scraping export structure:
- Scrape details are now included at the top of each exported file in the
scrape_details
field.- Subreddit scrapes - Includes
subreddit
,category
,n_results_or_keywords
, andtime_filter
. - Redditor scrapes - Includes
redditor
andn_results
. - Submission comments scrapes - Includes
submission_title
,n_results
, andsubmission_url
.
- Subreddit scrapes - Includes
- Scrape data is now stored in the
data
field.- Subreddit scrapes -
data
is a list containing submission objects. - Redditor scrapes -
data
is an object containing additional nested dictionaries:information
- a dictionary denoting Redditor metadata,interactions
- a dictionary denoting Redditor interactions (submissions and/or comments). Each interaction follows the Subreddit scrapes structure.
- Submission comments scrapes -
data
is an list containing additional nested dictionaries.- Raw comments contains dictionaries of
comment_id: SUBMISSION_METADATA
. - Structured comments follows the structure seen in raw comments, but includes an extra
replies
field in the submission metadata, holding a list of additional nested dictionaries ofcomment_id: SUBMISSION_METADATA
. This pattern repeats down to third level replies.
- Raw comments contains dictionaries of
- Subreddit scrapes -
- Scrape details are now included at the top of each exported file in the
- Word frequencies export structure:
- The original scrape data filepath is included in the
raw_file
field. data
is a dictionary containingword: frequency
.
- The original scrape data filepath is included in the
- PRAW scraping export structure:
- Log:
scrapes.log
is now namedurs.log
.- Validation of Reddit objects is now included - invalid Reddit objects will be logged as a warning.
- Rate limit information is now included in the log.
- JSON is now the default export option.
- Source code
- Moved PRAW scrapers into its own package.
- Subreddit scraper's "edited" field is now either a boolean (if the post was not edited) or a string (if it was).
- Previous iterations did not distinguish the different types and would solely return a string.
- Scrape settings for the basic Subreddit scraper is now cleaned within
Basic.py
, further streamlining conditionals inSubreddit.py
andExport.py
. - Returning final scrape settings dictionary from all scrapers after execution for logging purposes, further streamlining the
LogPRAWScraper
class inLogger.py
. - Passing the submission URL instead of the exception into the
not_found
list for submission comments scraping.- This is a part of a bug fix that is listed in the Fixed section.
- ASCII art:
- Modified the args error art to display specific feedback when invalid arguments are passed.
- Upgraded from relative to absolute imports.
- Replaced old header comments with docstring comment block.
- Upgraded method comments to Numpy/Scipy docstring format.
README
- Moved Releases section into its own document.
- Deleted all media from master branch.
- Tests
- Updated absolute imports to match new directory structure.
- Updated a few tests to match new changes made in the source code.
- Community documents
- Updated
PULL_REQUEST_TEMPLATE
:- Updated section for listing changes that have been made to match new Releases syntax.
- Wrapped New Dependencies in a code block.
- Updated
STYLE_GUIDE
:- Created new rules for method comments.
- Added
Releases
:- Moved Releases section from main
README
to a separate document.
- Moved Releases section from main
- Updated
Fixed
- Source code
- PRAW scraper settings
- Bug: Invalid Reddit objects (Subreddits, Redditors, or submissions) and their respective scrape settings would be added to the scrape settings dictionary even after failing validation.
- Behavior: URS would try to scrape invalid Reddit objects, then throw an error mid-scrape because it is unable to pull data via PRAW.
- Fix: Returning the invalid objects list from each scraper into
GetPRAWScrapeSettings.get_settings()
to circumvent this issue.
- Basic Subreddit scraper
- Bug: The time filter
all
would be applied to categories that do not support time filter use, resulting in errors while scraping. - Behavior: URS would throw an error when trying to export the file, resulting in a failed run.
- Fix: Added a conditional to check if the category allows for a time filter, and applies either the
all
time filter orNone
accordingly.
- Bug: The time filter
- PRAW scraper settings
Deprecated
- User interface
- Removed the
--json
flag since it is now the default export option.
- Removed the
URS v3.1.2
Release date: February 05, 2021
Scrapes will now be exported to scrape-defined directories within the date directory.
New in 3.1.2
- URS will create sub-directories within the date directory based on the scraper.
- Exported files will now be stored in the
subreddits
,redditors
, orcomments
directories.- These directories are only created if the scraper is ran. For example, the
redditors
directory will not be created if you never run the Redditor scraper.
- These directories are only created if the scraper is ran. For example, the
- Removed the first character used in exported filenames to distinguish scrape type in previous iterations of URS.
- This is no longer necessary due to the new sub-directory creation.
- Exported files will now be stored in the
- The forbidden access message that may appear when running the Redditor scraper was originally red. Changed the color from red to yellow to avoid confusion.
- Fixed a filenaming bug that would omit the scrape type if the filename length is greater than 50 characters.
- Updated
README
- Updated demo GIFs
- Added new directory structure visual generated by the
tree
command. - Created new section headers to improve navigation.
- Minor code reformatting/refactoring.
- Updated
STYLE_GUIDE
to reflect new changes and made a minor change to the PRAW API walkthrough.
- Updated
URS v3.1.1
Release date: June 27, 2020
Fulfilled user enhancement request by adding Subreddit time filter option.
New in 3.1.1:
- Users will now be able to specify a time filter for Subreddit categories
Controversial
,Search
, andTop
. - The valid time filters are:
all
day
hour
month
week
year
- Updated CLI unit tests to match new changes to how Subreddit args are parsed.
- Updated community documents located in the
.github/
directory:STYLE_GUIDE
, andPULL_REQUEST_TEMPLATE
. - Updated
README
to reflect new changes.
URS v3.1.0
Release date: June 22, 2020
Major code refactor. Applied OOP concepts to existing code and rewrote methods in attempt to improve readability, maintenance, and scalability.
New in 3.1.0:
- Scrapes will now be exported to the
scrapes/
directory within a subdirectory corresponding to the date of the scrape. These directories are automatically created for you when you run URS. - Added log decorators that record what is happening during each scrape, which scrapes were ran, and any errors that might arise during runtime in the log file
scrapes.log
. The log is stored in the same subdirectory corresponding to the date of the scrape. - Replaced bulky titles with minimalist titles for a cleaner look.
- Added color to terminal output.
- Improved naming convention for scripts.
- Integrating Travis CI and Codecov.
- Updated community documents located in the .github/ directory:
BUG_REPORT
,CONTRIBUTING
,FEATURE_REQUEST
,PULL_REQUEST_TEMPLATE
, andSTYLE_GUIDE
- Numerous changes to README. The most significant change was splitting and storing walkthroughs in
docs/
.
URS v3.0.0
Release date: January 15, 2020
New features
- Added JSON support
- Scrape Redditors
- Scrape post comments