Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Make python package for url normalization code #718

Open
dsjen opened this issue Jul 7, 2020 · 3 comments
Open

Make python package for url normalization code #718

dsjen opened this issue Jul 7, 2020 · 3 comments

Comments

@dsjen
Copy link
Contributor

dsjen commented Jul 7, 2020

As decided in today's tech meeting, I'm transferring this task into an issue.

@pypt
Copy link
Contributor

pypt commented Jul 8, 2020

To be more specific, here:

https://github.com/berkmancenter/mediacloud/blob/master/apps/common/src/python/mediawords/util/url/__init__.py#L158-L246

we normalize URLs with various "cruft" (tracking parameters, etc.) into their canonical form. For some examples, see the unit test:

https://github.com/berkmancenter/mediacloud/blob/master/apps/common/tests/python/mediawords/util/test_url.py#L117-L201

@rahulbot
Copy link
Contributor

rahulbot commented Jul 8, 2020

After a quick scan I'd say this code looks well-isolated enough, with lots of test cases, that Eric could take this task on. The package could just assume the the input is already a python str and eliminate the decode_object_from_bytes_if_needed call.

@pypt
Copy link
Contributor

pypt commented Jul 9, 2020

Some wishlist items of mine:

  • Merge with normalize_youtube_url()
  • Users (if any) will probably want to use their own user agent (web client) and logging for the module, so we could encapsulate this functionality in separate classes, provide default implementations (with requests / urllib3 and logging respectively), and make them configurable, e.g.:
class CustomLogger(AbstractLogger):

    def info(self, message: str) -> None:
        logging.info(message)

    def debug(self, message: str) -> None:
        logging.debug(message)

    # ...

# Another class implementing AbstractUserAgent's interface

normalized_url = normalize_url(url=url, logger=CustomLogger(), ua=CustomUserAgent())

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants