Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: capture-html , but with eww-readable #53

Open
stefan2904 opened this issue Dec 16, 2022 · 1 comment
Open

FR: capture-html , but with eww-readable #53

stefan2904 opened this issue Dec 16, 2022 · 1 comment

Comments

@stefan2904
Copy link

stefan2904 commented Dec 16, 2022

As always, thanks for the useful package!

Feature Request / Proposal

I propose to add a capture protocol that behaves like capture-html currently does, but (if nothing is selected), sends the whole page's html (i.e., document.body) to emacs and uses eww-readable to extract the interesting html (before converting it with pandoc). To basically a mix of the two existing protocols.

Use case

  • Let's say I want to capture a webpage into org. If I only want to capture selected part(s) of the website, then capture-html works perfectly.
  • On the other hand, if I want to capture the full webpage, and want to use the power of eww-readable, but want to capture the content I have already open, this is currently not possible. (Or I did misunderstand something.)

Background

What I mean: If I understand things correctly,

org-protocol-capture-html--with-pandoc (used by the capture-html protocol)

  1. user selects some text in browser, clicks on bookmarklet
  2. selected text is send to emacs via org-protocol
  3. function converts the html (coming from the browser) to org

org-protocol-capture-html--capture-eww-readable (used by the capture-eww-readable protocol`)

  1. user clicks on bookmarklet (which does not send selected text to emacs)
  2. function uses org-protocol-capture-html--url-html to download html directly
  3. function then converts the html (downloaded by curl) to org

The difference being that in capture-html the html is retrieved by the browser, while in capture-eww-readable it is retrieved by emacs/curl.

This makes a difference when the content being captured is, e.g., a page behind a paywall, or dynamically generated content. It is also a difference when the text is inserted into the DOM by some javascript which loads the content after the page itself is loaded.

@stefan2904
Copy link
Author

stefan2904 commented Dec 16, 2022

So basically (coming from someone who does not know elisp)

  • modify the capture-html bookmarklet to send document.body if nothing is selected)
  • copy org-protocol-capture-html--with-pandoc function and protocol
  • modify it to use org-protocol-capture-html--eww-readabl on (plist-get data :body) (or after the sanitization, not sure)

(probably there is a nicer way with less duplication of code)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant