Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSN article returns NULL due to (at least) not recursing into open shadowRoots #926

Open
bayramn opened this issue Dec 7, 2024 · 3 comments

Comments

@bayramn
Copy link

bayramn commented Dec 7, 2024

MSN article https://www.msn.com/en-us/news/world/south-korean-president-apologizes-for-declaring-martial-law-as-he-faces-impeachment-vote/ar-AA1vpHO2 simply returns null without errors.

I'm new to the library, is it expected when article is not readable or is this a bug?

@danielnixon
Copy link
Contributor

I'm not a Readability maintainer but I've done a bit of scraping. You're going to have two main problems with msn.com:

  1. It loads the article text dynamically (with JS; after the main page has loaded). If you use a 'naive' method of downloading the article (e.g. fetch, curl, wget), you're only ever going to get the skeleton HTML and not the article content itself. This comes in a subsequent request, as a JSON blob. Watch your network tab in your browser dev tools and you'll spot it.
  2. It puts the article text in a shadow DOM element (which Readability doesn't seem to extract).

To get past 1, you can use playwright or similar browser automation. You may already be doing this, not sure. You'll need to wait for the page to have loaded the article JSON and written it into the DOM. Using waitUntil: "networkidle" in page.goto is discouraged but gets the job done. Once you get that working, you'll probably be better off moving to waiting for a selector that you know only appears on the page once the article is loaded.

2 is trickier. To solve that, you'll need to run a script like this one to extract the content from the shadow DOM nodes. You can run the script in the context of the playwright page with page.evaluate(thatScript). I use a slight variation of that script that can handle being passed the document node, so I can just call return extractHTML(document); to serialize the whole page, including the html and head elements. The main addition you'd need to make to that script is basically:

    if (node instanceof Document) {
      return extractHTML(node.documentElement);
    }

just before the //beyond here, only deal with element nodes bail out.

Once you've got that html from playwright, pass it to Readability and it will work.

You may want to remove elements matching this selector from the DOM first: ".article-video-slot, video-card". Otherwise, you'll see (useless) video player elements at the top of the article produced by Readability. But that's just a minor point relative to everything else.

Problem 1 isn't really a problem for Readability to solve. Problem 2 arguably might be. It might be nice if Readability was able to dive into shadow DOM elements. Maybe _getNextNode should check node.shadowRoot && node.shadowRoot.firstElementChild as well as just checking node.firstElementChild.

@gijsk gijsk changed the title MSN article returns NULL MSN article returns NULL due to (at least) not recursing into open shadowRoots Dec 17, 2024
@gijsk
Copy link
Contributor

gijsk commented Dec 17, 2024

Thanks @danielnixon ! And yes, sounds like we'd need to update Readability to recurse into shadowroots... It still wouldn't work with closed shadow roots, I expect. I don't know that there's anything we could do about that, but then, I think it'd be unlikely article pages use those for the main article...

@danielnixon
Copy link
Contributor

danielnixon commented Dec 18, 2024

I agree it's unlikely. Even open shadow roots are seemingly rare in the sorts of pages one might want to use Readability on.

If it came to it, there are ways: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/dom/openOrClosedShadowRoot (assuming we're in a browser addon context)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants