Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some URL's cause img2dataset to hang indefinitely #437

Open
thecodingwizard opened this issue Oct 24, 2024 · 2 comments
Open

Some URL's cause img2dataset to hang indefinitely #437

thecodingwizard opened this issue Oct 24, 2024 · 2 comments

Comments

@thecodingwizard
Copy link

Some URL's, such as https://www.ihypress.com/holidays/christmas/xmas-icon.png (which appears in Laion), never finish downloading. In particular, this URL redirects to a video livestream that never ends.

The current download_image implementation has a timeout for establishing the connection, but it has no timeout for actually downloading the image itself.

def download_image(row, timeout, user_agent_token, disallowed_header_directives):
    """Download an image with urllib"""
    key, url = row
    img_stream = None
    user_agent_string = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"
    if user_agent_token:
        user_agent_string += f" (compatible; {user_agent_token}; +https://github.com/rom1504/img2dataset)"
    try:
        request = urllib.request.Request(url, data=None, headers={"User-Agent": user_agent_string})
        with urllib.request.urlopen(request, timeout=timeout) as r:
            if disallowed_header_directives and is_disallowed(
                r.headers,
                user_agent_token,
                disallowed_header_directives,
            ):
                return key, None, "Use of image disallowed by X-Robots-Tag directive"
            img_stream = io.BytesIO(r.read())  # can hang indefinitely here!
        return key, img_stream, None
    except Exception as err:  # pylint: disable=broad-except
        if img_stream is not None:
            img_stream.close()
        return key, None, str(err)

In particular, this function can block indefinitely on r.read(), making img2dataset unable to finish downloading Laion.

I believe adding proper timeouts (#261) will fix this issue.

@rom1504
Copy link
Owner

rom1504 commented Oct 24, 2024 via email

@thecodingwizard
Copy link
Author

Thanks for your response! For laion/aesthetics_v2_4.5 this was the only URL. I also found it quite difficult to even identify which URL it was stuck on; I basically had to binary search on the feather file that it was stuck on until I identified the culprit (though there's probably a better way to do this). I ended up just ignoring the shard containing this URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants