-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some URL's cause img2dataset to hang indefinitely #437
Comments
That's right but we currently have no good way to add url level timeouts
without decreasing speeds significantly.
Chunk level timeout is an option
In the meantime, can you ignore this one url or are there more?
…On Thu, Oct 24, 2024, 16:36 Nathan Wang ***@***.***> wrote:
Some URL's, such as
https://www.ihypress.com/holidays/christmas/xmas-icon.png (which appears
in Laion <https://huggingface.co/datasets/laion/aesthetics_v2_4.5>),
never finish downloading. In particular, this URL redirects to a video
livestream that never ends.
The current download_image implementation has a timeout for *establishing*
the connection, but it has no timeout for actually downloading the image
itself.
def download_image(row, timeout, user_agent_token, disallowed_header_directives):
"""Download an image with urllib"""
key, url = row
img_stream = None
user_agent_string = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"
if user_agent_token:
user_agent_string += f" (compatible; {user_agent_token}; +https://github.com/rom1504/img2dataset)"
try:
request = urllib.request.Request(url, data=None, headers={"User-Agent": user_agent_string})
with urllib.request.urlopen(request, timeout=timeout) as r:
if disallowed_header_directives and is_disallowed(
r.headers,
user_agent_token,
disallowed_header_directives,
):
return key, None, "Use of image disallowed by X-Robots-Tag directive"
img_stream = io.BytesIO(r.read()) # can hang indefinitely here!
return key, img_stream, None
except Exception as err: # pylint: disable=broad-except
if img_stream is not None:
img_stream.close()
return key, None, str(err)
In particular, this function can block indefinitely on r.read(), making
img2dataset unable to finish downloading Laion.
I believe adding proper timeouts (#261
<#261>) will fix this issue.
—
Reply to this email directly, view it on GitHub
<#437>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437QGV5H6NK6JBJNE5M3Z5EAWLAVCNFSM6AAAAABQRH6VFGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGYYTCOBQGQ2DKMA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks for your response! For |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Some URL's, such as
https://www.ihypress.com/holidays/christmas/xmas-icon.png
(which appears in Laion), never finish downloading. In particular, this URL redirects to a video livestream that never ends.The current
download_image
implementation has a timeout for establishing the connection, but it has no timeout for actually downloading the image itself.In particular, this function can block indefinitely on
r.read()
, making img2dataset unable to finish downloading Laion.I believe adding proper timeouts (#261) will fix this issue.
The text was updated successfully, but these errors were encountered: