Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add Support for Redirects in img2dataset #442

Open
airatvibe opened this issue Dec 12, 2024 · 0 comments
Open

Feature Request: Add Support for Redirects in img2dataset #442

airatvibe opened this issue Dec 12, 2024 · 0 comments

Comments

@airatvibe
Copy link

Currently, img2dataset does not support downloading files from URLs that require following HTTP redirects. For example, trying to download the file from the following URL fails due to multiple redirects in the process:

https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg

Below is an example of how wget handles the redirects:

wget https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg
--2024-12-12 18:00:45--  https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg
Resolving hors.easymerch.ru (hors.easymerch.ru)... 77.223.102.239, 188.246.224.25, 5.182.4.205, ...
Connecting to hors.easymerch.ru (hors.easymerch.ru)|77.223.102.239|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files21.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:45--  https://files21.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files21.easymerch.ru (files21.easymerch.ru)... 95.217.111.153
Connecting to files21.easymerch.ru (files21.easymerch.ru)|95.217.111.153|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files20.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:45--  https://files20.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files20.easymerch.ru (files20.easymerch.ru)... 135.181.16.12
Connecting to files20.easymerch.ru (files20.easymerch.ru)|135.181.16.12|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files19.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46--  https://files19.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files19.easymerch.ru (files19.easymerch.ru)... 95.217.111.157
Connecting to files19.easymerch.ru (files19.easymerch.ru)|95.217.111.157|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files18.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46--  https://files18.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files18.easymerch.ru (files18.easymerch.ru)... 65.21.140.24
Connecting to files18.easymerch.ru (files18.easymerch.ru)|65.21.140.24|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files17.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46--  https://files17.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files17.easymerch.ru (files17.easymerch.ru)... 65.21.138.242
Connecting to files17.easymerch.ru (files17.easymerch.ru)|65.21.138.242|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files16.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:46--  https://files16.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files16.easymerch.ru (files16.easymerch.ru)... 65.21.143.51
Connecting to files16.easymerch.ru (files16.easymerch.ru)|65.21.143.51|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files15.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47--  https://files15.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files15.easymerch.ru (files15.easymerch.ru)... 65.21.201.86
Connecting to files15.easymerch.ru (files15.easymerch.ru)|65.21.201.86|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files14.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47--  https://files14.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files14.easymerch.ru (files14.easymerch.ru)... 65.21.204.225
Connecting to files14.easymerch.ru (files14.easymerch.ru)|65.21.204.225|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files13.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47--  https://files13.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files13.easymerch.ru (files13.easymerch.ru)... 65.21.235.55
Connecting to files13.easymerch.ru (files13.easymerch.ru)|65.21.235.55|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files12.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:47--  https://files12.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files12.easymerch.ru (files12.easymerch.ru)... 65.21.204.240
Connecting to files12.easymerch.ru (files12.easymerch.ru)|65.21.204.240|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://files11.easymerch.ru/f-hors/6/70836/70840/239501.jpg [following]
--2024-12-12 18:00:48--  https://files11.easymerch.ru/f-hors/6/70836/70840/239501.jpg
Resolving files11.easymerch.ru (files11.easymerch.ru)... 65.21.204.245
Connecting to files11.easymerch.ru (files11.easymerch.ru)|65.21.204.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5718993 (5,5M) [image/jpeg]
Saving to: ‘239501.jpg.3’

239501.jpg.3                                    100%[=====================================================================================================>]   5,45M  9,76MB/s    in 0,6s    

2024-12-12 18:00:48 (9,76 MB/s) - ‘239501.jpg.3’ saved [5718993/5718993]

To ensure img2dataset works seamlessly with such URLs, it would be helpful to add a feature that enables automatic following of HTTP redirects.

Proposed Solution

Add an optional parameter (e.g., follow_redirects) that allows enabling auto-following of redirects during the download process. The default behavior could remain unchanged to preserve backward compatibility.

For example, the requests library already supports this functionality with its default behavior:

response = requests.get(url, timeout=30)
response.raise_for_status()

Alternatively, this behavior could be activated with an additional CLI flag.

Benefits

Enables downloading resources from dynamically redirected URLs.
Improves usability for datasets hosted on platforms with redirect-based file access.

Example Use Case

Using img2dataset to download files from:

https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg

Without this feature, the download fails, but with redirect support, the process completes successfully.

List of files to test:

https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70837/239497.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70838/239498.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70839/239499.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239500.jpg
https://hors.easymerch.ru/analytics/photos/view/f-hors/6/70836/70840/239501.jpg

Command to run:

img2dataset --url_list=list.txt --output_folder=images --processes_count 2 --thread_count 8 --image_size=256 --timeout 30

Try to download this images with using img2dataset and you will get an error:

HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.\nThe last 30x error message was:\nFound": 10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant