Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workspace.download_file: do not change basename #809

Open
bertsky opened this issue Feb 24, 2022 · 1 comment
Open

workspace.download_file: do not change basename #809

bertsky opened this issue Feb 24, 2022 · 1 comment

Comments

@bertsky
Copy link
Collaborator

bertsky commented Feb 24, 2022

I think this is caused by a change in assets: OCR-D/assets@b12e5eb, which was supposed to fix OCR-D/assets#87, but does not work.
Here is a debug log of what actually happens when copying the workspace to a temporary location:

DEBUG    ocrd.resolver.workspace_from_url:resolver.py:164 workspace_from_url
mets_basename='mets.xml'
mets_url='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml'
src_baseurl='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data'
dst_dir='/tmp/test-ocrd-calamari'
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
DEBUG    ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml' to '/tmp/test-ocrd-calamari/mets.xml'
DEBUG    ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/>  [_recursion_count=0]
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG    ocrd.workspace.download_file:workspace.py:158 First run of resolver.download_to_directory(OCR-D-IMG/INPUT_0017.tif) failed, try prepending baseurl '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data': File path passed as 'url' to download_to_directory does not exist: OCR-D-IMG/INPUT_0017.tif
DEBUG    ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/>  [_recursion_count=1]
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG    ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif' to '/tmp/test-ocrd-calamari/OCR-D-IMG/OCR-D-IMG_0001.tif'

So, essentially, Resolver.workspace_from_url undoes the non-standard path names when downloading, and subsequently the @imageFilename reference does not work (again).

@kba I suppose we could fix this in assets by using standard basenames, but it looks more like a bug in core to me.

Originally posted by @bertsky in OCR-D/ocrd_calamari#73 (comment)

@bertsky
Copy link
Collaborator Author

bertsky commented Feb 24, 2022

IOW, when you have a partial clone of a local workspace, and you attempt to download some of its files, the following happens:

  1. chdir to the clone's Workspace.directory (the only reference to the original workspace is in Workspace.baseurl now)
  2. resolving the relative local URL fails
  3. "downloading" it fails
  4. a recursive attempt is started with the absolute local URL (from baseurl + url)
  5. chdir to the same directory again
  6. resolving the absolute local URL fails
  7. downloading it into ID+ext succeeds ← this changes the relative local URL though
  8. further down the line, for Workspace.resolve_image_exif or Workspace.image_from_page, via a PAGE-XML's @imageFilename the old relative local URL is requested
  9. it cannot be not found

Me feeling is that 7 is wrong – we should at least keep the old relative URL.

But what if some PAGE files in the workspace to be cloned even contain remote references for @imageFilename?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant