Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests broken since last update #73

Closed
mikegerber opened this issue Feb 24, 2022 · 5 comments · Fixed by #74
Closed

Tests broken since last update #73

mikegerber opened this issue Feb 24, 2022 · 5 comments · Fixed by #74
Assignees

Comments

@mikegerber
Copy link
Collaborator

Since the last update, the tests are broken:

------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------
11:00:07.844 INFO processor.CalamariRecognize - INPUT FILE 0 / phys_0001
--------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------
INFO     processor.CalamariRecognize:recognize.py:81 INPUT FILE 0 / phys_0001
================================================================== short test summary info ==================================================================
FAILED test/test_recognize.py::test_recognize - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Perhaps you...
FAILED test/test_recognize.py::test_recognize_should_warn_if_given_rgb_image_and_single_channel_model - requests.exceptions.MissingSchema: Invalid URL 'OC...
FAILED test/test_recognize.py::test_word_segmentation - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Per...
FAILED test/test_recognize.py::test_glyphs - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Perhaps you me...
==================================================================== 4 failed in 16.04s =====================================================================
make: *** [Makefile:77: test] Error 1

Observations:

The new code from @bertsky's change in 1f0252d should download OCR-D-IMG/INPUT_0017.tif but doesn't:

% ls /tmp/test-ocrd-calamari/OCR-D-IMG 
OCR-D-IMG_0001.tif  OCR-D-IMG_0002.tif
@mikegerber
Copy link
Collaborator Author

The "downloaded" images' filenames are made from the mets:file's ID:

   <mets:fileGrp USE="OCR-D-IMG">
      <mets:file MIMETYPE="image/tiff" ID="OCR-D-IMG_0001">
        <mets:FLocat LOCTYPE="URL" xlink:href="OCR-D-IMG/INPUT_0017.tif"/>
      </mets:file>
      <mets:file MIMETYPE="image/tiff" ID="OCR-D-IMG_0002">
        <mets:FLocat LOCTYPE="URL" xlink:href="OCR-D-IMG/INPUT_0020.tif"/>
      </mets:file>
    </mets:fileGrp>

@mikegerber mikegerber self-assigned this Feb 24, 2022
@mikegerber
Copy link
Collaborator Author

With an old(!) checkout of test/assets I did not have these fails with this new code, so this may be worth investigating.

@mikegerber
Copy link
Collaborator Author

With an old(!) checkout of test/assets

See also #72.

@bertsky
Copy link
Contributor

bertsky commented Feb 24, 2022

I think this is caused by a change in assets: OCR-D/assets@b12e5eb, which was supposed to fix OCR-D/assets#87, but does not work.
Here is a debug log of what actually happens when copying the workspace to a temporary location:

DEBUG    ocrd.resolver.workspace_from_url:resolver.py:164 workspace_from_url
mets_basename='mets.xml'
mets_url='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml'
src_baseurl='/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data'
dst_dir='/tmp/test-ocrd-calamari'
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
DEBUG    ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml' to '/tmp/test-ocrd-calamari/mets.xml'
DEBUG    ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/>  [_recursion_count=0]
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG    ocrd.workspace.download_file:workspace.py:158 First run of resolver.download_to_directory(OCR-D-IMG/INPUT_0017.tif) failed, try prepending baseurl '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data': File path passed as 'url' to download_to_directory does not exist: OCR-D-IMG/INPUT_0017.tif
DEBUG    ocrd.workspace.download_file:workspace.py:142 download_file <OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif, local_filename=OCR-D-IMG/INPUT_0017.tif]/>  [_recursion_count=1]
DEBUG    ocrd.resolver.download_to_directory:resolver.py:49 directory=|/tmp/test-ocrd-calamari| url=|/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif| basename=|OCR-D-IMG_0001.tif| if_exists=|skip| subdir=|OCR-D-IMG|
DEBUG    ocrd.resolver.download_to_directory:resolver.py:99 Copying file '/ocrd_calamari/test/assets/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-IMG/INPUT_0017.tif' to '/tmp/test-ocrd-calamari/OCR-D-IMG/OCR-D-IMG_0001.tif'

So, essentially, Resolver.workspace_from_url undoes the non-standard path names when downloading, and subsequently the @imageFilename reference does not work (again).

@kba I suppose we could fix this in assets by using standard basenames, but it looks more like a bug in core to me.

@mikegerber
Copy link
Collaborator Author

Relevant parts of test_recognize.py:

METS_KANT = assets.url_of('kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml')                                                                 
WORKSPACE_DIR = '/tmp/test-ocrd-calamari'                                                                                                                    
                                                                                                                                
    resolver = Resolver()                                                                                                                                    
    workspace = resolver.workspace_from_url(METS_KANT, dst_dir=WORKSPACE_DIR)                                                                                
                                                                                                                             
    for imgf in workspace.mets.find_files(fileGrp="OCR-D-IMG"):                                                                                              
        imgf = workspace.download_file(imgf)
        print(imgf)                                                                                                             

This clones the workspace from test/assets and doesn't give the correct local filenames:

<OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0001, mimetype=image/tiff, url=OCR-D-IMG/OCR-D-IMG_0001.tif, local_filename=OCR-D-IMG/OCR-D-IMG_0001.tif]/> 
<OcrdFile fileGrp=OCR-D-IMG ID=OCR-D-IMG_0002, mimetype=image/tiff, url=OCR-D-IMG/OCR-D-IMG_0002.tif, local_filename=OCR-D-IMG/OCR-D-IMG_0002.tif]/> 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants