Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checksum mismatch vs misssing #32

Open
photomedia opened this issue Jul 30, 2021 · 3 comments
Open

checksum mismatch vs misssing #32

photomedia opened this issue Jul 30, 2021 · 3 comments

Comments

@photomedia
Copy link
Collaborator

photomedia commented Jul 30, 2021

A checksum MISMATCH should only occur when there is an existing checksum in the EPrints database for a file, and it doesn't match what is being checked. In the case of a MISMATCH, what the system does should be controlled by this option. From documentation:

$c->{DPExport}={on-checksum-mismatch}=skip-proceed|halt

skip-proceed should be the default, meaning that the problematic eprint is flagged with an error in the eprint's digital preservation errors field, but the batch job continues. If 'halt' is chosen, the entire batch job that the problematic eprint is a part of halts.

This option needs to be implemented, it is still not there in the code.

However, MISMATCH is not the same as a MISSING checksum in the EPrints database for a file/document. In this case, the system should do the following (from documentation):

For files with no MD5 value in the EPrints database:

**Ensure that the file is actually part of this eprint**
Generate a new MD5 from the file on disk
Write the MD5 to the EPrints database
Write the MD5 to the checksum.md5 manifest
Note that the MD5 was generated for the given file in the eprints' digital preservation warnings field

Relevant code is here, it needs to distinguish the two cases of MISSING vs MISMATCH:

my $ok = ( !defined( $hash_cache{ $file_path } ) || $hash_cache{ $file_path } ne $digest ) ? 0 : 1;

UPDATE: for files with no MD5 in the EPrints database, there is also a THIRD possibility of an error, which I did encounter: that of a pre-existing file in the "objects" directory of the export folder which doesn't belong with the EPrints that is currently being exported. That is because the current Eprint export algorithm doesn't delete the objects folder before writing to it, so a previous export's file could end up in the objects folder. In this case, the file would not have a corresponding hash in the database either. I am adding to the "no checksum in the database" error above "check that the file belongs with this eprint"

@photomedia
Copy link
Collaborator Author

To summarize:

We need to review the logic around the error throwing on checksum mismatch vs checksum missing. The missing checksum should be logged as such, and should by default be generated and added to the EPrints database - that's IF the file is actually a part of the eprint and not some left-over file from a previous export. Checksum mismatch should still skip-proceed by default, but checksum missing is a log message that a missing checksum for a file was generated/added. This (on-checksum-missing) could also be controlled by a flag in the config to skip-proceed|halt|generate, with generate being the default.

@photomedia
Copy link
Collaborator Author

I added some code to differentiate between the two issues: checksum MISMATCH vs MISSING.
In case of MISSING, a checksum is added to the file in EPrints database and processing continues.

281ef81

@photomedia
Copy link
Collaborator Author

The missing checksums is now resolved with the following commit: c754a3e
I updated the README with this as well, including new "add_missing_checksums" configuration option.

The leftover to-do item is just to control what happens in case of checksum-mismatch:
$c->{DPExport}={on-checksum-mismatch}=skip-proceed|halt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant