-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large corpus: Automation Tools stalling #36
Comments
Thanks. This is why I added the --limit parameter to the bin scripts (https://github.com/eprintsug/EPrintsArchivematica#bin-scripts). I use this to set limits as to maximum how many items at one time I process. For the initial export of all of our content, we will need to export out 20000 items, but I wont do this all at once and instead, work in batches of 500 or 1000 at a time. After each batch is done and processed, I check out the logs/Archivematica records in the GUI, to see that all items have a UUID in EPrints, meaning they were all archived, and then clear the transfers in Archivematica dashboard and clear the transfer directory. After the initial export of 20000+ items is done, I don't imagine I will need these limits, because we will just be exporting out 1 days' (or 1 weeks') worth of deposits. |
Cheers Tomasz. The limit parameter is very helpful indeed. :-) This suggestion is just that: something for the future! To be honest, I think the issue is more for repositories connected to a CRIS (i.e. Pure). Pure performs so many updates to individual eprints such that every week many thousands of items are 'touched'. It is difficult to know whether some of these touches are significant or not, or to distinguish them from updates initiated by team members, so re-processing them all is the only safe course of action (even with the strictest export triggers imposed). And, of course, processing quickly is necessary before they are touched again. In these instances it would be simpler to export and then re-process everything because repeated intervention is necessary to process in batches. Modifying the directory structure could mean that the job could be added to the cron tab and intervention could be minimized. Something to ruminate! |
Ok, I understand. Let's keep this open for comments for a while. My impression is that this change would be very difficult to implement. My thoughts as to why are the following:
|
Yes, I agree. It occurred to me too that -- even if it could be implemented -- it would cause problems for repositories using the existing plugin and directory structure. It might not be worth going here!
It certainly helps -- this is currently the only trigger we have enabled. But it doesn't completely resolve the issue because Pure's interaction with repositories (inc. DSpace) is very primitive. Interactions use Elsevier's proprietary connector rather than SWORD. Pure has dozens of cron jobs, some of which then initiate a write to EPrints, and this sometimes includes over-writing a file even if there has been no change to the file. But, to EPrints, it appears as if there has been a file change. 👎 Things would be a lot easier if Pure just used SWORD like any normal system.
I'm not sure but I am hoping so! :-) Seems to be working so far but I can report can soon with the benefit of further testing.
This question in relation to a potential change to the directory structure, yes? If so, my instinct says the structure would have to change because it would be suboptimal to have a lack of UUID specificity. But I guess this is another reason for us to conclude, 'Here be dragons!' ;-) |
This is ostensibly an Archivematica issue but future improvements to the plug-in could deliver improvements to the export and ingest process.
In instances where a large number of items have been exported from EPrints for Archivematica ingest, it is common for Archivematica's 'Automation Tools' (AT) functionality to stall. Liaison with Artefactual indicates that this is because AT can struggle with the number of directories in the transfer source (which for us is > 60,000). But even at lower numbers AT can stall, requiring technical intervention at the Archivematica side to re-start AT. This stalling can happen frequently and, as EPrints repositories expand their preservation exports, this will become an increasing issue.
A possible solution/improvement might be to enhance the directory structure used to export EPrints by adding an additional level of hierarchy. Instead of exporting by AID individually only, export each EPrint AID directory within a corresponding parent directory -- for example, grouping AIDs by 1-999, 1000-1999, 2000-2999, and so forth?
The text was updated successfully, but these errors were encountered: