-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Show and make accessible for use recent CDXJ files in the replay system web ui #82
Comments
I am not sure if I understood it completely, but in general I don't like the idea of letting the client to select the CDXJ files for reply. CDXJ is nothing but an index that can be created incrementally and the number of such files can be as little as one or as many as practically impossible for a human to deal with them n the client side. Merging CDXJ files is a trivial task and there is no magic involved in that. What would be a better idea in my opinion is toprovide an administrative interface to manage collections (namespaces) and associate one or more CDX files to each collection, then the users can select those named collections and not worry too much about the underlying details. |
@ibnesayeed That's a good idea. I would like to see the association of archival collection-->set of index files as well in the future. An admin interface would be one way to accomplish this. Merging CDXJ files is a trivial task and there is no magic involved in that. I am aware of this but the current ipwb implementation allows the use of one cdxj file at a time, which seems very limiting. Manipulation of ipwb-compatible cdxj files might be the job of a separate (sub-)tool. |
The current implementation allows only one CDXJ file because of the nature of the hackathon we developed it in. Extending it to iterate over a list of CDXJ files would not be difficult, but before that we need collection name spacing in place. |
@ibnesayeed How do you imagine this list of CDXJ files within a collection should be specified by the user? |
There are really many ways to implement this. Some would be more flexible and customizable than others, but might require more components such as some sort of database. One simple approach would be to introduce a convention and utilize the structure of the file system itself.
Look at the above directory and file organization. With this in place, if the replay server is invoked with the following command:
The server should recursively read all the CDXJ files that fall under the selected collection name space. For example, when requested for collection |
@ibnesayeed Good stuff. We can use this as the basis of introducing the collection concept into ipwb at some point. |
The idea behind this would be fast switching between "data sets" but also set the foundation for "merging" CDXJ files. As another use case, if we ever provide the ability to extract CDXJ lines only relevant to some user-provided parameter (e.g., only .co.uk URI-Rs), this would lessen the temporal burden required by the replay system to "find" the relevant URI-R/M at query time.
@ibnesayeed Thoughts on procedures and dynamics of cdxj merging? Your work on archive profiling seems relevant here.
The text was updated successfully, but these errors were encountered: