Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show and make accessible for use recent CDXJ files in the replay system web ui #82

Open
machawk1 opened this issue Jan 9, 2017 · 6 comments

Comments

@machawk1
Copy link
Member

machawk1 commented Jan 9, 2017

The idea behind this would be fast switching between "data sets" but also set the foundation for "merging" CDXJ files. As another use case, if we ever provide the ability to extract CDXJ lines only relevant to some user-provided parameter (e.g., only .co.uk URI-Rs), this would lessen the temporal burden required by the replay system to "find" the relevant URI-R/M at query time.

@ibnesayeed Thoughts on procedures and dynamics of cdxj merging? Your work on archive profiling seems relevant here.

@ibnesayeed
Copy link
Member

I am not sure if I understood it completely, but in general I don't like the idea of letting the client to select the CDXJ files for reply. CDXJ is nothing but an index that can be created incrementally and the number of such files can be as little as one or as many as practically impossible for a human to deal with them n the client side. Merging CDXJ files is a trivial task and there is no magic involved in that. What would be a better idea in my opinion is toprovide an administrative interface to manage collections (namespaces) and associate one or more CDX files to each collection, then the users can select those named collections and not worry too much about the underlying details.

@machawk1
Copy link
Member Author

machawk1 commented Jan 9, 2017

@ibnesayeed That's a good idea. I would like to see the association of archival collection-->set of index files as well in the future. An admin interface would be one way to accomplish this.

Merging CDXJ files is a trivial task and there is no magic involved in that.

I am aware of this but the current ipwb implementation allows the use of one cdxj file at a time, which seems very limiting. Manipulation of ipwb-compatible cdxj files might be the job of a separate (sub-)tool.

@ibnesayeed
Copy link
Member

The current implementation allows only one CDXJ file because of the nature of the hackathon we developed it in. Extending it to iterate over a list of CDXJ files would not be difficult, but before that we need collection name spacing in place.

@machawk1
Copy link
Member Author

machawk1 commented Jan 9, 2017

@ibnesayeed How do you imagine this list of CDXJ files within a collection should be specified by the user?

@ibnesayeed
Copy link
Member

There are really many ways to implement this. Some would be more flexible and customizable than others, but might require more components such as some sort of database. One simple approach would be to introduce a convention and utilize the structure of the file system itself.

/ipwb/collections/
├── bar/
│   ├── bar-1.cdxj
│   ├── bar-2.cdxj
│   └── bar-3.cdxj
├── foo/
│   ├── baz/
│   │   └── foo-baz-1.cdxj
│   ├── blah/
│   │   ├── foo-blah-1.cdxj
│   │   ├── foo-blah-2.cdxj
│   │   └── metadata.yaml
│   ├── foo-1.cdxj
│   └── metadata.yaml
└── metadata.yaml

Look at the above directory and file organization. With this in place, if the replay server is invoked with the following command:

$ ipwb replay /ipwb/collections/

The server should recursively read all the CDXJ files that fall under the selected collection name space. For example, when requested for collection bar, it should lookup using bar-1.cdxj, bar-2.cdxj, and bar-3.cdxj. When requested for foo, it should lookup using foo-1.cdxj, foo-baz-1.cdxj, foo-blah-1.cdxj, and foo-blah-2.cdxj. However, when requested for the collection foo/blah, it should only lookup using foo-blah-1.cdxj and foo-blah-2.cdxj. If no collection is specified then it will read all the CDXJ files under /ipwb/collections/ recursively. Additionally, each collection directory (on any nested level) can contain an optional metadata.yaml file that will customize various properties of the collection such as a more human friendly for of the collection name, the description of the collection, some inclusion/exclusion patterns to override the default behavior of the replay system for that collection name space. PyWB does similar collection management, but I am not sure if that supports recursive sub-collection feature or just a flat list of collections.

@machawk1
Copy link
Member Author

machawk1 commented Jan 9, 2017

@ibnesayeed Good stuff. We can use this as the basis of introducing the collection concept into ipwb at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants