Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add information in the README about the ability to pipe output of indexer directly to the replay system #110

Open
machawk1 opened this issue Feb 15, 2017 · 3 comments

Comments

@machawk1
Copy link
Member

machawk1 commented Feb 15, 2017

For example, one can:

ipwb index myCapture.warc | ipwb replay

Replay will read stdin if a CDXJ is not specified and thus process the CDXJ resulting from the ipwb indexer immediately instead of relying on the contents of a file.

@ibnesayeed
Copy link
Member

I am not comfortable with this method and perhaps would not advertise it even if it is supported for small testing. With the binary search in multiple CDXJ files to server TimeMap or Memento, it would become an overhead to maintain and wont scale well.

@machawk1
Copy link
Member Author

I have a few use cases where this feature is handy for small collections. Can you provide an example (w/ sample data) where this would not scale so we can account for these circumstances?

This ticket is about documenting how to use an existing feature. If others have small collections, it would informative to let them know that this feature is available.

@ibnesayeed
Copy link
Member

The real problem is the fact that the replay is allowing the index to be read from STDIN which is essential to support pipes. Although, it looks like a handy feature, but it won't scale well. Piping is handy and efficient when the consumer end of the pipe processes the data as it arrives and then gets done with it. In this case though, the index data supplied in the pipe will persist for the lifetime of the replay process and will be looked up (scanned) each time a request hits the replay. This persistence will happen in memory, not on the disc, which means for any fairly large dataset the system can run out of memory very quickly.

I am afraid that once advertised, it will be difficult to step back. Hence, even if you want to keep this feature for some time to make the tests handy, you should not document it as it might go away soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants