-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for on-disk appends, partitioning #36
Comments
Validating that input blocks don't overlap with existing blocks sounds like a clear win. Appending to existing partitions might be useful but might also be more than we want to maintain within castra itself. @esc and I considered buffering blocks and appending to existing blocks this when we first built castra but decided against it in order to keep castra very simple. Our plan was to add this stuff on top of castra in external code. This came out of dealing with the bcolz codebase which, while much more fully featured that castra, is also more expensive to maintain. It may be that it's time to revisit this decision; I just wanted to share historical reasons on how we've tried to keep the core simple. I like the Appending onto existing blocks sounds like it might be tricky. I understand that you've been diving into bloscpack to do this. I suspect that this would marry castra and bloscpack more tightly than they are currently. This tight coupling concerns me, especially if we want to switch to using other compression libraries. This concern about marrying the two is motivated a bit by bloscpack not releasing the GIL, see Blosc/python-blosc#101. I would be -0.5 on any change that removed this option going forward. |
My thoughts: Castras should have the following invariants:
Additionally, having the index be a time series partitioned by some period is a common pattern. We should try to make this as easy as possible for users, while also ensuring the two invariants above. In my mind, the following use case should work: # Create a castra partitioned by day:
c = Castra('filepath', template=temp, partitionby='d')
# Add some existing data
c.extend_sequence(some_iterator)
c.close()
# Get new data at a later time, and add it, while keeping the partitioning scheme
c = Castra('filepath')
c.extend(df) I really want to support this functionality, as it's something I would expect from a tool like this. Saying "this castra is partitioned by day" means to me that both
|
I'm not sure that castras should manage partition sizes; this may be a pandoras box (although if you have an implementation that does this well that could be a good counterargument.) All use cases that I've come across would be satisfied by moving the # Create a castra
c = Castra('filepath', template=temp)
# Add some existing data, partitioned by day
c.extend_sequence(some_iterator, partitionby='d')
c.close() Direct use of extend is up to the user to coordinate: c = Castra('filepath')
c.extend(df) # user manages partition size directly This keeps a lot of logic out of the actual castra object and yet satisfies most use cases I can think of. It's also something that I think can be done very cheaply. |
If you have a castra that already exists on disk up to May 15, and you have a dataframe from May 16 to June 16, what does Or a simpler case, suppose you have a castra that has an index up to May 16, 0:00:00, and you have a dataframe with a few more datapoints at that same time. How can you add that dataframe to the castra, without modifying the existing partitions? |
In the firs case I would expect to add two partitions with In the second case I would expect castra to throw an error. |
If there is an application where the second case ends up being really important (e.g. log files that come in slightly out of order) then that sounds like a motivating use case. Do you have such a case? |
I don't, and I don't think castra should handle "out-of-order" data specifically. I do think it should work on overlapping order though (end castra index == start of next frame) I do think that adding periodic new data to an existing castra is something that should work, and should be easy to do. Sometimes these datasets overlap. My main use-case is also covered by |
I just tried using
So it looks like we are erring at least in the case of |
I've been working on a refactor of Castra - before I spend any more time on this, I should probably get some feedback. Here's the plan:
Issues I'm attempting to solve:
[[1, 2, 3, 3], [3, 3, 4, 5, 6], ...]
were possible (and happened to me)The plan:
partitionby=None
to theinit
signature. This will live inmeta
. IfNone
, no repartitioning is done by Castra. Can also be a time period (things you can pass toresample
in pandas).extend
checks current partitions for equality overlap (even ifpartitionby=None
). There are 3 cases that can happen here:partitionby != None
, then data is partitioned by Castra into blocks.extend
should still take large dataframes (calling extend on a row is a bad idea), but will group them into partitions based on the rule passed topartitionby
. Using the functionality provided by bloscpack, the on disk partitions can be appended to with little overhead. This makes writing in cases where this happens slightly slower, but has no penalty on reads.extend_sequence
function. This takes an iterable of dataframes (can be a generator), and does the partitioning in memory instead of on disk. This will be faster than callingextend
in a loop (no on disk appends), but will result in the same disk file format.This method means that the disk will match what's in memory after calls to
extend
orextend_sequence
complete, will allow castra to do partitioning for the user, and will ensure that the partitions are valid. I have a crude version of this working now, and have found writes to be only slightly penalized when appends happen (no penalty if they don't), and no penalty for reading from disk.The text was updated successfully, but these errors were encountered: