Potential Splitstore Prune scaling problems #9128
Replies: 2 comments
-
Something else to note is that the hotstore suffers from the same theoretical problem with headers. The 2 marksets and 1 compaction idea would solve this for hotstore compaction too. All we would do is put old headers in markset 2 for permanent writing to the coldstore along with old message trees. Then the keys in the hotstore will be truly bounded and no scaling problem even exists for a blockchain running for hundreds of years. |
Beta Was this translation helpful? Give feedback.
-
I like the frozen store idea, where we move old messages (block headers currently stay hot, but there is no reason they couldnt move there too). As an added advantage, this store would be append only, so we could use something other than badger (eg storethehash). |
Beta Was this translation helpful? Give feedback.
-
Problem
@magik6k brought this to my attention in review here. (cc @arajasek @jennijuju @vyzo) The intended use of splitstore prune is to GC away all non-message non-header blocks from the coldstore. I have been pushing to use auto prune of the splitstore as the default lotus mode of operation because it results in an efficient hotstore and a coldstore containing all of the necessary data to recreate filecoin state.
The core of the potential problem is that the compaction algorithm is designed to traverse all objects in the blokcstore. For the hotstore this is bounded because the hotstore flushes all but CompactionBoundary of chain state (messages, states, receipts, forks) every CompactionThreshold - CompactionBoundary epochs (1 finality with current constants) leaving at most CompactionThresholds of chain data ever in the store. Cold store prune has similar behavior with the key difference that it keeps around both headers and messages, and message tree blocks and bytes are much larger than those of headers. Like headers in the hot store messages in the cold store are append only, so eventually the coldstore compaction (prune) process is going to be traversing a set of almost entirely stable message blocks that should not be pruned when computing the much smaller dead set.
Slow pruning in an autoprune setup is bad because it takes resources away not just from lotus but from hotstore compaction which can only run when prune is not finished. This could lead to bloating the hotstore and slowing down lotus further with slower hotstore accesses. Assuming badger has some superlinear scaling properties there is also a cold store size where prune time starts to get worse more quickly.
Brainstorming Solutions
We could rethink the sensible default of keeping all messages around. It could be that keeping only a range of more recent messages is a better default. Pruning earlier messages would be relatively easy to add to the pruning chain walk. Then the all messages case could be a more specialty archive node more invested in working around cold prune problems.
We could add another layer of indirection, the frozen store for keeping around old messages to keep down the total number of keys in the coldstore. This is a new algorithm with two different kinds of marksets, the first for keeping in the store, the second for moving to a new store, no mark for deleting.
If we do the above with two marksets there's no reason I can immediately think of for a third store. What we would do instead is add this second set to the hot store compaction algorithm. Hotstore chain walk would mark two things, 1) the things to keep 2) the things to move to the cold store. Hotstore compaction would then traverse all keys as today to add to the purge set anything not in markset (1) and these things would be deleted from the hotstore. However instead of moving the complement of markset (1) to the coldstore the hotstore would then only move items from markset (2) to the coldstore, the rest would be discarded. To implement discard mode markset (2) is just the empty set. To implement universal mode markset (2) is the complement of markset (1) and this would be implemented implicitly doing the same thing as today. However we could do prune at compaction time pretty easily by having slightly a smarter chain traversal funciton that knows what to put in markset (1) vs (2).
As a bit of a tangent this generalizes nicely into an autoprune strategy that I think will have some use and should be added in the future -- a strategy that keeps around every x states in order to trade off storage of states and speed of recomputing states in some optimal way. All you need is a chain walking function that knows about which states to keep and it puts them in markset (2).
Appendix --scratch thinking and calculations--
The 50% efficiency point will kick in when PruneThreshold - CompactionBoundary epochs of state trees (currently 3 finalities) have about the same number of blocks as the total message blocks in the cold store -- mostly a function of how long the splitstore has been operating. I have no good idea of when this happens. Assuming similar blocks per byte counts in states as in message trees and going off of very rough size estimates we have
500 MB / message trees / day
25 GB / state tree / day
So 50x more blocks in state
PruneThreshold - CompactionBoundary ~ 1 day
So ~ 50 days until 50% efficiency or 100 days until 99.99... efficiency, so something we should start seeing on the timescale of months.
It might be better to think about total time traversing the blockstore. If this is really inefficient but the blockstore traversal is really fast then this works well enough. This makes badger scaling relevant.
A characteristic problem point is when pruning takes longer than 2 * (CompactionThreshold - CompactionBoundary) because then we will be not only delaying every 3rd compaction by an additional finality but we will also start seeing interference with every hot compaction during autoprune. I want to know what number of messages causes prune to take this epoch time. Maybe it is a few months maybe it is a few years.
Beta Was this translation helpful? Give feedback.
All reactions