-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure littlefs to not use recursion, measure stack usage #658
Conversation
Now with -s/--sort and -S/--reverse-sort for sorting the functions by size. You may wonder why add reverse-sort, since its utility doesn't seem worth the cost to implement (these are just helper scripts after all), the reason is that reverse-sort is quite useful on the command-line, where scrollback may be truncated, and you only care about the larger entries. Outside of the command-line, normal sort is prefered. Fortunately the difference is just the sign in the sort key. Note this conflicts with the short --summary flag, so that has been removed.
This is to avoid unexpected script behavior even though data.py should always return 0 bytes for littlefs. Maybe a check for this should be added to CI?
scripts/coverage.py was missed originally because it's not ran as often as the others. Since it requires run-time info, it's usually only used in CI.
Note this detects loops (recursion), and renders this as infinity. Currently littlefs does have a single recursive function and you can see how this infects the full call graph. Eventually this should be removed.
- size -> code_size - size -> data_size - frame -> stack_frame - limit -> stack_limit - hits -> coverage_hits - count -> coverage_count
Note this does include internal structs, so this should probably be limited to informative purposes.
This required a patch to the --diff flag for the scripts to ignore a missing file. This enables the useful one liner for making comparisons with potentially missing previous versions: ./scripts/code.py lfs.o -d lfs.o.code.csv -o lfs.o.code.csv function (0 added, 0 removed) old new diff TOTAL 25476 25476 +0 One downside, these previous files are easy to delete as a part of make clean, which limits their usefulness for comparing configuration changes...
- Added -L/--depth argument to show dependencies for scripts/stack.py, this replaces calls.py - Additional internal restructuring to avoid repeated code - Removed incorrect diff percentage when there is no actual size - Consistent percentage rendering in test.py
A full summary of static measurements (code size, stack usage, etc) can now be found with: make summary This is done through the combination of a new ./scripts/summary.py script and the ability of existing scripts to merge into existing csv files, allowing multiple results to be merged either in a pipeline, or in parallel with a single ./script/summary.py call. The ./scripts/summary.py script can also be used to quickly compare different builds or configurations. This is a proper implementation of a similar but hacky shell script that has already been very useful for making optimization decisions: $ ./scripts/structs.py new.csv -d old.csv --summary name (2 added, 0 removed) code stack structs TOTAL 28648 (-2.7%) 2448 1012 Also some other small tweaks to scripts: - Removed state saving diff rules. This isn't the most useful way to handle comparing changes. - Added short flags for --summary (-Y) and --files (-F), since these are quite often used.
- Changed `make summary` to show a one line summary - Added `make lfs.csv` rule, which is useful for finding more info with other scripts - Fixed small issue in ./scripts/summary.py - Added *.ci (callgraph) and *.csv (script output) to CI
- Added to GitHub statuses (61 results) - Reworked generated release table to include these (16 results, only thumb) These also required a surprisingly large number of other changes: - Bumbed CI Ubuntu version 18.04 -> 20.04, 22.04 is already on the horizon but not usable in GitHub yet - Manualy upgrade to GCC v10, this is required for the -fcallgraph-info flag that scripts/stack.py uses. - Increased paginated status queries to 100 per-page. If we have more statuses than this the status diffs may get much more complicated... - Forced whitespace in generated release table to always be nbsp. GitHub tables get scrunched rather ugly without this, prefering margins to readable tables. - Added limited support for "∞" results, since this is returned by ./scripts/stack.py for recursive functions. As a side-note, this increases the number of statuses reported per-commit from 6 to 61, so hopefully that doesn't cause any problems...
Using errors=replace in python utf-8 decoding makes these scripts more resilient to underlying errors, rather than just throwing an unhelpfully generic decode error.
GCC is a bit frustrating here, it really wants to generate every file in a single command, which _is_ more efficient if our build system could leverage this. But -fcallgraph-info is a rather novel flag, so we can't really rely on it for generally compiling and testing littlefs. The multi-file output gets in the way when we want an explicitly separate rule for callgraph-info generation. We can't generate the callgraph-info without generating the objects files. This becomes a surprsing issue when parallel building (make -j) is used! Suddenly we might end up with both the .o and .ci rules writing to .o files, which creates a really difficult to track down issue of corrupted .o files. The temporary solution is to use an order-only prerequisite. This still ends up building the .o files twice, but it's an acceptable tradeoff for not requiring the -fcallgraph-info for all builds.
This requires parsing an additional section of the dwarfinfo (--dwarf=rawlines) to get the declaration file info. --- Interpreting the results of ./scripts/structs.py reporting is a bit more complicated than other scripts, structs aren't used in a consistent manner so the cost of a large struct depends on the context in which it is used. But that being said, there really isn't much reason to report internal-only structs. These structs really only exist for type-checking in internal algorithms, and their cost will end up reflected in other RAM measurements, either stack, heap, or other.
…ntu snaps Introduced when updating CI to Ubuntu 20.04, Ubuntu snaps consume loop devices, which conflict with out assumption that /dev/loop0 will always be unused. Changed to request a dynamic loop device from losetup, though it would have been nice if Ubuntu snaps allocated from the last device or something.
Avoids redundant counting of structs shared in multiple .c files, which is very common. This is different from the other scripts, code.py/data.py/stack.py, but this difference makes sense as struct declarations have a very different lifetime.
As noted in Python's subprocess library: > This will deadlock when using stdout=PIPE and/or stderr=PIPE and the > child process generates enough output to a pipe such that it blocks > waiting for the OS pipe buffer to accept more data. Curiously, this only became a problem when updating to Ubuntu 20.04 in CI (python3.6 -> python3.8).
lfs_dir_commit originally relied heavily on tail-recursion, though at least one path (through relocations) was not tail-recursive, and could cause unbounded stack usage in extreme cases of bad blocks. (Keep in mind even extreme cases of bad blocks should be in scope for littlefs). In order to remove recursion from this code path, several changed were raequired: - The lfs_dir_compact logic had to be somewhat inverted. Instead of first compacting and then resolving issues such as relocations and orphans, the overarching lfs_dir_commit now contains a state-machine which after committing or compacting handles the extra changes to the filesystem in a single, non-recursive loop - Instead of fixing all relocations recursively, >1 relocation requires defering to a full deorphan step. This step is unfortunately an additional n^2 process. It also required some changes to lfs_deorphan in order to ignore intentional orphans created as an intermediary in lfs_mkdir. Maybe in the future we should remove these. - Tail recursion normally found in lfs_fs_deorphan had to be rewritten as a loop which restarts any time a new commit causes a relocation. This does show that the algorithm may not terminate, but only if every block is bad, which will eventually cause littlefs to run out of blocks to write to.
This mostly just required separate functions for "lfs_file_rawwrite" and "lfs_file_flushedwrite", since lfs_file_flush recursively invokes lfs_file_rawread and lfs_file_rawwrite. This comes at a code cost, but gives us bounded and measurable RAM usage on this code path.
lfs_dir_traverse is a bit unpleasant in that it is inherently a recursive function, but without a strict bound of 4 calls (commit -> filter -> move -> filter), and efforts to unroll the recursion comes at a signification code cost. It turns out the best solution I've found so far is to simple create an explicit stack with an explicit bound of 4 calls (or more accurately, 3 pushed frames). --- This actually highlights one of the bigger flaws in littlefs right now, which is that this function, lfs_dir_traverse, takes O(n^2) disk reads to traverse. Note that LFS_FROM_MOVE can only occur once per commit, which is why this code is O(n^2) and not O(n^4).
Did you see #621 (I'm sure you did), in particular:
And then:
So I assume the optimisation being talked about there is to cut down this recursion significantly. But it looks like that optimisation cannot be applied generally, is that true? It would be great to see this optimisation applied because it really does help to improve littlefs performance. |
Ah I did, and was actually just re-reading it, but haven't been able to fully understand/provide useful feedback to the proposal yet. I hadn't realized it was related to this, thanks for pointing that out.
It actually should be "only" O(n^2+n^2) = O(n^2), since a LFS_FROM_MOVE tag can only occur once in an attribute list (here, the LFS_FROM_* tags can't occur on disk). That being said, this O(n^2) loop has been be a big mistake. It's something I'm actively exploring, with at least one potential prototype, but it's quite difficult to solve. I'd be quite interested to know @invoxiaamo's ideas, but that might be a discussion for that PR. This is the main reason behind #203
I was initially skeptical but looking into it a bit more this might be true. If so this would be a great improvement to have and might be able to reduce the stack usage in lfs_dir_traverse down to 3 frames. |
Regarding this PR: in general I think it's a good idea to eliminate the use of recursion, or at least have more control over the recursion. If it's fully eliminated it might be interesting to add a check to CI to ensure it's not added again accidentally (and that will limit the features that can be added in the future, which IMO is a good thing). As for code size increase: we do use littlefs in a bootloader and it needs to be small, but it's in read-only mode so looks like that won't change with this PR. |
It's not a hard error, but CI results in the "show all checks" panel would show "∞ B" instead of "1696 B", unfortunately GitHub has made these hard to view recently. At the very least it would be difficult to ignore come release time. But making it a hard error is an even better idea. |
Restructure littlefs to not use recursion, measure stack usage
@geky I recognize this was a difficult and sizeable undertaking and I am extremely thankful for the work you've done to accomplish this. I had a whole class of bugs based around stack usage within littlefs in one of our applications on a RAM constrained device where every bit of stack per task counted. I ended up guessing and checking the RAM usage but was always worried about usage during relocation for worn blocks down the road. Having this be deterministic is exactly what I was hoping for out of littlefs. |
This PR restructures the internals of littlefs a bit to completely avoid recursion. In theory the design of littlefs uses strictly bounded RAM, though it was written using some recursive functions, and at least one code path relied on tail-call optimization for bounded RAM consumption, which isn't guaranteed in standard C.
This broke down into three pieces of logic that used recursion:
lfs_dir_commit/lfs_fs_relocate - 84da4c0
The biggest culprit of recursion was the relocation logic responsible for moving metadata to new blocks when a block goes bad or has high wear. The reason is because this occurs deep within the commit logic and is inherently recursive (any commit can require relocation, which needs to commit to fix the relocation!)
This required quite a bit of restructuring, inverting most of the commit logic so that relocations occur on an outer loop that invokes the core commit logic. The commit logic ends up with several new entry points that "fall back" to outer loops under different error conditions, with the top level loop being a full
lfs_fs_deorphan
pass.lfs_file_read/write/flush - fedf646
A bit easier to solve,
lfs_file_read
andlfs_file_write
sometimes need to flush state from otherlfs_file_read
/lfs_file_write
calls. The only problem being thatlfs_file_flush
callslfs_file_read
/lfs_file_write
to do the flush!This was easy to solve with an extra set of
lfs_file_flushedread
/lfs_file_flushedwrite
entry points that avoid this. Note thatthis code path couldn't actually occur normally outside of filesystem bugs.
lfs_dir_traverse - 8109f28
lfs_dir_traverse
was also a bit tricky. It's inherently recursive, but has a strict upper bound of 4 calls (a traverse with a filter with a move with a filter).I tried a few different things, but it was difficult to unroll things without an extra ~1KiB of code cost. The solution here
is to add an explicit, bounded stack (just an array of state), which turned out to not increase the code cost too badly.
I suspect this hints that modeling traversals with callbacks is a poor choice for a non-recursive code, so this may be worth implementing differently at some point in the future.
This PR also adds a few more scripts and utilities to help analyze builds and compare different implementations:
make code
or./scripts/code.py
- Reports the code size of functionsmake data
or./scripts/data.py
- Reports any data sections, this should always be zero for littlefsmake stack
or./scripts/stack.py
- Reports stack frames and maximum stack usage of functionsmake structs
or./scripts/structs.py
- Reports the size of any public structsFor example:
Output
Or with callgraph info:
$ make stack STACKFLAGS+='-L1'
Output
Note this depends on the
-fstack-usage
flag introduced in GCC 10.Also, if you're curious what the deepest call is in littlefs after this change:
These changes do come with a code cost, as some of the recursion allowed a bit better code reuse:
Should resolve #417