Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: various optimization targeting 'rename' and "compact' (but not only...) #621

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

invoxiaamo
Copy link
Contributor

@invoxiaamo invoxiaamo commented Dec 1, 2021

Hello.
While we are working with littlefs for few months, we experience few long stall of the system during up to 30s, despite having a solid hardware (250+ MHz ARM M33, lot of RAM, 50 MB/s of NOR flash bus speed).

Our investigation quickly focused on the rename operations that are used for atomic file substitution.
The issues are similar to #327.

In order to easily reproduce the issue we found the following scenario which is very representative of the problem:

    - format and mount a 4KB block FS
    - create 100 files of 256 Bytes named "/dummy_%d"
    - create a 1024 Byte file "/test"
    - rename "/test" "/test_rename"
    - create a 1024 Byte file "/test"
    - rename "/test" "/test_rename"  [1]

The second rename [1] trigger a compaction of the dir.
We measure various performance counters:

  • lfs_dir_traverse is called 148 393 times
  • lfs_bd_read is called 25 205 444 times (for 100MB+ of data requested)
  • With current cache implementation, 13 872 154 SPI read transfers where done (222MB+ of data transfers).
    THIS IS HUGE for a 102 files / 28KB filled FS....

We provide here 2 cache improvements and a simple but very effective lfs_dir_traverse optimization.
All the optimizations are independent.

  1. The cache "read ahead" optimization simply request larger rcache fill when it detects that reads are continuous.
    => 66% reduction if IO operations
    => 20% increase of data transfers on SPI
    => 33% overall time reduction of SPI transfers (since the effective data transfers is only a small part of the SPI command sequence)

  2. A "dynamic" block caching (enabled with LFS_DYN_CACHE define) to malloc a whole block when we detect random consecutive read on a single block. the RAM allocated is freed as soon as the user lfs_xxxxx operation returns. Consequently it should not increase heap usage as long as there is a single block temporary available in RAM.
    => 99.997 % reduction of IO operations (compared to (1)) - only 123 remaining block read vs. 13872154
    => 99.993 % reduction of IO data transfer - only 15KB vs. 222MB.
    When RAM (eg. 4KB for our plaform) is not an issue, this is a very interesting optimization.

  3. Yet, despite (2), the overall CPU time only decrease by 75%, leaving the huge number of times lfs_dir_traverse is called.
    We analyze the call graph and saw that 4 levels of recursive calls of lfs_dir_traverse are done during compaction.
    A O(n^4) is obviously a problem...
    The 4 level is due to the presence of LFS_FROM_MOVE tag is attribute walk list.

 [1] lfs_dir_traverse(cb=lfs_dir_commit_size)
     - do 'duplicates and tag update'
   [2] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1])
       - Reaching a LFS_FROM_MOVE tag
     [3] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) <= on 'from' dir
         - do 'duplicates and tag update'
       [4] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[3])

Yet, for the particular case of cb=lfs_dir_traverse_filter for [3] and [4], we saw that all the calls are useless

  • can't raise 'duplicate' trigger of [2]
  • don't update tag[1] of level [2]

The shortcut of this particular case reduce the number of lfs_dir_traverse calls by 97.8% (3248 vs 148393).

With all the 3 optimizations, the overall compaction time on our platform drop by 98.3%

Monitoring the lfs_bd_read usage during time consuming operations (eg. rename)
shows a very large of uncache reads on small (less than cache size)
consecutive offsets.

Implementing a read ahead strategy when such consecutive reads are detective
is an effective optimization.
Rename can be VERY time consuming. One of the reasons is the 4 recursion
level depth of lfs_dir_traverse() seen if a compaction happened during the
rename.

lfs_dir_compact()
  size computation
    [1] lfs_dir_traverse(cb=lfs_dir_commit_size)
         - do 'duplicates and tag update'
       [2] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1])
           - Reaching a LFS_FROM_MOVE tag (here)
         [3] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) <= on 'from' dir
             - do 'duplicates and tag update'
           [4] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[3])
  followed by the compaction itself:
    [1] lfs_dir_traverse(cb=lfs_dir_commit_commit)
         - do 'duplicates and tag update'
       [2] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1])
           - Reaching a LFS_FROM_MOVE tag (here)
         [3] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) <= on 'from' dir
             - do 'duplicates and tag update'
           [4] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[3])

Yet, analyse shows that levels [3] and [4] don't perform anything
if the callback is lfs_dir_traverse_filter...

A practical example:

- format and mount a 4KB block FS
- create 100 files of 256 Bytes named "/dummy_%d"
- create a 1024 Byte file "/test"
- rename "/test" "/test_rename"
- create a 1024 Byte file "/test"
- rename "/test" "/test_rename"
This triggers a compaction where lfs_dir_traverse was called 148393 times,
generating 25e6+ lfs_bd_read calls (~100 MB+ of data)

With the optimization, lfs_dir_traverse is now called 3248 times
(589e3 lfs_bds_calls (~2.3MB of data)

=> x 43 improvement...
@genotix
Copy link

genotix commented Dec 1, 2021

Dude this is seriously -amazing- work!!
Huge thumbs up!!!

@dpgeorge
Copy link
Contributor

This sounds really great!

Do you think it's possible to implement optimisation (2) (and possibly (1) as well) at the block device layer? That is, not within littlefs itself but rather within the read/prog calls defined by the user?

@invoxiaamo
Copy link
Contributor Author

@dpgeorge

  • (2) is really an small improvement of the current cache. Either is is integrated as a fix of the current one, either it can be skipped.
  • (1) can obviously be implemented in the flash driver, but the interesting part of the current implementation is that malloc memory is released as soon as the lfs_open/write/rename... call return. It means that the memory is not steal and doesn't increase the memory pressure of the embedded device.

@invoxiaamo
Copy link
Contributor Author

@AlexanderCsr8904
Is your suggestion is to squash all commits of this pull request into one ?
In fact, this pull request is a RFC to collect reviews and ideas. I will create independent pull requests later, in particularly (3) which is the most important.
For (2), not everybody has the same needs. For some, memory (even temporary malloc) is an issue compared to performance (think FS only used to store few configuration files), while for other, speed is important. That's why (2) depends of the LFS_DYN_CACHE configuration define.
There is also the commit concerning performance statistics which is independent.

Copy link

@BenBE BenBE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two minor typos …

lfs.c Outdated Show resolved Hide resolved
lfs.c Outdated Show resolved Hide resolved
invoxiaamo and others added 2 commits December 10, 2021 17:06
Co-authored-by: BenBE <[email protected]>
Co-authored-by: BenBE <[email protected]>
Copy link

@GIldons GIldons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix dynamic cache initialization.

lfs.c Show resolved Hide resolved
Co-authored-by: GIldons <[email protected]>
@geky geky removed the enhancement label Apr 9, 2022
@geky
Copy link
Member

geky commented Apr 9, 2022

Hi @invoxiaamo, this is a really great analysis and these are very creative solution! Thanks for creating this PR and sorry I wasn't able to properly look into it until now.

(3) seems immediately actionable. The original algorithm shouldn't be exceeding O(n^2), but I can reproduce what you're seeing, so it appears this was a mistake when translating the design to implementation.

Would you be able to create a separate PR for just (3)? We would be able to merge it into the next release then.


Sorry if you've already mentioned it and I'm just missing it, but what cache_size are you using?

I'm curious how these optimizations compare to setting cache_size = block_size. In theory at that point the full metadata block should fit in the cache avoiding the need to go to disk O(n^2) times.

  • (2) is unfortunately a bit incompatible with some of littlefs's design, as littlefs really only uses malloc for convenience functions wrapping optionally static buffers. This means you can use littlefs without malloc without sacrificing any of its features.

    In theory you could actually allocate the dyn-block-cache's buffer on the stack, but dynamic stack allocations come with their own set of problems which we should avoid...

    It may be reasonable to just add an optional third cache to the lfs_config struct. In practice temporary allocations like this are difficult to take advantage of in resource constrained systems because of the need to be prepared for the worst-case RAM consumption.


    A more general solution, one that I would like to see happen at some point, would be to change the caching system in littlefs to be able to cache blocks > 2 if enough RAM is provided. This would in effect turn littlefs's cache layer into an actual block cache and I believe provide a similar solution as (2), while also allowing more RAM to cache more blocks (though probably with diminishing returns).

  • (1) is quite a smart idea, though eagerly fetching reads was the intention of the hint argument to lfs_bd_read. Do you think the performance gains of (1) could be accomplished by tuning lfs_bd_read's hint in the hot path?

    You can see hint is set to block_size (the maximum) in most of lfs_dir_fetchmatch:

    littlefs/lfs.c

    Lines 904 to 906 in 9c7e232

    int err = lfs_bd_read(lfs,
    NULL, &lfs->rcache, lfs->cfg->block_size,
    dir->pair[0], off, &tag, sizeof(tag));

    But it looks like this could be improved in other parts of the code path:

    littlefs/lfs.c

    Lines 767 to 769 in 9c7e232

    int err = lfs_bd_read(lfs,
    NULL, &lfs->rcache, sizeof(tag),
    dir->pair[0], off, &tag, sizeof(tag));

@geky geky added this to the v2.5 milestone Apr 9, 2022
@invoxiaamo
Copy link
Contributor Author

invoxiaamo commented Apr 9, 2022

hello @geky
I just create PR #666 with only (3).
Regards,
Arnaud

@geky geky removed this from the v2.5 milestone Apr 9, 2022
@geky geky added the needs minor version new functionality only allowed in minor versions label Apr 9, 2022
@geky geky mentioned this pull request Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement needs minor version new functionality only allowed in minor versions performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants