-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve precompilation coverage #3285
Conversation
Also maybe CSV.jl should be handled by as extension? (we would just then need to ensure that we precompile things in a way that we avoid invalidations). If you have some experience here what is best please comment. Thank you! |
As a comment, we probably indeed need to fix these invalidations. Here is what I have when both CSV.jl and DataFrames.jl are loaded in the "Julia 1.9 this PR (new precompilation)" scenario:
and then running the operations in the precompilation part takes 4.684947 seconds (while without CSV.jl it takes 0.394517 so indeed we loose almost all benefits of precompilation) |
Nice! Does it fix most of that recompile time if you depend on InlineStrings & SentinelArrays? |
If I add InlineStrings.jl and SentinelArrays.jl to dependencies AND
So there is recompilation but much less. If I add CSV.jl as a dependency instead then:
So all is good if we load CSV.jl (although we get recompilation when loading DataFrames.jl - @timholy: can you tell why?). In summary: it looks like adding CSV.jl as a dependency would be the best option. The question is if it is worth to make it a conditional dependency (probably yes, but I have not benchmarked it). Also @quinnj - CSV.jl is now on 0.10.9 version. What are the plans for further development/versions of CSV.jl? (the issue is what compat bounds to put into Project.toml if we decide to go forward with adding CSV.jl as a dependency) |
I have pushed the version with CSV.jl as a dependency (simple version - no conditional loading) if someone is interested in testing this. |
Julia complains that the following method definitions are ambiguous:
I will fix this when we make a decision what to include as dependencies. EDIT: fixed |
Do you get recompilation if you use |
Everything above is without Revise.jl and with |
Fixes for the Revise stack:
Packages that define new AbstractString subtypes are tricky! |
Yeah, I've been a little tied up w/ other projects at the moment, so haven't had a lot of time for CSV.jl lately. @Drvi, @nickrobinson251, and I have prototyped a new internal refactoring that currently lives here, which optimizes memory/perf for the chunked/row streaming case, and I want to adapt it to work for the I also really appreciate the investigative efforts here by @bkamins and @timholy; I'm more than happy to make any changes necessary in InlineStrings.jl, SentinelArrays.jl, CSV.jl, WeakRefStrings.jl or wherever else if it means a better story for DataFrames.jl! |
With JuliaLang/julia#48557 I can verify (on a different machine) julia> @time using CSV
0.713006 seconds (759.16 k allocations: 47.936 MiB, 9.82% gc time, 1.54% compilation time)
julia> @time using DataFrames
2.397113 seconds (2.93 M allocations: 164.404 MiB, 5.63% gc time) Fixes the recompilation during load. |
@timholy - I am not sure if it was discussed in other places but maybe the way to go would be to define If this general solution is not something that you would find useful in general maybe we could add CC @KristofferC |
You know about the last section of the SnoopPrecompile docs? using SnoopPrecompile, Preferences
set_preferences!(SnoopPrecompile, "skip_precompile" => ["PackageA", "PackageB"]) That's strongly encouraged over the ENV solution, as the ENV solution can cause you to end up with an inconsistent cache (there's no record of what the ENV settings were when a given package was precompiled). Warning, though: I may change how this works to make the settings more "granular." Stay vaguely tuned over the next month or so.
My impression is that @fonsp is planning to implement (or has implemented) utilities to sync the manifests of many different notebooks to a single "master" environment. (It's "just" a matter of copying the version info from one Manifest into the corresponding slot in a second Manifest.) I hope that should at least hold us over until the exciting work on parallel LLVM compilation lands. |
OK - thank you for an explanation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, see very small comments.
After some more thinking and testing I buy the argument that CSV.jl is too heavy dependency for DataFrames.jl. However, SentinelArrays.jl and InlineStrings.jl seem relatively lightweight as we can see here:
(they add 46 ms and 6.4 ms respectively, which I think is acceptable) Now a comparison of timing of normal load of DataFrames.jl is as follows: If we depend on CSV.jl
If we depend on SentinelArrays.jl and InlineStrings.jl
(time would be similar if we would not depend on SentinelArrays.jl and InlineStrings.jl) The benefit of having SentinelArrays.jl and InlineStrings.jl is that in case someone uses them (directly or indirectly) we do not invalidate precompiled DataFrames.jl code (in practice if someone uses CSV.jl, but maybe in the future these packages will be used more widely). So:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too bad that we have to add dependencies just for precompilation, but it's probably worth it as a temporary measure. Ideally at some point Julia will be able to make these conditional on these packages being installed in the environment.
Co-authored-by: Milan Bouchet-Valat <[email protected]>
I suspect it's headed to "pseudo-stdlib" status. I plan to move SnoopPrecompile out to JuliaLang sometime soon; I'm dragging my feet mostly because I wonder if we should rename it precisely to avoid conflating it with SnoopCompile. (They use similar techniques and thus are parallel in my mind, but they are also quite different.) SnoopCompile is big with lots of dependencies, but SnoopPrecompile is tiny: https://github.com/timholy/SnoopCompile.jl/blob/master/SnoopPrecompile/src/SnoopPrecompile.jl is the entire package (and it's about 40% docstring). |
I think @nalimilan's concern is having to add CSV/InlineStrings/SentinelArrays for precompilation, not SnoopPrecompile, which as you point out is lightweight. |
Gotcha. Keep in mind that adding them is an efficient way of avoiding having your code invalidated, but it's not the only solution. The other main approach is to identify the inference failures in DataFrames.jl that are causing Julia to be uncertain about which methods will be dispatched and then fix those inference failures. That said, I'm fully on board with this being an expedient and very effective solution that will make things better for your users. I'm painfully aware that SnoopCompile + |
I wanted to confirm one thing here. Since |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Thank you! (I would love to continue the discussion to get a better understanding what can be done) |
Yes, there are places where deliberate non-specialization is difficult to reconcile with resistance to invalidations. In such cases, setting |
Fixes #3248
To do:
Now I implemented step 1 (select precompilation statements)
Here are some statistics:
Julia 1.9
main
branch (old precompilation)Julia 1.9 this PR (new precompilation)
Julia 1.8.5 this PR (new precompilation)
In general my recommendation is to use the long list of precompilation statements. It adds 9 seconds to precompilation and 0.1 second to load time (but hopefully users will accept this; maybe the only problematic place is Pluto.jl, so let us discuss this). The benefit is that we precompile all commonly used functions.
Decide what to do with InlineStrings.jl and SentinelArrays.jl
After we settle the decision on step 1, we need what to do with InlineStrings.jl and SentinelArrays.jl. I will benchmark it later (after we decide what precompilations to keep). We have three options in general:
DataFrame
should be improved.@nalimilan, @quinnj, @timholy - do you have any opinion? Thank you!