Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve precompilation coverage #3285

Merged
merged 9 commits into from
Feb 11, 2023
Merged

improve precompilation coverage #3285

merged 9 commits into from
Feb 11, 2023

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Feb 5, 2023

Fixes #3248

To do:

  • select precompilation statements
  • decide what to do with InlineStrings.jl and SentinelArrays.jl

Now I implemented step 1 (select precompilation statements)

Here are some statistics:

Julia 1.9 main branch (old precompilation)

  • precompilation time: 36.419711 seconds
  • DataFrames.jl load time later: 1.408057
  • execution of code that is proposed to be used in precompilation (new set of precompile statements): 5.860520 seconds

Julia 1.9 this PR (new precompilation)

  • precompilation time: 45.814128 seconds
  • DataFrames.jl load time later: 1.587016
  • execution of code that is proposed to be used in precompilation (new set of precompile statements): 0.394517 seconds

Julia 1.8.5 this PR (new precompilation)

  • precompilation time: 21.730528 seconds
  • DataFrames.jl load time later: 2.356902
  • execution of code that is proposed to be used in precompilation (new set of precompile statements): 13.682346 seconds

In general my recommendation is to use the long list of precompilation statements. It adds 9 seconds to precompilation and 0.1 second to load time (but hopefully users will accept this; maybe the only problematic place is Pluto.jl, so let us discuss this). The benefit is that we precompile all commonly used functions.

Decide what to do with InlineStrings.jl and SentinelArrays.jl

After we settle the decision on step 1, we need what to do with InlineStrings.jl and SentinelArrays.jl. I will benchmark it later (after we decide what precompilations to keep). We have three options in general:

  • do not add them
  • add them
  • add CSV.jl as a hidden dependency (in this way when CSV.jl changes its dependencies we will automatically track them). Also then I could add a simple precompilation statement for loading CSV file in DataFrames.jl, so user experience of time of loading CSV files to a DataFrame should be improved.

@nalimilan, @quinnj, @timholy - do you have any opinion? Thank you!

@bkamins bkamins added the ecosystem Issues in DataFrames.jl ecosystem label Feb 5, 2023
@bkamins bkamins added this to the 1.5 milestone Feb 5, 2023
@bkamins
Copy link
Member Author

bkamins commented Feb 5, 2023

Also maybe CSV.jl should be handled by as extension? (we would just then need to ensure that we precompile things in a way that we avoid invalidations). If you have some experience here what is best please comment. Thank you!

@bkamins
Copy link
Member Author

bkamins commented Feb 5, 2023

As a comment, we probably indeed need to fix these invalidations. Here is what I have when both CSV.jl and DataFrames.jl are loaded in the "Julia 1.9 this PR (new precompilation)" scenario:

julia> @time using CSV
  0.504498 seconds (849.83 k allocations: 54.461 MiB, 3.65% gc time, 2.13% compilation time)

julia> @time using DataFrames
  1.858852 seconds (2.67 M allocations: 169.493 MiB, 3.31% gc time, 34.05% compilation time: 100% of which was recompilation)

and then running the operations in the precompilation part takes 4.684947 seconds (while without CSV.jl it takes 0.394517 so indeed we loose almost all benefits of precompilation)

@timholy
Copy link
Contributor

timholy commented Feb 5, 2023

Nice!

Does it fix most of that recompile time if you depend on InlineStrings & SentinelArrays?

@bkamins
Copy link
Member Author

bkamins commented Feb 6, 2023

If I add InlineStrings.jl and SentinelArrays.jl to dependencies AND include them (i.e. only having them in dependencies is not enough), then the time that is affected is running the test code (all else is comparable) and it is:

0.862469 seconds (1.32 M allocations: 74.504 MiB, 2.16% gc time, 97.84% compilation time: 54% of which was recompilation)

So there is recompilation but much less.

If I add CSV.jl as a dependency instead then:

  • precompilation time goes up to 52.912350 seconds (not that bad)
  • then load time of DataFrames.jl goes up to 2.225259 seconds (a bit more but not prohibitive)
  • time to run the benchmark without loading CSV.jl: 0.360435 seconds (good)
  • and the final timings:
julia> @time using CSV
  0.752680 seconds (850.07 k allocations: 54.492 MiB, 2.51% gc time, 1.25% compilation time)

julia> @time using DataFrames
  2.002735 seconds (3.01 M allocations: 189.720 MiB, 3.83% gc time, 30.74% compilation time: 100% of which was recompilation)

julia> @time # running all the benchmark codes
  0.363915 seconds (105.03 k allocations: 5.971 MiB, 95.26% compilation time)

julia> @time CSV.read("test.csv", DataFrame) # and this is something that is really nice - a big bonus of fast first time to read CSV as DataFrame
  0.059060 seconds (26.87 k allocations: 1.770 MiB, 97.05% compilation time)

So all is good if we load CSV.jl (although we get recompilation when loading DataFrames.jl - @timholy: can you tell why?).

In summary: it looks like adding CSV.jl as a dependency would be the best option. The question is if it is worth to make it a conditional dependency (probably yes, but I have not benchmarked it).

Also @quinnj - CSV.jl is now on 0.10.9 version. What are the plans for further development/versions of CSV.jl? (the issue is what compat bounds to put into Project.toml if we decide to go forward with adding CSV.jl as a dependency)

@bkamins
Copy link
Member Author

bkamins commented Feb 6, 2023

I have pushed the version with CSV.jl as a dependency (simple version - no conditional loading) if someone is interested in testing this.

@bkamins
Copy link
Member Author

bkamins commented Feb 6, 2023

Julia complains that the following method definitions are ambiguous:

reduce(::typeof(vcat), dfs::Union{Tuple{AbstractDataFrame, Vararg{AbstractDataFrame}}, AbstractVector{<:AbstractDataFrame}}; cols, source)
reduce(op::OP, x::SentinelArrays.ChainedVector) where OP

I will fix this when we make a decision what to include as dependencies.

EDIT: fixed

@timholy
Copy link
Contributor

timholy commented Feb 6, 2023

So all is good if we load CSV.jl (although we get recompilation when loading DataFrames.jl - @timholy: can you tell why?).

Do you get recompilation if you use --startup=no? I see there are several sources of Revise invalidation (I keep finding those...), will try to fix.

@bkamins
Copy link
Member Author

bkamins commented Feb 6, 2023

Everything above is without Revise.jl and with --startup=no.

@timholy
Copy link
Contributor

timholy commented Feb 6, 2023

Fixes for the Revise stack:

So all is good if we load CSV.jl (although we get recompilation when loading DataFrames.jl - @timholy: can you tell why?).

Base.require invalidation 😢 :

image

Packages that define new AbstractString subtypes are tricky!

@timholy
Copy link
Contributor

timholy commented Feb 6, 2023

JuliaLang/julia#48557

@quinnj
Copy link
Member

quinnj commented Feb 6, 2023

Also @quinnj - CSV.jl is now on 0.10.9 version. What are the plans for further development/versions of CSV.jl? (the issue is what compat bounds to put into Project.toml if we decide to go forward with adding CSV.jl as a dependency)

Yeah, I've been a little tied up w/ other projects at the moment, so haven't had a lot of time for CSV.jl lately. @Drvi, @nickrobinson251, and I have prototyped a new internal refactoring that currently lives here, which optimizes memory/perf for the chunked/row streaming case, and I want to adapt it to work for the CSV.File case as well. It should resolve the multithreading corner cases we continue to see pop up and be a better long-term solution for overall memory use as well. We just need to find the time to do the work to get it upstreamed to CSV.jl. So roughly my plan is we will probably have a 0.10.10 and maybe 0.10.11 release w/ some bugfixes and such, but 1.0 will be once we can upstream our new streaming work. I'm hopeful we can do that by the end of this year.

I also really appreciate the investigative efforts here by @bkamins and @timholy; I'm more than happy to make any changes necessary in InlineStrings.jl, SentinelArrays.jl, CSV.jl, WeakRefStrings.jl or wherever else if it means a better story for DataFrames.jl!

@timholy
Copy link
Contributor

timholy commented Feb 6, 2023

With JuliaLang/julia#48557 I can verify (on a different machine)

julia> @time using CSV
  0.713006 seconds (759.16 k allocations: 47.936 MiB, 9.82% gc time, 1.54% compilation time)

julia> @time using DataFrames
  2.397113 seconds (2.93 M allocations: 164.404 MiB, 5.63% gc time)

Fixes the recompilation during load.

@bkamins
Copy link
Member Author

bkamins commented Feb 6, 2023

@timholy - I am not sure if it was discussed in other places but maybe the way to go would be to define ENV["JULIA_PACKAGE_PRECOMPILE"] and if it is set to "no" then skip precompilation instructions. Otherwise perform precompilation. This could allow e.g. Pluto.jl to disable precompilation if it is not desirable.

If this general solution is not something that you would find useful in general maybe we could add ENV["JULIA_DATAFRAMES_PRECOMPILE"] that would have the same effect but would be only limited to DataFrames.jl precompilation?

CC @KristofferC

@timholy
Copy link
Contributor

timholy commented Feb 6, 2023

You know about the last section of the SnoopPrecompile docs?

using SnoopPrecompile, Preferences
set_preferences!(SnoopPrecompile, "skip_precompile" => ["PackageA", "PackageB"])

That's strongly encouraged over the ENV solution, as the ENV solution can cause you to end up with an inconsistent cache (there's no record of what the ENV settings were when a given package was precompiled). Warning, though: I may change how this works to make the settings more "granular." Stay vaguely tuned over the next month or so.

This could allow e.g. Pluto.jl to disable precompilation if it is not desirable.

My impression is that @fonsp is planning to implement (or has implemented) utilities to sync the manifests of many different notebooks to a single "master" environment. (It's "just" a matter of copying the version info from one Manifest into the corresponding slot in a second Manifest.) I hope that should at least hold us over until the exciting work on parallel LLVM compilation lands.

@bkamins
Copy link
Member Author

bkamins commented Feb 6, 2023

OK - thank you for an explanation.

docs/src/man/basics.md Outdated Show resolved Hide resolved
src/other/precompile.jl Outdated Show resolved Hide resolved
Copy link
Contributor

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, see very small comments.

docs/src/man/basics.md Outdated Show resolved Hide resolved
docs/src/man/basics.md Outdated Show resolved Hide resolved
docs/src/man/basics.md Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Feb 7, 2023

After some more thinking and testing I buy the argument that CSV.jl is too heavy dependency for DataFrames.jl. However, SentinelArrays.jl and InlineStrings.jl seem relatively lightweight as we can see here:

julia> @time_imports using DataFrames
      0.8 ms  Statistics
      0.3 ms  Reexport
      0.2 ms  Compat
      6.1 ms  OrderedCollections
     59.4 ms  DataStructures
      0.5 ms  SortingAlgorithms
      0.8 ms  DataAPI
     15.7 ms  PooledArrays
      7.6 ms  Missings
      2.4 ms  InvertedIndices
      0.3 ms  IteratorInterfaceExtensions
      0.2 ms  TableTraits
      0.9 ms  Formatting
      0.3 ms  DataValueInterfaces
     13.9 ms  Tables
    335.9 ms  StringManipulation
     71.1 ms  Crayons
      0.8 ms  LaTeXStrings
    174.1 ms  PrettyTables
     12.2 ms  Preferences
      0.3 ms  SnoopPrecompile
     46.0 ms  SentinelArrays
     62.9 ms  Parsers
      6.4 ms  InlineStrings
   1004.4 ms  DataFrames

(they add 46 ms and 6.4 ms respectively, which I think is acceptable)

Now a comparison of timing of normal load of DataFrames.jl is as follows:

If we depend on CSV.jl

julia> @time using DataFrames
  2.256193 seconds (3.49 M allocations: 217.968 MiB, 4.23% gc time, 3.26% compilation time)

If we depend on SentinelArrays.jl and InlineStrings.jl

julia> @time using DataFrames
  1.910445 seconds (2.64 M allocations: 167.254 MiB, 3.77% gc time, 4.10% compilation time)

(time would be similar if we would not depend on SentinelArrays.jl and InlineStrings.jl)

The benefit of having SentinelArrays.jl and InlineStrings.jl is that in case someone uses them (directly or indirectly) we do not invalidate precompiled DataFrames.jl code (in practice if someone uses CSV.jl, but maybe in the future these packages will be used more widely). So:

  • CSV.jl users get some benefit (not 100%, but a lot)
  • non-CSV.jl users get speedup

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too bad that we have to add dependencies just for precompilation, but it's probably worth it as a temporary measure. Ideally at some point Julia will be able to make these conditional on these packages being installed in the environment.

docs/src/man/basics.md Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
@timholy
Copy link
Contributor

timholy commented Feb 10, 2023

add dependencies just for precompilation

I suspect it's headed to "pseudo-stdlib" status. I plan to move SnoopPrecompile out to JuliaLang sometime soon; I'm dragging my feet mostly because I wonder if we should rename it precisely to avoid conflating it with SnoopCompile. (They use similar techniques and thus are parallel in my mind, but they are also quite different.)

SnoopCompile is big with lots of dependencies, but SnoopPrecompile is tiny: https://github.com/timholy/SnoopCompile.jl/blob/master/SnoopPrecompile/src/SnoopPrecompile.jl is the entire package (and it's about 40% docstring).

@quinnj
Copy link
Member

quinnj commented Feb 10, 2023

I think @nalimilan's concern is having to add CSV/InlineStrings/SentinelArrays for precompilation, not SnoopPrecompile, which as you point out is lightweight.

@timholy
Copy link
Contributor

timholy commented Feb 10, 2023

Gotcha. Keep in mind that adding them is an efficient way of avoiding having your code invalidated, but it's not the only solution. The other main approach is to identify the inference failures in DataFrames.jl that are causing Julia to be uncertain about which methods will be dispatched and then fix those inference failures. That said, I'm fully on board with this being an expedient and very effective solution that will make things better for your users.

I'm painfully aware that SnoopCompile + ascend + Cthulhu is a big stack of code to learn, and reading Julia's type-inferred CodeInfos is a bit like a 2-language problem. I just started working on JuliaDebug/Cthulhu.jl#345 because I think it's long overdue that Julia have an easy way for relative newbies to identify and fix type-instability in their code. Should help fix invalidations and save lots of hours on discourse helping people resolve "why is Julia slower than LanguageX?" questions.

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2023

identify the inference failures in DataFrames.jl that are causing Julia to be uncertain about which methods will be dispatched and then fix those inference failures.

I wanted to confirm one thing here. Since DataFrame is type unstable on purpose at some point we need to have this "dispatch uncertainty" (at the point where we move from type unstable to type stable code) and it is unavoidable. Do I understand this correctly?

Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins bkamins merged commit 1b9fa19 into main Feb 11, 2023
@bkamins bkamins deleted the bk/precompilation branch February 11, 2023 09:42
@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2023

Thank you! (I would love to continue the discussion to get a better understanding what can be done)

@timholy
Copy link
Contributor

timholy commented Feb 13, 2023

Yes, there are places where deliberate non-specialization is difficult to reconcile with resistance to invalidations. In such cases, setting Base.Experimental.@max_methods = 1 is probably your best hope. If that causes performance problems, perhaps one way to refine that strategy might be to split out the code that can tolerate it into a separate submodule, and only set @max_methods on the sub-module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecosystem Issues in DataFrames.jl ecosystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalidations when loading CSV
4 participants