Merge pull request #77 from sintefmath/dev

Add high-level CUDA support, doc improvements, NLDD performance increases, CSR backend as default
sintefmath · Nov 12, 2024 · 178d314 · 178d314 · moyner · Nov 13, 2024
2 parents ea3b38d + ff7e0c0
commit 178d314
Show file tree

Hide file tree

Showing 46 changed files with 1,346 additions and 343 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "JutulDarcy"
 uuid = "82210473-ab04-4dce-b31b-11573c4f8e0a"
 authors = ["Olav Møyner <[email protected]>"]
-version = "0.2.35"
+version = "0.2.36"
 
 [deps]
 AlgebraicMultigrid = "2169fc97-5a83-5252-b627-83903c6c433c"
@@ -34,20 +34,26 @@ TimerOutputs = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"
 Tullio = "bc48ee85-29a4-5162-ae0b-a64e1601d4bc"
 
 [weakdeps]
+AMGX = "c963dde9-0319-47f5-bf0c-b07d3c80ffa6"
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
 GLMakie = "e9467ef8-e4e7-5192-8a1a-b1aee30e663a"
 HYPRE = "b5ffcf37-a2bd-41ab-a3da-4bd9bc8ad771"
 MPI = "da04e1cc-30fd-572f-bb4f-1f8673147195"
 Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a"
 PartitionedArrays = "5a9dfac6-5c52-46f7-8278-5e2210713be9"
 
 [extensions]
+JutulDarcyAMGXExt = "AMGX"
+JutulDarcyCUDAExt = "CUDA"
 JutulDarcyGLMakieExt = "GLMakie"
 JutulDarcyMakieExt = "Makie"
 JutulDarcyPartitionedArraysExt = ["PartitionedArrays", "MPI", "HYPRE"]
 
 [compat]
 AlgebraicMultigrid = "0.5.1, 0.6.0"
 Artifacts = "1"
+AMGX = "0.2"
+CUDA = "5"
 DataStructures = "0.18.13"
 Dates = "1"
 DelimitedFiles = "1.6"
@@ -56,7 +62,7 @@ ForwardDiff = "0.10.30"
 GLMakie = "0.10.13"
 GeoEnergyIO = "1.1.12"
 HYPRE = "1.6.0, 1.7"
-Jutul = "0.2.40"
+Jutul = "0.2.42"
 Krylov = "0.9.1"
 LazyArtifacts = "1"
 LinearAlgebra = "1"
@@ -78,6 +84,8 @@ Tullio = "0.3.4"
 julia = "1.7"
 
 [extras]
+AMGX = "c963dde9-0319-47f5-bf0c-b07d3c80ffa6"
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
 GLMakie = "e9467ef8-e4e7-5192-8a1a-b1aee30e663a"
 HYPRE = "b5ffcf37-a2bd-41ab-a3da-4bd9bc8ad771"
 MPI = "da04e1cc-30fd-572f-bb4f-1f8673147195"
@@ -87,4 +95,4 @@ TestItemRunner = "f8b46487-2199-4994-9208-9a1283c18c0a"
 TestItems = "1c621080-faea-4a02-84b6-bbd5e436b8fe"
 
 [targets]
-test = ["Test", "TestItemRunner", "TestItems", "HYPRE", "MPI", "PartitionedArrays"]
+test = ["Test", "CUDA", "TestItemRunner", "TestItems", "HYPRE", "MPI", "PartitionedArrays"]
diff --git a/docs/make.jl b/docs/make.jl
@@ -137,6 +137,7 @@ function build_jutul_darcy_docs(build_format = nothing;
                     ],
             "Further reading" => [
                 "man/advanced/mpi.md",
+                "man/advanced/gpu.md",
                 "man/advanced/compiled.md",
                 "Jutul functions" => "ref/jutul.md",
                 "Bibliography" => "extras/refs.md"

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -33,8 +33,8 @@ features:
     link: /examples/intro_sensitivities
 
   - icon: 🏃
-    title: High performance 
-    details: Fast execution with support for MPI and thread parallelism
+    title: High performance on CPU & GPU
+    details: Fast execution with support for MPI, CUDA and thread parallelism
     link: /man/advanced/mpi
 ---
 ````

diff --git a/docs/src/man/advanced/gpu.md b/docs/src/man/advanced/gpu.md
@@ -0,0 +1,57 @@
+# GPU support
+
+JutulDarcy includes experimental support for running linear solves on the GPU. For many simulations, the linear systems are the most compute-intensive part and a natural choice for acceleration. At the moment, the support is limited to CUDA GPUs through [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl). For the most efficient CPR preconditioner, [AMGX.jl](https://github.com/JuliaGPU/AMGX.jl) is required which is currently limited to Linux systems. Windows users may have luck by running Julia inside [WSL](https://learn.microsoft.com/en-us/windows/wsl/install).
+
+## How to use
+
+If you have installed JutulDarcy, you should start by adding the CUDA and optionally the AMGX packages using the package manager:
+
+```julia
+using Pkg
+Pkg.add("CUDA") # Requires a CUDA-capable GPU
+Pkg.add("AMGX") # Requires CUDA + Linux
+```
+
+Once the packages have been added to the same environment as JutulDarcy, you can load them to enable GPU support. Let us grab the first ten steps of the EGG benchmark model:
+
+```julia
+using Jutul, JutulDarcy
+dpth = JutulDarcy.GeoEnergyIO.test_input_file_path("EGG", "EGG.DATA")
+case = setup_case_from_data_file(dpth)
+case = case[1:10]
+```
+
+### Running on CPU
+
+If we wanted to run this on CPU we would simply call `simulate_reservoir`:
+
+```julia
+result_cpu = simulate_reservoir(case);
+```
+
+### Running on GPU with block ILU(0)
+
+If we now load `CUDA` we can run the same simulation using the CUDA-accelerated linear solver. By itself, CUDA only supports the ILU(0) preconditioner. JutulDarcy will automatically pick this preconditioner when CUDA is requested without AMGX, but we write it explicitly here:
+
+```julia
+using CUDA
+result_ilu0_cuda = simulate_reservoir(case, linear_solver_backend = :cuda, precond = :ilu0);
+```
+
+### Running on GPU with CPR AMGX-ILU(0)
+
+Loading the AMGX package makes a pure GPU-based two-stage CPR available. Again, we are explicit in requesting CPR, but if both `CUDA` and `AMGX` are available and functional this is redundant:
+
+```julia
+using AMGX
+result_amgx_cuda = simulate_reservoir(case, linear_solver_backend = :cuda, precond = :cpr);
+```
+
+In short, load `AMGX` and `CUDA` and run `simulate_reservoir(case, linear_solver_backend = :cuda)` to get GPU results. The EGG model is quite small, so if you want to see significant performance increases, a larger case will be necessary. `AMGX` also contains a large number of options that can be configured for advanced users.
+
+## Technical details and limitations
+
+The GPU implementation relies on assembly on CPU and pinned memory to transfer onto the CPU. This means that the performance can be significantly improved by launching Julia with multiple threads to speed up the non-GPU parts of the code. AMGX is currently single-GPU only and does not work with MPI. Currently, only `Float64` is supported for CPR, but pure ILU(0) solves support `Float32` as well.
+
+!!! warning "Experimental status"
+    Multiple successive runs with different `AMGX` instances have resulted in crashes when old instances are garbage collected. This part of the code is still considered experimental, with contributions welcome if you are using it.
diff --git a/docs/src/man/advanced/mpi.md b/docs/src/man/advanced/mpi.md
@@ -4,7 +4,7 @@ JutulDarcy can use threads by default, but advanced options can improve performa
 
 ## Overview of parallel support
 
-There are two main ways of exploiting multiple cores in Jutul/JutulDarcy: Threads are automatically used for assembly and can be used for parts of the linear solve. If you require the best performance, you have to go to MPI where the linear solvers can use a parallel [BoomerAMG preconditioner](https://hypre.readthedocs.io/en/latest/solvers-boomeramg.html) via [HYPRE.jl](https://github.com/fredrikekre/HYPRE.jl).
+There are two main ways of exploiting multiple cores in Jutul/JutulDarcy: Threads are automatically used for assembly and can be used for parts of the linear solve. If you require the best performance, you have to go to MPI where the linear solvers can use a parallel [BoomerAMG preconditioner](https://hypre.readthedocs.io/en/latest/solvers-boomeramg.html) via [HYPRE.jl](https://github.com/fredrikekre/HYPRE.jl). In addition, there is experimental GPU support described in [GPU support](@ref).
 
 ### MPI parallelization
 
@@ -25,7 +25,6 @@ Starting Julia with multiple threads (for example `julia --project. --threads=4`
 
 Threads are easy to use and can give a bit of benefit for large models.
 
-
 ### Mixed-mode parallelism
 
 You can mix the two approaches: Adding multiple threads to each MPI process can use threads to speed up assembly and property evaluations.
@@ -42,7 +41,6 @@ A few hints when you are looking at performance:
 
 Example: 200k cell model on laptop: 1 process 235 s -> 4 processes 145s
 
-
 ## Solving with MPI in practice
 
 There are a few adjustments needed before a script can be run in MPI.

diff --git a/examples/co2_sloped.jl b/examples/co2_sloped.jl
@@ -133,7 +133,7 @@ krog_t = so.^2
 krog = PhaseRelativePermeability(so, krog_t, label = :og)
 
 # Higher resolution for second table:
-sg = range(0, 1, 50)
+sg = range(0, 1, 50);
 
 # Evaluate Brooks-Corey to generate tables:
 tab_krg_drain = brooks_corey_relperm.(sg, n = 2, residual = 0.1)
@@ -251,7 +251,7 @@ wd, states, t = simulate_reservoir(state0, model, dt,
     parameters = parameters,
     forces = forces,
     max_timestep = 90day
-)
+);
 # ## Plot the CO2 mole fraction
 # We plot the overall CO2 mole fraction. We scale the color range to log10 to
 # account for the fact that the mole fraction in cells made up of only the
@@ -339,7 +339,8 @@ for cell in 1:nc
     x, y, z = centers[:, cell]
     is_inside[cell] = sqrt((x - 720.0)^2 + 20*(z-70.0)^2) < 75
 end
-plot_cell_data(mesh, is_inside)
+fig, ax, plt = plot_cell_data(mesh, is_inside)
+fig
 # ## Plot inventory in ellipsoid
 # Note that a small mobile dip can be seen when free CO2 passes through this region.
 inventory = co2_inventory(model, wd, states, t, cells = findall(is_inside))

diff --git a/examples/data_input_file.jl b/examples/data_input_file.jl
@@ -5,28 +5,42 @@
 # regular JutulDarcy simulation, allowing modification and use of the case in
 # differentiable workflows.
 #
-# We begin by loading the SPE9 dataset via the GeoEnergyIO package.
+# We begin by loading the SPE9 dataset via the GeoEnergyIO package. This package
+# includes a set of open datasets that can be used for testing and benchmarking.
+# The SPE9 dataset is a 3D model with a corner-point grid and a set of wells
+# produced by the Society of Petroleum Engineers. The specific version of the
+# file included here is taken from the [OPM
+# tests](https://github.com/OPM/opm-tests) repository.
 using JutulDarcy, GeoEnergyIO
-pth = GeoEnergyIO.test_input_file_path("SPE9", "SPE9.DATA")
+pth = GeoEnergyIO.test_input_file_path("SPE9", "SPE9.DATA");
 # ## Set up and run a simulation
 # If we do not need the case, we could also have done:
 # ws, states = simulate_data_file(pth)
 case = setup_case_from_data_file(pth)
-ws, states = simulate_reservoir(case)
+ws, states = simulate_reservoir(case);
 # ## Show the input data
 # The input data takes the form of a Dict:
 case.input_data
 # We can also examine the for example RUNSPEC section, which is also represented
 # as a Dict.
 case.input_data["RUNSPEC"]
 # ## Plot the simulation model
-# These plot are interactive when run outside of the documentations.
+# These plot are normally interactive, but if you are reading the published
+# online documentation static screenshots will be inserted instead.
 using GLMakie
 plot_reservoir(case.model, states)
 # ## Plot the well responses
 # We can plot the well responses (rates and pressures) in an interactive viewer.
+# Multiple wells can be plotted simultaneously, with options to select which
+# units are to be used for plotting.
 plot_well_results(ws)
 # ## Plot the field responses
 # Similar to the wells, we can also plot field-wide measurables. We plot the
 # field gas production rate and the average pressure as the initial selection.
+# If you are running this case interactively you can select which measurables to
+# plot.
+#
+# We observe that the field pressure steadily decreases over time, as a result
+# of the gas production. The drop in pressure is not uniform, as during the
+# period where little gas is produced, the decrease in field pressure is slower.
 plot_reservoir_measurables(case, ws, states, left = :fgpr, right = :pres)
diff --git a/examples/five_spot_ensemble.jl b/examples/five_spot_ensemble.jl
@@ -4,8 +4,7 @@
 # injector in one corner and the producer in the opposing corner, with a
 # significant volume of fluids injected into the domain.
 using JutulDarcy, Jutul
-nx = 50
-#-
+nx = 50;
 # ## Setup
 # We define a function that, for a given porosity field, computes a solution
 # with an estimated permeability field. For assumptions and derivation of the
@@ -55,10 +54,9 @@ function simulate_qfs(porosity = 0.2)
     forces = setup_reservoir_forces(model, control = controls)
     return simulate_reservoir(state0, model, dt, parameters = parameters, forces = forces, info_level = -1)
 end
-#-
 # ## Simulate base case
 # This will give the solution with uniform porosity of 0.2.
-ws, states, report_time = simulate_qfs()
+ws, states, report_time = simulate_qfs();
 # ### Plot the solution of the base case
 # We observe a radial flow pattern initially, before coning occurs near the
 # producer well once the fluid has reached the opposite corner. The uniform
@@ -75,7 +73,6 @@ ax = Axis(fig[1, 2])
 h = contourf!(ax, get_sat(states[nt]))
 Colorbar(fig[1, end+1], h)
 fig
-#-
 # ## Create 10 realizations
 # We create a small set of realizations of the same model, with porosity that is
 # uniformly varying between 0.05 and 0.3. This is not especially sophisticated
@@ -89,11 +86,10 @@ wells = []
 report_step = nt
 for i = 1:N
     poro = 0.05 .+ 0.25*rand(Float64, (nx*nx))
-    ws, states, rt = simulate_qfs(poro)
-    push!(wells, ws)
-    push!(saturations, get_sat(states[report_step]))
+    ws_i, states_i, rt = simulate_qfs(poro)
+    push!(wells, ws_i)
+    push!(saturations, get_sat(states_i[report_step]))
 end
-#-
 # ### Plot the oil rate at the producer over the ensemble
 using Statistics
 fig = Figure()
@@ -106,15 +102,13 @@ end
 xlims!(ax, [mean(report_time), report_time[end]])
 ylims!(ax, 0, 0.0075)
 fig
-#-
 # ### Plot the average saturation over the ensemble
 avg = mean(saturations)
 fig = Figure()
 h = nothing
 ax = Axis(fig[1, 1])
 h = contourf!(ax, avg)
 fig
-#-
 # ### Plot the isocontour lines over the ensemble
 fig = Figure()
 h = nothing

diff --git a/examples/model_coarsening.jl b/examples/model_coarsening.jl
@@ -27,7 +27,7 @@ fine_case = setup_case_from_data_file(data_pth);
 fine_model = fine_case.model
 fine_reservoir = reservoir_domain(fine_model)
 fine_mesh = physical_representation(fine_reservoir)
-ws, states = simulate_reservoir(fine_case, info_level = 1);
+ws, states = simulate_reservoir(fine_case, info_level = -1);
 # ## Coarsen the model and plot partition
 # We coarsen the model using a partition size of 20x20x2 and the IJK method
 # where the underlying structure of the mesh is used to subdivide the blocks.

diff --git a/examples/optimize_simple_bl.jl b/examples/optimize_simple_bl.jl
@@ -1,12 +1,13 @@
-using Jutul
-using JutulDarcy
-using LinearAlgebra
-using GLMakie
 # # Example demonstrating optimzation of parameters against observations
 # We create a simple test problem: A 1D nonlinear displacement. The observations
 # are generated by solving the same problem with the true parameters. We then
 # match the parameters against the observations using a different starting guess
 # for the parameters, but otherwise the same physical description of the system.
+using Jutul
+using JutulDarcy
+using LinearAlgebra
+using GLMakie
+
 function setup_bl(;nc = 100, time = 1.0, nstep = 100, poro = 0.1, perm = 9.8692e-14)
     T = time
     tstep = repeat([T/nstep], nstep)
@@ -39,10 +40,10 @@ poro_ref = 0.1
 perm_ref = 9.8692e-14
 # ## Set up and simulate reference
 model_ref, state0_ref, parameters_ref, forces, tstep = setup_bl(nc = N, nstep = Nt, poro = poro_ref, perm = perm_ref)
-states_ref, = simulate(state0_ref, model_ref, tstep, parameters = parameters_ref, forces = forces, info_level = -1)
+states_ref, = simulate(state0_ref, model_ref, tstep, parameters = parameters_ref, forces = forces, info_level = -1);
 # ## Set up another case where the porosity is different
 model, state0, parameters, = setup_bl(nc = N, nstep = Nt, poro = 2*poro_ref, perm = 1.0*perm_ref)
-states, rep = simulate(state0, model, tstep, parameters = parameters, forces = forces, info_level = -1)
+states, rep = simulate(state0, model, tstep, parameters = parameters, forces = forces, info_level = -1);
 # ## Plot the results
 fig = Figure()
 ax = Axis(fig[1, 1], title = "Saturation")

diff --git a/examples/relperms.jl b/examples/relperms.jl
@@ -316,7 +316,6 @@ lines(simulate_bl(model, prm), axis = (title = "Parametric LET function simulati
 # ### Check out the parameters
 # The LET parameters are now numerical parameters in the reservoir:
 rmodel = reservoir_model(model)
-display(rmodel)
 
 # ## Conclusion
 # We have explored a few aspects of relative permeabilities in JutulDarcy. There

diff --git a/examples/two_phase_buckley_leverett.jl b/examples/two_phase_buckley_leverett.jl
@@ -52,7 +52,6 @@ end
 n, n_f = 100, 1000
 states, model, report = solve_bl(nc = n)
 print_stats(report)
-#-
 # ## Run refined version (1000 cells, 1000 steps)
 # Using a grid with 100 cells will not yield a fully converged solution. We can
 # increase the number of cells at the cost of increasing the runtime a bit. Note
@@ -61,7 +60,6 @@ print_stats(report)
 # use an iterative solver.
 states_refined, _, report_refined = solve_bl(nc = n_f);
 print_stats(report_refined)
-#-
 # ## Plot results
 # We plot the saturation front for the base case at different times together
 # with the final solution for the refined model. In this case, refining the grid