Skip to content
This repository has been archived by the owner on Feb 16, 2021. It is now read-only.

WIP: Differentiable programming #12

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions Courses/Differentiable Programming/010 Introduction.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# # Course on Automatic Differentiation in Juia

# ## Why derivatives ?

# Derivatives are very useful in scietific computing.
# A common use case is in iterative routines which
# include finding roots and extreme points to nonlinear equations.
# Ther are ways to do nonlinear optimization and nonlinear equation solving
# without using derivatives but they are typically slow and non robust.
# By including "higher order information" or the derivative, the iterative routine
# gets more information and this opens up a whole new class of algorithms
# that can be orders of magnitude faster than the ones that do not use the derivative.
#
# Manually deriving the derivative of a function (or algorithm) can be
# tedious and error prone. A small error in the implementation of the derivative
# can lead to that the optimal order of convergence in an iterative method is lost.
# Indeed, the author of thise couse has lost many days from trying to locate that
# last bug in a derivative implementation.
#
# Thinking about the rules for derivatives, they are quite simpl (here using an apostrophe
# to denote the derivative):
#
# * Addition rule: $(f(x) + g(x))' = f'(x) + g'(x)$
# * Product rule: $(f(x)g(x))' = f(x)g'(x) + f'(x)g(x)$
# * Chain rule: $(f(g(x)))' = f'(g(x))g'(x)$
#
# It is natural to ask the question, should it not be possible to
# automate the computation of these derivatives
# and just have our program give back the derivative for us automatically.
# In most cases, the answer to this is yes, it is possible to get the exact derivative
# without having to implement any derivatives.
# This is known as Automatic Differentiation typically shortened as AD.
#
#
# ## Course goals
#
# In this course we will use the programming language Julia to look at AD.
# The course goals is that after successfully finishing the course, you
# should be able to:
# * Use some of the popular AD tools in Julia, both for forward mode and reverse mode AD.
# * Call optimization routines using derivatives computed with AD.
# * Understand the theoretical background behind dual numbers and how they can be used for forward mode AD.
# * Implement you own toy-version of forward mode AD.
315 changes: 315 additions & 0 deletions Courses/Differentiable Programming/020 Descent.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,315 @@
#nb import Pkg; Pkg.add(Pkg.PackageSpec(url="https://github.com/JuliaComputing/JuliaAcademyData.jl"))
#nb using JuliaAcademyData; activate("Differentiable Programming")
datapath(p) = joinpath("../../../JuliaAcademyData.jl/courses/Differentiable Programming", p) #src

# # Descending to the top
#
# One killer app for derivatives is gradient descent. This is the process of
# incrementally improving some algorithm by adjusting its "knobs" (the tuneable
# parameters) based on its performance for some existing data.
#
# Each step incrementally improves on the previous set of parameters by
# determining which way to "nudge" each parameter in order to improve on its output.
# The trick is finding this direction efficiently.
#
# We could of course try changing each parameter individually and see which direction to move, but that's pretty tedious,
# numerically fraught, and expensive. If, however, we knew the _partial derivatives_ with
# respect to each parameter then we could simply "descend" down the slope of our error
# function until we reach the bottom — that is, the minimum error!
#
# We all know how to draw a line on a graph — it just requires knowing the slope
# and intercept of the equation
#
# $$
# y = m x + b
# $$
#
# What we want to do, though, is the inverse problem. We have a dataset and we
# want to find a line that best fits it. We can use gradient descent to do this:
#
# <img src="https://raw.githubusercontent.com/JuliaComputing/JuliaAcademyData.jl/master/courses/Differentiable%20Programming/images/020-linear.gif" alt="Fitting a line" />
#
# This is trivial — and there are better ways to do this in the first place —
# but the beauty of gradient descent is that it extends to much more complicated
# examples. For example, we can even fit a differential equation with this method:
#
# <img src="https://raw.githubusercontent.com/JuliaComputing/JuliaAcademyData.jl/master/courses/Differentiable%20Programming/images/020-diffeq.gif" alt="Fitting a differential equation" />
#
# But let's examine that line-fitting first as its simplicity makes a number of
# points clear. To begin, let's load some data.

using DifferentialEquations, CSV #src
import Random #src
Random.seed!(20) #src
generate_data(ts) = solve(ODEProblem([1.0,1.0],(0.0,10.0),[1.5,1.0,3.0,1.0]) do du,u,p,t; du[1] = p[1]*u[1] - p[2]*prod(u); du[2] = -p[3]*u[2] + p[4]*prod(u); end, saveat=ts)[1,:].*(1.0.+0.02.*randn.()).+0.05.*randn.() #src
ts = 0:.005:1 #src
ys = generate_data(ts) #src
CSV.write(datapath("data/020-descent-data.csv"), (t=ts, y=ys)) #src

using CSV
df = CSV.read(datapath("data/020-descent-data.csv"))
#-
using Plots
scatter(df.t, df.y, xlabel="time", label="data")

# Now we want to fit some model to this data — for a linear model we just need
# two parameters:
#
# $$
# y = m x + b
# $$
#
# Lets create a structure that represents this:

mutable struct LinearModel
m::Float64
b::Float64
end
(model::LinearModel)(x) = model.m*x + model.b

# And create a randomly picked model to see how we do:
linear = LinearModel(randn(), randn())
plot!(df.t, linear.(df.t), label="model")

# Of course, we just chose our `m` and `b` at random here, of course it's not
# going to fit our data well! Let's quantify how far we are from an ideal line
# with a _loss_ function:

loss(f, xs, ys) = sum((f.(xs) .- ys).^2)
loss(linear, df.t, df.y)

# That's a pretty big number — and we want to decrease it. Let's update our plot
# to include this loss value in the legend so we can keep track:

p = scatter(df.t, df.y, xlabel="time", label="data")
plot!(p, df.t, linear.(df.t), label="model: loss $(round(Int, loss(linear, df.t, df.y)))")

# And now we want to try to improve the fit. To do so, we just need to make
# the loss function as small as possible. We can of course simply try a bunch
# of values and brute-force a "good" solution. Plotting this as a three-dimensional
# surface gives us the imagery of a hill, and our goal is to find our way to the bottom:

ms, bs = (-1:.01:6, -2:.05:2.5)
surface(ms, bs, [loss(LinearModel(m, b), df.t, df.y) for b in bs for m in ms], xlabel="m values", ylabel="b values", title="loss value")

# A countour plot makes it a bit more obvious where the minimum is:

contour(ms, bs, [loss(LinearModel(m, b), df.t, df.y) for b in bs for m in ms], levels=150, xlabel="m values", ylabel="b values")

# But building those graphs are expensive! And it becomes completely intractible as soon as we
# have more than a few parameters to fit. We can instead just try nudging the
# current model's values to simply figure out which direction is "downhill":

linear.m += 0.1
plot!(p, df.t, linear.(df.t), label="new model: loss $(round(Int, loss(linear, df.t, df.y)))")

# We'll either have made things better or worse — but either way it's easy to
# see which way to change `m` in order to improve our model.
#
# Essentially, what we're doing here is we're finding the derivative of our
# loss function with respect to `m`, but again, there's a bit to be desired
# here. Instead of evaluating the loss function twice and observing the difference,
# we can just ask Julia for the derivative! This is the heart of differentiable
# programming — the ability to ask a _complete program_ for its derivative.
#
# There are several techniques that you can use to compute the derivative. We'll
# use the [Zygote package](https://fluxml.ai/Zygote.jl/latest/) first. It will
# take a second as it compiles the "adjoint" code to compute the derivatives.
# Since we're doing this in multiple dimensions, the two derivatives together
# are called the gradient:

using Zygote
grads = Zygote.gradient(linear) do m
return loss(m, df.t, df.y)
end
grads

# So there we go! Zygote saw that the model was gave it — `linear` — had two fields and thus
# it computed the derivative of the loss function with respect to those two
# parameters.
#
# Now we know the slope of our loss function — and thus
# we know which way to move each parameter to improve it! We just need to iteratively walk
# downhill! We don't want to take too large a step, so we'll multiply each
# derivative by a "learning rate," typically called `η`.
η = 0.001
linear.m -= η*grads[1][].m
linear.b -= η*grads[1][].b

plot!(p, df.t, linear.(df.t), label="updated model: loss $(round(Int, loss(linear, df.t, df.y)))")

# Now we just need to do this a bunch of times!

for i in 1:200
grads = Zygote.gradient(linear) do m
return loss(m, df.t, df.y)
end
linear.m -= η*grads[1][].m
linear.b -= η*grads[1][].b
i > 40 && i % 10 != 1 && continue
#nb IJulia.clear_output(true)
scatter(df.t, df.y, label="data", xlabel="time", legend=:topleft)
display(plot!(df.t, linear.(df.t), label="model (loss: $(round(loss(linear, df.t, df.y), digits=3)))"))
end

#src === CREATING GIF ===
f = let #src
linear = LinearModel(randn(), randn()) #src
f = @gif for i in 1:150 #src
grads = Zygote.gradient(linear) do m #src
return loss(m, df.t, df.y) #src
end #src
linear.m -= η*grads[1][].m #src
linear.b -= η*grads[1][].b #src
scatter(df.t, df.y, label="data", xlabel="time", legend=:topleft) #src
plot!(df.t, linear.(df.t), label="model (loss: $(round(loss(linear, df.t, df.y), digits=3)))") #src
end #src
mv(f.filename, datapath("images/020-linear.gif")) #src
end #src



# That's looking pretty good now! You might be saying this is crazy — I know
# how to do a least squares fit! And you'd be right — in this case you can
# easily do this with a linear solve of the system of equations:
m, b = [df.t ones(size(df.t))] \ df.y

@show (m, b)
@show (linear.m, linear.b);

# ## Exercise: Use gradient descent to fit a quadratic model to the data
#
# Obviously a linear fit leaves a bit to be desired here. Try to use the same
# framework to fit a quadratic model. Check your answer against the algebraic
# solution with the `\` operator.

struct PolyModel
p::Vector{Float64}
end
function (m::PolyModel)(x)
r = m.p[1]*x^0
for i in 2:length(m.p)
r += m.p[i]*x^(i-1)
end
return r
end
poly = PolyModel(rand(3))
loss(poly, df.t, df.y)
η = 0.001
for i in 1:1000
grads = Zygote.gradient(poly) do m
return loss(m, df.t, df.y)
end
poly.p .-= η.*grads[1].p
end
scatter(df.t, df.y, label="data", legend=:topleft)
plot!(df.t, poly.(df.t), label="model by descent")
plot!(df.t, PolyModel([df.t.^0 df.t df.t.^2] \ df.y).(df.t), label="model by linear solve")

# ## Nonlinearities
#
# Let's see what happens when load a bit more data:

let #src
ts = 0:.04:8 #src
ys = generate_data(ts) #src
CSV.write(datapath("data/020-descent-data-2.csv"), (t=ts, y=ys)) #src
end #src
df2 = CSV.read(datapath("data/020-descent-data-2.csv"))
scatter(df2.t, df2.y, xlabel="t", label="more data")

# Now what? This clearly won't be fit well with a low-order polynomial, and
# even a high-order polynomial will get things wrong once we try to see what
# will happen in the "future"!
#
# Let's use a bit of knowledge about where this data comes from: these happen
# to be the number of rabbits in a predator-prey system! We can express the
# general behavior with a pair of differential equations:
#
# $$
# x' = \alpha x - \beta x y \\
# y' = -\delta y + \gamma x y
# $$
#
# * $x$ is the number of rabbits (🐰)
# * $y$ is the number of wolves (🐺)
# * $\alpha$ and $\beta$ describe the growth and death rates for the rabbits
# * $\gamma$ and $\delta$ describe the growth and death rates for the wolves
#
# But we don't know what those rate constants are — all we have is the population
# of rabbits over time.
#
# These are the classic Lotka-Volterra equations, and can be expressed in Julia
# using the [DifferentialEquations](https://github.com/JuliaDiffEq/DifferentialEquations.jl) package:
using DifferentialEquations
function rabbit_wolf(du,u,p,t)
🐰, 🐺 = u
α, β, δ, γ = p
du[1] = α*🐰 - β*🐰*🐺
du[2] = -δ*🐺 + γ*🐰*🐺
end
u0 = [1.0,1.0]
tspan = extrema(df2.t)
p = rand(4).+1 # We don't know what this is!
## But lets see what the model looks like right now:
prob = ODEProblem(rabbit_wolf,u0,tspan,p)
sol = solve(prob, Tsit5(), saveat=df2.t)

scatter(df2.t, df2.y, xlabel="t", label="more data")
plot!(sol, label=["rabbits","wolves"])

# So we're going to have to improve this — unlike the previous examples, we don't
# have an easy algebraic solution here! But let's try using gradient descent.
# The easiest way to get gradients out of a differential equation solver right
# now is through the [DiffEqFlux package](https://github.com/JuliaDiffEq/DiffEqFlux.jl), and instead of manually converging,
# we can use the Flux package to automatically (and more smartly) handle the gradient descent. Just
# like before, though, we compute the loss of the model evaluation — and in this
# case the model is _solving a differential equation!_
using Flux, DiffEqFlux
p = Flux.param(ones(4))
diffeq_loss(p, xs, ys) = sum(abs2, diffeq_rd(p,prob,Tsit5(),saveat=df2.t)[1,:] .- df2.y)

# This works slightly differently — we now track the gradients directly in the
# p vector:
p.grad
#-
l = diffeq_loss(p, df2.t, df2.y)
#-
DiffEqFlux.Tracker.back!(l) # we need to back-propagate our tracking of the gradients
p.grad # but now we can see the gradients involved in that computation!

# So now we can do exactly the same thing as before: iteratively update the parameters
# to descend to a (hopefully) well-fit model:
#
# ```julia
# p.data .-= η*p.grad
# ```
#
# But we can be a bit smarter about this just by asking Flux to handle everything
# with its train! function and associated functionality:

data = Iterators.repeated((df2.t, df2.y), 150)
opt = ADAM(0.1)
history = Any[] #src
cb = function () #callback function to observe training
#nb IJulia.clear_output(true)
plt = scatter(df2.t, df2.y, label="data", ylabel="population (thousands)", xlabel="t")
display(plot!(plt, solve(remake(prob,p=Flux.data(p)),Tsit5(),saveat=0.1),
ylim=(0,8), label=["rabbits","wolves"], title="loss: $(round(Flux.data(diffeq_loss(p, df2.t, df2.y)), digits=3))"))
push!(history, plt) #src
end
Flux.train!((xs, ys)->diffeq_loss(p, xs, ys), [p], data, opt, cb = cb)

f = @gif for i in 1:length(history) #src
plot(history[i]) #src
end #src
mv(f.filename, datapath("images/020-diffeq.gif")) #src

# # Summary
#
# You can now see the power of differentiating whole programs — we can easily
# and efficiently tune parameters without brute forcing solutions. Gradient
# descent easily extends to machine learning and artificial intelligence
# applications — and there are a number of tricks that can increase its efficiency
# and help avoid local minima. There are a variety of other places where knowing the gradient can
# be powerful and helpful.
Loading