Neural networks can be constructed using the Torch::NN
Now that you had a glimpse of Torch::Autograd
, Torch::NN
depends on Torch::Autograd
to define models and differentiate them.
A Torch::NN::Module
contains layers, and a method forward(input)
that returns the output
A typical training procedure for a neural network is as follows:
- Define the neural network that has some learnable parameters (or weights)
- Iterate over a dataset of inputs
- Process input through the network
- Compute the loss (how far is the output from being correct)
- Propagate gradients back into the network’s parameters
- Update the weights of the network, typically using a simple update rule:
weight = weight - learning_rate * gradient
Let’s define this network:
require "torch"
class MyNet < Torch::NN::Module
def initialize
# 1 input image channel, 6 output channels, 5x5 square convolution
# kernel
@conv1 =, 6, 5)
@conv2 =, 16, 5)
# an affine operation: y = Wx + b
@fc1 = * 5 * 5, 120)
@fc2 =, 84)
@fc3 =, 10)
def forward(x)
# Max pooling over a (2, 2) window
x = Torch::NN::F.max_pool2d(Torch::NN::F.relu(, [2, 2])
# If the size is a square, you can specify with a single number
x = Torch::NN::F.max_pool2d(Torch::NN::F.relu(, 2)
x = Torch.flatten(x, 1) # flatten all dimensions except the batch dimension
x = Torch::NN::F.relu(
x = Torch::NN::F.relu(
net =
p net
(conv1): Conv2d(1, 6, kernel_size: [5, 5], stride: [1, 1])
(conv2): Conv2d(6, 16, kernel_size: [5, 5], stride: [1, 1])
(fc1): Linear(in_features: 400, out_features: 120, bias: true)
(fc2): Linear(in_features: 120, out_features: 84, bias: true)
(fc3): Linear(in_features: 84, out_features: 10, bias: true)
You just have to define the forward
method, and the backward
method (where gradients are computed) is automatically defined for you using Torch::Autograd
. You can use any of the Tensor operations in the forward
The learnable parameters of a model are returned by net.parameters
params = net.parameters
p params.length
p params[0].size # conv1's .weight
[6, 1, 5, 5]
Let’s try a random 32x32 input. Note: expected input size of this net (LeNet) is 32x32. To use this net on the MNIST dataset, please resize the images from the dataset to 32x32.
input = Torch.randn(1, 1, 32, 32)
out =
p out
tensor([[ 0.0372, -0.1058, 0.1373, 0.0924, -0.1012, -0.0540, 0.0306, 0.0649,
-0.0865, 0.0951]], requires_grad: true)
Zero the gradient buffers of all parameters and backprops with random gradients:
out.backward(Torch.randn(1, 10))
Note: Torch::NN
only supports mini-batches. The entire Torch::NN
module only supports inputs that are a mini-batch of samples, and not a single sample.
For example, Torch::NN::Conv2d
will take in a 4D Tensor of nSamples x nChannels x Height x Width
If you have a single sample, just use input.unsqueeze(0)
to add a fake batch dimension.
Before proceeding further, let’s recap all the classes you’ve seen so far.
- A multi-dimensional array with support for autograd operations likebackward
. Also holds the gradient w.r.t. the tensor.Torch::NN::Module
- Neural network module. Convenient way of encapsulating parameters, with helpers for moving them to GPU, exporting, loading, etc.Torch::NN::Parameter
- A kind of Tensor, that is automatically registered as a parameter when assigned as an attribute to aTorch:NN::Module
At this point, we covered:
- Defining a neural network
- Processing inputs and calling backward
Still Left:
- Computing the loss
- Updating the weights of the network
A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.
There are several different loss functions under the Torch::NN
module. A simple loss is: Torch::NN::MSELoss
which computes the mean-squared error between the input and the target.
For example:
output =
target = Torch.randn(10) # a dummy target, for example
target = target.view(1, -1) # make it the same shape as output
criterion =
loss =, target)
p loss
tensor(1.4371, requires_grad: true)
Now, if you follow loss
in the backward direction, you will see a graph of computations that looks like this:
input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
-> flatten -> linear -> relu -> linear -> relu -> linear
-> MSELoss
-> loss
So, when we call loss.backward
, the whole graph is differentiated w.r.t. the neural net parameters, and all Tensors in the graph that have requires_grad: true
will have their .grad
Tensor accumulated with the gradient.
To backpropagate the error all we have to do is to loss.backward
. You need to clear the existing gradients though, else gradients will be accumulated to existing gradients.
Now we shall call loss.backward
, and have a look at conv1’s bias gradients before and after the backward.
net.zero_grad # zeroes the gradient buffers of all parameters
puts "conv1.bias.grad before backward"
p net.conv1.bias.grad
puts "conv1.bias.grad after backward"
p net.conv1.bias.grad
conv1.bias.grad before backward
conv1.bias.grad after backward
tensor([ 0.0044, -0.0015, -0.0084, 0.0121, -0.0160, 0.0089])
Now, we have seen how to use loss functions.
The only thing left to learn is:
- Updating the weights of the network
The simplest update rule used in practice is the Stochastic Gradient Descent (SGD):
weight = weight - learning_rate * gradient
We can implement this using simple Ruby code:
learning_rate = 0.01
net.parameters.each do |f|!( * learning_rate)
However, as you use neural networks, you want to use various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. To enable this, we built a small module: Torch::Optim
that implements all these methods. Using it is very simple:
# create your optimizer
optimizer =, lr: 0.01)
# in your training loop:
optimizer.zero_grad # zero the gradient buffers
output =
loss =, target)
optimizer.step # do the update
Note: Observe how gradient buffers had to be manually set to zero using optimizer.zero_grad
. This is because gradients are accumulated as explained in the Backprop section.