Nonlinearity and Neural Networks

Aravinda 加阳
unpack
Published in
4 min readApr 5, 2021

--

This article explores nonlinearity and neural network architectures.

Linear Function vs. Neural Network

Linear Function vs. Non-linear Function

If w1 and w2 are weight tensors, and b1 and b2 are bias tensors; initially random initialized, following is a linear function. In Python, matrix multiplication is represented with the @ operator.

def linear(xb): 
return xb@w1 + b1

How to turn a linear function into a neural network?

def simple_net(xb):    res = xb@w1 + b1    res = res.max(tensor(0.0))     res = res@w2 + b2    return res

Take the output of the first linear function and do max operation between that output and zero, then put that output through another linear function. This is a neural network.

A neural network can approximate any complex function to any level of accuracy given the correct set of parameters. This is known as the universal approximation theorem.

This act is called function composition. Function composition is an act to combine simple functions to build more complicated ones. The result of each function is passed as the argument of the next, and the result of the last one is the result of the whole.

That function res.max(tensor(0.0)) is called a Rectified Linear Unit. (ReLU). It replaces every negative number with a zero. This function is in PyTorch as F.relu.

ReLU is an activation function or non-linearity.

Why do we add a Non-linear function?

Using more linear layers, we can have our model do more computation, and therefore model more complex functions.

But there’s no point just putting one linear layer directly after another one because a series of any number of linear layers in a row can be replaced with a single linear layer with a different set of parameters. Mathematically, we say the composition of two linear functions is another linear function. So, we can stack as many linear classifiers as we want on top of each other, and without nonlinear functions between them, it will just be the same as one linear classifier.

But if we put a nonlinear function between them, such as max, then this is no longer true. Now each linear layer is actually somewhat decoupled from the other ones and can do its own useful work. The max function operates as a simple if statement.

Some other Activation functions.

Activation Functions

Pytorch implementation.

simple_net = nn.Sequential( 
nn.Linear(28*28,30),
nn.ReLU(),
nn.Linear(30,1)
)

nn.Sequential does function composition.

nn.Linear linear function.

nn.ReLU is a PyTorch module that does exactly the same thing as the F.relu function.

Model Comparison

One Layer Model:

Simple Neural Network :

Deeper Model :

We can add as many layers as we want, as long as we add a nonlinearity between each pair of linear layers. however, the deeper the model gets, the harder it is to optimize the parameters in practice.

resnet18 training

Why do we use deeper models?

A single nonlinearity with two linear layers is enough to approximate any function. But with a deeper model ( with more layers)…

  • Smaller matrices with more layers get better results than larger matrices and few layers.
  • Could perform much better in practice, train more quickly, and takes less memory.

References :

https://www.researchgate.net/figure/Various-forms-of-non-linear-activation-functions-Figure-adopted-from-Caffe-Tutorial_fig3_315667264

--

--