Pytorch Introduction — Enter NonLinear Functions | by Ivo Bernardo

With our data in-place, it’s time to train our first Neural Network. We’ll use a similar architecture to what we’ve done in the last blog post of the series, using a Linear version of our Neural Network with the ability to handle linear patterns:

from torch import nnclass LinearModel(nn.Module):
def __init__(self):
super().__init__()
self.layer_1 = nn.Linear(in_features=12, out_features=5)
self.layer_2 = nn.Linear(in_features=5, out_features=1)
def forward(self, x):
return self.layer_2(self.layer_1(x))

This neural network uses the nn.Linearmodule from pytorch to create a Neural Network with 1 deep layer (one input layer, a deep layer and an output layers).

Although we can create our own class inheriting from nn.Module , we can also use (more elegantly) the nn.Sequential constructor to do the same:

model_0 = nn.Sequential(
nn.Linear(in_features=12, out_features=5),
nn.Linear(in_features=5, out_features=1)
)

model_0 Neural Network Architecture — Image by Author

Cool! So our Neural Network contains a single inner layer with 5 neurons (this can be seen by the out_features=5 on the first layer).

This inner layer receives the same number of connections from each input neuron. The 12 in in_features in the first layer reflects the number of features and the 1 in out_features of the second layer reflects the output (a single value raging from 0 to 1).

To train our Neural Network, we’ll define a loss function and an optimizer. We’ll define BCEWithLogitsLoss (PyTorch 2.1 documentation) as this loss function (torch implementation of Binary Cross-Entropy, appropriate for classification problems) and Stochastic Gradient Descent as the optimizer (using torch.optim.SGD ).

# Binary Cross entropy
loss_fn = nn.BCEWithLogitsLoss()# Stochastic Gradient Descent for Optimizer
optimizer = torch.optim.SGD(params=model_0.parameters(), 
lr=0.01)

Finally, as I’ll also want to calculate the accuracy for every epoch of training process, we’ll design a function to calculate that:

def compute_accuracy(y_true, y_pred):
tp_tn = torch.eq(y_true, y_pred).sum().item()
acc = (tp_tn / len(y_pred)) * 100 
return acc

Time to train our model! Let’s train our model for 1000 epochs and see how a simple linear network is able to deal with this data:

torch.manual_seed(42)epochs = 1000
train_acc_ev = []
test_acc_ev = []
# Build training and evaluation loop
for epoch in range(epochs):
model_0.train()
y_logits = model_0(X_train).squeeze()
loss = loss_fn(y_logits,
y_train) 
# Calculating accuracy using predicted logists
acc = compute_accuracy(y_true=y_train, 
y_pred=torch.round(torch.sigmoid(y_logits))) 
train_acc_ev.append(acc)
# Training steps
optimizer.zero_grad()
loss.backward()
optimizer.step()
model_0.eval()
# Inference mode for prediction on the test data
with torch.inference_mode():
test_logits = model_0(X_test).squeeze() 
test_loss = loss_fn(test_logits,
y_test)
test_acc = compute_accuracy(y_true=y_test,
y_pred=torch.round(torch.sigmoid(test_logits)))
test_acc_ev.append(test_acc)
# Print out accuracy and loss every 100 epochs
if epoch % 100 == 0:
print(f"Epoch: epoch, Loss: loss:.5f, Accuracy: acc:.2f% | Test loss: test_loss:.5f, Test acc: test_acc:.2f%")

Unfortunately the neural network we’ve just built is not good enough to solve this problem. Let’s see the evolution of training and test accuracy:

Train and Test Accuracy through the Epochs — Image by Author

(I’m plotting accuracy instead of loss as it is easier to interpret in this problem)

Interestingly, our Neural Network isn’t able improve much of the test set accuracy.

With the knowledge have from previous blog posts, we can try to add more layers and neurons to our neural network. Let’s try to do both and see the outcome:

deeper_model = nn.Sequential(
nn.Linear(in_features=12, out_features=20),
nn.Linear(in_features=20, out_features=20),
nn.Linear(in_features=20, out_features=1)
)

deeper_model Neural Network Architecture — Image by Author

Although our deeper model is a bit more complex with an extra layer and more neurons, that doesn’t translate into more performance in the network:

Train and Test Accuracy through the Epochs for deeper model— Image by Author

Even though our model is more complex, that doesn’t really bring more accuracy to our classification problem.

To be able to achieve more performance, we need to unlock a new feature of Neural Networks — activation functions!

If making our model wider and larger didn’t bring much improvement, there must be something else that we can do with Neural Networks that will be able to improve its performance, right?

That’s where activation functions can be used! In our example, we’ll return to our simpler model, but this time with a twist:

model_non_linear = nn.Sequential(
nn.Linear(in_features=12, out_features=5),
nn.ReLU(),
nn.Linear(in_features=5, out_features=1)
)

What’s the difference between this model and the first one? The difference is that we added a new block to our neural network — nn.ReLU . The rectified linear unit is an activation function that will change the calculation in each of the weights of the Neural Network:

ReLU Illustrative Example — Image by Author

Every value that goes through our weights in the Neural Network will be computed against this function. If the value of the feature times the weight is negative, the value is set to 0, otherwise the calculated value is assumed. Just this small change adds a lot of power to a Neural Network architecture — in torch we have different activation functions we can use such as nn.ReLU , nn.Tanh or nn.ELU . For an overview of all activation functions, check this link.

Our neural network architecture contains a small twist, at the moment:

Neural Network Architecture — ReLU — Image by Author

With this small twist in the Neural Network, every value coming from the first layer (represented by nn.Linear(in_features=12, out_features=5) ) will have to go through the “ReLU” test.

Let’s see the impact of fitting this architecture on our data:

Train and Test Accuracy through the Epochs for non-linear model — Image by Author

Cool! Although we see some of the performance degrading after 800 epochs, this model doesn’t exhibit overfitting as the previous ones. Keep in mind that our dataset is very small, so there’s a chance that our results are better just by randomness. Nevertheless, adding activation functions to your torch models definitely has a huge impact in terms of performance, training and generalization, particularly when you have a lot of data to train on.

Now that you know the power of non-linear activation functions, it’s also relevant to know: