from IPython.display import HTML, display, Math
import warnings
warnings.filterwarnings('ignore')
Learn pyTorch and Tensorflow Simultaneously¶
If you are interested in machine learning and can’t decide whether to learn pyTorch or Tensorflow, this is an article for you. Why should we fall for the hassel of learning one then migrating to another, when we can possibly learn both at once?
From version 2.0, Tensorflow’s coding experience started to be much similar to pyTorch—thanks to the introduction of Gradient tape. Following this series of tutorials, you can quickly become profecient in both pyTorch and Tensorflow.
Tape-based Graient Computation¶
“Wengert list” [1] is a natural way to compute the numeric values of full or partial derivatives of an algebraic function without developing the analytical expression of the derivatives.
HTML('<img src="https://www.dropbox.com/s/hgmguh7ieuxxszs/magnetic-tape.png?dl=1" width=400>')
- It works like a magnetic tape that records the mathematical operations that are applied to some variables. This enables efficient implementation of backpropagation algorithms along an arbitrary directed acyclic graph (DAG) of mathematical operations
- It is implemented as “Autograd” in pyTorch, and as “GradientTape” in tensorflow
- Tape-based architecture is conceptually closer to the mathematical formulations used by researchers (as opposed to layer architecture which is conceptually closer to software engineers)
- The concept of tape is almost identical for pyTorch and TensorFlow and thus enables easier transition of code from research to production
- For pyTorch, it enables implementation of dynamic neural networks (i.e. networks that can change its architecture in training or inference time based on the data)
- For Tensorflow, it enables automatic construction of computation graphs from python code which leads to extremely fast computation with/without specialized hardware (e.g. GPU, TPU)
[1] R.E. Wengert (1964). “A simple automatic derivative evaluation program”. Comm. ACM. 7 (8): 463–464.
Computing the Gradient¶
Computing the gradient of a vector is the fundamental operation that takes place in neural networks and many other machine learning techniques that adopt gradient decent or similar optimization approaches. Let’s revise the concept through an example:
Let’s assume $x = \left[x_1, x_2, x_3\right]^T$ is a three dimentional vector. What is the gradient of $z=x^Tx$?
display(Math(r"""
z = \begin{bmatrix}x_{1}, x_{2}, x_{3}\end{bmatrix}\begin{bmatrix}x_{1} \\x_{2} \\x_{3}\end{bmatrix} = x_1^2+x_2^2+x_3^2 \quad\quad
\text{so,}
\nabla_x z = \begin{bmatrix}\frac{\partial z}{\partial x_{1}} \\\frac{\partial z}{\partial x_{2}} \\\frac{\partial z}{\partial x_{3}}\end{bmatrix} = 2x
"""))
Automatic Computation of Gradients¶
In pyTorch, variables already include a tape. So, they remember the mathematical operations as long as “requires_grad” is set to true. This enables automatic gradient calculation
import torch
from torch.autograd import Variable
x = Variable(torch.Tensor([1,2,3]), requires_grad=True)
y = x**2
z = y.sum()
print(x, y, z)
# It is necessary to remember the mathematical operations
# to compute the gradients. We start back propagation
# from z and want the gradient w.r.t. x
z.backward()
print(x.grad)
# Using GPU
x = x.cuda(0)
tensor([1., 2., 3.], requires_grad=True) tensor([1., 4., 9.], grad_fn=<PowBackward0>) tensor(14., grad_fn=<SumBackward0>) tensor([2., 4., 6.])
In Tensorflow 2.x, the same is done with the help of GradientTape
import tensorflow as tf
x = tf.Variable([1.0, 2.0, 3.0])
x_tensor = tf.convert_to_tensor([1.0, 2.0, 3.0])
# Tensorflow Variables automatically gets attached to the tape
with tf.GradientTape() as tape:
y = x ** 2
z = tf.math.reduce_sum(y)
# If it is a tensor, you need to manually attach it to the tape
# by calling the watch function
with tf.GradientTape() as tape2:
tape2.watch(x_tensor) # attaching the tensor to the tape
y_tensor = x_tensor ** 2
z_tensor = tf.math.reduce_sum(y_tensor)
grad = tape.gradient(z, x)
grad_tensor = tape2.gradient(z_tensor,x_tensor)
print(grad)
print(grad_tensor)
tf.Tensor([2. 4. 6.], shape=(3,), dtype=float32) tf.Tensor([2. 4. 6.], shape=(3,), dtype=float32)
Gradient Descent¶
Gradient descent is an algorithm to find the optima (e.g. minima or maxima) of a function. Let’s see the algorithm in action:
HTML('<img src="https://www.dropbox.com/s/5bm62eqdveo9h5g/pytorch_tensorflow_tutorial_1.gif?dl=1">')
import torch
from torch.autograd import Variable
x = Variable(torch.Tensor([-5.5, 5.5]), requires_grad=True)
print(x)
for i in range(5):
# Also try some other complicated functions of x
y = (x-3)**2
z = y.sum()
z.backward()
out = x.data - 0.25*x.grad.data
print(out)
x = Variable(out, requires_grad=True)
tensor([-5.5000, 5.5000], requires_grad=True) tensor([-1.2500, 4.2500]) tensor([0.8750, 3.6250]) tensor([1.9375, 3.3125]) tensor([2.4688, 3.1562]) tensor([2.7344, 3.0781])
Tensorflow¶
import tensorflow as tf
x = tf.Variable([-5.5, 3.])
print(x)
for i in range(5):
with tf.GradientTape() as tape:
tape.watch(x)
y = (x - 3) ** 2.
z = tf.reduce_sum(y)
dz_dx = tape.gradient(z, x)
print(dz_dx)
# Update parameter
x = x - 0.25*dz_dx
print(x)
<tf.Variable 'Variable:0' shape=(2,) dtype=float32, numpy=array([-5.5, 3. ], dtype=float32)> tf.Tensor([-17. 0.], shape=(2,), dtype=float32) tf.Tensor([-1.25 3. ], shape=(2,), dtype=float32) tf.Tensor([-8.5 0. ], shape=(2,), dtype=float32) tf.Tensor([0.875 3. ], shape=(2,), dtype=float32) tf.Tensor([-4.25 0. ], shape=(2,), dtype=float32) tf.Tensor([1.9375 3. ], shape=(2,), dtype=float32) tf.Tensor([-2.125 0. ], shape=(2,), dtype=float32) tf.Tensor([2.46875 3. ], shape=(2,), dtype=float32) tf.Tensor([-1.0625 0. ], shape=(2,), dtype=float32) tf.Tensor([2.734375 3. ], shape=(2,), dtype=float32)
Implementing a simple neuron¶
Let us implement an or gate
using a simple neuron. The neuron must take two numbers (each can be either 0 or 1) and produce the logical OR
operation between the output. The beauty of a neural network is that we do not need to explicitely tell it how to actually do the computation. We’ll just provide the inputs and the outputs; the network will automatically figure out what to do with the input to produce the correct output.
The mathematical formulation of the neuron is shown below.
HTML('<img src="https://www.dropbox.com/s/a6zb23eu3dhaj9y/Simple_neuron_1.png?dl=1">')
In the following code snippets, notice how similar pyTorch and Tensorflow codes are. The neural network classes differ only in terms of the names of the method responsible for the forward pass of the computation. In pyTorch, it is named forward
and in Tensorflow, it is __call__
.
pyTorch¶
import numpy as np
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
class NNet(nn.Module):
def __init__(self):
super().__init__()
# It takes row vectors of dimension 2
self.lin = nn.Linear(2,1)
def forward(self, x):
return F.sigmoid(self.lin(x))
net_pt = NNet()
x = Variable(torch.Tensor([[1,0]]))
print(net_pt(x))
tensor([[0.5201]], grad_fn=<SigmoidBackward>)
Tensorflow¶
import tensorflow as tf
from tensorflow.keras.layers import Dense
class NNet(tf.Module):
def __init__(self):
super().__init__()
self.dense_layer = Dense(1, activation='sigmoid')
def __call__(self, input):
dense_output = self.dense_layer(input)
return dense_output
net_tf = NNet()
input = tf.Variable([[1., 0.]])
print(net_tf(input))
tf.Tensor([[0.23421289]], shape=(1, 1), dtype=float32)
The Linear
layer in pyTorch or the Dense
layer in Tensorflow contains some hidden variables that we call the network parameters
. During the training process, these variables adjust to the “correct” value to produce the desired output for any given input. It is possible to see the network parameters in both pyTorch and Tensorflow.
display(HTML("<strong>Parameters for the pytorch network:</strong>"))
for a_param in net_pt.parameters():
print(a_param)
display(HTML("<strong>Parameters for the Tensorflow network:</strong>"))
for a_param in net_tf.variables:
print(a_param.value())
Parameter containing: tensor([[ 0.6024, -0.4435]], requires_grad=True) Parameter containing: tensor([-0.5219], requires_grad=True)
tf.Tensor( [[-1.1846737] [-1.2093475]], shape=(2, 1), dtype=float32) tf.Tensor([0.], shape=(1,), dtype=float32)
Noteably, one of the parameters contain the same dimensionality as the input, and the other parameter has the dimensionality of 1. The first parameter (the one with dim. 2, let’s say $\mathbf{w}$) is multiplied with the input and the other one (with dim 1, say $b$) is added to the product. Together, they form a decision line in the 2-dimensional space. The parameter $\mathbf{w}$ is the slope of the decision line, and $b$ is the intercept.
Training the neuron¶
We implemented the forward operation through the neuron. But it is an untrained neuron, so it didn’t provide the correct answer. We want to train it so that it can correctly work as an or gate. To conduct the training, we need the following two things:
- We need some data to train on. In the data, we’ll provide some randomly generated
input
and the desiredoutput
. This format of training is called supervised training. - We need to decide upon a
loss
for providing a wrong answer. Then we’ll minimize the loss using gradient descent
pyTorch¶
import torch.optim as optim
optimizer = optim.SGD(net_pt.parameters(), lr=0.33)
for i in range(5000):
input_v = (np.random.rand(1,2)>0.5).astype(float)
output_v = np.any(input_v).astype(float)
net_output = net_pt(Variable(torch.Tensor(input_v)))
loss = (Variable(torch.Tensor([output_v]))-net_output)**2
if i % 100 == 0:
print('input:', input_v, 'actual output should be:', output_v,'net output:', net_output.data.numpy()[0,0], 'loss:', loss.data.numpy()[0])
loss.backward()
optimizer.step()
net_pt.zero_grad()
input: [[0. 0.]] actual output should be: 0.0 net output: 0.37241194 loss: [0.13869065] input: [[0. 1.]] actual output should be: 1.0 net output: 0.74754155 loss: [0.06373527] input: [[0. 0.]] actual output should be: 0.0 net output: 0.38373128 loss: [0.1472497] input: [[0. 0.]] actual output should be: 0.0 net output: 0.25899634 loss: [0.0670791] input: [[0. 0.]] actual output should be: 0.0 net output: 0.2260203 loss: [0.05108518] input: [[0. 0.]] actual output should be: 0.0 net output: 0.23298575 loss: [0.05428236] input: [[0. 1.]] actual output should be: 1.0 net output: 0.89136934 loss: [0.01180062] input: [[0. 1.]] actual output should be: 1.0 net output: 0.8948328 loss: [0.01106014] input: [[1. 1.]] actual output should be: 1.0 net output: 0.9967218 loss: [1.07465685e-05] input: [[0. 0.]] actual output should be: 0.0 net output: 0.15272528 loss: [0.02332501] input: [[0. 0.]] actual output should be: 0.0 net output: 0.16019672 loss: [0.02566299] input: [[1. 1.]] actual output should be: 1.0 net output: 0.9985903 loss: [1.9872807e-06] input: [[1. 1.]] actual output should be: 1.0 net output: 0.99887794 loss: [1.259013e-06] input: [[0. 0.]] actual output should be: 0.0 net output: 0.14272739 loss: [0.02037111] input: [[1. 0.]] actual output should be: 1.0 net output: 0.9234849 loss: [0.00585456] input: [[0. 1.]] actual output should be: 1.0 net output: 0.9309966 loss: [0.00476147] input: [[0. 1.]] actual output should be: 1.0 net output: 0.93148065 loss: [0.0046949] input: [[1. 0.]] actual output should be: 1.0 net output: 0.9344236 loss: [0.00430026] input: [[0. 0.]] actual output should be: 0.0 net output: 0.108278975 loss: [0.01172434] input: [[1. 1.]] actual output should be: 1.0 net output: 0.9994425 loss: [3.1078645e-07] input: [[1. 0.]] actual output should be: 1.0 net output: 0.9390702 loss: [0.00371244] input: [[1. 1.]] actual output should be: 1.0 net output: 0.9995535 loss: [1.9936081e-07] input: [[1. 1.]] actual output should be: 1.0 net output: 0.99959034 loss: [1.6782354e-07] input: [[0. 1.]] actual output should be: 1.0 net output: 0.94351006 loss: [0.00319111] input: [[1. 0.]] actual output should be: 1.0 net output: 0.94535804 loss: [0.00298574] input: [[0. 1.]] actual output should be: 1.0 net output: 0.9476619 loss: [0.00273928] input: [[0. 1.]] actual output should be: 1.0 net output: 0.9494861 loss: [0.00255166] input: [[1. 0.]] actual output should be: 1.0 net output: 0.9475804 loss: [0.00274781] input: [[0. 0.]] actual output should be: 0.0 net output: 0.08541586 loss: [0.00729587] input: [[0. 0.]] actual output should be: 0.0 net output: 0.082992785 loss: [0.0068878] input: [[0. 0.]] actual output should be: 0.0 net output: 0.08331633 loss: [0.00694161] input: [[0. 1.]] actual output should be: 1.0 net output: 0.95129913 loss: [0.00237177] input: [[0. 1.]] actual output should be: 1.0 net output: 0.94989055 loss: [0.00251096] input: [[0. 0.]] actual output should be: 0.0 net output: 0.07586751 loss: [0.00575588] input: [[1. 1.]] actual output should be: 1.0 net output: 0.9997925 loss: [4.3049514e-08] input: [[1. 0.]] actual output should be: 1.0 net output: 0.95409477 loss: [0.00210729] input: [[1. 0.]] actual output should be: 1.0 net output: 0.95546055 loss: [0.00198376] input: [[1. 1.]] actual output should be: 1.0 net output: 0.9998254 loss: [3.0478876e-08] input: [[1. 0.]] actual output should be: 1.0 net output: 0.9571857 loss: [0.00183307] input: [[0. 1.]] actual output should be: 1.0 net output: 0.9546285 loss: [0.00205857] input: [[1. 1.]] actual output should be: 1.0 net output: 0.99984837 loss: [2.2992936e-08] input: [[0. 1.]] actual output should be: 1.0 net output: 0.9558302 loss: [0.00195097] input: [[1. 0.]] actual output should be: 1.0 net output: 0.95921606 loss: [0.00166333] input: [[0. 1.]] actual output should be: 1.0 net output: 0.95750844 loss: [0.00180553] input: [[0. 0.]] actual output should be: 0.0 net output: 0.06605629 loss: [0.00436343] input: [[0. 1.]] actual output should be: 1.0 net output: 0.9582146 loss: [0.00174602] input: [[0. 0.]] actual output should be: 0.0 net output: 0.06398068 loss: [0.00409353] input: [[1. 1.]] actual output should be: 1.0 net output: 0.9998853 loss: [1.315135e-08] input: [[1. 0.]] actual output should be: 1.0 net output: 0.96160764 loss: [0.00147397] input: [[0. 0.]] actual output should be: 0.0 net output: 0.06089913 loss: [0.0037087]
Tensorflow¶
optim = tf.optimizers.SGD(learning_rate=0.33)
for i in range(5000):
in_v = tf.Variable((np.random.rand(1,2)>0.5).astype(np.float32))
out_v = tf.Variable(np.any(in_v).astype(np.float32))
with tf.GradientTape() as tape:
net_output = net_tf(in_v)
# We are manually computing the squared error loss.
# It is possible to use built in functions instead.
loss = tf.reduce_mean((net_output - out_v) ** 2)
loss_grad = tape.gradient(loss, net_tf.trainable_variables)
if i % 100 == 0:
print('input:', in_v.numpy(), 'actual output should be:', out_v.numpy(),'net output:', net_output.numpy()[0,0], 'loss:', loss.numpy())
# Take a gradient descent update
optim.apply_gradients(zip(loss_grad, net_tf.trainable_variables))
input: [[0. 0.]] actual output should be: 0.0 net output: 0.5 loss: 0.25 input: [[1. 1.]] actual output should be: 1.0 net output: 0.89627415 loss: 0.010759052 input: [[1. 1.]] actual output should be: 1.0 net output: 0.9577437 loss: 0.0017855945 input: [[1. 0.]] actual output should be: 1.0 net output: 0.82634145 loss: 0.030157292 input: [[1. 1.]] actual output should be: 1.0 net output: 0.98112774 loss: 0.00035616223 input: [[0. 0.]] actual output should be: 0.0 net output: 0.2614005 loss: 0.06833022 input: [[0. 0.]] actual output should be: 0.0 net output: 0.22208074 loss: 0.049319852 input: [[1. 1.]] actual output should be: 1.0 net output: 0.99333763 loss: 4.4387158e-05 input: [[1. 1.]] actual output should be: 1.0 net output: 0.99574536 loss: 1.8101955e-05 input: [[0. 1.]] actual output should be: 1.0 net output: 0.89214134 loss: 0.01163349 input: [[0. 0.]] actual output should be: 0.0 net output: 0.17270328 loss: 0.029826423 input: [[1. 0.]] actual output should be: 1.0 net output: 0.9107087 loss: 0.007972932 input: [[1. 1.]] actual output should be: 1.0 net output: 0.9982552 loss: 3.0443507e-06 input: [[1. 0.]] actual output should be: 1.0 net output: 0.9149189 loss: 0.007238793 input: [[1. 1.]] actual output should be: 1.0 net output: 0.9987037 loss: 1.6803466e-06 input: [[0. 1.]] actual output should be: 1.0 net output: 0.9218901 loss: 0.00610116 input: [[1. 1.]] actual output should be: 1.0 net output: 0.99904114 loss: 9.1941234e-07 input: [[0. 1.]] actual output should be: 1.0 net output: 0.9253377 loss: 0.0055744634 input: [[0. 0.]] actual output should be: 0.0 net output: 0.11312116 loss: 0.012796396 input: [[1. 0.]] actual output should be: 1.0 net output: 0.92976546 loss: 0.00493289 input: [[0. 1.]] actual output should be: 1.0 net output: 0.9326902 loss: 0.0045306087 input: [[0. 1.]] actual output should be: 1.0 net output: 0.9371491 loss: 0.003950235 input: [[1. 0.]] actual output should be: 1.0 net output: 0.93886304 loss: 0.0037377279 input: [[0. 1.]] actual output should be: 1.0 net output: 0.9345463 loss: 0.0042841877 input: [[1. 0.]] actual output should be: 1.0 net output: 0.9407276 loss: 0.0035132184 input: [[0. 1.]] actual output should be: 1.0 net output: 0.9391228 loss: 0.0037060338 input: [[1. 1.]] actual output should be: 1.0 net output: 0.9996106 loss: 1.5163013e-07 input: [[1. 1.]] actual output should be: 1.0 net output: 0.9996427 loss: 1.2764201e-07 input: [[0. 1.]] actual output should be: 1.0 net output: 0.94264394 loss: 0.0032897177 input: [[1. 1.]] actual output should be: 1.0 net output: 0.9996921 loss: 9.481324e-08 input: [[0. 1.]] actual output should be: 1.0 net output: 0.9484706 loss: 0.0026552796 input: [[1. 1.]] actual output should be: 1.0 net output: 0.99973613 loss: 6.9627255e-08 input: [[0. 0.]] actual output should be: 0.0 net output: 0.0808695 loss: 0.0065398766 input: [[0. 0.]] actual output should be: 0.0 net output: 0.079687946 loss: 0.0063501685 input: [[1. 0.]] actual output should be: 1.0 net output: 0.9511493 loss: 0.0023863923 input: [[0. 1.]] actual output should be: 1.0 net output: 0.950129 loss: 0.0024871193 input: [[1. 1.]] actual output should be: 1.0 net output: 0.9997973 loss: 4.1093532e-08 input: [[1. 1.]] actual output should be: 1.0 net output: 0.99981123 loss: 3.563332e-08 input: [[0. 0.]] actual output should be: 0.0 net output: 0.072211094 loss: 0.0052144425 input: [[1. 1.]] actual output should be: 1.0 net output: 0.9998307 loss: 2.8654767e-08 input: [[0. 1.]] actual output should be: 1.0 net output: 0.95585364 loss: 0.001948901 input: [[1. 1.]] actual output should be: 1.0 net output: 0.99984765 loss: 2.3210362e-08 input: [[1. 1.]] actual output should be: 1.0 net output: 0.9998547 loss: 2.1116776e-08 input: [[1. 0.]] actual output should be: 1.0 net output: 0.9592119 loss: 0.0016636702 input: [[0. 0.]] actual output should be: 0.0 net output: 0.06742147 loss: 0.004545655 input: [[0. 1.]] actual output should be: 1.0 net output: 0.95896834 loss: 0.001683597 input: [[0. 1.]] actual output should be: 1.0 net output: 0.9591133 loss: 0.0016717223 input: [[0. 0.]] actual output should be: 0.0 net output: 0.06368894 loss: 0.004056281 input: [[0. 0.]] actual output should be: 0.0 net output: 0.062695764 loss: 0.003930759 input: [[0. 0.]] actual output should be: 0.0 net output: 0.062419973 loss: 0.003896253
Let us check the network parameters again
display(HTML("<strong>Parameters for the pytorch network:</strong>"))
param_pt = []
for a_param in net_pt.parameters():
param_pt.append(a_param.data.numpy())
print(a_param)
param_tf = []
display(HTML("<strong>Parameters for the Tensorflow network:</strong>"))
for a_param in net_tf.variables:
param_tf.append(a_param.numpy())
print(a_param.value())
Parameter containing: tensor([[5.9829, 5.9479]], requires_grad=True) Parameter containing: tensor([-2.7361], requires_grad=True)
tf.Tensor( [[5.9202266] [5.9310384]], shape=(2, 1), dtype=float32) tf.Tensor([-2.7203953], shape=(1,), dtype=float32)
So, in this case, the decision surface (A 2-d surface is a line) for the trained pyTorch and Tensorflow model is :
$$ wX^T + b = 0 $$In the following figures, the decision surface is plotted for both models. Have you noticed how the points (0,1), (1,0), and (1,1) are located in one side of the decision line but the point (0,0) is located on the other side? This is the working principle of a neural network when working as a classifier. During the training process, the internal parameters are adjusted so that the different classes of datapoints at located at different sides of the surface. There are many techniques for optimizing the loss, adjusting the learning rate, terminating the loop for training a neural network. But the core underlying principle is just the same.
import matplotlib.pyplot as plt
import numpy as np
delta = 0.002
xrange = np.arange(-1.0, 2.0, delta)
yrange = np.arange(-1.0, 2.0, delta)
X, Y = np.meshgrid(xrange,yrange)
plt.contour(X, Y, param_pt[0][0,0]*X + param_pt[0][0,1]*Y + np.array(param_pt[1]), [0])
plt.scatter([0,0,1,1], [0,1,0,1])
plt.grid('both')
plt.show()
display(HTML("Decision Boundary for the pyTorch Model"))
plt.contour(X, Y, param_tf[0][0,0]*X + param_tf[0][1,0]*Y + np.array(param_tf[1]), [0])
plt.scatter([0,0,1,1], [0,1,0,1])
plt.grid('both')
plt.show()
display(HTML("Decision Boundary for the Tensorflow Model"))
How about an xor
gate?¶
At this point you might be wondering about how the neuron will perform for an xor gate. There is no straight line in the 2-d space that can separate the positive classes from the negative in this case. Long ago, MIT Professor Marvin Minsky challenged the work of Frank Rosenblatt on the basis on this very problem and caused an AI winter. Fortunately, now we know the solution. It is true that this problem is not solvable by a single neuron. But we can arbitrarily increase the number of neurons (By either using more Dense
(Tensorflow) or Linear
(pyTorch) layers, or by setting the output of the Dense
/Linear
layers to more than 1). In that way, the decision surface doesn’t have to be a straight line. By arbitrarily increasing the number of neurons (e.g. the number of model parameters), it is possible to make the dicision surface arbitrarily distorted. Try playing with the number of neurons in the neural network playground for this problem.