PyTorch

What is PyTorch

In official website: https://pytorch.org/tutorials/ . It states that：

A replacement for NumPy to use the power of GPUs
A deep learning research platform that provides maximum flexibility and speed

Diff between Tensorflow and PyTorch

The most important difference between the two is the way these frameworks define the computational graphs. While Tensorflow creates a static graph, PyTorch believes in a dynamic graph. So what does this mean? In Tensorflow, you first have to define the entire computation graph of the model and then run your ML model. But in PyTorch, you can define/manipulate your graph on-the-go. This is particularly helpful while using variable length inputs in RNNs.

Tensors

from __future__ import print_function
import torch

x = torch.empty(5,3)
x
# out
tensor([[1.1210e-44, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00]])

x = torch.rand(5, 3)
x
# out
tensor([[0.9122, 0.0691, 0.9595],
        [0.2535, 0.0617, 0.5030],
        [0.3705, 0.4274, 0.8880],
        [0.0304, 0.0172, 0.9135],
        [0.9683, 0.9874, 0.5131]])

x = torch.zeros(5, 3, dtype=torch.long)
x
# out:
tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]])

Reuse the input tensor

x = x.new_ones(5, 3, dtype=torch.double)
x
# out
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], dtype=torch.float64)

Override dtype

x = torch.randn_like(x, dtype=torch.float)
x
# Out:
tensor([[-0.6438, -1.6627, -1.0903],
        [ 0.3002,  0.4009, -0.7618],
        [ 0.1420,  0.9419,  0.1807],
        [-1.2571,  0.0923,  0.3649],
        [-0.2423,  0.1674,  0.6538]])

AUTOGRAD：自动分化
PyTorch中所有神经网络的核心是autograd软件包。让我们先简要地介绍一下，然后再训练第一个神经网络。

autograd软件包为Tensor上的所有操作提供自动区分。这是一个按运行定义的框架，这意味着您的backprop是由代码的运行方式定义的，并且每次迭代都可以不同。

让我们通过一些示例以更简单的方式看待这一点。

张量
torch.Tensor是程序包的中心类。如果将其属性.requires_grad设置为True，它将开始跟踪对其的所有操作。完成计算后，您可以调用.backward()并自动计算所有梯度。该张量的梯度将累积到.grad属性中。

要停止tensor(张量)跟踪历史记录，可以调用.detach()将其与计算历史记录分离，并防止跟踪将来的计算。

为了防止跟踪历史记录（和使用内存），您还可以使用torch.no_grad()：包装代码块。这在评估模型时特别有用，因为模型可能具有可训练的参数并使 require_grad = True，但我们不需要梯度。

还有另外一类对autograd非常重要，它是一个 Function。

Tensor和Function相互连接并建立一个无环图，该图对完整的计算历史进行编码。每个tensor都有一个.grad_fn属性，该属性引用创建了张量的函数（用户创建的张量除外-它们的grad_fn is None）。

如果要计算导数，可以在Tensor上调用.backward()。如果Tensor是scalar标量（即，它包含一个元素数据），则无需为Backward()指定任何参数，但是，如果Tensor具有更多元素，则需要指定gradient梯度参数，该参数应匹配张量的shape。

import torch

x = torch.ones(2, 2, requires_grad=True)
x
# out
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
# use tensor opperation
y = x + 2
y
# out
tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)

y.grad_fn
# out
<AddBackward0 at 0x7fac10f9d390>

# more operation
z = y * y * 3
out = z.mean()
out

# out
tensor(27., grad_fn=<MeanBackward0>)

# requires_grad has default False
a = torch .randn(2, 2)
a = ((a * 3) / (a-1))
print(a.requires_grad)

a.requires_grad_(True)

print(a.requires_grad)

b = (a * a).sum()
print(b.grad_fn)

# out 
False
True
<SumBackward0 object at 0x7fac11114940>

Gradients

Now backprop, since $out$ contains a single scalar then out.backward() is equivalent to out.backward(torch.tensor(1,))

out.backward()
x.grad
#out 
tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])

Since the $out$ has tensor $out = \frac{1}{4}\sum_i 3(x_i+2)^2$ then the first deritive is $\frac{\partial out}{\partial x_i}\frac{3}{2}\sum_i (x_i+2)$ . Hence the $\frac{\partial out}{\partial x_i}|_{x_I=1} = 9/2 = 4.5$

In math, the gradient of the vector valued function is defined as a Jacobian matrix:

$J = \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & \dots &\frac{\partial y_1}{x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \dots &\frac{\partial y_m}{x_n} \end{pmatrix}$

Nerual Network

Neural networks are in the torch.nn package.

Now we can use autograd and nn depends autograd to define the model and differentiate them. An nn.Module contains layers and forward(input) that returns output

A simple feed-forward network has the input and feeds it through several layers then produce an output.

A typical training procedure for a neural network is as follows:

Define the neural network that has some learnable parameters (or weights)
Iterate over a dataset of inputs
Process input through the network
Compute the loss (how far is the output from being correct)
Propagate gradients back into the network’s parameters
Update the weights of the network, typically using a simple update rule: weight = weight - learning_rate * gradient

Define Network

import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features
      
net = Net()
net

#out
Net(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=576, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

After define the forward function, and the backward function (where gradients are computed) is automatically defined for you using autograd. You can use any of the Tensor operations in the forward function.

The learnable parameters of a model are returned by net.parameters()

input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

net.zero_grad()
out.backward(torch.randn(1, 10))

Loss Function

output = net(input)
target = torch.randn(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)
#out
tensor(0.8399, grad_fn=<MseLossBackward>)

print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU
#out
<MseLossBackward object at 0x7fac110ef518>
<AddmmBackward object at 0x7fac110d9b00>
<AccumulateGrad object at 0x7fac110d97f0>

Backprop

net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

#out
conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([ 0.0171, -0.0057, -0.0081,  0.0024,  0.0039,  0.0138])

Update the weight

The simplest update rule used in practice is the Stochastic Gradient Descent (SGD):

1 2	> weight = weight - learning_rate * gradient >

We can implement this using simple Python code:

1
2
3

learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

In Nerual network we want to use several different method to update

import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

Training A Classifier

For any knid of data like image, text, audio or video data, we can use standard python packages that load data into a numpy array. Then we can convert this array into a torch.*Tensor.

For images, packages such as Pillow, OpenCV are useful
For audio, packages such as scipy and librosa
For text, either raw Python or Cython based loading, or NLTK and SpaCy are useful

The images in CIFAR-10 are of size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.

We will do the following steps in order:

Load and normalizing the CIFAR10 training and test datasets using torchvision
Define a Convolutional Neural Network
Define a loss function
Train the network on the training data
Test the network on the test data

import torch
import torchvision
import torchvision.transforms as transforms

dataiter = iter(testloader)
images, labels = dataiter.next()

# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

import matplotlib.pyplot as plt
import numpy as np

# functions to show an image


def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()


# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()

import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')