MiniFlow

Outline:

What is a Neural Network
Graphs
Gradient Descent
Stochastic Gradient Descent

What is a Neural Network

A neural network is a graph of mathematical functions:
- linear combinations and activation functions.
The graph consists of nodes, and edges.

example-neural-network

Nodes in each layer (except for nodes in the input layer) perform mathematical functions using inputs from nodes in the previous layers.
- For example, a node could represent $f(x,y)=x+y$ , where $x$ and $y$ are input values from nodes in the previous layer.
Each node creates an output value which may be passed to nodes in the next layer.
- The output value from the output layer does not get passed to a future layer (last layer!)
Layers between the input layer and the output layer are called hidden layers.
The edges in the graph describe the connections between the nodes, along which the values flow from one layer to the next.
- These edges can also apply operations to the values that flow along them, such as multiplying by weights, adding biases, etc..

Graphs

The nodes and edges create a graph structure.
It isn’t hard to imagine that increasingly complex graphs can calculate almost anything.
There are generally two steps to create neural networks:
- Define the graph of nodes and edges.
- Propagate values through the graph.

Gradient Descent

Technically, the gradient actually points uphill, in the direction of steepest ascent.
But if we put a - sign in front of this value, we get the direction of steepest descent, which is what we want.
How much force should be applied to the push. This is known as the learning rate.
- which is an apt name since this value determines how quickly or slowly the neural network learns.
This is more of a guessing game than anything else but empirically values in the range 0.1 to 0.0001 work well.
- The range 0.001 to 0.0001 is popular, as 0.1 and 0.01 are sometimes too large.

Gradient-Descent-Convergence:

Gradient-Descent-Civergence: gradient-descent-divergence

Code:

"""
Given the starting point of any `x` gradient descent
should be able to find the minimum value of x for the
cost function `f` defined below.
"""
import random
from gd import gradient_descent_update


def f(x):
    """
    Quadratic function.

    It's easy to see the minimum value of the function
    is 5 when is x=0.
    """
    return x**2 + 5


def df(x):
    """
    Derivative of `f` with respect to `x`.
    """
    return 2*x

def gradient_descent_update(x, gradx, learning_rate):
    """
    Performs a gradient descent update.
    """
    # TODO: Implement gradient descent.
    
    # Return the new value for x
    return x - learning_rate * gradx

# Random number between 0 and 10,000. Feel free to set x whatever you like.
x = random.randint(0, 10000)
# TODO: Set the learning rate
learning_rate = 0.1
epochs = 100

for i in range(epochs+1):
    cost = f(x)
    gradx = df(x)
    print("EPOCH {}: Cost = {:.3f}, x = {:.3f}".format(i, cost, gradx))
    x = gradient_descent_update(x, gradx, learning_rate)

Output:

EPOCH 0: Cost = 2601774.000, x = 3226.000
EPOCH 1: Cost = 1665137.160, x = 2580.800
EPOCH 2: Cost = 1065689.582, x = 2064.640
EPOCH 3: Cost = 682043.133, x = 1651.712
EPOCH 4: Cost = 436509.405, x = 1321.370
EPOCH 5: Cost = 279367.819, x = 1057.096
EPOCH 6: Cost = 178797.204, x = 845.677
EPOCH 7: Cost = 114432.011, x = 676.541
EPOCH 8: Cost = 73238.287, x = 541.233
EPOCH 9: Cost = 46874.304, x = 432.986
EPOCH 10: Cost = 30001.354, x = 346.389
EPOCH 11: Cost = 19202.667, x = 277.111
EPOCH 12: Cost = 12291.507, x = 221.689
EPOCH 13: Cost = 7868.364, x = 177.351
EPOCH 14: Cost = 5037.553, x = 141.881
EPOCH 15: Cost = 3225.834, x = 113.505
EPOCH 16: Cost = 2066.334, x = 90.804
EPOCH 17: Cost = 1324.254, x = 72.643
EPOCH 18: Cost = 849.322, x = 58.114
EPOCH 19: Cost = 545.366, x = 46.492
EPOCH 20: Cost = 350.834, x = 37.193
EPOCH 21: Cost = 226.334, x = 29.755
EPOCH 22: Cost = 146.654, x = 23.804
EPOCH 23: Cost = 95.658, x = 19.043
EPOCH 24: Cost = 63.021, x = 15.234
EPOCH 25: Cost = 42.134, x = 12.187
EPOCH 26: Cost = 28.766, x = 9.750
EPOCH 27: Cost = 20.210, x = 7.800
EPOCH 28: Cost = 14.734, x = 6.240
EPOCH 29: Cost = 11.230, x = 4.992
EPOCH 30: Cost = 8.987, x = 3.994
EPOCH 31: Cost = 7.552, x = 3.195
EPOCH 32: Cost = 6.633, x = 2.556
EPOCH 33: Cost = 6.045, x = 2.045
EPOCH 34: Cost = 5.669, x = 1.636
EPOCH 35: Cost = 5.428, x = 1.309
EPOCH 36: Cost = 5.274, x = 1.047
EPOCH 37: Cost = 5.175, x = 0.838
EPOCH 38: Cost = 5.112, x = 0.670
EPOCH 39: Cost = 5.072, x = 0.536
EPOCH 40: Cost = 5.046, x = 0.429

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a version of Gradient Descent
- Where on each forward pass a batch of data is randomly sampled from total dataset.
Ideally, the entire dataset would be fed into the neural network on each forward pass.
- But in practice, it’s not practical due to memory constraints.
SGD is an approximation of Gradient Descent.
- The more batches processed by the neural network, the better the approximation.

A naïve implementation of SGD involves:

Randomly sample a batch of data from the total dataset.
Running the network forward and backward to calculate the gradient (with data from (1)).
Apply the gradient descent update.
Repeat steps 1-3 until convergence or the loop is stopped by another mechanism (i.e. the number of epochs).

Code: miniflow.py

import numpy as np

class Node:
    """
    Base class for nodes in the network.

    Arguments:

        `inbound_nodes`: A list of nodes with edges into this node.
    """
    def __init__(self, inbound_nodes=[]):
        """
        Node's constructor (runs when the object is instantiated). Sets
        properties that all nodes need.
        """
        # A list of nodes with edges into this node.
        self.inbound_nodes = inbound_nodes
        # The eventual value of this node. Set by running
        # the forward() method.
        self.value = None
        # A list of nodes that this node outputs to.
        self.outbound_nodes = []
        # New property! Keys are the inputs to this node and
        # their values are the partials of this node with
        # respect to that input.
        self.gradients = {}
        # Sets this node as an outbound node for all of
        # this node's inputs.
        for node in inbound_nodes:
            node.outbound_nodes.append(self)

    def forward(self):
        """
        Every node that uses this class as a base class will
        need to define its own `forward` method.
        """
        raise NotImplementedError

    def backward(self):
        """
        Every node that uses this class as a base class will
        need to define its own `backward` method.
        """
        raise NotImplementedError


class Input(Node):
    """
    A generic input into the network.
    """
    def __init__(self):
        # The base class constructor has to run to set all
        # the properties here.
        #
        # The most important property on an Input is value.
        # self.value is set during `topological_sort` later.
        Node.__init__(self)

    def forward(self):
        # Do nothing because nothing is calculated.
        pass

    def backward(self):
        # An Input node has no inputs so the gradient (derivative)
        # is zero.
        # The key, `self`, is reference to this object.
        self.gradients = {self: 0}
        # Weights and bias may be inputs, so you need to sum
        # the gradient from output gradients.
        for n in self.outbound_nodes:
            self.gradients[self] += n.gradients[self]

class Linear(Node):
    """
    Represents a node that performs a linear transform.
    """
    def __init__(self, X, W, b):
        # The base class (Node) constructor. Weights and bias
        # are treated like inbound nodes.
        Node.__init__(self, [X, W, b])

    def forward(self):
        """
        Performs the math behind a linear transform.
        """
        X = self.inbound_nodes[0].value
        W = self.inbound_nodes[1].value
        b = self.inbound_nodes[2].value
        self.value = np.dot(X, W) + b

    def backward(self):
        """
        Calculates the gradient based on the output values.
        """
        # Initialize a partial for each of the inbound_nodes.
        self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}
        # Cycle through the outputs. The gradient will change depending
        # on each output, so the gradients are summed over all outputs.
        for n in self.outbound_nodes:
            # Get the partial of the cost with respect to this node.
            grad_cost = n.gradients[self]
            # Set the partial of the loss with respect to this node's inputs.
            self.gradients[self.inbound_nodes[0]] += np.dot(grad_cost, self.inbound_nodes[1].value.T)
            # Set the partial of the loss with respect to this node's weights.
            self.gradients[self.inbound_nodes[1]] += np.dot(self.inbound_nodes[0].value.T, grad_cost)
            # Set the partial of the loss with respect to this node's bias.
            self.gradients[self.inbound_nodes[2]] += np.sum(grad_cost, axis=0, keepdims=False)


class Sigmoid(Node):
    """
    Represents a node that performs the sigmoid activation function.
    """
    def __init__(self, node):
        # The base class constructor.
        Node.__init__(self, [node])

    def _sigmoid(self, x):
        """
        This method is separate from `forward` because it
        will be used with `backward` as well.

        `x`: A numpy array-like object.
        """
        return 1. / (1. + np.exp(-x))

    def forward(self):
        """
        Perform the sigmoid function and set the value.
        """
        input_value = self.inbound_nodes[0].value
        self.value = self._sigmoid(input_value)

    def backward(self):
        """
        Calculates the gradient using the derivative of
        the sigmoid function.
        """
        # Initialize the gradients to 0.
        self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes}
        # Sum the partial with respect to the input over all the outputs.
        for n in self.outbound_nodes:
            grad_cost = n.gradients[self]
            sigmoid = self.value
            self.gradients[self.inbound_nodes[0]] += sigmoid * (1 - sigmoid) * grad_cost


class MSE(Node):
    def __init__(self, y, a):
        """
        The mean squared error cost function.
        Should be used as the last node for a network.
        """
        # Call the base class' constructor.
        Node.__init__(self, [y, a])

    def forward(self):
        """
        Calculates the mean squared error.
        """
        # NOTE: We reshape these to avoid possible matrix/vector broadcast
        # errors.
        #
        # For example, if we subtract an array of shape (3,) from an array of shape
        # (3,1) we get an array of shape(3,3) as the result when we want
        # an array of shape (3,1) instead.
        #
        # Making both arrays (3,1) insures the result is (3,1) and does
        # an elementwise subtraction as expected.
        y = self.inbound_nodes[0].value.reshape(-1, 1)
        a = self.inbound_nodes[1].value.reshape(-1, 1)

        self.m = self.inbound_nodes[0].value.shape[0]
        # Save the computed output for backward.
        self.diff = y - a
        self.value = np.mean(self.diff**2)

    def backward(self):
        """
        Calculates the gradient of the cost.
        """
        self.gradients[self.inbound_nodes[0]] = (2 / self.m) * self.diff
        self.gradients[self.inbound_nodes[1]] = (-2 / self.m) * self.diff


def topological_sort(feed_dict):
    """
    Sort the nodes in topological order using Kahn's Algorithm.

    `feed_dict`: A dictionary where the key is a `Input` Node and the value is the respective value feed to that Node.

    Returns a list of sorted nodes.
    """

    input_nodes = [n for n in feed_dict.keys()]

    G = {}
    nodes = [n for n in input_nodes]
    while len(nodes) > 0:
        n = nodes.pop(0)
        if n not in G:
            G[n] = {'in': set(), 'out': set()}
        for m in n.outbound_nodes:
            if m not in G:
                G[m] = {'in': set(), 'out': set()}
            G[n]['out'].add(m)
            G[m]['in'].add(n)
            nodes.append(m)

    L = []
    S = set(input_nodes)
    while len(S) > 0:
        n = S.pop()

        if isinstance(n, Input):
            n.value = feed_dict[n]

        L.append(n)
        for m in n.outbound_nodes:
            G[n]['out'].remove(m)
            G[m]['in'].remove(n)
            # if no other incoming edges add to S
            if len(G[m]['in']) == 0:
                S.add(m)
    return L


def forward_and_backward(graph):
    """
    Performs a forward pass and a backward pass through a list of sorted Nodes.

    Arguments:

        `graph`: The result of calling `topological_sort`.
    """
    # Forward pass
    for n in graph:
        n.forward()

    # Backward pass
    # see: https://docs.python.org/2.3/whatsnew/section-slices.html
    for n in graph[::-1]:
        n.backward()


def sgd_update(trainables, learning_rate=1e-2):
    """
    Updates the value of each trainable with SGD.

    Arguments:

        `trainables`: A list of `Input` Nodes representing weights/biases.
        `learning_rate`: The learning rate.
    """
    # Performs SGD
    #
    # Loop over the trainables
    for t in trainables:
        # Change the trainable's value by subtracting the learning rate
        # multiplied by the partial of the cost with respect to this
        # trainable.
        partial = t.gradients[t]
        t.value -= learning_rate * partial

Code: nn.py

"""
Have fun with the number of epochs!

Be warned that if you increase them too much,
the VM will time out :)
"""

import numpy as np
from sklearn.datasets import load_boston
from sklearn.utils import shuffle, resample
from miniflow import *

# Load data
data = load_boston()
X_ = data['data']
y_ = data['target']

# Normalize data
X_ = (X_ - np.mean(X_, axis=0)) / np.std(X_, axis=0)

n_features = X_.shape[1]
n_hidden = 10
W1_ = np.random.randn(n_features, n_hidden)
b1_ = np.zeros(n_hidden)
W2_ = np.random.randn(n_hidden, 1)
b2_ = np.zeros(1)

# Neural network
X, y = Input(), Input()
W1, b1 = Input(), Input()
W2, b2 = Input(), Input()

l1 = Linear(X, W1, b1)
s1 = Sigmoid(l1)
l2 = Linear(s1, W2, b2)
cost = MSE(y, l2)

feed_dict = {
    X: X_,
    y: y_,
    W1: W1_,
    b1: b1_,
    W2: W2_,
    b2: b2_
}

epochs = 10
# Total number of examples
m = X_.shape[0]
batch_size = 11
steps_per_epoch = m // batch_size

graph = topological_sort(feed_dict)
trainables = [W1, b1, W2, b2]

print("Total number of examples = {}".format(m))

# Step 4
for i in range(epochs):
    loss = 0
    for j in range(steps_per_epoch):
        # Step 1
        # Randomly sample a batch of examples
        X_batch, y_batch = resample(X_, y_, n_samples=batch_size)

        # Reset value of X and y Inputs
        X.value = X_batch
        y.value = y_batch

        # Step 2
        forward_and_backward(graph)

        # Step 3
        sgd_update(trainables)

        loss += graph[-1].value

    print("Epoch: {}, Loss: {:.3f}".format(i+1, loss/steps_per_epoch))

Output:

Total number of examples = 506
Epoch: 1, Loss: 137.353
Epoch: 2, Loss: 38.041
Epoch: 3, Loss: 30.666
Epoch: 4, Loss: 24.717
Epoch: 5, Loss: 23.817
...
...
...
Epoch: 994, Loss: 3.993
Epoch: 995, Loss: 3.549
Epoch: 996, Loss: 4.550
Epoch: 997, Loss: 3.830
Epoch: 998, Loss: 3.975
Epoch: 999, Loss: 3.160
Epoch: 1000, Loss: 3.711