<div style="text-align: right"> 03 July, 2017 </div>

<img src="https://wiki.tum.de/download/attachments/25009442/tensor-flow_opengraph_h.png?version=1&modificationDate=1485888308193&api=v2" style="float: center; width: 50%; margin-bottom: 0.5em;">

# TensorFlowTutorial

### by Anne Peter (anne.peter@uni-weimar.de)

---

## Addition to the VGG16 Network

<img src="http://www.cc.gatech.edu/~hays/compvision/proj6/deepNetVis.png">

<img src="https://www.cs.toronto.edu/~frossard/post/vgg16/vgg16.png">

### In Keras

```python
model = Sequential()

model.add(ZeroPadding2D((1,1),input_shape=(3,224,224)))
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000, activation='softmax'))
```

# What is TensorFlow?

"TensorFlow is an open source software library for machine intelligence and neural networks." - https://www.tensorflow.org/

## 1. Installing TensorFlow

Follow the installation guide on the offical TensorFlow site: https://www.tensorflow.org/install/

- available on Ubuntu, Windows and Mac OS X
- Note: <a href="https://www.tensorflow.org/install/install_mac">As of version 1.2, TensorFlow no longer provides GPU support on Mac OS X.</a>

TensorFlow with CPU support:
- always possible, but slow

TensorFlow with GPU support:
- you must have a supported NVIDIA graphics card with CUDA Compute Capability 3.0 or higher
- list of supported NVIDIA GPUs: https://developer.nvidia.com/cuda-gpus
- requires CUDA Toolkit 8.0
- requires cuDNN (CUDA Deep Neural Network library) v5.1 (you must sign up for that)
- typically faster

Don't forget to set the environment variable LD_LIBRARY_PATH and CUDA_HOME.

## 2. Running TensorFlow

### Short Test Programm

If you have installed TensorFlow in a virtual environment you need to activate TensorFlow first.<br>
A closer look: https://www.tensorflow.org/versions/r0.10/get_started/os_setup

#### In case of Anaconda

Type the following in your terminal:
```
source activate tensorflow
```

It will change to:
```
(tensorflow) yourName@yourPc:~$
```

#### Run Python

Type in your terminal (or in your virtual environment):
```python
python3
```

Then:
```python
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
>>> print(sess.run(hello))
```

The output should be:
```
Hello, TensorFlow!
```

## 3. ANNs again

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/300px-Colored_neural_network.svg.png">

### Bias

A neuron receives input from some other units, or from an external source and computes an output. Each input has an associated weight (w), which describes how strong its influence is. The node applies a function f (activation function) to the weighted sum of its inputs as shown below:

<img src="https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-09-at-3-42-21-am.png?w=568&h=303">

The above network takes numerical inputs X1 and X2 and has weights w1 and w2 associated with those inputs. Additionally, there is another input 1 with weight b (called the Bias) associated with it.

Importance of Bias: The main function of Bias is to provide every node with a trainable constant value (in addition to the normal inputs that the node receives).

In effect, a bias value allows you to shift the activation function to the left or right, which may be critical for successful learning.

It might help to look at a simple example. Consider this 1-input, 1-output network that has no bias:

<img src="https://i.stack.imgur.com/bI2Tm.gif">

The output of the network is computed by multiplying the input (x) by the weight (w0) and passing the result through some kind of activation function (e.g. a sigmoid function.)

Here is the function that this network computes for various values of w0:

<img src="https://i.stack.imgur.com/ddyfr.png">

Changing the weight w0 essentially changes the "steepness" of the sigmoid. That's useful, but what if you wanted the network to output 0 when x is 2? Just changing the steepness of the sigmoid won't really work -- you want to be able to shift the entire curve to the right.

That's exactly what the bias allows you to do. If we add a bias to that network, like so:

<img src="https://i.stack.imgur.com/oapHD.gif">

Then the output of the network becomes sig(w0*x + w1*1.0). Here is what the output of the network looks like for various values of w1:

<img src="https://i.stack.imgur.com/t2mC3.png">

Having a weight of -5 for w1 shifts the curve to the right, which allows us to have a network that outputs 0 when x is 2.

## 4. Deeper into TensorFlow

### Tensors

The central unit of data in TensorFlow is the tensor. A tensor consists of a set of primitive values shaped into an array of any number of dimensions. A tensor's rank is its number of dimensions. Here are some examples of tensors:

```python
[1, 2, 3] # rank = 1; this is a vector with 3 elements
[[1, 2, 3], [4, 5, 6]] # rank = 2; a matrix with shape [2, 3] 
[[[1, 2, 3]], [[7, 8, 9]]] # rank = 3; tensor with shape [2, 1, 3]
```

 \begin{pmatrix}
  1 \\
  2 \\
  3 
 \end{pmatrix}

\begin{pmatrix}
  1 & 2 & 3 \\
  3 & 5 & 6
 \end{pmatrix}

<img src="https://upload.wikimedia.org/wikipedia/commons/7/71/Epsilontensor.svg" style="width: 30%;">

<br>
<img src="https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/Images/Tensor_2.png">

### The Computational Graph (Linear Regression)

You might think of TensorFlow Core programs as consisting of two discrete sections:

  1. Building the computational graph.
  2. Running the computational graph.

A computational graph is a series of TensorFlow operations arranged into a graph of nodes. Let's build a simple computational graph. Each node takes zero or more tensors as inputs and produces a tensor as an output. One type of node is a constant. Like all TensorFlow constants, it takes no inputs, and it outputs a value it stores internally. We can create two floating point Tensors node1 and node2 as follows:

In [1]:
import tensorflow as tf

node1 = tf.constant(3.0, dtype=tf.float32)
node2 = tf.constant(4.0) # also tf.float32 implicitly
print(node1, node2)

Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)


Notice that printing the nodes does not output the values 3.0 and 4.0 as you might expect. Instead, they are nodes that, when evaluated, would produce 3.0 and 4.0, respectively. To actually evaluate the nodes, we must run the computational graph within a session. A session encapsulates the control and state of the TensorFlow runtime.

The following code creates a Session object and then invokes its run method to run enough of the computational graph to evaluate node1 and node2. By running the computational graph in a session as follows:

In [2]:
sess = tf.Session()
print(sess.run([node1, node2]))

[3.0, 4.0]


We can build more complicated computations by combining Tensor nodes with operations (operations are also nodes.). For example, we can add our two constant nodes and produce a new graph as follows:

In [3]:
node3 = tf.add(node1, node2)
print("node3: ", node3)
print("sess.run(node3): ",sess.run(node3))

node3:  Tensor("Add:0", shape=(), dtype=float32)
sess.run(node3):  7.0


TensorFlow provides a utility called TensorBoard that can display a picture of the computational graph. Here is a screenshot showing how TensorBoard visualizes the graph:

<img src="https://www.tensorflow.org/images/getting_started_add.png">

As it stands, this graph is not especially interesting because it always produces a constant result. A graph can be parameterized to accept external inputs, known as placeholders. A placeholder is a promise to provide a value later.

In [4]:
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b  # + provides a shortcut for tf.add(a, b)

The preceding three lines are a bit like a function or a lambda in which we define two input parameters (a and b) and then an operation on them. We can evaluate this graph with multiple inputs to specify Tensors that provide concrete values to these placeholders:

In [5]:
print(sess.run(adder_node, {a: 3, b: 4.5}))
print(sess.run(adder_node, {a: [1,3], b: [2, 4]}))

7.5
[ 3.  7.]


In TensorBoard, the graph looks like this:

<img src="https://www.tensorflow.org/images/getting_started_adder.png">

We can make the computational graph more complex by adding another operation.

In [6]:
add_and_triple = adder_node * 3.
print(sess.run(add_and_triple, {a: 3, b: 4.5}))

22.5


The computational graph would look as follows in TensorBoard:

<img src="https://www.tensorflow.org/images/getting_started_triple.png">

In machine learning we will typically want a model that can take arbitrary inputs, such as the one above. To make the model trainable, we need to be able to modify the graph to get new outputs with the same input. Variables allow us to add trainable parameters to a graph. They are constructed with a type and initial value:

In [8]:
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W * x + b

Constants are initialized when you call tf.constant, and their value can never change. By contrast, variables are not initialized when you call tf.Variable. To initialize all the variables in a TensorFlow program, you must explicitly call a special operation as follows:

In [9]:
init = tf.global_variables_initializer()
sess.run(init)

It is important to realize that 'init' is just a handle.<br>
Until we call sess.run, the variables are uninitialized.

Since x is a placeholder, we can evaluate linear_model for several values of x simultaneously as follows:

In [10]:
print(sess.run(linear_model, {x:[1,2,3,4]}))

[ 0.          0.30000001  0.60000002  0.90000004]


We've created a model, but we don't know how good it is yet. To evaluate the model on training data, we need a y placeholder to provide the desired values, and we need to write a loss function.

A loss function measures how far apart the current model is from the provided data. We'll use a standard loss model for linear regression, which sums the squares of the deltas between the current model and the provided data. linear_model - y creates a vector where each element is the corresponding example's error delta. We call tf.square to square that error. Then, we sum all the squared errors to create a single scalar that abstracts the error of all examples using tf.reduce_sum:

In [11]:
# desired output
y = tf.placeholder(tf.float32)

# squared diffrence between real output and desired output (vector)
squared_deltas = tf.square(linear_model - y)

# sum up all squared errors to a single scalar
loss = tf.reduce_sum(squared_deltas)

print(sess.run(loss, {x:[1,2,3,4], y:[0,-1,-2,-3]}))

23.66


### Training with tf.train

TensorFlow provides optimizers that slowly change each variable in order to minimize the loss function. The simplest optimizer is gradient descent. It modifies each variable according to the magnitude of the derivative of loss with respect to that variable.<br>
In general, computing symbolic derivatives manually is tedious and error-prone. Consequently, TensorFlow can automatically produce derivatives given only a description of the model using the function tf.gradients. For simplicity, optimizers typically do this for you. For example:

In [12]:
# optimizer changes variables to minimize loss function
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

sess.run(init) # reset values to incorrect defaults.

# easier in Keras, right?
for i in range(1000):
  sess.run(train, {x:[1,2,3,4], y:[0,-1,-2,-3]})

print(sess.run([W, b]))

[array([-0.9999969], dtype=float32), array([ 0.99999082], dtype=float32)]


Now we have done actual machine learning! Although doing this simple linear regression doesn't require much TensorFlow core code, more complicated models and methods to feed data into your model necessitate more code. Thus TensorFlow provides higher level abstractions for common patterns, structures, and functionality. We will learn how to use some of these abstractions in the next section.

### Complete Program

Let's put it all together:

In [13]:
import numpy as np

# model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)

# model input and output
x = tf.placeholder(tf.float32)
linear_model = W * x + b
y = tf.placeholder(tf.float32)

# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares

# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

# training data
x_train = [1,2,3,4]
y_train = [0,-1,-2,-3]

# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
for i in range(1000):
  sess.run(train, {x:x_train, y:y_train})

# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x:x_train, y:y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))

W: [-0.9999969] b: [ 0.99999082] loss: 5.69997e-11


Notice that the loss is a very small number (close to zero). If you run this program multiple times your loss will not be exactly the same, because the model is initialized with random values.

This more complicated program can still be visualized in TensorBoard:

<img src="https://www.tensorflow.org/images/getting_started_final.png">

### Training with tf.contrib.learn

tf.contrib.learn is a high-level TensorFlow library that simplifies the mechanics of machine learning, including the following:

 - running training loops
 - running evaluation loops
 - managing data sets
 - managing feeding

tf.contrib.learn defines many common models.

Notice how much simpler the linear regression program becomes with tf.contrib.learn.

We declare list of features in order to get a predefined TensorFlow model. We only have one real-valued feature. Feature columns provide a mechanism to map data to a model. There are many other types of columns that are more complicated and useful.<br>
The Tensor representing the RealValuedColumn will have the shape of [batch_size, dimension].

In [14]:
features = [tf.contrib.layers.real_valued_column("x", dimension=1)]

An estimator is the front end to invoke training (fitting) and evaluation (inference). There are many predefined types like linear regression, logistic regression, linear classification, logistic classification, and many neural network regressors and classifiers. The following code provides an estimator that does linear regression.

In [16]:
estimator = tf.contrib.learn.LinearRegressor(feature_columns=features)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f6f9db1ad30>, '_master': '', '_num_ps_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000}


TensorFlow provides many helper methods to read and set up data sets. Here we use two data sets: one for training and one for evaluation. We have to tell the function how many epochs we want and how big each batch should be.

In [17]:
x_train = np.array([1., 2., 3., 4.])
y_train = np.array([0., -1., -2., -3.])

x_eval = np.array([2., 5., 8., 1.])
y_eval = np.array([-1.01, -4.1, -7, 0.])

input_fn = tf.contrib.learn.io.numpy_input_fn({"x":x_train}, y_train,
                                              batch_size=4,
                                              num_epochs=1000)

eval_input_fn = tf.contrib.learn.io.numpy_input_fn(
    {"x":x_eval}, y_eval, batch_size=4, num_epochs=1000)

We can invoke 1000 training steps by invoking the  method and passing the training data set.

In [18]:
estimator.fit(input_fn=input_fn, steps=1000)

Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmprfalltl7/model.ckpt.
INFO:tensorflow:loss = 2.25, step = 1
INFO:tensorflow:global_step/sec: 1192.67
INFO:tensorflow:loss = 0.019772, step = 101
INFO:tensorflow:global_step/sec: 1391.22
INFO:tensorflow:loss = 0.0136342, step = 201
INFO:tensorflow:global_step/sec: 1053.04
INFO:tensorflow:loss = 0.00141048, step = 301
INFO:tensorflow:global_step/sec: 1078.9
INFO:tensorflow:loss = 0.000165774, step = 401
INFO:tensorflow:global_step/sec: 1115.74
INFO:tensorflow:loss = 8.03088e-05, step = 501
INFO:tensorflow:global_step/sec: 1141.63
INFO:tensorflow:loss = 7.62421e-06, step

LinearRegressor(params={'head': <tensorflow.contrib.learn.python.learn.estimators.head._RegressionHead object at 0x7f6f9db1add8>, 'feature_columns': [_RealValuedColumn(column_name='x', dimension=1, default_value=None, dtype=tf.float32, normalizer=None)], 'optimizer': None, 'gradient_clip_norm': None, 'joint_weights': False})

Here we evaluate how well our model did.

In [19]:
train_loss = estimator.evaluate(input_fn=input_fn)
eval_loss = estimator.evaluate(input_fn=eval_input_fn)
print()
print("Train loss: %r"% train_loss)
print("Eval loss: %r"% eval_loss)
# print(estimator.evaluate(input_fn=input_fn)

Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-07-03-07:50:21
INFO:tensorflow:Finished evaluation at 2017-07-03-07:50:22
INFO:tensorflow:Saving dict for global step 1000: global_step = 1000, loss = 9.31061e-09
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-07-03-07:50:22
INFO:tensorflow:Finished evaluation at 2017-07-03-07:50:23
INFO:tensorflow:S

Notice how our eval data has a higher loss, but it is still close to zero. That means we are learning properly.

### Own Model

tf.contrib.learn does not lock you into its predefined models. Suppose we wanted to create a custom model that is not built into TensorFlow. We can still retain the high level abstraction of data set, feeding, training, etc. of tf.contrib.learn. For illustration, we will show how to implement our own equivalent model to LinearRegressor using our knowledge of the lower level TensorFlow API.

To define a custom model that works with tf.contrib.learn, we need to use tf.contrib.learn.Estimator. tf.contrib.learn.LinearRegressor is actually a sub-class of tf.contrib.learn.Estimator. Instead of sub-classing Estimator, we simply provide Estimator a function model_fn that tells tf.contrib.learn how it can evaluate predictions, training steps, and loss. The code is as follows:

We declare a list of features, we only have one real-valued feature.

In [20]:
def model(features, labels, mode):
    
  # build a linear model and predict values
  w = tf.get_variable("w", [1], dtype=tf.float64)
  b = tf.get_variable("b", [1], dtype=tf.float64)
  y = w*features['x'] + b # weighted inputs plus bias
    
  # loss sub-graph
  loss = tf.reduce_sum(tf.square(y - labels))
    
  # training sub-graph
  global_step = tf.train.get_global_step()
  optimizer = tf.train.GradientDescentOptimizer(0.01)
  train = tf.group(optimizer.minimize(loss),
                   tf.assign_add(global_step, 1))
    
  # modelFnOps connects subgraphs we built to the appropriate functionality
  return tf.contrib.learn.ModelFnOps(
      mode=mode, predictions=y,
      loss=loss,
      train_op=train)

# estimator
estimator = tf.contrib.learn.Estimator(model_fn=model)

# define our data sets
x_train = np.array([1., 2., 3., 4.])
y_train = np.array([0., -1., -2., -3.])
x_eval = np.array([2., 5., 8., 1.])
y_eval = np.array([-1.01, -4.1, -7, 0.])
input_fn = tf.contrib.learn.io.numpy_input_fn({"x": x_train}, y_train, 4, num_epochs=1000)

# train
estimator.fit(input_fn=input_fn, steps=1000)

# evaluate how well our model did
train_loss = estimator.evaluate(input_fn=input_fn)
eval_loss = estimator.evaluate(input_fn=eval_input_fn)
print()
print("train loss: %r"% train_loss)
print("eval loss: %r"% eval_loss)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f6f7159ef98>, '_master': '', '_num_ps_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmp4odj4wxe/model.ckpt.
INFO:tensorflow:loss = 105.780138798, step = 1
INFO:tensorflow:global_step/sec: 1177.37
INFO:tensorflow:loss = 0.26134146266, step = 101
INFO:tensorflow:global_step/sec: 1230.48
INFO:tensorflow:loss = 0.0125601875681, step = 201
INFO:tensorflow:global_step/sec: 1173.68
INFO:tensorflow:loss = 0.000522341980613, step = 3

Notice how the contents of the custom model() function are very similar to our manual model training loop from the lower level API.

### XOR with TensorFlow

To start with, we need to load in the TensorFlow library:

In [21]:
import tensorflow as tf

The next step is to set up placeholders to hold the input data. TensorFlow will automatically fill them with the data when we run the network. In our XOR problem, we have four different training examples and each example has two features. There are also four expected outputs, each with just one value (either a 0 or 1). In TensorFlow, this looks like this:

In [22]:
# 4 training examples, each has 2 features (00, 01, 10, 11)
x_ = tf.placeholder(tf.float32, shape=[4,2], name="x-input")

# 4 expected outputs, each 1 value (0 or 1)
y_ = tf.placeholder(tf.float32, shape=[4,1], name="y-input")

I’ve set up the inputs to be floating point numbers rather than the more natural integers to avoid having to cast them to floating points when multiplying the weights later on. The shape parameter tells the placeholder what the dimensions are of data we’ll be passing in.

The next step is to set up the parameters for the network. These are called Variables in TensorFlow.  Variables will be modified by TensorFlow during the training steps.

In [23]:
# [2,2] = shape of output tensor: a 1-D integer tensor
Theta1 = tf.Variable(tf.random_uniform([2,2], -1, 1), name="Theta1")
Theta2 = tf.Variable(tf.random_uniform([2,1], -1, 1), name="Theta2")

For our Theta matrices, we want them initialized to random values between -1 and +1, so we use the built-in random_uniform function to do that.

In TensorFlow, we set up the bias nodes separately, but still as Variables. This let’s the algorithms modify the values of the bias node. This is mathematically equivalent to having a signal value of 1 and initial weights of 0 on the links from the bias nodes.

In [24]:
Bias1 = tf.Variable(tf.zeros([2]), name="Bias1")
Bias2 = tf.Variable(tf.zeros([1]), name="Bias2")

Now we set up the model.

In [25]:
A2 = tf.sigmoid(tf.matmul(x_, Theta1) + Bias1)
Hypothesis = tf.sigmoid(tf.matmul(A2, Theta2) + Bias2)

Here, matmul is TensorFlow’s matrix multiplication function, and sigmoid naturally is the sigmoid calculation function.

As before, our cost function is the average over all the training examples:

In [26]:
cost = tf.reduce_mean(( (y_ * tf.log(Hypothesis))
       + ((1 - y_) * tf.log(1.0 - Hypothesis)) ) * -1)

So far, that has been relatively straightforward. Let’s look at training the network.

TensorFlow ships with several different training algorithms. We’re going to use the gradient descent algorithm:

In [27]:
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

What this statement says is that we’re going to use GradientDescentOptimizer as our training algorithm, the learning rate is going to be 0.01 and we want to minimize the cost function above. This means that we don’t have to implement our own algorithm for minimizing the cost.

That’s all there is to setting up the network. Now we just have to go through a few initialization steps before running the examples through the network.

As I mentioned above, TensorFlow runs a model inside a session, which it uses to maintain the state of the variables as they are passed through the network we’ve set up. So the first step in that session is to initialise all the Variables from above. This step allocates values to the various Variables in accordance with how we set them up (i.e. random numbers for Theta and zeros for Bias).

In [28]:
XOR_X = [[0,0],[0,1],[1,0],[1,1]]
XOR_Y = [[0],[1],[1],[0]]

init = tf.global_variables_initializer()
sess = tf.Session()

# for visualization
writer = tf.summary.FileWriter("./logs/xor_logs", sess.graph)

sess.run(init)

Each time the training step is executed, the values in the dictionary feed_dict are loaded into the placeholders that we set up at the beginning. As the XOR problem is relatively simple, each epoch will contain the entire training set. To see what’s going on inside the loop, just print out the values of the Variables.

The next step is to run some epochs.

In [31]:
for i in range(1000000):
    sess.run(train_step, feed_dict={x_: XOR_X, y_: XOR_Y})
    if i % 10000 == 0:
        print('Epoch:', i)
        print('Hypothesis:')
        print(sess.run(Hypothesis, feed_dict={x_: XOR_X, y_: XOR_Y}))
        print('Theta1:')
        print(sess.run(Theta1))
        print('Bias1:')
        print(sess.run(Bias1))
        print('Theta2:')
        print(sess.run(Theta2))
        print('Bias2:')
        print(sess.run(Bias2))
        print('cost:', sess.run(cost, feed_dict={x_: XOR_X, y_: XOR_Y}))
        print()

print()
print('Input:')
print(XOR_X)
print('Hypothesis:')
print(sess.run(Hypothesis, feed_dict={x_: XOR_X, y_: XOR_Y}))
print('cost:', sess.run(cost, feed_dict={x_: XOR_X, y_: XOR_Y}))

Epoch: 0
Hypothesis:
[[ 0.50156796]
 [ 0.5046078 ]
 [ 0.49542457]
 [ 0.49822351]]
Theta1:
[[ 0.24304973  0.11544408]
 [-0.18594487 -0.16864645]]
Bias1:
[ 0.25040069 -0.02451689]
Theta2:
[[-0.60212612]
 [ 0.37223211]]
Bias2:
[ 0.16099767]
cost: 0.693051

Epoch: 10000
Hypothesis:
[[ 0.50182217]
 [ 0.50578773]
 [ 0.49426284]
 [ 0.49780619]]
Theta1:
[[ 0.29708269  0.13173406]
 [-0.2242807  -0.186643  ]]
Bias1:
[ 0.29762152 -0.03670175]
Theta2:
[[-0.60845363]
 [ 0.38608986]]
Bias2:
[ 0.16695368]
cost: 0.692973

Epoch: 20000
Hypothesis:
[[ 0.50217646]
 [ 0.50779343]
 [ 0.49229765]
 [ 0.49705204]]
Theta1:
[[ 0.38396621  0.15452576]
 [-0.28574663 -0.21174367]]
Bias1:
[ 0.36882213 -0.05270659]
Theta2:
[[-0.62387395]
 [ 0.40834585]]
Bias2:
[ 0.17873059]
cost: 0.692783

Epoch: 30000
Hypothesis:
[[ 0.50257367]
 [ 0.51181   ]
 [ 0.48842731]
 [ 0.49539754]]
Theta1:
[[ 0.5411706   0.19040614]
 [-0.39851186 -0.25034413]]
Bias1:
[ 0.48937076 -0.07512942]
Theta2:
[[-0.66627246]
 [ 0.44998714]]
Bias2:
[ 

Epoch: 320000
Hypothesis:
[[ 0.00360153]
 [ 0.9951036 ]
 [ 0.9966684 ]
 [ 0.00308914]]
Theta1:
[[ 7.20160913  6.52006388]
 [-6.99017572 -6.78816557]]
Bias1:
[ 3.5028584  -3.45751309]
Theta2:
[[-12.03102398]
 [ 12.6248436 ]]
Bias2:
[ 5.67092419]
cost: 0.00373689

Epoch: 330000
Hypothesis:
[[ 0.00345702]
 [ 0.99529088]
 [ 0.99680293]
 [ 0.00296679]]
Theta1:
[[ 7.22113705  6.54373932]
 [-7.00924921 -6.81149626]]
Bias1:
[ 3.51239514 -3.46887708]
Theta2:
[[-12.10731792]
 [ 12.70113754]]
Bias2:
[ 5.70907116]
cost: 0.00358917

Epoch: 340000
Hypothesis:
[[ 0.00333322]
 [ 0.99546659]
 [ 0.99691808]
 [ 0.00286202]]
Theta1:
[[ 7.24021053  6.56281281]
 [-7.0283227  -6.83056974]]
Bias1:
[ 3.52193189 -3.47841382]
Theta2:
[[-12.17824459]
 [ 12.76974583]]
Bias2:
[ 5.74608517]
cost: 0.00345883

Epoch: 350000
Hypothesis:
[[ 0.00321496]
 [ 0.99562067]
 [ 0.99702531]
 [ 0.00276199]]
Theta1:
[[ 7.25928402  6.58188629]
 [-7.04739618 -6.84964323]]
Bias1:
[ 3.53146863 -3.48795056]
Theta2:
[[-12.24500179]
 [ 1

Epoch: 640000
Hypothesis:
[[ 0.00155851]
 [ 0.99785006]
 [ 0.99855083]
 [ 0.00134736]]
Theta1:
[[ 7.62320042  6.95526266]
 [-7.39861822 -7.22294855]]
Bias1:
[ 3.69936681 -3.67309928]
Theta2:
[[-13.61110401]
 [ 14.19131565]]
Bias2:
[ 6.46854258]
cost: 0.00162761

Epoch: 650000
Hypothesis:
[[ 0.00153746]
 [ 0.99789029]
 [ 0.99857676]
 [ 0.00132963]]
Theta1:
[[ 7.63273716  6.9647994 ]
 [-7.40815496 -7.23248529]]
Bias1:
[ 3.70413518 -3.67786765]
Theta2:
[[-13.64113045]
 [ 14.21992588]]
Bias2:
[ 6.48667622]
cost: 0.00160134

Epoch: 660000
Hypothesis:
[[ 0.00151299]
 [ 0.997922  ]
 [ 0.99859875]
 [ 0.00130895]]
Theta1:
[[ 7.6422739   6.97433615]
 [-7.41769171 -7.24202204]]
Bias1:
[ 3.70890355 -3.68263602]
Theta2:
[[-13.66974068]
 [ 14.24853611]]
Bias2:
[ 6.50098133]
cost: 0.00157658

Epoch: 670000
Hypothesis:
[[ 0.00148891]
 [ 0.99795306]
 [ 0.99862051]
 [ 0.00128858]]
Theta1:
[[ 7.65181065  6.98387289]
 [-7.42722845 -7.25155878]]
Bias1:
[ 3.71367192 -3.68740439]
Theta2:
[[-13.69835091]
 [ 1

Epoch: 950000
Hypothesis:
[[  1.01944304e-03]
 [  9.98595178e-01]
 [  9.99054492e-01]
 [  8.83262139e-04]]
Theta1:
[[ 7.82308817  7.1610961 ]
 [-7.58781958 -7.42960024]]
Bias1:
[ 3.78917003 -3.7749033 ]
Theta2:
[[-14.42505741]
 [ 14.99634838]]
Bias2:
[ 6.88228798]
cost: 0.00106384

Epoch: 960000
Hypothesis:
[[  1.00909313e-03]
 [  9.98609006e-01]
 [  9.99064028e-01]
 [  8.74418474e-04]]
Theta1:
[[ 7.82785654  7.16586447]
 [-7.59258795 -7.43436861]]
Bias1:
[ 3.79155421 -3.77728748]
Theta2:
[[-14.4441309 ]
 [ 15.01542187]]
Bias2:
[ 6.89182472]
cost: 0.00105319

Epoch: 970000
Hypothesis:
[[  9.98848118e-04]
 [  9.98622537e-01]
 [  9.99073267e-01]
 [  8.65663344e-04]]
Theta1:
[[ 7.83262491  7.17063284]
 [-7.59735632 -7.43913698]]
Bias1:
[ 3.7939384  -3.77967167]
Theta2:
[[-14.46320438]
 [ 15.03449535]]
Bias2:
[ 6.90136147]
cost: 0.00104274

Epoch: 980000
Hypothesis:
[[  9.88706131e-04]
 [  9.98636067e-01]
 [  9.99082565e-01]
 [  8.56995001e-04]]
Theta1:
[[ 7.83739328  7.17540121]
 [-7.6021

As you can see in the display for the Hypothesis variable, the network has learned to output nearly correct values for the inputs.

To see the graph of the model, TensorFlow includes a utility called TensorBoard.

<img src="https://aimatters.files.wordpress.com/2016/01/tf_graph.png">

We can see that our inputs x-input and y-input are the starts of the graph, and that they flow through the processes at layer2 and layer3, ultimately being used in the cost function.

### MNIST Handwritten Digits Recognition

MNIST is a simple computer vision dataset. It consists of images of handwritten digits like these:

<img src="https://www.tensorflow.org/images/MNIST.png">

It also includes labels for each image, telling us which digit it is.

Let's get some pictures of digits!<br>
These two lines of code which will download and read in the data automatically:

In [32]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


The MNIST data is split into three parts: 55,000 data points of training data (mnist.train), 10,000 points of test data (mnist.test), and 5,000 points of validation data (mnist.validation). This split is very important: it's essential in machine learning that we have separate data which we don't learn from so that we can make sure that what we've learned actually generalizes!

As mentioned earlier, every MNIST data point has two parts: an image of a handwritten digit and a corresponding label. We'll call the images "x" and the labels "y". Both the training set and test set contain images and their corresponding labels.

Each image is 28 pixels by 28 pixels, which can be thought of as an array of numbers describing how dark each pixel is. For example:

<img src="https://www.tensorflow.org/images/MNIST-Matrix.png">

We can flatten this array into a vector of 28x28 = 784 numbers. From this perspective, the MNIST images are just a bunch of points in a 784-dimensional vector space.

Flattening the data throws away information about the 2D structure of the image. Isn't that bad? Well, the best computer vision methods do exploit this structure, and we will in later tutorials. But the simple method we will be using here, a softmax regression (defined below), won't.

The result is that mnist.train.images is a tensor (an n-dimensional array) with a shape of [55000, 784]. The first dimension is an index into the list of images and the second dimension is the index for each pixel in each image. Each entry in the tensor is a pixel intensity between 0 and 1, for a particular pixel in a particular image.

<img src="https://www.tensorflow.org/images/mnist-train-xs.png">

Each image in MNIST has a corresponding label, a number between 0 and 9 representing the digit drawn in the image.

For the purposes of this tutorial, we're going to want our labels as "one-hot vectors". A one-hot vector is a vector which is 0 in most dimensions, and 1 in a single dimension. In this case, the $n$th digit will be represented as a vector which is 1 in the $n$th dimension.<br>
For example, 3 would be [0, 0, 0, 1, 0, 0, 0, 0, 0, 0].<br>
Consequently, mnist.train.labels is a [55000, 10] array of floats.

<img src="https://www.tensorflow.org/images/mnist-train-ys.png">

We know that every image in MNIST is of a handwritten digit between zero and nine. So there are only ten possible things that a given image can be. We want to be able to look at an image and give the probabilities for it being each digit. For example, our model might look at a picture of a nine and be 80% sure it's a nine, but give a 5% chance to it being an eight (because of the top loop) and a bit of probability to all the others because it isn't 100% sure.

This is a classic case where a softmax regression is a natural, simple model. If you want to assign probabilities to an object being one of several different things, softmax is the thing to do, because softmax gives us a list of values between 0 and 1 that add up to 1.

A softmax regression has two steps: first we add up the evidence of our input being in certain classes, and then we convert that evidence into probabilities.

To sum up the evidence that a given image is in a particular class, we do a weighted sum of the pixel intensities. The weight is negative if that pixel having a high intensity is evidence against the image being in that class, and positive if it is evidence in favor.

The following diagram shows the weights one model learned for each of these classes. <font color="red">Red</font> represents <font color="red">negative weights</font>, while <font color="blue">blue</font> represents <font color="blue">positive weights</font>.

<img src="https://www.tensorflow.org/images/softmax-weights.png">

<img src="http://colah.github.io/posts/2014-10-Visualizing-MNIST/img/mnist_pca/MNIST-PCA1-4.png"/>

We also add some extra evidence called a bias. Basically, we want to be able to say that some things are more likely independent of the input. The result is that the evidence for a class given an input $x$ is:

<center>
$\text{evidence}_i = \sum_j W_{i,~ j} x_j + b_i$
</center>

Where $W_i$ is the weights and $b_i$ is the bias for class $i$, and $i$ is an index for summing over the pixels in our input image $x$. We then convert the evidence tallies into our predicted probabilities $y$ using the "softmax" function:

<center>
$y = \text{softmax}(\text{evidence})$
</center>

Here softmax is serving as an activation function, shaping the output of our linear function into the form we want -- in this case, a probability distribution over 10 cases. You can think of it as converting units of evidence into probabilities of our input being in each class. It's defined as:

<center>
$\text{softmax}(x) = \text{normalize}(\exp(x))$
</center>

It's often helpful to think of softmax this way: exponentiating its inputs and then normalizing them. The exponentiation means that one more unit of evidence increases the weight given to any hypothesis multiplicatively. And conversely, having one less unit of evidence means that a hypothesis gets a fraction of its earlier weight. No hypothesis ever has zero or negative weight. Softmax then normalizes these weights, so that they add up to one, forming a valid probability distribution.

You can picture our softmax regression as looking something like the following, although with a lot more $x$s. For each output, we compute a weighted sum of the $x$s, add a bias, and then apply softmax.

<img src="https://www.tensorflow.org/images/softmax-regression-scalargraph.png">

If we write that out as equations, we get:

<img src="https://www.tensorflow.org/images/softmax-regression-scalarequation.png">

We can "vectorize" this procedure, turning it into a matrix multiplication and vector addition. This is helpful for computational efficiency. (It's also a useful way to think.)

<img src="https://www.tensorflow.org/images/softmax-regression-vectorequation.png">

More compactly, we can just write:

<center>
$y = \text{softmax}(Wx + b)$
</center>

Now let's turn that into something that TensorFlow can use.

### Implementing the Regression

To do efficient numerical computing in Python, we typically use libraries like NumPy that do expensive operations such as matrix multiplication outside Python, using highly efficient code implemented in another language. Unfortunately, there can still be a lot of overhead from switching back to Python every operation. This overhead is especially bad if you want to run computations on GPUs or in a distributed manner, where there can be a high cost to transferring data.

TensorFlow also does its heavy lifting outside Python, but it takes things a step further to avoid this overhead. Instead of running a single expensive operation independently from Python, TensorFlow lets us describe a graph of interacting operations that run entirely outside Python.

We describe these interacting operations by manipulating symbolic variables. Let's create one:

In [33]:
x = tf.placeholder(tf.float32, [None, 784])

x isn't a specific value. It's a placeholder, a value that we'll input when we ask TensorFlow to run a computation. We want to be able to input any number of MNIST images, each flattened into a 784-dimensional vector. We represent this as a 2-D tensor of floating-point numbers, with a shape [None, 784]. (Here None means that a dimension can be of any length.)

We also need the weights and biases for our model. We could imagine treating these like additional inputs, but TensorFlow has an even better way to handle it: Variable. A Variable is a modifiable tensor that lives in TensorFlow's graph of interacting operations. It can be used and even modified by the computation. For machine learning applications, one generally has the model parameters be Variables.

In [34]:
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

We create these Variables by giving tf.Variable the initial value of the Variable: in this case, we initialize both W and b as tensors full of zeros. Since we are going to learn W and b, it doesn't matter very much what they initially are.

Notice that W has a shape of [784, 10] because we want to multiply the 784-dimensional image vectors by it to produce 10-dimensional vectors of evidence for the difference classes. b has a shape of [10] so we can add it to the output.

We can now implement our model. It only takes one line to define it!

In [35]:
y = tf.nn.softmax(tf.matmul(x, W) + b)

First, we multiply x by W with the expression tf.matmul(x, W). This is flipped from when we multiplied them in our equation, where we had $W_x$, as a small trick to deal with x being a 2D tensor with multiple inputs. We then add b, and finally apply tf.nn.softmax.

That's it. It only took us one line to define our model, after a couple short lines of setup. That isn't because TensorFlow is designed to make a softmax regression particularly easy: it's just a very flexible way to describe many kinds of numerical computations, from machine learning models to physics simulations. And once defined, our model can be run on different devices: your computer's CPU, GPUs, and even phones!

### Training

In order to train our model, we need to define what it means for the model to be good. Well, actually, in machine learning we typically define what it means for a model to be bad. We call this the cost, or the loss, and it represents how far off our model is from our desired outcome. We try to minimize that error, and the smaller the error margin, the better our model is.

One very common, very nice function to determine the loss of a model is called "<b>cross-entropy</b>." Cross-entropy arises from thinking about information compressing codes in information theory but it winds up being an important idea in lots of areas, from gambling to machine learning. It's defined as:

<center>
$H_{y'}(y) = -\sum_i y'_i \log(y_i)$
</center>

Where $y$ is our predicted probability distribution, and $y'$is the true distribution (the one-hot vector with the digit labels). In some rough sense, the cross-entropy is measuring how inefficient our predictions are for describing the truth.

To implement cross-entropy we need to first add a new placeholder to input the correct answers:

In [36]:
y_ = tf.placeholder(tf.float32, [None, 10])

Then we can implement the cross-entropy function, $-\sum y'\log(y)$:

In [37]:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

First, tf.log computes the logarithm of each element of y. Next, we multiply each element of y_ with the corresponding element of tf.log(y). Then tf.reduce_sum adds the elements in the second dimension of y, due to the reduction_indices=[1] parameter. Finally, tf.reduce_mean computes the mean over all the examples in the batch.

Note that in the source code, we don't use this formulation, because it is numerically unstable. Instead, we apply tf.nn.softmax_cross_entropy_with_logits on the unnormalized logits (e.g., we call softmax_cross_entropy_with_logits on tf.matmul(x, W) + b), because this more numerically stable function internally computes the softmax activation. In your code, consider using tf.nn.softmax_cross_entropy_with_logits instead.

Now that we know what we want our model to do, it's very easy to have TensorFlow train it to do so. Because TensorFlow knows the entire graph of your computations, it can automatically use the backpropagation algorithm to efficiently determine how your variables affect the loss you ask it to minimize. Then it can apply your choice of optimization algorithm to modify the variables and reduce the loss.

In [38]:
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(cross_entropy)

In this case, we ask TensorFlow to minimize cross_entropy using the gradient descent algorithm with a learning rate of 0.5. Gradient descent is a simple procedure, where TensorFlow simply shifts each variable a little bit in the direction that reduces the cost. But TensorFlow also provides many other optimization algorithms: using one is as simple as tweaking one line.

What TensorFlow actually does here, behind the scenes, is to add new operations to your graph which implement backpropagation and gradient descent. Then it gives you back a single operation which, when run, does a step of gradient descent training, slightly tweaking your variables to reduce the loss.

We can now launch the model in an InteractiveSession:

In [39]:
sess = tf.InteractiveSession()

We first have to create an operation to initialize the variables we created:

In [40]:
tf.global_variables_initializer().run()

Let's train -- we'll run the training step 1000 times!

In [41]:
for i in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

Each step of the loop, we get a "batch" of one hundred random data points from our training set. We run train_step feeding in the batches data to replace the placeholders.

Using small batches of random data is called stochastic training -- in this case, stochastic gradient descent. Ideally, we'd like to use all our data for every step of training because that would give us a better sense of what we should be doing, but that's expensive. So, instead, we use a different subset every time. Doing this is cheap and has much of the same benefit.

### Evaluating Our Model

How well does our model do?

Well, first let's figure out where we predicted the correct label. tf.argmax is an extremely useful function which gives you the index of the highest entry in a tensor along some axis. For example, tf.argmax(y,1) is the label our model thinks is most likely for each input, while tf.argmax(y_,1) is the correct label. We can use tf.equal to check if our prediction matches the truth.

In [42]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

That gives us a list of booleans. To determine what fraction are correct, we cast to floating point numbers and then take the mean. For example, [True, False, True, True] would become [1,0,1,1] which would become 0.75.

In [43]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Finally, we ask for our accuracy on our test data.

In [44]:
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

0.9013


Is that good? Well, not really. This is because we're using a very simple model. With some changes, we can get to 99%. The best models can get to over 99.7% accuracy! (For more information, have a look at this <a href="http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html">list of results</a>.)

### References:

VGG16 in Keras: https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3<br>
Bias: https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/ and https://stackoverflow.com/questions/2480650/role-of-bias-in-neural-networks<br>
Getting Started with TensorFlow: https://www.tensorflow.org/get_started/get_started<br>
XOR with Tensorflow: https://aimatters.wordpress.com/2016/01/16/solving-xor-with-a-neural-network-in-tensorflow/<br>
MNIST Handwritten Digits Recognition: https://www.tensorflow.org/get_started/mnist/beginners