Deep Learning with Python

PART 1 - FUNDAMENTALS OF DEEP LEARNING

1.What is deep learning?

Artificial intelligence, machine learning, and deep learning

Before deep learning: a brief history of machine learning

Why deep learning? Why now?

2.Before we begin: the mathematical building blocks of neural networks

A first look at a neural network

Data representations for neural networks

tensor: it’s a container for numbers. You may be already familiar with matrices, which are 2D tensors: tensors are a generalization of matrices to an arbitrary number of dimensions (note that in the context of tensors, a dimension is often called an axis).

A tensor is defined by three key attributes: Number of axes (rank), Shape, Data type.

Because the contents of the tensors manipulated by tensor operations can be interpreted as coordinates of points in some geometric space, all tensor operations have a geometric interpretation.

The gears of neural networks: tensor operations

Neural networks consist entirely of chains of tensor operations and that all of these tensor operations are just geometric transformations of the input data. It follows that you can interpret a neural network as a very complex geometric transformation in a high-dimensional space, implemented via a long series of simple steps.

Uncrumpling paper balls is what machine learning is about: finding neat representations for complex, highly folded data manifolds.

The engine of neural networks: gradient-based optimization

training loop:

  1. Draw a batch of training samples x and corresponding targets y.
  2. Run the network on x (a step called the forward pass) to obtain predictions y_pred .
  3. Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y .
  4. Update all weights of the network in a way that slightly reduces the loss on this batch.

One naive solution would be to freeze all weights in the network except the one scalar coefficient being considered, and try different values for this coefficient. But such an approach would be horribly inefficient, because you’d need to compute two forward passes (which are expensive) for every individual coefficient (of which there are many, usually thousands and sometimes up to millions).

A much better approach is to take advantage of the fact that all operations used in the network are differentiable, and compute the gradient of the loss with regard to the network’s coefficients. You can then move the coefficients in the opposite direction from the gradient, thus decreasing the loss.

For every differentiable function f(x) (differentiable means “can be derived”: for example, smooth, continuous functions can be derived),

A gradient is the derivative of a tensor operation. It’s the generalization of the concept of derivatives to functions of multidimensional inputs: that is, to functions that take tensors as inputs.

You saw earlier that the derivative of a function f(x) of a single coefficient can be interpreted as the slope of the curve of f. Likewise, gradient(f)(W0) can be interpreted as the tensor describing the curvature of f(W) around W0.

For this reason, in much the same way that, for a function f(x), you can reduce the value of f(x) by moving x a little in the opposite direction from the derivative, with a function f(W) of a tensor, you can reduce f(W) by moving W in the opposite direction from the gradient: for example, W1 = W0 - step * gradient(f)(W0)

mini-batch stochastic gradient descent (mini-batch SGD):

  1. Draw a batch of training samples x and corresponding targets y.
  2. Run the network on x to obtain predictions y_pred.
  3. Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y .
  4. Compute the gradient of the loss with regard to the network’s parameters (a backward pass).
  5. Move the parameters a little in the opposite direction from the gradient for example W -= step * gradient—thus reducing the loss on the batch a bit.

Note that a variant of the mini-batch SGD algorithm would be to draw a single sample and target at each iteration, rather than drawing a batch of data. This would be true SGD (as opposed to mini-batch SGD). Alternatively, going to the opposite extreme, you could run every step on all data available, which is called batch SGD. Each update would then be more accurate, but far more expensive. The efficient compromise between these two extremes is to use mini-batches of reasonable size.

Additionally, there exist multiple variants of SGD that differ by taking into account previous weight updates when computing the next weight update, rather than just looking at the current value of the gradients. There is, for instance, SGD with momentum, as well as Adagrad, RMSP rop, and several others. Such variants are known as optimization methods or optimizers. In particular, the concept of momentum, which is used in many of these variants, deserves your attention. Momentum addresses two issues with SGD: convergence speed and local minima.

Chaining derivatives: the Backpropagation algorithm

Calculus tells us that such a chain of functions can be derived using the following identity, called the chain rule: f(g(x)) = f’(g(x)) * g’(x). Applying the chain rule to the computation of the gradient values of a neural network gives rise to an algorithm called Backpropagation (also sometimes called reverse-mode differentiation). Backpropagation starts with the final loss value and works backward from the top layers to the bottom layers, applying the chain rule to compute the contribution that each parameter had in the loss value.

Looking back at our first example

3.Getting started with neural networks

Anatomy of a neural network

Figure 3.1 Relationship between the network, layers, loss function, and optimizer

Relationship between the network, layers, loss function, and optimizer

Different layers are appropriate for different tensor formats and different types of data processing. For instance, simple vector data, stored in 2D tensors of shape (samples, features) , is often processed by densely connected layers, also called fully connected or dense layers (the Dense class in Keras). Sequence data, stored in 3D tensors of shape (samples, timesteps, features), is typically processed by recurrent layers such as an LSTM layer. Image data, stored in 4D tensors, is usually processed by 2D convolution layers Conv2D).

1
2
3
4
from keras import layers

# A dense layer with 32 output units
layer = layers.Dense(32, input_shape=(784,))

Once the network architecture is defined, you still have to choose two more things:

  • Loss function (objective function)—The quantity that will be minimized during training. It represents a measure of success for the task at hand.

  • Optimizer—Determines how the network will be updated based on the loss function. It implements a specific variant of stochastic gradient descent (SGD). A neural network that has multiple outputs may have multiple loss functions (one per output).

Choosing the right objective function for the right problem is extremely important: your network will take any shortcut it can, to minimize the loss; so if the objective doesn’t fully correlate with success for the task at hand, your network will end up doing things you may not have wanted.

Fortunately, when it comes to common problems such as classification, regression, and sequence prediction, there are simple guidelines you can follow to choose the correct loss. For instance, you’ll use binary crossentropy for a two-class classification problem, categorical crossentropy for a many-class classification problem, mean-squared error for a regression problem, connectionist temporal classification (CTC) for a sequence-learning problem, and so on. Only when you’re working on truly new research problems will you have to develop your own objective functions. In the next few chapters, we’ll detail explicitly which loss functions to choose for a wide range of common tasks.

Introduction to Keras

There are two ways to define a model: using the Sequential class (only for linear stacks of layers, which is the most common network architecture by far) or the functional API (for directed acyclic graphs of layers, which lets you build completely arbitrary architectures).

what type of network architectures work for different kinds of problems, how to pick the right learning configuration, and how to tweak a model until it gives the results you want to see.

Setting up a deep-learning workstation

Jupyter notebooks: the preferred way to run deep-learning experiments

you don’t have to rerun all of your previous code if something goes wrong late in an experiment.

Classifying movie reviews: a binary classification example

There are two key architecture decisions to be made about such a stack of Dense layers:

  • How many layers to use
  • How many hidden units to choose for each layer

In chapter 4, you’ll learn formal principles to guide you in making these choices.

A relu (rectified linear unit) is a function meant to zero out negative values Figure 3.4  The rectified linear unit function

whereas a sigmoid “squashes” arbitrary values into the [0, 1] interval outputting something that can be interpreted as a probability.

Figure 3.5 The sigmoid function

What are activation functions, and why are they necessary?

Without an activation function like relu (also called a non-linearity), the Dense layer would consist of two linear operations—a dot product and an addition:

1
output = dot(W, input) + b

In order to get access to a much richer hypothesis space that would benefit from deep representations, you need a non-linearity, or activation function. relu is the most popular activation function in deep learning, but there are many other candidates, which all come with similarly strange names: prelu, elu, and so on.

Crossentropy is usually the best choice when you’re dealing with models that output probabilities. Crossentropy is a quantity from the field of Information Theory that measures the distance between probability distributions or, in this case, between the ground-truth distribution and your predictions.

Wrapping up
  • In a binary classification problem (two output classes), your network should end with a Dense layer with one unit and a sigmoid activation: the output of your network should be a scalar between 0 and 1, encoding a probability.
  • With such a scalar sigmoid output on a binary classification problem, the loss function you should use is binary_crossentropy.
  • The rmsprop optimizer is generally a good enough choice, whatever your problem. That’s one less thing for you to worry about.

Classifying newswires: a multiclass classification example

Because you have many classes, this problem is an instance of multi-class classification; and because each data point should be classified into only one category, the problem is more specifically an instance of single-label, multiclass classification. If each data point could belong to multiple categories (in this case, topics), you’d be facing a multilabel, multiclass classification problem.

In the previous example, you used 16-dimensional intermediate layers, but a 16-dimensional space may be too limited to learn to separate 46 different classes: such small layers may act as information bottlenecks, permanently dropping relevant information.

1
2
3
4
5
6
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

There are two other things you should note about this architecture:

  • You end the network with a Dense layer of size 46. This means for each input sample, the network will output a 46-dimensional vector. Each entry in this vector (each dimension) will encode a different output class.

  • The last layer uses a softmax activation. You saw this pattern in the MNIST example. It means the network will output a probability distribution over the 46 different output classes—for every input sample, the network will produce a 46 dimensional output vector, where output[i] is the probability that the sample belongs to class i . The 46 scores will sum to 1.

The best loss function to use in this case is categorical_crossentropy. It measures the distance between two probability distributions: here, between the probability distribution output by the network and the true distribution of the labels. By minimizing the distance between these two distributions, you train the network to output something as close as possible to the true labels.

Plotting the training and validation loss:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import matplotlib.pyplot as plt
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

Plotting the training and validation accuracy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
plt.clf()  # Clears the figure
acc = history.history['acc']
val_acc = history.history['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

The only thing this approach would change is the choice of the loss function. The loss function used in listing 3.21, categorical_crossentropy, expects the labels to follow a categorical encoding. With integer labels, you should use sparse_categorical_ crossentropy.

The importance of having sufficiently large intermediate layers

We mentioned earlier that because the final outputs are 46-dimensional, you should avoid intermediate layers with many fewer than 46 hidden units.

Wrapping up
  • If you’re trying to classify data points among N classes, your network should end with a Dense layer of size N.

  • In a single-label, multiclass classification problem, your network should end with a softmax activation so that it will output a probability distribution over the N output classes.

  • Categorical crossentropy is almost always the loss function you should use for such problems. It minimizes the distance between the probability distributions output by the network and the true distribution of the targets.

  • There are two ways to handle labels in multiclass classification:

    • Encoding the labels via categorical encoding (also known as one-hot encoding) and using categorical_crossentropy as a loss function
    • Encoding the labels as integers and using the sparse_categorical_crossentropy loss function
  • If you need to classify data into a large number of categories, you should avoid creating information bottlenecks in your network due to intermediate layers that are too small.

Predicting house prices: a regression example

And each feature in the input data (for example, the crime rate) has a different scale. For instance, some values are proportions, which take values between 0 and 1; others take values between 1 and 12, others between 0 and 100, and so on.

It would be problematic to feed into a neural network values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), you subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation.

In general, the less training data you have, the worse overfitting will be, and using a small network is one way to mitigate overfitting.

the mse loss function—mean squared error, the square of the difference between the predictions and the targets. This is a widely used loss function for regression problems.

You’re also monitoring a new metric during training: mean absolute error (MAE). It’s the absolute value of the difference between the predictions and the targets.

K-fold validation

Wrapping up
  • Regression is done using different loss functions than what we used for classification. Mean squared error (MSE) is a loss function commonly used for regression.

  • Similarly, evaluation metrics to be used for regression differ from those used for classification; naturally, the concept of accuracy doesn’t apply for regression. A common regression metric is mean absolute error (MAE).

  • When features in the input data have values in different ranges, each feature should be scaled independently as a preprocessing step.

  • When there is little data available, using K-fold validation is a great way to reliably evaluate a model.

  • When little training data is available, it’s preferable to use a small network with few hidden layers (typically only one or two), in order to avoid severe overfitting.

4.Fundamentals of machine learning

Four branches of machine learning

Mini-batch or batch—A small set of samples (typically between 8 and 128) that are processed simultaneously by the model. The number of samples is often a power of 2, to facilitate memory allocation on GPU. When training, a mini-batch is used to compute a single gradient-descent update applied to the weights of the model.

Evaluating machine-learning models

In machine learning, the goal is to achieve models that generalize—that perform well on never-before-seen data—and overfitting is the central obstacle.

three classic evaluation recipes:

  • simple hold-out validation
  • K-fold validation
  • iterated K-fold validation with shuffling.

Data representativeness: you usually should randomly shuffle your data before splitting it into training and test sets.

The arrow of time: If you’re trying to predict the future given the past (for example, tomorrow’s weather, stock movements, and so on), you should not randomly shuffle your data before splitting it, because doing so will create a temporal leak.

Redundancy in your data: Make sure your training set and validation set are disjoint.

Data preprocessing, feature engineering, and feature learning

Data preprocessing for neural networks

Data preprocessing aims at making the raw data at hand more amenable to neural networks. This includes vectorization, normalization, handling missing values, and feature extraction.

VALUE NORMALIZATION

In general, it isn’t safe to feed into a neural network data that takes relatively large values (for example, multidigit integers, which are much larger than the initial values taken by the weights of a network) or data that is heterogeneous (for example, data where one feature is in the range 0–1 and another is in the range 100–200). Doing so can trigger large gradient updates that will prevent the network from converging.

To make learning easier for your network, your data should have the following characteristics:

  • Take small values—Typically, most values should be in the 0–1 range.

  • Be homogenous—That is, all features should take values in roughly the same range.

Additionally, the following stricter normalization practice is common and can help, although it isn’t always necessary

  • Normalize each feature independently to have a mean of 0.
  • Normalize each feature independently to have a standard deviation of 1.
HANDLING MISSING VALUES

In general, with neural networks, it’s safe to input missing values as 0, with the condition that 0 isn’t already a meaningful value. The network will learn from exposure to the data that the value 0 means missing data and will start ignoring the value.

Feature engineering

Feature engineering is the process of using your own knowledge about the data and about the machine-learning algorithm at hand (in this case, a neural network) to make the algorithm work better by applying hardcoded (nonlearned) transformations to the data before it goes into the model. In many cases, it isn’t reasonable to expect a machine learning model to be able to learn from completely arbitrary data. The data needs to be presented to the model in a way that will make the model’s job easier.

That’s the essence of feature engineering: making a problem easier by expressing it in a simpler way. It usually requires understanding the problem in depth.

Before deep learning, feature engineering used to be critical, because classical shallow algorithms didn’t have hypothesis spaces rich enough to learn useful features by themselves.

Fortunately, modern deep learning removes the need for most feature engineering, because neural networks are capable of automatically extracting useful features from raw data.

Good features let you solve a problem with far less data.

Overfitting and underfitting

You must evaluate an array of different architectures (on your validation set, not on your test set, of course) in order to find the correct model size for your data. The general workflow to find an appropriate model size is to start with relatively few layers and parameters, and increase the size of the layers or add new layers until you see diminishing returns with regard to validation loss.

Adding weight regularization

You may be familiar with the principle of Occam’s razor: given two explanations for something, the explanation most likely to be correct is the simplest one—the one that makes fewer assumptions.

Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights to take only small values, which makes the distribution of weight values more regular. This is called weight regularization, and it’s done by adding to the loss function of the network a cost associated with having large weights.

This cost comes in two flavors:

  • L1 regularization—The cost added is proportional to the absolute value of the weight coefficients (the L1 norm of the weights).

  • L2 regularization—The cost added is proportional to the square of the value of the weight coefficients (the L2 norm of the weights).

Note that because this penalty is only added at training time, the loss for this network will be much higher at training than at test time.

Adding dropout

Let’s say a given layer would normally return a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training. After applying dropout, this vector will have a few zero entries distributed at random: for example, [0, 0.5, 1.3, 0, 1.1]. The dropout rate is the fraction of the features that are zeroed out; it’s usually set between 0.2 and 0.5. At test time, no units are dropped out; instead, the layer’s output values are scaled down by a factor equal to the dropout rate, to balance for the fact that more units are active than at training time.

To recap, these are the most common ways to prevent overfitting in neural networks:

  • Get more training data.
  • Reduce the capacity of the network.
  • Add weight regularization.
  • Add dropout.

The universal workflow of machine learning

For balanced-classification problems, where every class is equally likely, accuracy and area under the receiver operating characteristic curve (ROC AUC) are common metrics. For class-imbalanced problems, you can use precision and recall. For ranking problems or multilabel classification, you can use mean average precision. And it isn’t uncommon to have to define your own custom metric by which to measure success. To get a sense of the diversity of machine-learning success metrics and how they relate to different problem domains, it’s helpful to browse the data science competitions on Kaggle (https://kaggle.com); they showcase a wide range of problems and evaluation metrics.

Preparing your data
  • As you saw previously, your data should be formatted as tensors.
  • The values taken by these tensors should usually be scaled to small values: for example, in the [-1, 1] range or [0, 1] range.
  • If different features take values in different ranges (heterogeneous data), then the data should be normalized.
  • You may want to do some feature engineering, especially for small-data problems.
Developing a model that does better than a baseline

Assuming that things go well, you need to make three key choices to build your first working model:

  • Last-layer activation
  • Loss function
  • Optimization configuration—What optimizer will you use? What will its learning rate be? In most cases, it’s safe to go with rmsprop and its default learning rate.
Problem type Last-layer activation Loss function
Binary classification sigmoid binary_crossentropy
Multiclass, single-label classification softmax categorical_crossentropy
Multiclass, multilabel classification sigmoid binary_crossentropy
Regression to arbitrary values None mse
Regression to values between 0 and 1 sigmoid mse or binary_crossentropy
Scaling up: developing a model that overfits
  • Add layers.
  • Make the layers bigger.
  • Train for more epochs.
Regularizing your model and tuning your hyperparameters
  • Add dropout.
  • Try different architectures: add or remove layers.
  • Add L1 and/or L2 regularization.
  • Try different hyperparameters (such as the number of units per layer or the learning rate of the optimizer) to find the optimal configuration.
  • Optionally, iterate on feature engineering: add new features, or remove features that don’t seem to be informative.

PART 2 - DEEP LEARNING IN PRACTICE

5.Deep learning for computer vision

Introduction to convnets

Training a convnet from scratch on a small dataset

The relevance of deep learning for small-data problems

Using a pretrained convnet

Visualizing what convnets learn

6.Deep learning for text and sequences

Working with text data

Understanding recurrent neural networks

Advanced use of recurrent neural networks

Sequence processing with convnets

7.Advanced deep-learning best practices

Going beyond the Sequential model: the Keras functional API

Inspecting and monitoring deep-learning models using Keras callbacks and TensorBoard

Getting the most out of your models

8.Generative deep learning

Text generation with LSTM

DeepDream

Neural style transfer

Generating images with variational autoencoders

Introduction to generative adversarial networks

9.Conclusions

Key concepts in review

The limitations of deep learning

The future of deep learning

Staying up to date in a fast-moving field

Final wordscd C