Machine Learning Crash Course Courses

ML Concepts

Introduction to ML

First, it gives you a tool to reduce the time you spend programming.

Second, it will allow you to customize your products, making them better for specific groups of people.

Third, machine learning lets you solve problems that you, as a programmer, have no idea how to do by hand.

machine learning changes the way you think about a problem. Software engineers are trained to think logically and mathematically; we use assertions to prove properties of our program are correct. With machine learning, the focus shifts from a mathematical science to a natural science: we’re making observations about an uncertain world, running experiments, and using statistics, not logic, to analyze the results of the experiment. The ability to think like a scientist will expand your horizons and open up new areas that you couldn’t explore without it.

Framing (框架处理)

What is (supervised) machine learning? Concisely put, it is the following:

ML systems learn how to combine input to produce useful predictions on never-before-seen data.

Labels (标签)

A label is the thing we’re predicting—the y variable in simple linear regression.

Features (特征)

A feature is an input variable—the x variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features.

Examples (样本)

An example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.) We break examples into two categories:

labeled examples

A labeled example includes both feature(s) and the label.
unlabeled examples

An unlabeled example contains features but not the label.

Models (模型)

A model defines the relationship between features and label.

Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label.
Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y').

Regression vs. classification (回归与分类)

A regression model predicts continuous values.

A classification model predicts discrete values (离散值).

Descending into ML

Linear Regression

$$ y’ = b + w_1x_1 + w_2x_2 + w_3x_3 $$

where:

$y′$ is the predicted label (a desired output).
$b$ is the bias (the y-intercept), sometimes referred to as $w_0$.
$w1$ is the weight of feature 1. Weight is the same concept as the “slope” m in the traditional equation of a line.
$x_1$ is a feature (a known input).
a more sophisticated model might rely on multiple features, each having a separate weight ($w_1$, $w_2$, etc.).

Training and Loss

Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called empirical risk minimization (经验风险最小化).

Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model’s prediction was on a single example. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.

You might be wondering whether you could create a mathematical function—a loss function—that would aggregate the individual losses in a meaningful fashion.

Squared loss: a popular loss function

The linear regression models we’ll examine here use a loss function called squared loss (also known as L2 loss).

Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples: $$ MSE = \frac{1}{N}\sum_{(x,y)\in{D}} (y - prediction(x))^2 $$

$(x,y)$ is an example in which
$x$ is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.
$y$ is the example’s label (for example, temperature).
$prediction(x)$ is a function of the weights and bias in combination with the set of features x.
$D$ is a data set containing many labeled examples, which are $(x,y)$ pairs.
$N$ is the number of examples in $D$.

Reducing Loss (降低损失)

An Iterative Approach (迭代方法)

A Machine Learning model is trained by starting with an initial guess for the weights and bias and iteratively adjusting those guesses until learning the weights and bias with the lowest possible loss.

Usually, you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged (收敛).

Gradient Descent (梯度下降法)

For the kind of regression problems we’ve been examining, the resulting plot of loss vs. $w_1$ will always be convex (凸形). In other words, the plot will always be bowl-shaped, Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.

Calculating the loss function for every conceivable value of $w_{1}$ over the entire data set would be an inefficient way of finding the convergence point. Let’s examine a better mechanism—very popular in machine learning—called gradient descent.

The first stage in gradient descent is to pick a starting value (a starting point) for $w_{1}$. The starting point doesn’t matter much; therefore, many algorithms simply set $w_{1}$ to 0 or pick a random value.

The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. the gradient of the loss is equal to the derivative (slope) of the curve, and tells you which way is “warmer” or “colder.” When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.

Note that a gradient is a vector, so it has both of the following characteristics:

a direction
a magnitude (大小)

Learning Rate (学习速率)

As noted, the gradient vector has both a direction and a magnitude. Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.

Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate.

If you pick a learning rate that is too small, learning will take too long, Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong.

The ideal learning rate:

The ideal learning rate in one-dimension is 1f(x)″ (the inverse of the second derivative of f(x) at x).
The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial derivatives).
The story for general convex functions is more complex.

Stochastic Gradient Descent (随机梯度下降法)

batch

In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration.

A large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows. Some redundancy can be useful to smooth out noisy gradients, but enormous batches tend not to carry much more predictive value than large batches.

By choosing examples at random from our data set, we could estimate (albeit, noisily) a big average from a much smaller one. Stochastic gradient descent (SGD) takes this idea to the extreme–it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term “stochastic” indicates that the one example comprising each batch is chosen at random.

Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.

First Steps with TF

Programming Exercises

If you are unfamiliar with NumPy or pandas, please begin by doing the following two Colab exercises:

NumPy Ultraquick Tutorial Colab exercise, which provides all the NumPy information you need for this course.
pandas UltraQuick Tutorial Colab exercise, which provides all the pandas information you need for this course.

After gaining competency in NumPy and pandas, do the following two Colab exercises to explore linear regression and hyperparameter tuning in tf.keras:

Linear Regression with Synthetic Data Colab exercise, which explores linear regression with a toy dataset.
Linear Regression with a Real Dataset Colab exercise, which guides you through the kinds of analysis you should do on a real dataset.

Generalization (泛化)

Video Lecture Peril of Overfitting

Training and Test Sets

Video Lecture Splitting Data Playground Exercise

Validation Set

Check Your Intuition Video Lecture Another Partition Programming Exercise

Representation (表示法)

Video Lecture Feature Engineering Qualities of Good Features Cleaning Data

Feature Crosses (特征组合)

Video Lecture Encoding Nonlinearity Crossing One-Hot Vectors Playground Exercises Programming Exercise

Check Your Understanding

Regularization: Simplicity (正则化：简单性)

Playground Exercise: Overcrossing? Video Lecture L2 Regularization Lambda Playground Exercise: L2 Regularization Check Your Understanding

Logistic Regression

Video Lecture Calculating a Probability Loss and Regularization

Classification

Video Lecture Thresholding True vs. False; Positive vs. Negative Accuracy Precision and Recall Check Your Understanding: Accuracy, Precision, Recall ROC Curve and AUC Check Your Understanding: ROC and AUC Prediction Bias Programming Exercise

Regularization: Sparsity (正则化：稀疏性)

Video Lecture L1 Regularization Playground Exercise Check Your Understanding

Neural Networks

Video Lecture Structure Playground Exercises Programming Exercise

Training Neural Nets

Video Lecture Best Practices

Multi-Class Neural Nets

Video Lecture One vs. All Softmax Programming Exercise

Embeddings (嵌入)

Video Lecture Motivation from Collaborative Filtering Categorical Input Data Translating to a Lower-Dimensional Space Obtaining Embeddings

ML Engineering

Production ML Systems

Static vs. Dynamic Training

Video Lecture Check Your Understanding

Static vs. Dynamic Inference

Video Lecture Check Your Understanding

Data Dependencies

Video Lecture Check Your Understanding

Fairness

Video Lecture Types of Bias Identifying Bias Evaluating for Bias Programming Exercise Check Your Understanding

Table of Contents

Machine Learning Crash Course Courses

ML Concepts

Introduction to ML

Framing (框架处理)

Labels (标签)

Features (特征)

Examples (样本)

Models (模型)

Regression vs. classification (回归与分类)

Descending into ML

Linear Regression

Training and Loss

Squared loss: a popular loss function

Reducing Loss (降低损失)

An Iterative Approach (迭代方法)

Gradient Descent (梯度下降法)

Learning Rate (学习速率)

Stochastic Gradient Descent (随机梯度下降法)

First Steps with TF

Generalization (泛化)

Training and Test Sets

Validation Set

Representation (表示法)

Feature Crosses (特征组合)

Regularization: Simplicity (正则化：简单性)

Logistic Regression

Classification

Regularization: Sparsity (正则化：稀疏性)

Neural Networks

Training Neural Nets

Multi-Class Neural Nets

Embeddings (嵌入)

ML Engineering

Production ML Systems

Static vs. Dynamic Training

Static vs. Dynamic Inference

Data Dependencies

Fairness

ML Systems in the Real World

Cancer Prediction

Literature

Guidelines

Conclusion

Next Steps