Vishal Ramesh | Foundationalml

Gradient Descent

Every function has minimas and maximas. Gradient Descent is one way to find a minima.

Consider a function \(J(\theta)\). When we try to minimize it, we start at some randome \(\theta\), identify the gradient at that point, then update \(\theta\) by taking a step in the direction opposite to the gradient (with a controlled step size \(\alpha\)) to minimize \(J(\theta)\).

This is represented as \(\theta_j := \theta_j - \alpha\frac{\delta}{\delta\theta_j}J(\theta)\)

Performing this opteration repeatedly, will bring us to a minima.

It is crucial to choose the right step size. Gradient descent with a low \(\alpha\) will take a long time to reach the minima, while a high \(\alpha\) might overshoot the minima.

When we process every sample in the training set to perform one step in the descent, it is called Batch Gradient Descent. This has one disadvantage. For large sets of data, the number of computations for every step or update is very large since we need to compute the gradient for the entire data set.

Linear Models

Linear Regression

Linear Regression is a method used to model the relationship between a dependent variable and one or more independent variables by fitting a Linear Equation to the data. We basically try to find the best line (or plane or hyperplane depending on the dimensionality) that represents our data.

How Does it Work?

A linear equation with one independent variable (or feature) looks like this. \(Y = \theta_0 + \theta_1\cdot X\)

When you have more than one feature, it becomes

\[Y = \theta_0 +\theta_1 \cdot X_1 + \theta_2 \cdot X_2 + ...\]

which can be simplified into

\[\sum_{j=0}^{n} \theta_j \cdot X_j\ \ \ \ \ \ where, X_0=1\\\]

Also written as \(Y = h(X)\).

The above summation can be represented using matrices (for \(n=2\)) as

\[\theta = \begin{matrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \end{matrix}\ \ \ and\ \ \ X = \begin{matrix} X_0 \\ X_1 \\ X_2 \\ \end{matrix}\]

\(\theta\) is called the parameters (or weights) of the learning algorithm. The objective of the learning algorithm is to choose parameters \(\theta\) that allows us to make good predictions for \(Y\), i.e., Choose \(\theta\) such that \(h(x) \approx y\) for training samples.

To achieve this, we need to minimize the difference between \(h(x)\) and \(Y\) in the training samples. This difference is called the loss (or cost) and for linear regression, it is defined using the Mean Square Error (MSE). So our goal is to minimize the loss by adjusting \(\theta\).

\[\underset{\theta}{minimize}\ \frac{1}{2}\sum_{i=1}^{m}J(\theta)\]

where \(J(\theta) = (h_\theta(x^{(i)}) - y^{(i)})^2\), \(m\) is the number of training samples, and \(x^{(i)}\) and \(y^{(i)}\) are individual training samples.

The \(\frac{1}{2}\) is present just to make the gradient computation easier. When you differentiate the squared component, the \(\frac{1}{2}\) will get cancelled in the result

Optimizing using Gradient Descent

There are a lot of optimizers that can be used to minimize the cost function. In this case, let’s look at Gradient Descent.

For the cost function \(J(\theta)\) and model parameter \(\theta_j\), Gradient Descent is written as

\[\theta_j := \theta_j - \alpha\frac{\delta}{\delta\theta_j}J(\theta)\]

where \(\alpha\) is the learning rate.

So, continuing with the computation of the new value for \(\theta_j\),

\[\frac{\delta}{\delta\theta_j} J(\theta) = \frac{\delta}{\delta\theta_j} \frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2\]

Ignoring the \(\Sigma\) and computing for a single training sample (for the sake of simplicity and the sum rule of differentiation),

\[= \frac{\delta}{\delta\theta_j} \frac{1}{2}(h_\theta(x) - y)^2\] \[= 2\cdot\frac{1}{2}(h_\theta(x)-y)\cdot\frac{\delta}{\delta\theta_j}(h_\theta(x)-y)\] \[= (h_\theta(x)-y)\cdot\frac{\delta}{\delta\theta_j}(\theta_0x_0+\theta_1x_1...+\theta_nx_n-y)\]

None of the terms inside the partial derivative depend on \(\theta_j\) except for \(\theta_jx_j\). So the partial derivative of all these terms are \(0\) and for \(\theta_jx_j\), it is \(x_j\). Therefore, the above expression simplifies into

\[= (h_\theta(x)-y)\cdot x_j\]

That gives us \(\theta_j := \theta_j - \alpha(h_\theta(x)-y)\cdot x_j\).

The above is for just one training sample. For the entire training set, we get

\[\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x^{(i)}_j \ \ \ \ -\ Eq.\ 1\]

and the derivative of the cost function when defined using all the training samples is

\[\frac{\delta}{\delta\theta_j} J(\theta) = \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x^{(i)}_j\]

We’re including a \(\frac{1}{m}\) to avoid exploding gradients. When are add the gradients of all the training samples, the step size we take might increase with the size of the dataset. To avoid this, we’re averaging out the gradients using \(\frac{1}{m}\).

This is to optimize one parameter (which is used by one feature in the input matrix). So for a training sample with \(n\) features, Gradient descent becomes

for j = 0, 1, ..., n
  Eq. 1

This operation is also called Batch Gradient Descent because we process the entire dataset for every step in the descent.

Performing the Gradient Descent multiple times will eventually minimize the cost and give us a \(\theta\) that would be best fitting linear equation that models/describes the training data.

Additional Info

The cost function (MSE) is a quadratic function. This means it has exactly one minima (local and global minima are the same).
For Linear Regression, you can find the optimal \(\theta\) (or global minima) in a single step using Normal Equations.

Simple Implementation in Python

Show Code

Some Math for Machine Learning

NOTE: This is not a thorough coverage of all the math you’ll need for machine learning. This only includes concepts that will help better understand the algorithms and their mathematical breakdown covered in the Foundational Machine Learning Series

Mathematical Notations and Terminologies

A lot of the notations and terminologies are explained using programming analogies wherever possible.

Summation

Represented using \(\Sigma\), it is used to denote an iterative addition operation.

For example, \(\sum_{i=0}^{n} x_i\) is equivalent to

sum = 0
for i in range(n):
    sum += x[i]

Derivative

Derivative or differentiation of a function \(f(x)\) w.r.t \(x\) is represented as \(\frac{d}{dx}f(x)\) or \(f'(x)\). For more info on derivates, check here.

Maximas and Minimas

In calculus, minima and maxima (collectively called extrema) are the “peaks” and “valleys” of a function.

Local Extrema: These are the peaks or valleys within a specific neighborhood. A function can have many of these.
Global (Absolute) Extrema: These are the single highest or lowest points over the entire domain of the function.

Normal Equations

What are Normal Equations?

Matrix Multiplcation

What are Matrices?

Matrices are a rectangular arrangement of data. They can be numbers, variables, symbols or expressions. These are(https://vishalramesh.com/foundational-ml/math/notations-and-terminologies#maximas-and-minimas) represented in rows and columns. Here’s an example: \(numbers = \begin{bmatrix} 1 & 4 & 7 \\ 2 & 5 & 8 \\ 3 & 6 & 9 \end{bmatrix}\) or \(fruits = \begin{bmatrix} banana & apple & mango \\ jackfruit & tomato & papaya \\ \end{bmatrix}\)

The shape of a matrix is represented as the (https://vishalramesh.com/foundational-ml/math/notations-and-terminologies#maximas-and-minimas)number of rows X number of columns. The shape of the matrix numbers is 3X3 and the shape of the matrix fruits is 2X3.

Matrix Multiplication

Consider two matrices \(A = \begin{bmatrix} a_{00} & a_{01} & a_{02} \\ a_{10} & a_{11} & a_{12} \\ a_{20} & a_{21} & a_{22} \end{bmatrix} , B = \begin{bmatrix} b_{00} & b_{01} & b_{02} \\ b_{10} & b_{11} & b_{12} \\ b_{20} & b_{21} & b_{22} \end{bmatrix}\)

To multiply two matrices, you the the first row of the first matrix, and do a scalar multiplcation with the first column of the second matrix. This is the value of the first row’s first column. Repeat the process to build out the entire result matrix.

The product of these matrices \(A \cdot B\) or just \(AB\) is as follows. \(B = \begin{bmatrix} a_{00}b_{00} + a_{01}b_{10} + a_{02}b_{20} & a_{00}b_{01} + a_{01}b_{11} + a_{02}b_{21} & a_{00}b_{02} + a_{01}b_{12} + a_{02}b_{22} \\ a_{10}b_{00} + a_{11}b_{10} + a_{12}b_{20} & a_{10}b_{01} + a_{11}b_{11} + a_{12}b_{21} & a_{10}b_{02} + a_{11}b_{12} + a_{12}b_{22} \\ a_{20}b_{00} + a_{21}b_{10} + a_{22}b_{20} & a_{20}b_{01} + a_{21}b_{11} + a_{22}b_{21} & a_{20}b_{02} + a_{21}b_{12} + a_{22}b_{22} \end{bmatrix}\)

Check this out if you want a more visual explanation or just scroll to the end of this page for a visualizer.

Matrix multiplication is not Commutative. So \(AB != BA\).
To be able to multiply two matrices, the number of columns of the first matrix should be equal to the number of rows of the second matrix.

Why is this important in ML?

Machine Learning has a lot of linear algebra. We can represent these operations as matrices.

For example, consider the linear equation \(y = ax_{0} + bx_{1} + cx_{2}\). We can represent this using matrices as \(y = \begin{bmatrix} a & b & c \end{bmatrix} \times \begin{bmatrix} x_0\\ x_1\\ x_2 \end{bmatrix}\) and there are a lot of matrix “features” that make solving linear equations much easier and faster. You’ll see them as you go.

Another advantage of using Matrices in machine learning is that it lets us speed up our processing. GPUs are really poweful at executing instructions in parallel that you can parallelize a large number of operations if you can represent them as matrix operations and perform them on a GPU.

Matrix Multiplication: CPU vs GPU

There are more uses of matrices in machine learning. You’ll learn about them as you understand the algorithms and their implementations.

Differential Calculus

Differential Calculus is a branch of calculus that deals with derivates, which is the rate of change of a function with respect to a variable.

Derivatives

Lets look at an example. Consider an object travelling at a velocity w.r.t time defined by the function \(f(t)\). Meaning, at \(t=0\), the velocity is \(f(0)\), at \(t=1\), the velocity is \(f(1)\) and so on. For the sake of simplicity, let’s have \(f(t)\) be a linear function which will look something like this:

\(f(t) = a\cdot t + c\), where \(a\) and \(c\) are constants.

The derivative in this case will tell us the variation in \(v\) for a unit \(t\) (for velocity, this is acceleration refers to). Basically, the derivative (\(\Delta\)) in this case would be \(f(t+1) - f(t)\). This when computed, will give us: \(f(t+1) - f(t) = a\cdot(t+1) + c - (a\cdot t + c) \\ = a\cdot t + a + c - a\cdot t - c \\ = a\)

This was simple for a straight line where the rate of change stays the same. Now consider this curve:

In the linear function, we could determine the derivative by just doing \(f(t+1) - f(t)\). But in the second, non-linear scenario, we can’t do that because the rate of change is also changing with time. Using the approach we used for the linear function would only give us the average rate of change for a unit time. It won’t work to find the rate of change at a specific point in time. So, instead of determining the change in a function (\(\Delta f\)) for a change in a variable (\(\Delta x\)), we try to identify the infinitesimally small change in the function (\(df(x)\)) for an infinitesimally small change in a variable (\(dx\)), i.e., \(\frac{df(x)}{fx}\).

A derivative of a function \(f(x)\) can be represented in several ways, some of them being \(f'(x)\), \(\frac{d}{dx}f(x)\) or \(\frac{df}{dx}\).

For some rules about computing derivates and derivatives of some common functions, check here.

Partial Derivatives

In the above section we saw an example where we had a function that had only one variable. Now, consider a function that has multiple variables \(f(x,y,z,...)\).

Partial derivates tell us how a function like this changes while tweaking only one variable, but keeping the rest as constant.

Partial derivative of \(f\) w.r.t \(x\) is represented as \(\frac{\delta f}{\delta x}\).

Computing the a partial derivative follows the same rules as a regular derivate, you just treat all the variables w.r.t which you’re computing the gradient as constants. So in the above function \(f(x,y,z...)\), for \(\frac{\delta f}{\delta x}\), you’ll treat \(x\) as the only variable.

Why is this useful?

Gradients are a large part of training machine learning algorithms and optimization functions. A gradient of the function \(f\) is just a bundle of all the partial derivatives of the function. \(\nabla f = \begin{bmatrix} \frac{\delta f}{\delta x}, \frac{\delta f}{\delta y}, \frac{\delta f}{\delta z}, ... \end{bmatrix}\)

The gradient acts as a compass which always points in the direction of the steepest ascent.