Figure 1: Trajectory towards local minimum Optimization refers to the task of minimizing/maximizing an objective function f(x) parameterized by x. In machine/deep learning terminology, it’s the task of minimizing the cost/loss function J(w) parameterized by the model’s parameters w ∈ R^d. Optimization algorithms (in case of minimization) have one of the following goals: Show
There are three kinds of optimization algorithms:
Gradient Descent is the most common optimization algorithm in machine learning and deep learning. It is a first-order optimization algorithm. This means it only takes into account the first derivative when performing the updates on the parameters. On each iteration, we update the parameters in the opposite direction of the gradient of the objective function J(w) w.r.t the parameters where the gradient gives the direction of the steepest ascent. The size of the step we take on each iteration to reach the local minimum is determined by the learning rate α. Therefore, we follow the direction of the slope downhill until we reach a local minimum. In this article, we’ll cover gradient descent algorithm and its variants: Batch Gradient Descent, Mini-batch Gradient Descent, and Stochastic Gradient Descent. Let’s first see how gradient descent works on logistic regression before going into the details of its variants. For the sake of simplicity, let’s assume that the logistic regression model has only two parameters: weight w and bias b. 1. Initialize weight w and bias b to any random numbers. 2. Pick a value for the learning rate α. The learning rate determines how big the step would be on each iteration.
Therefore, plot the cost function against different values of α and pick the value of α that is right before the first value that didn’t converge so that we would have a very fast learning algorithm that converges (see figure 2). Figure 2: Gradient descent with different learning rates. Source
3. Make sure to scale the data if it’s on a very different scales. If we don’t scale the data, the level curves (contours) would be narrower and taller which means it would take longer time to converge (see figure 3). Figure 3: Gradient descent: normalized versus unnormalized level curves.Scale the data to have μ = 0 and σ = 1. Below is the formula for scaling each example: 4. On each iteration, take the partial derivative of the cost function J(w) w.r.t each parameter (gradient): The update equations are:
Now let’s discuss the three variants of gradient descent algorithm. The main difference between them is the amount of data we use when computing the gradients for each learning step. The trade-off between them is the accuracy of the gradient versus the time complexity to perform each parameter’s update (learning step). Batch Gradient DescentBatch Gradient Descent is when we sum up over all examples on each iteration when performing the updates to the parameters. Therefore, for each update, we have to sum over all examples: for i in range(num_epochs): The main advantages:
The main disadvantages:
Mini-batch Gradient DescentInstead of going over all examples, Mini-batch Gradient Descent sums up over lower number of examples based on the batch size. Therefore, learning happens on each mini-batch of b examples:
for i in range(num_epochs): The batch size is something we can tune. It is usually chosen as power of 2 such as 32, 64, 128, 256, 512, etc. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as power of 2. The main advantages:
The main disadvantages:
With large training datasets, we don’t usually need more than 2–10 passes over all training examples (epochs). Note: with batch size b = m (number of training examples), we get the Batch Gradient Descent. Stochastic Gradient DescentInstead of going through all examples, Stochastic Gradient Descent (SGD) performs the parameters update on each example (x^i,y^i). Therefore, learning happens on every example:
for i in range(num_epochs): It shares most of the advantages and the disadvantages with mini-batch version. Below are the ones that are specific to SGD:
Below is a graph that shows the gradient descent’s variants and their direction towards the minimum: Figure 6: Gradient descent variants’ trajectory towards minimumAs the figure above shows, SGD direction is very noisy compared to mini-batch. ChallengesBelow are some challenges regarding gradient descent algorithm in general as well as its variants — mainly batch and mini-batch:
As a result, the direction that looks promising to the gradient may not be so and may lead to slow the learning process or even diverge.
Originally published at imaddabbura.github.io on December 21, 2017. Which of the following statements is correct about the disadvantages or risk of automation in production?Which of the following statements is correct about the disadvantages or risks of Automation in production? Correct Answer: Automation requires high volumes of output.
What are some other external factors which might influence process layout design?What are some external factors which might influence process layout design? - Ability to design a safe environment for employees and customers. - Size and location of restrooms. - Noise levels at the various locations.
Which of the following describes a tracking capacity strategy?Which of the following describes a tracking capacity strategy? It adds capacity incrementally to keep pace with increasing demand.
How many basic process types are there?There are three basic process types: input, processing, and output.
|