Stochastic gradient descent is an optimization algorithm often used in machine learning applications to find the model parameters that correspond to the best fit between predicted and actual outputs. It’s an inexact but powerful technique.
Stochastic gradient descent is widely used in machine learning applications. Combined with backpropagation, it’s dominant in neural network training applications.
In this tutorial, you’ll learn:
- How gradient descent and stochastic gradient descent algorithms work
- How to apply gradient descent and stochastic gradient descent to minimize the loss function in machine learning
- What the learning rate is, why it’s important, and how it impacts results
- How to write your own function for stochastic gradient descent
Free Bonus: 5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you’ll need to take your Python skills to the next level.
Basic Gradient Descent Algorithm
The gradient descent algorithm is an approximate and iterative method for mathematical optimization. You can use it to approach the minimum of any differentiable function.
Note: There are many optimization methods and subfields of mathematical programming. If you want to learn how to use some of them with Python, then check out Scientific Python: Using SciPy for Optimization and Hands-On Linear Programming: Optimization With Python.
Although gradient descent sometimes gets stuck in a local minimum or a saddle point instead of finding the global minimum, it’s widely used in practice. Data science and machine learning methods often apply it internally to optimize model parameters. For example, neural networks find weights and biases with gradient descent.
Cost Function: The Goal of Optimization
The cost function, or loss function, is the function to be minimized (or maximized) by varying the decision variables. Many machine learning methods solve optimization problems under the surface. They tend to minimize the difference between actual and predicted outputs by adjusting the model parameters (like weights and biases for neural networks, decision rules for random forest or gradient boosting, and so on).
In a regression problem, you typically have the vectors of input variables ๐ฑ = (๐ฅ₁, …, ๐ฅแตฃ) and the actual outputs ๐ฆ. You want to find a model that maps ๐ฑ to a predicted response ๐(๐ฑ) so that ๐(๐ฑ) is as close as possible to ๐ฆ. For example, you might want to predict an output such as a person’s salary given inputs like the person’s number of years at the company or level of education.
Your goal is to minimize the difference between the prediction ๐(๐ฑ) and the actual data ๐ฆ. This difference is called the residual.
In this type of problem, you want to minimize the sum of squared residuals (SSR), where SSR = ฮฃแตข(๐ฆแตข − ๐(๐ฑแตข))² for all observations ๐ = 1, …, ๐, where ๐ is the total number of observations. Alternatively, you could use the mean squared error (MSE = SSR / ๐) instead of SSR.
Both SSR and MSE use the square of the difference between the actual and predicted outputs. The lower the difference, the more accurate the prediction. A difference of zero indicates that the prediction is equal to the actual data.
SSR or MSE is minimized by adjusting the model parameters. For example, in linear regression, you want to find the function ๐(๐ฑ) = ๐₀ + ๐₁๐ฅ₁ + ⋯ + ๐แตฃ๐ฅแตฃ, so you need to determine the weights ๐₀, ๐₁, …, ๐แตฃ that minimize SSR or MSE.
In a classification problem, the outputs ๐ฆ are categorical, often either 0 or 1. For example, you might try to predict whether an email is spam or not. In the case of binary outputs, it’s convenient to minimize the cross-entropy function that also depends on the actual outputs ๐ฆแตข and the corresponding predictions ๐(๐ฑแตข):
In logistic regression, which is often used to solve classification problems, the functions ๐(๐ฑ) and ๐(๐ฑ) are defined as the following:
Again, you need to find the weights ๐₀, ๐₁, …, ๐แตฃ, but this time they should minimize the cross-entropy function.
Gradient of a Function: Calculus Refresher
In calculus, the derivative of a function shows you how much a value changes when you modify its argument (or arguments). Derivatives are important for optimization because the zero derivatives might indicate a minimum, maximum, or saddle point.
The gradient of a function ๐ถ of several independent variables ๐ฃ₁, …, ๐ฃแตฃ is denoted with ∇๐ถ(๐ฃ₁, …, ๐ฃแตฃ) and defined as the vector function of the partial derivatives of ๐ถ with respect to each independent variable: ∇๐ถ = (∂๐ถ/∂๐ฃ₁, …, ∂๐ถ/๐ฃแตฃ). The symbol ∇ is called nabla.
The nonzero value of the gradient of a function ๐ถ at a given point defines the direction and rate of the fastest increase of ๐ถ. When working with gradient descent, you’re interested in the direction of the fastest decrease in the cost function. This direction is determined by the negative gradient, −∇๐ถ.
Read the full article at https://realpython.com/gradient-descent-algorithm-python/ »
[ Improve Your Python With ๐ Python Tricks ๐ – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
from Planet Python
via read more
No comments:
Post a Comment