Basic Concepts
Gradient descent is a basic algorithm for finding the value of X that minimizes the function f(X) in k variables X.
Assumptions:
The function is differentiable, which means that its graph is smooth, i.e. without gaps (discontinuities) or cusps (corners)
The function is convex, which means that any line connecting two points on the curve doesn’t pass through another point on the curve. This is equivalent to saying that the second derivative is positive at all points.
For a function with these two properties, any local minimum is a global minimum.
Note that a point where the first and second derivatives are zero is called a saddle point (inflection point when X is one-dimensional).
Algorithm
Start with a guess X0 for the value of X that minimizes f(X). The better the initial guess the faster the algorithm converges to a solution. At the n+1 step, set
Xn+1 = Xn – α⋅∇f(Xn)
where ∇f(X) is the gradient of f (i.e. the column vector of partial first derivatives) at Xn and α > 0 is a pre-assigned parameter called the learning rate. The lower the value of alpha, the slower the rate of convergence. Values of alpha that are too big can lead to oscillations which either slow down convergence or prevent it from happening all together.
Convergence is achieved when
||Xn+1 – Xn|| < ε
for some preassigned small positive value ε where ||X||2 = the sum of the squared elements in X. Alternatively, we can use ||∇f(Xn)|| < ε.
Observations
Note that this algorithm is equivalent to Newton-Raphson’s algorithm
Xn+1 = Xn – αH-1∇f(Xn)
where the Hessian matrix H is permanently set to the identity matrix and α = 1.
The gradient descent algorithm generally converges quite slowly, especially compared to the Newton-Raphson method, but it is simpler to implement since only the first derivatives of the function need to be calculated and not the second derivatives (which make up the Hessian matrix).
One of the reasons why the gradient descent algorithm, as described above, converges so slowly (if at all) is that a fixed learning rate is quite inefficient. There are a number of versions of the gradient descent whereby the learning rate is optimized at each iteration. We consider two such approaches: steepest descent and backtracking line search (BTLS).
Steepest Descent
At each step of the algorithm, we want ||∇f(Xn)||2, the sum of the squares of the elements in the gradient, to get smaller and converge to zero. One way to achieve this is to find the value of αn which minimizes Xn – αn∇f(Xn). We then set Xn+1 = Xn – αn∇f(Xn). This is the method of steepest descent. The downside of this approach is that we need to perform a minimization within a minimization, although this minimization is in one dimension.
Instead of trying to find the optimum value of αn, we can settle for a good-enough value of alpha. This is objective of the backtracking line search.
Backtracking Line Search
In backtracking line search, at each iteration the learning rate is revised. We are willing to accept a less-than-ideal learning rate as long as it still leads in the correct direction with reasonable convergence. The benefit of this approach is that we don’t need to optimize the value of αn.
In this approach, we set some constant c to a value between 0 and 1, and set the learning reduction constant rrate to a value between 0 and 1. For our purposes, c = .001 and rrate = .5 are reasonable settings for these values. At each iteration we initially set the learning rate αn to 1 and then test
f(Xn – αn∇f(Xn)) ≤ f(Xn) – cαn||∇f(Xn)||2
If the test is successful, we set Xn+1 = Xn – αn∇f(Xn) and move on to the next iteration. If the test is not successful, we set αn to rrate ⋅ αn, repeating this as many times as necessary until the test is successful.
Worksheet Functions
MGRADIENT(R1, Rx, learn, iter, prec, incr): returns a column array with the value of X that minimizes the function f(X) using gradient descent with a fixed learning rate learn (default .1) based on an initial guess of X in the column array Rx where R1 is a cell that contains a formula that represents the function f(X).
See Lambda Functions using Cell Formulas for more details about R1.
iter (default 500) = the maximum number of iterations. The algorithm will terminate prior to iter iterations if the error is less than some value based on prec (default .0000001). Since this approach uses the Real Statistics function DERIV to estimate the partial derivatives of the function f(X), the incr parameter used by DERIV needs to be specified (default .000001); See Numerical Differentiation.
MGRADIENTX(R1, Rx, learn, iter, prec, incr, rrate, c): similar to MGRADIENT, except that the initial learning rate learn (default 1) is not fixed but varies based on the backtracking line search approach with rrate set to the learning rate reduction constant (default .5). The constant c is as described above (default .5).
Examples
Click here to see three examples of the use of these worksheet functions.
References
Khan Academy (2024) Gradient descent
https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/optimizing-multivariable-functions/a/what-is-gradient-descent
Wikipedia (2024) Gradient descent
https://en.wikipedia.org/wiki/Gradient_descent
Cao, L., Sui, Z., Zhang, J., Yan, Y., Gu, Y. (2021) Line search methods
https://optimization.cbe.cornell.edu/index.php?title=Line_search_methods
Kochenderfer, M, J., Wheeler, T. A. (2019) Algorithms for optimization
https://algorithmsbook.com/optimization/files/optimization.pdf