LogSumExp

Smooth approximation to the maximum function

The LogSumExp (LSE) (also called RealSoftMax[1] or multivariable softplus) function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms.[2] It is defined as the logarithm of the sum of the exponentials of the arguments:

L S E ( x 1 , , x n ) = log ( exp ( x 1 ) + + exp ( x n ) ) . {\displaystyle \mathrm {LSE} (x_{1},\dots ,x_{n})=\log \left(\exp(x_{1})+\cdots +\exp(x_{n})\right).}

Properties

The LogSumExp function domain is R n {\displaystyle \mathbb {R} ^{n}} , the real coordinate space, and its codomain is R {\displaystyle \mathbb {R} } , the real line. It is an approximation to the maximum max i x i {\displaystyle \max _{i}x_{i}} with the following bounds

max { x 1 , , x n } L S E ( x 1 , , x n ) max { x 1 , , x n } + log ( n ) . {\displaystyle \max {\{x_{1},\dots ,x_{n}\}}\leq \mathrm {LSE} (x_{1},\dots ,x_{n})\leq \max {\{x_{1},\dots ,x_{n}\}}+\log(n).}
The first inequality is strict unless n = 1 {\displaystyle n=1} . The second inequality is strict unless all arguments are equal. (Proof: Let m = max i x i {\displaystyle m=\max _{i}x_{i}} . Then exp ( m ) i = 1 n exp ( x i ) n exp ( m ) {\displaystyle \exp(m)\leq \sum _{i=1}^{n}\exp(x_{i})\leq n\exp(m)} . Applying the logarithm to the inequality gives the result.)

In addition, we can scale the function to make the bounds tighter. Consider the function 1 t L S E ( t x 1 , , t x n ) {\displaystyle {\frac {1}{t}}\mathrm {LSE} (tx_{1},\dots ,tx_{n})} . Then

max { x 1 , , x n } < 1 t L S E ( t x 1 , , t x n ) max { x 1 , , x n } + log ( n ) t . {\displaystyle \max {\{x_{1},\dots ,x_{n}\}}<{\frac {1}{t}}\mathrm {LSE} (tx_{1},\dots ,tx_{n})\leq \max {\{x_{1},\dots ,x_{n}\}}+{\frac {\log(n)}{t}}.}
(Proof: Replace each x i {\displaystyle x_{i}} with t x i {\displaystyle tx_{i}} for some t > 0 {\displaystyle t>0} in the inequalities above, to give
max { t x 1 , , t x n } < L S E ( t x 1 , , t x n ) max { t x 1 , , t x n } + log ( n ) . {\displaystyle \max {\{tx_{1},\dots ,tx_{n}\}}<\mathrm {LSE} (tx_{1},\dots ,tx_{n})\leq \max {\{tx_{1},\dots ,tx_{n}\}}+\log(n).}
and, since t > 0 {\displaystyle t>0}
t max { x 1 , , x n } < L S E ( t x 1 , , t x n ) t max { x 1 , , x n } + log ( n ) . {\displaystyle t\max {\{x_{1},\dots ,x_{n}\}}<\mathrm {LSE} (tx_{1},\dots ,tx_{n})\leq t\max {\{x_{1},\dots ,x_{n}\}}+\log(n).}
finally, dividing by t {\displaystyle t} gives the result.)

Also, if we multiply by a negative number instead, we of course find a comparison to the min {\displaystyle \min } function:

min { x 1 , , x n } log ( n ) t 1 t L S E ( t x ) < min { x 1 , , x n } . {\displaystyle \min {\{x_{1},\dots ,x_{n}\}}-{\frac {\log(n)}{t}}\leq {\frac {1}{-t}}\mathrm {LSE} (-tx)<\min {\{x_{1},\dots ,x_{n}\}}.}

The LogSumExp function is convex, and is strictly increasing everywhere in its domain.[3] It is not strictly convex, since it is affine (linear plus a constant) on the diagonal and parallel lines:[4]

L S E ( x 1 + c , , x n + c ) = L S E ( x 1 , , x n ) + c . {\displaystyle \mathrm {LSE} (x_{1}+c,\dots ,x_{n}+c)=\mathrm {LSE} (x_{1},\dots ,x_{n})+c.}

Other than this direction, it is strictly convex (the Hessian has rank n 1 {\displaystyle n-1} ), so for example restricting to a hyperplane that is transverse to the diagonal results in a strictly convex function. See L S E 0 + {\displaystyle \mathrm {LSE} _{0}^{+}} , below.

Writing x = ( x 1 , , x n ) , {\displaystyle \mathbf {x} =(x_{1},\dots ,x_{n}),} the partial derivatives are:

x i L S E ( x ) = exp x i j exp x j , {\displaystyle {\frac {\partial }{\partial x_{i}}}{\mathrm {LSE} (\mathbf {x} )}={\frac {\exp x_{i}}{\sum _{j}\exp {x_{j}}}},}
which means the gradient of LogSumExp is the softmax function.

The convex conjugate of LogSumExp is the negative entropy.

log-sum-exp trick for log-domain calculations

The LSE function is often encountered when the usual arithmetic computations are performed on a logarithmic scale, as in log probability.[5]

Similar to multiplication operations in linear-scale becoming simple additions in log-scale, an addition operation in linear-scale becomes the LSE in log-scale:

L S E ( log ( x 1 ) , . . . , log ( x n ) ) = log ( x 1 + + x n ) {\displaystyle \mathrm {LSE} (\log(x_{1}),...,\log(x_{n}))=\log(x_{1}+\dots +x_{n})}
A common purpose of using log-domain computations is to increase accuracy and avoid underflow and overflow problems when very small or very large numbers are represented directly (i.e. in a linear domain) using limited-precision floating point numbers.[6]

Unfortunately, the use of LSE directly in this case can again cause overflow/underflow problems. Therefore, the following equivalent must be used instead (especially when the accuracy of the above 'max' approximation is not sufficient).

L S E ( x 1 , , x n ) = x + log ( exp ( x 1 x ) + + exp ( x n x ) ) {\displaystyle \mathrm {LSE} (x_{1},\dots ,x_{n})=x^{*}+\log \left(\exp(x_{1}-x^{*})+\cdots +\exp(x_{n}-x^{*})\right)}
where x = max { x 1 , , x n } {\displaystyle x^{*}=\max {\{x_{1},\dots ,x_{n}\}}}

Many math libraries such as IT++ provide a default routine of LSE and use this formula internally.

A strictly convex log-sum-exp type function

LSE is convex but not strictly convex. We can define a strictly convex log-sum-exp type function[7] by adding an extra argument set to zero:

L S E 0 + ( x 1 , . . . , x n ) = L S E ( 0 , x 1 , . . . , x n ) {\displaystyle \mathrm {LSE} _{0}^{+}(x_{1},...,x_{n})=\mathrm {LSE} (0,x_{1},...,x_{n})}
This function is a proper Bregman generator (strictly convex and differentiable). It is encountered in machine learning, for example, as the cumulant of the multinomial/binomial family.

In tropical analysis, this is the sum in the log semiring.

See also

References

  1. ^ Zhang, Aston; Lipton, Zack; Li, Mu; Smola, Alex. "Dive into Deep Learning, Chapter 3 Exercises". www.d2l.ai. Retrieved 27 June 2020.
  2. ^ Nielsen, Frank; Sun, Ke (2016). "Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities". Entropy. 18 (12): 442. arXiv:1606.05850. Bibcode:2016Entrp..18..442N. doi:10.3390/e18120442. S2CID 17259055.
  3. ^ El Ghaoui, Laurent (2017). Optimization Models and Applications.
  4. ^ "convex analysis - About the strictly convexity of log-sum-exp function - Mathematics Stack Exchange". stackexchange.com.
  5. ^ McElreath, Richard. Statistical Rethinking. OCLC 1107423386.
  6. ^ "Practical issues: Numeric stability". CS231n Convolutional Neural Networks for Visual Recognition.
  7. ^ Nielsen, Frank; Hadjeres, Gaetan (2018). "Monte Carlo Information Geometry: The dually flat case". arXiv:1803.07225.