Optimizer¶

Momentum¶

class paddle.v2.optimizer.Momentum(momentum=None, sparse=False, **kwargs)

Momentum Optimizer.

When sparse=False, the momentum update formula is as follows:

$\begin{split}v_{t} &= k * v_{t-1} - \gamma_t (g_{t} + \lambda w_{t-1}) \\ w_{t} &= w_{t-1} + v_{t} \\\end{split}$

where, $k$ is momentum, $\lambda$ is decay rate, $\gamma_t$ is learning rate at the t’th iteration. $w_{t}$ is the weight as the t’th iteration. And the $v_{t}$ is the history momentum variable.

When sparse=True, the update scheme:

$\begin{split}\alpha_t &= \alpha_{t-1} / k \\ \beta_t &= \beta_{t-1} / (1 + \lambda \gamma_t) \\ u_t &= u_{t-1} - \alpha_t \gamma_t g_t \\ v_t &= v_{t-1} + \tau_{t-1} \alpha_t \gamma_t g_t \\ \tau_t &= \tau_{t-1} + \beta_t / \alpha_t\end{split}$

where $k$ is momentum, $\lambda$ is decay rate, $\gamma_t$ is learning rate at the t’th iteration.

Parameters: momentum (float) – the momentum factor. sparse (bool) – with sparse support or not, False by default.

class paddle.v2.optimizer.Adam(beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)

Adam optimizer. The details of please refer Adam: A Method for Stochastic Optimization

$\begin{split}m(w, t) & = \beta_1 m(w, t-1) + (1 - \beta_1) \nabla Q_i(w) \\ v(w, t) & = \beta_2 v(w, t-1) + (1 - \beta_2)(\nabla Q_i(w)) ^2 \\ w & = w - \frac{\eta m(w, t)}{\sqrt{v(w,t) + \epsilon}}\end{split}$
Parameters: beta1 (float) – the $\beta_1$ in equation. beta2 (float) – the $\beta_2$ in equation. epsilon (float) – the $\epsilon$ in equation. It is used to prevent divided by zero.

class paddle.v2.optimizer.Adamax(beta1=0.9, beta2=0.999, **kwargs)

The details of please refer this Adam: A Method for Stochastic Optimization

$\begin{split}m_t & = \beta_1 * m_{t-1} + (1-\beta_1)* \nabla Q_i(w) \\ u_t & = max(\beta_2*u_{t-1}, abs(\nabla Q_i(w))) \\ w_t & = w_{t-1} - (\eta/(1-\beta_1^t))*m_t/u_t\end{split}$
Parameters: beta1 (float) – the $\beta_1$ in the equation. beta2 (float) – the $\beta_2$ in the equation.

class paddle.v2.optimizer.AdaGrad(**kwargs)

For details please refer this Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

$\begin{split}G &= \sum_{\tau=1}^{t} g_{\tau} g_{\tau}^T \\ w & = w - \eta diag(G)^{-\frac{1}{2}} \circ g\end{split}$

class paddle.v2.optimizer.DecayedAdaGrad(rho=0.95, epsilon=1e-06, **kwargs)

AdaGrad method with decayed sum gradients. The equations of this method show as follow.

$\begin{split}E(g_t^2) &= \rho * E(g_{t-1}^2) + (1-\rho) * g^2 \\ learning\_rate &= 1/sqrt( ( E(g_t^2) + \epsilon )\end{split}$
Parameters: rho (float) – The $\rho$ parameter in that equation epsilon (float) – The $\epsilon$ parameter in that equation.

class paddle.v2.optimizer.AdaDelta(rho=0.95, epsilon=1e-06, **kwargs)

$\begin{split}E(g_t^2) &= \rho * E(g_{t-1}^2) + (1-\rho) * g^2 \\ learning\_rate &= sqrt( ( E(dx_{t-1}^2) + \epsilon ) / ( \ E(g_t^2) + \epsilon ) ) \\ E(dx_t^2) &= \rho * E(dx_{t-1}^2) + (1-\rho) * (-g*learning\_rate)^2\end{split}$
Parameters: rho (float) – $\rho$ in equation epsilon (float) – $\rho$ in equation

RMSProp¶

class paddle.v2.optimizer.RMSProp(rho=0.95, epsilon=1e-06, **kwargs)

RMSProp(for Root Mean Square Propagation) optimizer. For details please refer this slide.

The equations of this method as follows:

$\begin{split}v(w, t) & = \rho v(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2 \\ w & = w - \frac{\eta} {\sqrt{v(w,t) + \epsilon}} \nabla Q_{i}(w)\end{split}$
Parameters: rho (float) – the $\rho$ in the equation. The forgetting factor. epsilon (float) – the $\epsilon$ in the equation.