fluid.optimizer

SGD

paddle.fluid.optimizer.SGD

alias of SGDOptimizer

Momentum

paddle.fluid.optimizer.Momentum

alias of MomentumOptimizer

Adagrad

paddle.fluid.optimizer.Adagrad

alias of AdagradOptimizer

Adam

paddle.fluid.optimizer.Adam

alias of AdamOptimizer

Adamax

paddle.fluid.optimizer.Adamax

alias of AdamaxOptimizer

DecayedAdagrad

paddle.fluid.optimizer.DecayedAdagrad

alias of DecayedAdagradOptimizer

Ftrl

paddle.fluid.optimizer.Ftrl

alias of FtrlOptimizer

SGDOptimizer

class paddle.fluid.optimizer.SGDOptimizer(learning_rate, regularization=None, name=None)

Optimizer of the stochastic gradient descent algorithm.

\[param\_out = param - learning\_rate * grad\]
Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.2)
sgd_optimizer.minimize(cost)

MomentumOptimizer

class paddle.fluid.optimizer.MomentumOptimizer(learning_rate, momentum, use_nesterov=False, regularization=None, name=None)

Simple Momentum optimizer with velocity state

This optimizer has a flag for Nestrov Momentum.

The update equations are as follows:

\[ \begin{align}\begin{aligned}& velocity = mu * velocity + gradient\\& if (use\_nesterov):\\&\quad param = param - (gradient + mu * velocity) * learning\_rate\\& else:\\&\quad param = param - learning\_rate * velocity\end{aligned}\end{align} \]
Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • momentum (float) – momentum factor
  • use_nesterov (bool) – enables Nesterov momentum
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

optimizer = fluid.optimizer.Momentum(learning_rate=0.2, momentum=0.1)
optimizer.minimize(cost)

AdagradOptimizer

class paddle.fluid.optimizer.AdagradOptimizer(learning_rate, epsilon=1e-06, regularization=None, name=None)

Adaptive Gradient Algorithm (Adagrad)

The update is done as follows:

\[ \begin{align}\begin{aligned}moment\_out &= moment + grad * grad\\param\_out &= param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align} \]

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have the epsilon attribute. It is added here in our implementation as also proposed here: http://cs231n.github.io/neural-networks-3/#ada for numerical stability to avoid the division by zero error.

Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • epsilon (float) – a small float value for numerical stability.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

optimizer = fluid.optimizer.Adagrad(learning_rate=0.2)
optimizer.minimize(cost)

AdamOptimizer

class paddle.fluid.optimizer.AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, regularization=None, name=None)

This implements the Adam optimizer from Section 2 of the Adam paper : https://arxiv.org/abs/1412.6980. Adam is a first-order gradient-based optimization method based on adaptive estimates of lower-order moments.

Adam updates:

\[ \begin{align}\begin{aligned}t & = t + 1\\moment\_1\_out & = {\beta}_1 * moment\_1 + (1 - {\beta}_1) * grad\\moment\_2\_out & = {\beta}_2 * moment\_2 + (1 - {\beta}_2) * grad * grad\\learning\_rate & = learning\_rate * \ \frac{\sqrt{1 - {\beta}_2^t}}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon}\end{aligned}\end{align} \]
Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • beta1 (float) – The exponential decay rate for the 1st moment estimates.
  • beta2 (float) – The exponential decay rate for the 2nd moment estimates.
  • epsilon (float) – a small float value for numerical stability.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

optimizer = fluid.optimizer.Adam(learning_rate=0.2)
optimizer.minimize(cost)

AdamaxOptimizer

class paddle.fluid.optimizer.AdamaxOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, regularization=None, name=None)

We implement the Adamax optimizer from Section 7 of the Adam paper: https://arxiv.org/abs/1412.6980. Adamax is a variant of the Adam algorithm based on the infinity norm.

Adamax updates:

\[ \begin{align}\begin{aligned}t & = t + 1\\moment\_out & = {\beta}_1 * moment + (1 - {\beta}_1) * grad\\inf\_norm\_out & = max({\beta}_2 * inf\_norm + \epsilon, |grad|)\\learning\_rate & = \frac{learning\_rate}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out}\end{aligned}\end{align} \]

The original paper does not have an epsilon attribute. However, it is added here for numerical stability to prevent the division by 0 error.

Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • beta1 (float) – The exponential decay rate for the 1st moment estimates.
  • beta2 (float) – The exponential decay rate for the 2nd moment estimates.
  • epsilon (float) – a small float value for numerical stability.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

optimizer = fluid.optimizer.Adamax(learning_rate=0.2)
optimizer.minimize(cost)

Notes

Currently, AdamaxOptimizer doesn’t support sparse gradient.

DecayedAdagradOptimizer

class paddle.fluid.optimizer.DecayedAdagradOptimizer(learning_rate, decay=0.95, epsilon=1e-06, regularization=None, name=None)

Decayed Adagrad Optimizer

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)

The update is done as follows:

\[ \begin{align}\begin{aligned}moment\_out & = decay * moment + (1 - decay) * grad * grad\\param\_out & = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align} \]

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have an epsilon attribute. It is added here for numerical stability to avoid the division by zero error.

Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • decay (float) – decay rate.
  • epsilon (float) – a small float value for numerical stability.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

optimizer = fluid.optimizer.DecayedAdagrad(learning_rate=0.2)
optimizer.minimize(cost)

Notes

Currently, DecayedAdagradOptimizer doesn’t support sparse gradient.

RMSPropOptimizer

class paddle.fluid.optimizer.RMSPropOptimizer(learning_rate, rho=0.95, epsilon=1e-06, momentum=0.0, centered=False, regularization=None, name=None)

Root Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\w & = w - \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{aligned}\end{align} \]

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by \(sqrt{v(w,t)}\).

In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]

if centered is True:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\g(w, t) & = \rho g(w, t-1) + (1 - \rho)\nabla Q_{i}(w)\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) - (g(w, t))^2 + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]

where, \(\rho\) is a hyperparameter and typical values are 0.9, 0.95 and so on. :math: beta is the momentum term. :math: epsilon is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters:
  • learning_rate (float) – global learning rate.
  • rho (float) – rho is :math: rho in equation, set 0.95 by default.
  • epsilon (float) –
    math:epsilon in equation is smoothing term to

    avoid division by zero, set 1e-6 by default.

  • momentum (float) – \(\beta\) in equation is the momentum term, set 0.0 by default.
  • centered (bool) – If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.RMSProp(0.0001)
_, params_grads = optimizer.minimize(cost)

FtrlOptimizer

class paddle.fluid.optimizer.FtrlOptimizer(learning_rate, l1=0.0, l2=0.0, lr_power=-0.5, regularization=None, name=None)

FTRL (Follow The Regularized Leader) Optimizer.

The paper that proposed Follow The Regularized Leader (FTRL): (https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf)

\[ \begin{align}\begin{aligned}&new\_accum = squared\_accum + grad^2\\&if (lr\_power == -0.5):\\&\quad linear\_accum += grad - \frac{\sqrt{new\_accum} - \sqrt{squared\_accum}}{learning\_rate * param}\\&else:\\&\quad linear\_accum += grad - \frac{new\_accum^{-lr\_power} - accum^{-lr\_power}}{learning\_rate * param}\\ &x = l1 * sign(linear\_accum) - linear\_accum\\&if (lr\_power == -0.5):\\&\quad y = \frac{\sqrt{new\_accum}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&else:\\&\quad y = \frac{new\_accum^{-lr\_power}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&squared\_accum += grad^2\end{aligned}\end{align} \]
Parameters:
  • learning_rate (float|Variable) – global learning rate.
  • l1 (float) –
  • l2 (float) –
  • lr_power (float) –
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.Ftrl(0.0001)
_, params_grads = optimizer.minimize(cost)

Notes

Currently, FtrlOptimizer doesn’t support sparse gradient.

Adadelta

paddle.fluid.optimizer.Adadelta

alias of AdadeltaOptimizer

ModelAverage

class paddle.fluid.optimizer.ModelAverage(average_window_rate, min_average_window=10000, max_average_window=10000, regularization=None, name=None)

Accumulate the average of parameters whtin sliding window. The average result will be saved in temporary variables which can be applied to parameter variables of current model by calling ‘apply()’ method. And the ‘restore()’ method is used to restored the parameter values of current model.

The size of average window is determined by average_window_rate, min_average_window, max_average_window and current update times.

Parameters:
  • average_window_rate – The rate of average window.
  • min_average_window – The minimum size of average window.
  • max_average_window – The maximum size of average window.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

optimizer = fluid.optimizer.Momentum()
optimizer.minimize(cost)
model_average = fluid.optimizer.ModelAverage(0.15,
                                        min_average_window=10000,
                                        max_average_window=20000)
for pass_id in range(args.pass_num):
    for data in train_reader():
        exe.run(fluid.default_main_program()...)

    with model_average.apply(exe):
        for data in test_reader():
            exe.run(inference_program...)
apply(*args, **kwds)

Apply average values to parameters of current model.

restore(executor)

Restore parameter values of current model.

RMSPropOptimizer

class paddle.fluid.optimizer.RMSPropOptimizer(learning_rate, rho=0.95, epsilon=1e-06, momentum=0.0, centered=False, regularization=None, name=None)

Root Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\w & = w - \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{aligned}\end{align} \]

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by \(sqrt{v(w,t)}\).

In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]

if centered is True:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\g(w, t) & = \rho g(w, t-1) + (1 - \rho)\nabla Q_{i}(w)\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) - (g(w, t))^2 + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]

where, \(\rho\) is a hyperparameter and typical values are 0.9, 0.95 and so on. :math: beta is the momentum term. :math: epsilon is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters:
  • learning_rate (float) – global learning rate.
  • rho (float) – rho is :math: rho in equation, set 0.95 by default.
  • epsilon (float) –
    math:epsilon in equation is smoothing term to

    avoid division by zero, set 1e-6 by default.

  • momentum (float) – \(\beta\) in equation is the momentum term, set 0.0 by default.
  • centered (bool) – If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.RMSProp(0.0001)
_, params_grads = optimizer.minimize(cost)