fluid.optimizer

SGD

paddle.fluid.optimizer.SGD

alias of SGDOptimizer

Momentum

paddle.fluid.optimizer.Momentum

alias of MomentumOptimizer

Adagrad

paddle.fluid.optimizer.Adagrad

alias of AdagradOptimizer

Adam

paddle.fluid.optimizer.Adam

alias of AdamOptimizer

Adamax

paddle.fluid.optimizer.Adamax

alias of AdamaxOptimizer

DecayedAdagrad

paddle.fluid.optimizer.DecayedAdagrad

alias of DecayedAdagradOptimizer

Ftrl

paddle.fluid.optimizer.Ftrl

alias of FtrlOptimizer

SGDOptimizer

class paddle.fluid.optimizer.SGDOptimizer(learning_rate, **kwargs)

Optimizer of the stochastic gradient descent algorithm.

\[param\_out = param - learning\_rate * grad\]
Parameters:learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.

Examples

sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.2)
sgd_optimizer.minimize(cost)

MomentumOptimizer

class paddle.fluid.optimizer.MomentumOptimizer(learning_rate, momentum, use_nesterov=False, **kwargs)

Simple Momentum optimizer with velocity state

This optimizer has a flag for Nestrov Momentum.

The update equations are as follows:

\[ \begin{align}\begin{aligned}& velocity = mu * velocity + gradient\\& if (use\_nesterov):\\&\quad param = param - gradient * learning\_rate + mu * velocity * learning\_rate\\& else:\\&\quad param = param - learning\_rate * velocity\end{aligned}\end{align} \]
Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • momentum (float) – momentum factor
  • use_nesterov (bool) – enables Nesterov momentum

Examples

optimizer = fluid.optimizer.Momentum(learning_rate=0.2, momentum=0.1)
optimizer.minimize(cost)

AdagradOptimizer

class paddle.fluid.optimizer.AdagradOptimizer(learning_rate, epsilon=1e-06, **kwargs)

Adaptive Gradient Algorithm (Adagrad)

The update is done as follows:

\[ \begin{align}\begin{aligned}moment\_out &= moment + grad * grad\\param\_out &= param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align} \]

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have the epsilon attribute. It is added here in our implementation as also proposed here: http://cs231n.github.io/neural-networks-3/#ada for numerical stability to avoid the division by zero error.

Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • epsilon (float) – a small float value for numerical stability.

Examples

optimizer = fluid.optimizer.Adagrad(learning_rate=0.2)
optimizer.minimize(cost)

AdamOptimizer

class paddle.fluid.optimizer.AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)

This implements the Adam optimizer from Section 2 of the Adam paper : https://arxiv.org/abs/1412.6980. Adam is a first-order gradient-based optimization method based on adaptive estimates of lower-order moments.

Adam updates:

\[ \begin{align}\begin{aligned}t & = t + 1\\moment\_1\_out & = {\beta}_1 * moment\_1 + (1 - {\beta}_1) * grad\\moment\_2\_out & = {\beta}_2 * moment\_2 + (1 - {\beta}_2) * grad * grad\\learning\_rate & = learning\_rate * \ \frac{\sqrt{1 - {\beta}_2^t}}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon}\end{aligned}\end{align} \]
Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • beta1 (float) – The exponential decay rate for the 1st moment estimates.
  • beta2 (float) – The exponential decay rate for the 2nd moment estimates.
  • epsilon (float) – a small float value for numerical stability.

Examples

optimizer = fluid.optimizer.Adam(learning_rate=0.2)
optimizer.minimize(cost)

AdamaxOptimizer

class paddle.fluid.optimizer.AdamaxOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, **kwargs)

We implement the Adamax optimizer from Section 7 of the Adam paper: https://arxiv.org/abs/1412.6980. Adamax is a variant of the Adam algorithm based on the infinity norm.

Adamax updates:

\[ \begin{align}\begin{aligned}t & = t + 1\\moment\_out & = {\beta}_1 * moment + (1 - {\beta}_1) * grad\\inf\_norm\_out & = max({\beta}_2 * inf\_norm + \epsilon, |grad|)\\learning\_rate & = \frac{learning\_rate}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out}\end{aligned}\end{align} \]

The original paper does not have an epsilon attribute. However, it is added here for numerical stability to prevent the division by 0 error.

Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • beta1 (float) – The exponential decay rate for the 1st moment estimates.
  • beta2 (float) – The exponential decay rate for the 2nd moment estimates.
  • epsilon (float) – a small float value for numerical stability.

Examples

optimizer = fluid.optimizer.Adamax(learning_rate=0.2)
optimizer.minimize(cost)

DecayedAdagradOptimizer

class paddle.fluid.optimizer.DecayedAdagradOptimizer(learning_rate, decay=0.95, epsilon=1e-06, **kwargs)

Decayed Adagrad Optimizer

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)

The update is done as follows:

\[ \begin{align}\begin{aligned}moment\_out & = decay * moment + (1 - decay) * grad * grad\\param\_out & = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align} \]

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have an epsilon attribute. It is added here for numerical stability to avoid the division by zero error.

Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • decay (float) – decay rate.
  • epsilon (float) – a small float value for numerical stability.

Examples

optimizer = fluid.optimizer.DecayedAdagrad(learning_rate=0.2)
optimizer.minimize(cost)

RMSPropOptimizer

class paddle.fluid.optimizer.RMSPropOptimizer(learning_rate, rho=0.95, epsilon=1e-06, momentum=0.0, **kwargs)

Root Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\w & = w - \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{aligned}\end{align} \]

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by \(sqrt{v(w,t)}\).

In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{v(w,t) + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]

where, \(\rho\) is a hyperparameter and typical values are 0.9, 0.95 and so on. :math: beta is the momentum term. :math: epsilon is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters:
  • learning_rate (float) – global learning rate.
  • rho (float) – rho is :math: rho in equation, set 0.95 by default.
  • epsilon (float) –
    math:epsilon in equation is smoothing term to

    avoid division by zero, set 1e-6 by default.

  • momentum (float) – \(\beta\) in equation is the momentum term, set 0.0 by default.
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.RMSProp(0.0001)
_, params_grads = optimizer.minimize(cost)

FtrlOptimizer

class paddle.fluid.optimizer.FtrlOptimizer(learning_rate, l1=0.0, l2=0.0, lr_power=-0.5, **kwargs)

FTRL (Follow The Regularized Leader) Optimizer.

The paper that proposed Follow The Regularized Leader (FTRL): (https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf)

\[ \begin{align}\begin{aligned}&new\_accum = squared\_accum + grad^2\\&if (lr\_power == -0.5):\\&\quad linear\_accum += grad - \frac{\sqrt{new\_accum} - \sqrt{squared\_accum}}{learning\_rate * param}\\&else:\\&\quad linear\_accum += grad - \frac{new\_accum^{-lr\_power} - accum^{-lr\_power}}{learning\_rate * param}\\ &x = l1 * sign(linear\_accum) - linear\_accum\\&if (lr\_power == -0.5):\\&\quad y = \frac{\sqrt{new\_accum}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&else:\\&\quad y = \frac{new\_accum^{-lr\_power}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&squared\_accum += grad^2\end{aligned}\end{align} \]
Parameters:
  • learning_rate (float|Variable) – global learning rate.
  • l1 (float) –
  • l2 (float) –
  • lr_power (float) –
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.Ftrl(0.0001)
_, params_grads = optimizer.minimize(cost)

Adadelta

paddle.fluid.optimizer.Adadelta

alias of AdadeltaOptimizer

ModelAverage

class paddle.fluid.optimizer.ModelAverage(average_window_rate, min_average_window=10000, max_average_window=10000, **kwargs)

Accumulate the average of parameters whtin sliding window. The average result will be saved in temporary variables which can be applied to parameter variables of current model by calling ‘apply()’ method. And the ‘restore()’ method is used to restored the parameter values of current model.

The size of average window is determined by average_window_rate, min_average_window, max_average_window and current update times.

Parameters:
  • average_window_rate – The rate of average window.
  • min_average_window – The minimum size of average window.
  • max_average_window – The maximum size of average window.

Examples

optimizer = fluid.optimizer.Momentum()
optimizer.minimize(cost)
model_average = fluid.optimizer.ModelAverage(0.15,
                                        min_average_window=10000,
                                        max_average_window=20000)
for pass_id in range(args.pass_num):
    for data in train_reader():
        exe.run(fluid.default_main_program()...)

    with model_average.apply(exe):
        for data in test_reader():
            exe.run(inference_program...)
apply(*args, **kwds)

Apply average values to parameters of current model.

restore(executor)

Restore parameter values of current model.

Optimizer

class paddle.fluid.optimizer.Optimizer(learning_rate, regularization=None, LARS_weight_decay=0.0)

Optimizer Base class.

Define the common interface of an optimizer. User should not use this class directly, but need to use one of it’s implementation.

global_learning_rate(program=None)

get global decayed learning rate :return:

create_optimization_pass(parameters_and_grads, loss, startup_program=None)

Add optimization operators to update gradients to variables.

Parameters:
  • loss (Variable) – the target that this optimization is for.
  • parameters_and_grads (list(tuple(Variable, Variable))) –
  • list of (a) –
Returns:

a list of operators that will complete one step of optimization. This will include parameter update ops, global step update ops and any other custom ops required by subclasses to manage their internal state.

Return type:

return_op_list

minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface append_backward() and create_optimization_pass() into one.

RMSPropOptimizer

class paddle.fluid.optimizer.RMSPropOptimizer(learning_rate, rho=0.95, epsilon=1e-06, momentum=0.0, **kwargs)

Root Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\w & = w - \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{aligned}\end{align} \]

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by \(sqrt{v(w,t)}\).

In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{v(w,t) + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]

where, \(\rho\) is a hyperparameter and typical values are 0.9, 0.95 and so on. :math: beta is the momentum term. :math: epsilon is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters:
  • learning_rate (float) – global learning rate.
  • rho (float) – rho is :math: rho in equation, set 0.95 by default.
  • epsilon (float) –
    math:epsilon in equation is smoothing term to

    avoid division by zero, set 1e-6 by default.

  • momentum (float) – \(\beta\) in equation is the momentum term, set 0.0 by default.
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.RMSProp(0.0001)
_, params_grads = optimizer.minimize(cost)