# fluid.optimizer¶

## SGD¶

alias of SGDOptimizer

## Momentum¶

alias of MomentumOptimizer

## Ftrl¶

alias of FtrlOptimizer

## SGDOptimizer¶

Optimizer of the stochastic gradient descent algorithm.

$param\_out = param - learning\_rate * grad$
Parameters: learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.

Examples

sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.2)
sgd_optimizer.minimize(cost)

## MomentumOptimizer¶

Simple Momentum optimizer with velocity state

This optimizer has a flag for Nestrov Momentum.

The update equations are as follows:

\begin{align}\begin{aligned}& velocity = mu * velocity + gradient\\& if (use\_nesterov):\\&\quad param = param - (gradient + mu * velocity) * learning\_rate\\& else:\\&\quad param = param - learning\_rate * velocity\end{aligned}\end{align}
Parameters: learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element. momentum (float) – momentum factor use_nesterov (bool) – enables Nesterov momentum

Examples

optimizer = fluid.optimizer.Momentum(learning_rate=0.2, momentum=0.1)
optimizer.minimize(cost)

The update is done as follows:

\begin{align}\begin{aligned}moment\_out &= moment + grad * grad\\param\_out &= param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align}

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have the epsilon attribute. It is added here in our implementation as also proposed here: http://cs231n.github.io/neural-networks-3/#ada for numerical stability to avoid the division by zero error.

Parameters: learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element. epsilon (float) – a small float value for numerical stability.

Examples

optimizer.minimize(cost)

This implements the Adam optimizer from Section 2 of the Adam paper : https://arxiv.org/abs/1412.6980. Adam is a first-order gradient-based optimization method based on adaptive estimates of lower-order moments.

\begin{align}\begin{aligned}t & = t + 1\\moment\_1\_out & = {\beta}_1 * moment\_1 + (1 - {\beta}_1) * grad\\moment\_2\_out & = {\beta}_2 * moment\_2 + (1 - {\beta}_2) * grad * grad\\learning\_rate & = learning\_rate * \ \frac{\sqrt{1 - {\beta}_2^t}}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon}\end{aligned}\end{align}
Parameters: learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element. beta1 (float) – The exponential decay rate for the 1st moment estimates. beta2 (float) – The exponential decay rate for the 2nd moment estimates. epsilon (float) – a small float value for numerical stability.

Examples

optimizer.minimize(cost)

We implement the Adamax optimizer from Section 7 of the Adam paper: https://arxiv.org/abs/1412.6980. Adamax is a variant of the Adam algorithm based on the infinity norm.

\begin{align}\begin{aligned}t & = t + 1\\moment\_out & = {\beta}_1 * moment + (1 - {\beta}_1) * grad\\inf\_norm\_out & = max({\beta}_2 * inf\_norm + \epsilon, |grad|)\\learning\_rate & = \frac{learning\_rate}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out}\end{aligned}\end{align}

The original paper does not have an epsilon attribute. However, it is added here for numerical stability to prevent the division by 0 error.

Parameters: learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element. beta1 (float) – The exponential decay rate for the 1st moment estimates. beta2 (float) – The exponential decay rate for the 2nd moment estimates. epsilon (float) – a small float value for numerical stability.

Examples

optimizer.minimize(cost)

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)

The update is done as follows:

\begin{align}\begin{aligned}moment\_out & = decay * moment + (1 - decay) * grad * grad\\param\_out & = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align}

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have an epsilon attribute. It is added here for numerical stability to avoid the division by zero error.

Parameters: learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element. decay (float) – decay rate. epsilon (float) – a small float value for numerical stability.

Examples

optimizer.minimize(cost)

## RMSPropOptimizer¶

class paddle.fluid.optimizer.RMSPropOptimizer(learning_rate, rho=0.95, epsilon=1e-06, momentum=0.0, **kwargs)

Root Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\w & = w - \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{aligned}\end{align}

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by $sqrt{v(w,t)}$.

In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{v(w,t) + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align}

where, $\rho$ is a hyperparameter and typical values are 0.9, 0.95 and so on. :math: beta is the momentum term. :math: epsilon is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters:
• learning_rate (float) – global learning rate.
• rho (float) – rho is :math: rho in equation, set 0.95 by default.
• epsilon (float) – math: epsilon in equation is smoothing term to

avoid division by zero, set 1e-6 by default.

• momentum (float) – $\beta$ in equation is the momentum term, set 0.0 by default.
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.RMSProp(0.0001)

## FtrlOptimizer¶

class paddle.fluid.optimizer.FtrlOptimizer(learning_rate, l1=0.0, l2=0.0, lr_power=-0.5, **kwargs)

\begin{align}\begin{aligned}&new\_accum = squared\_accum + grad^2\\&if (lr\_power == -0.5):\\&\quad linear\_accum += grad - \frac{\sqrt{new\_accum} - \sqrt{squared\_accum}}{learning\_rate * param}\\&else:\\&\quad linear\_accum += grad - \frac{new\_accum^{-lr\_power} - accum^{-lr\_power}}{learning\_rate * param}\\ &x = l1 * sign(linear\_accum) - linear\_accum\\&if (lr\_power == -0.5):\\&\quad y = \frac{\sqrt{new\_accum}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&else:\\&\quad y = \frac{new\_accum^{-lr\_power}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&squared\_accum += grad^2\end{aligned}\end{align}
Parameters: learning_rate (float|Variable) – global learning rate. l1 (float) – l2 (float) – lr_power (float) – ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.Ftrl(0.0001)

## ModelAverage¶

Accumulate the average of parameters whtin sliding window. The average result will be saved in temporary variables which can be applied to parameter variables of current model by calling ‘apply()’ method. And the ‘restore()’ method is used to restored the parameter values of current model.

The size of average window is determined by average_window_rate, min_average_window, max_average_window and current update times.

Parameters: average_window_rate – The rate of average window. min_average_window – The minimum size of average window. max_average_window – The maximum size of average window.

Examples

optimizer = fluid.optimizer.Momentum()
optimizer.minimize(cost)
model_average = fluid.optimizer.ModelAverage(0.15,
min_average_window=10000,
max_average_window=20000)
for pass_id in range(args.pass_num):
exe.run(fluid.default_main_program()...)

with model_average.apply(exe):
exe.run(inference_program...)
apply(*args, **kwds)

Apply average values to parameters of current model.

restore(executor)

Restore parameter values of current model.

## RMSPropOptimizer¶

class paddle.fluid.optimizer.RMSPropOptimizer(learning_rate, rho=0.95, epsilon=1e-06, momentum=0.0, **kwargs)

Root Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\w & = w - \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{aligned}\end{align}

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by $sqrt{v(w,t)}$.

In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{v(w,t) + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align}

where, $\rho$ is a hyperparameter and typical values are 0.9, 0.95 and so on. :math: beta is the momentum term. :math: epsilon is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters:
• learning_rate (float) – global learning rate.
• rho (float) – rho is :math: rho in equation, set 0.95 by default.
• epsilon (float) – math: epsilon in equation is smoothing term to

avoid division by zero, set 1e-6 by default.

• momentum (float) – $\beta$ in equation is the momentum term, set 0.0 by default.
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

optimizer = fluid.optimizer.RMSProp(0.0001)