Operators
sgd
SGD operator
This operator implements one step of the stochastic gradient descent algorithm.
$$param\_out = param  learning\_rate * grad$$
Inputs:  Param : (Tensor) Input parameter
 LearningRate : (Tensor) Learning rate of SGD
 Grad : (Tensor) Input gradient
Outputs:  ParamOut : (Tensor) Output parameter
adagrad
Adaptive Gradient Algorithm (Adagrad).
The update is done as follows:
$$moment\_out = moment + grad * grad \\ param\_out = param  \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon} $$
The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have the epsilon attribute. It is added here in our implementation as also proposed here: http://cs231n.github.io/neuralnetworks3/#ada for numerical stability to avoid the division by zero error.
Inputs:  Param : (Tensor) Input parameter
 Grad : (Tensor) Input gradient
 Moment : (Tensor) Second moment
 LearningRate : (Tensor) Learning rate
Outputs:  ParamOut : (Tensor) Output parameter
 MomentOut : (Tensor) Output second moment
Attributes:  epsilon (Duplicable): (float, default 1.0e6) Constant for numerical stability
conv3d
Convolution3D Operator.
The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and output(Output) are in NCDHW format, where N is batch size, C is the number of channels,D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCDHW format, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$ D_{out}= \frac{(D_{in} + 2 * paddings[0]  (dilations[0] * (D_f  1) + 1))}{ strides[0]}+ 1 \\ H_{out}= \frac{(H_{in} + 2 * paddings[1]  (dilations[1] * (H_f  1) + 1))}{ strides[1]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[2]  (dilations[2] * (W_f  1) + 1))}{ strides[2]}+ 1 $$
Inputs:  Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCDHW, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups.
Outputs:  Output : (Tensor) The output tensor of convolution operator.The format of output tensor is also NCDHW.
Attributes:  strides (Duplicable): (vector<int>, default:{1, 1, 1}), the strides(d_stride, h_stride, w_stride) of convolution operator.
 paddings (Duplicable): (vector<int>, default:{0, 0, 0}), the paddings(d_pad, h_pad, w_pad) of convolution operator.
 groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.
 dilations (Duplicable): (vector<int> default:{1, 1, 1}), the dilations(d_dilation, h_dilation, w_dilation) of convolution operator.
conv2d
Convolution Operator.
The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and Output(Output) are in NCHW format. Where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCHW format. Where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$ H_{out}= \frac{(H_{in} + 2 * paddings[0]  (dilations[0] * (H_f  1) + 1))}{strides[0]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[1]  (dilations[1] * (W_f  1) + 1))}{strides[1]}+ 1 $$
Inputs:  Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCHW, where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups.
Outputs:  Output : (Tensor) The output tensor of convolution operator. The format of output tensor is also NCHW.
Attributes:  strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride) of convolution operator.
 paddings (Duplicable): (vector<int> default:{0, 0}), the paddings(h_pad, w_pad) of convolution operator.
 groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.
 dilations (Duplicable): (vector<int> default:{1, 1}), the dilations(h_dilation, w_dilation) of convolution operator.
pool3d
Pool3d Operator.
The pooling3d operation calculates the output based on the input, pooling_type, ksize, strides, and paddings parameters. Input(X) and output(Out) are in NCDHW format, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the feature, respectively. Parameters(ksize, strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: X shape: $(N, C, D_{in}, H_{in}, W_{in})$ Output: Out shape: $(N, C, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = \frac{(D_{in}  ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ H_{out} = \frac{(H_{in}  ksize[1] + 2 * paddings[1])}{strides[1]} + 1 \\ W_{out} = \frac{(W_{in}  ksize[2] + 2 * paddings[2])}{strides[2]} + 1 $$
Inputs:  X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively.
Outputs:  Out : (Tensor) The output tensor of pooling operator.The format of output tensor is also NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively.
Attributes:  pooling_type (Duplicable): (string) Pooling type, can be "max" for maxpooling and "avg" for averagepooling.
 ksize (Duplicable): (vector<int>) The pooling window size(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
 global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings wille be ignored.
 strides (Duplicable): (vector<int>, default {1,1,1}) Strides(depth, height, width) of the pooling operator.
 paddings (Duplicable): (vector<int>, default {0,0,0}), paddings(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
pool2d
Pool2d Operator.
The pooling2d operation calculates the output based on the input, pooling_type and ksize, strides, paddings parameters. Input(X) and output(Out) are in NCHW format, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Parameters(ksize, strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.
Example:
Input: X shape: $(N, C, H_{in}, W_{in})$ Output: Out shape: $(N, C, H_{out}, W_{out})$ Where $$ H_{out} = \frac{(H_{in}  ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ W_{out} = \frac{(W_{in}  ksize[1] + 2 * paddings[1])}{strides[1]} + 1 $$Inputs:  X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
Outputs:  Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
Attributes:  pooling_type (Duplicable): (string), pooling type, can be "max" for maxpooling and "avg" for averagepooling.
 ksize (Duplicable): (vector<int>) The pooling window size(height, width) of the pooling operator. If global_pooling = true, ksize and paddings will be ignored.
 global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.
 strides (Duplicable): (vector<int>, default {1, 1}), strides(height, width) of pooling operator.
 paddings (Duplicable): (vector<int>, default {0,0}), paddings(height, width) of pooling operator.If global_pooling = true, paddings and ksize will be ignored.
max_pool3d_with_index
MaxPool3d Operator.
The maxpooling3d with index operation calculates the output and the mask based on the input and ksize, strides, paddings parameters. Input(X) and output(Out, Mask) are in NCDHW format, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the feature, respectively. Parameters(ksize, strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out, Mask) size may be different.
Example: Input: X shape: $(N, C, D_{in}, H_{in}, W_{in})$ Output: Out shape: $(N, C, D_{out}, H_{out}, W_{out})$ Mask shape: $(N, C, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = \frac{(D_{in}  ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ H_{out} = \frac{(H_{in}  ksize[1] + 2 * paddings[1])}{strides[1]} + 1 \\ W_{out} = \frac{(W_{in}  ksize[2] + 2 * paddings[2])}{strides[2]} + 1 $$
Inputs:  X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCDHW, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively
Outputs:  Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCDHW, where N is the batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively.
 Mask : (Tensor) The Mask tensor of pooling operator. The format of output tensor is also NCDHW, where N is the batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively. It represents the index in the current feature map.
Attributes:  ksize (Duplicable): (vector<int>) The pooling window size(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
 global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.
 strides (Duplicable): (vector<int>, default {1,1,1}), strides(depth, height, width) of pooling operator.
 paddings (Duplicable): (vector, default {0,0,0}), paddings(depth, height, width) of pooling operator. If global_pooling = true, paddings and ksize will be ignored.
lod_rank_table
Create LoDRanTable by LoDTensor
LoD Rank Table stores the
level
oflod
which is ordered by sequence length in descending order. It is useful when implement dynamic RNN and is shared by dynamic RNN memory, dynamic RNN slice input and dynamic RNN slice output operators.Inputs:  X : (LoDTensor) input lod tensor, must contain lod information.
Outputs:  Out : (LoDRankTable) The rank table of specific level.
Attributes:  level (Duplicable): (int) the specific lod level to rank.
array_to_lod_tensor
This Op build a big LoDTensor from a std::vector
and a LoDRankTable. It is supposed to be used in getting dynamic RNN's outputs back to a normal LoDTensor. The std::vector would be the output of RNN Op and the LoDRankTable would be build with RNN's input. Inputs:  X : (std::vector<LodTensor>) A vector of tensors that is going to be casted to a big LoDTensor.
 RankTable : (LoDRankTable) RankTable provides the coarse lod infomation to build the output LoDTensor. See 'paddle/framework/lod_rank_table.h' for more details.
Outputs:  Out : (LoDTensor) The LoDTensor formed by input tensor array.
sequence_conv
Sequence Conv Operator.
SequenceConvOp performs convolution operation on features of contextLength timesteps of each instance. The convolution operation calculates the output based on the input, filter, strides and paddings parameters. The size of each dimension of the parameters is checked during infershape. In order to ensure the equal length of sequence before and after convolution, it is necessary to fill the top and bottom of each sequence based on context_length, context_stride and context_start.
Inputs:  X : (LoDTensor) the input(X) is a LodTensor, which supports variabletime length input sequence. The underlying tensor in this LoDTensor is a matrix with shape (T, N), where T is the total time steps in this minibatch and N is the input_hidden_size.
 PaddingData : (Tensor, optional) the input(PaddingData) is an optional parameter, and it is learnable. This is a tensor with shape (P, N), where P is the top_pad + bottom_pad, N is the input_hidden_size. In order to ensure the equal length of sequence before and after convolution, it is necessary to fill the top and bottom of each sequence according to context_length, context_stride and context_start
 Filter : (Tensor) the input(Filter) is an learnable parameter.This is a tensor with shape (K, M), where K is the context_length * input_hidden_size, M is the output feature size.
Outputs:  Out : (LoDTensor) the output(Out) is a LodTensor, which support variabletime length output sequence. The underlying tensor in this LoDTensor is a matrix with shape (T, M), where, T is the total time steps in this minibatch, M is the output feature size.
Attributes:  paddingTrainable (Duplicable): (bool, default:false) the padding data of SequenceConvOp is trainable or not.
 contextLength (Duplicable): (int) the contextLength of SequenceConvOp is the height of the convolution kernel.
 contextStart (Duplicable): (int, default:0) the contextStart of SequenceConvOp represents the beginning of the convolution of the number of rows of sequence, which can be negative. The negative number means to pad contextStart timesteps of zeros or learnable parameters at the beginning of each instance. The positive number means to skip contextStart timesteps of each instance.
 contextStride (Duplicable): (int, default:1) the contextStride of SequenceConvOp represents the stride length of convolution kernel. Currently, SequenceConvOp only supportscontextStride=1.
sequence_pool
Sequence Pool Operator.
The SequencePoolOp pools features of all timesteps of each instance. It supports six pooling types: 1. AVERAGE: $$Out[i] = \frac{\sum_i X_i}{N}$$ 2. SUM: $$Out[i] = \sum_jX_{ij}$$ 3. SQRT: $$Out[i] = \frac{\sum_jX_{ij}}{\sqrt{len(X_i)}}$$ 4. LAST: Out[i] = last instance in ith sequence X[i] 5. FIRST: Out[i] = first instance in ith sequence X[i] 6. MAX: $$Out[i] = max(X_i)$$
The following example explains how this works: For a minibatch of 3 variablelength sentences, containing 2, 3, and 2 timesteps:
Assume X is a [7,M,N] LoDTensor, and X>lod()[0] = [0, 2, 5, 7], 7=2+3+2. Besides, for the sake of simplicity, we assume M=1 and N=1, and the value of X = [[1, 3], [2, 4, 6], [5, 1]].
Thus, Out is a [3,1,1] Tensor without LoD infomation. And for different pooltype, the value of Out is as follows:
 AVERAGE: [2, 4, 3], where 2=(1+3)/2, 4=(2+4+6)/3, 3=(5+1)/2
 SUM: [4, 12, 6], where 4=1+3, 12=2+4+6, 6=5+1
 SQRT: [2.82, 6.93, 4.24], where 2.82=(1+3)/sqrt(2), 6.93=(2+4+6)/sqrt(3), 4.24=(5+1)/sqrt(2)
 MAX: [3, 6, 5], where 3=max(1,3), 6=max(2,4,6), 5=max(5,1)
 LAST: [3, 6, 1], where 3=last(1,3), 6=last(2,4,6), 1=last(5,1)
 FIRST: [1, 2, 5], where 1=first(1,3), 2=first(2,4,6), 5=first(5,1)
Inputs:  X : (LoDTensor) The variablelength input of SequencePoolOp
Outputs:  Out : (Tensor) The output of SequencePoolOp does not contain LoD infomation.
 MaxIndex (Intermediate) : (Tensor<int>) This tensor is used for the sequence maxpooling to record the max indexes.
Attributes:  pooltype (Duplicable): (int, default AVERAGE) the pooling pooltype of SequencePoolOp.
lstm
LongShort Term Memory (LSTM) Operator.
The defalut implementation is diagonal/peephole connection (https://arxiv.org/pdf/1402.1128.pdf), the formula is as follows:
$$ i_t = \sigma(W_{ix}x_{t} + W_{ih}h_{t1} + W_{ic}c_{t1} + b_i) \\ f_t = \sigma(W_{fx}x_{t} + W_{fh}h_{t1} + W_{fc}c_{t1} + b_f) \\ \tilde{c_t} = act_g(W_{cx}x_t + W_{ch}h_{t1} + b_c) \\ o_t = \sigma(W_{ox}x_{t} + W_{oh}h_{t1} + W_{oc}c_t + b_o) \\ c_t = f_t \odot c_{t1} + i_t \odot \tilde{c_t} \\ h_t = o_t \odot act_h(c_t) $$
where the W terms denote weight matrices (e.g. $W_{xi}$ is the matrix of weights from the input gate to the input), $W_{ic}, W_{fc}, W_{oc}$ are diagonal weight matrices for peephole connections. In our implementation, we use vectors to reprenset these diagonal weight matrices. The b terms denote bias vectors ($b_i$ is the input gate bias vector), $sigma$ is the nonline activations, such as logistic sigmoid function, and $i, f, o$ and $c$ are the input gate, forget gate, output gate, and cell activation vectors, respectively, all of which have the same size as the cell output activation vector $h$.
The $odot$ is the elementwise product of the vectors. $act_g$ and $act_h$ are the cell input and cell output activation functions and
tanh
is usually used for them. $tilde{c_t}$ is also called candidate hidden state, which is computed based on the current input and the previous hidden state.Set
use_peepholes
False to disable peephole connection. The formula is omitted here, please refer to the paper http://www.bioinf.jku.at/publications/older/2604.pdf for details.Note that these $W_{xi}x_{t}, W_{xf}x_{t}, W_{xc}x_{t}, W_{xo}x_{t}$ operations on the input $x_{t}$ are NOT included in this operator. Users can choose to use fullyconnect operator before LSTM operator.
Inputs:  Input : (LoDTensor) the first input is a LodTensor, which support variabletime length input sequence. The underlying tensor in this LoDTensor is a matrix with shape (T X 4D), where T is the total time steps in this minibatch, D is the hidden size.
 H0 : (Tensor, optional) the initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size and D is the hidden size.
 C0 : (Tensor, optional) the initial cell state is an optional input. This is a tensor with shape (N x D), where N is the batch size. `H0` and `C0` can be NULL but only at the same time
 Weight : (Tensor) the learnable hiddenhidden weights.  The shape is (D x 4D), where D is the hidden size.  Weight = {W_ch, W_ih, W_fh, W_oh}
 Bias : (Tensor) the learnable weights, which contains two parts: inputhidden bias weight and peephole connections weight if setting `use_peepholes` True. 1. `use_peepholes = False`  The shape is (1 x 4D).  Bias = {b_c, b_i, b_f, b_o}.2. `use_peepholes = True`  The shape is (1 x 7D).  Bias = {b_c, b_i, b_f, b_o, W_ic, W_fc, W_oc}.
Outputs:  Hidden : (LoDTensor) the hidden state of LSTM operator. The shape is (T x D), and lod is the same with the `Input`.
 Cell : (LoDTensor) the cell state of LSTM operator. The shape is (T x D), and lod is the same with the `Input`.
 BatchGate (Intermediate) : (LoDTensor) This LoDTensor contains input gate, forget gate and output gate after the nonlinear computation. This LoDTensor has the same shape as the reorganized input, which is also be called batch input. The LoD size is 2. The first LoD is the batch offsets and the second LoD contains the indexes, which denote the position of reorganized sequence in the raw input.
 BatchCellPreAct (Intermediate) : (LoDTensor) This LoDTensor is obtained in the forward and used in the backward.
Attributes:  use_peepholes (Duplicable): (bool, defalut: True) whether to enable diagonal/peephole connections.
 is_reverse (Duplicable): (bool, defalut: False) whether to compute reversed LSTM.
 gate_activation (Duplicable): (string, default: sigmoid)The activation for input gate, forget gate and output gate, `sigmoid` by default.
 cell_activation (Duplicable): (string, default: tanh)The activation for cell output, `tanh` by defalut.
 candidate_activation (Duplicable): (string, default: tanh)The activation for candidate hidden state, `tanh` by default.
conv3d_transpose
Convolution3D Transpose Operator.
The convolution transpose operation calculates the output based on the input, filter and strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and output(Output) are in NCDHW format. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCDHW format. Where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter,H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.
Example:
Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = (D_{in}  1) * strides[0]  2 * paddings[0] + D_f \\ H_{out} = (H_{in}  1) * strides[1]  2 * paddings[1] + H_f \\ W_{out} = (W_{in}  1) * strides[2]  2 * paddings[2] + W_f $$Inputs:  Input : (Tensor) The input tensor of convolution transpose operator.The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution transpose operator.The format of the filter tensor is MCDHW, where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.We enforce groups number == 1 and padding == 0 in the convolution3d transpose scenario.
Outputs:  Output : (Tensor) The output tensor of convolution transpose operator.The format of output tensor is also NCDHW.Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
Attributes:  strides (Duplicable): (vector<int> default:{1, 1, 1}), the strides{d_stride, h_stride, w_stride} of convolution transpose operator.
 paddings (Duplicable): (vector<int> default:{0, 0, 0}), paddings(d_pad, h_pad, w_pad) of convolution transpose operator.
conv2d_transpose
Convolution2D Transpose Operator.
The convolution transpose operation calculates the output based on the input, filter and strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and output(Output) are in NCHW format. Where N is batchsize, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCHW format. Where M is the number of input feature channels, C is the number of output feature channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$ H_{out} = (H_{in}  1) * strides[0]  2 * paddings[0] + H_f \\ W_{out} = (W_{in}  1) * strides[1]  2 * paddings[1] + W_f $$
Inputs:  Input : (Tensor) The input tensor of convolution transpose operator. The format of input tensor is NCHW. Where N is batch size, C is the number of input channels, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution transpose operator. The format of the filter tensor is MCHW, where M is the number of input feature channels, C is the number of output feature channels,H is the height of the filter, and W is the width of the filter. We enforce groups number == 1 in the convolution transpose scenario.
Outputs:  Output : (Tensor) The output tensor of convolution transpose operator. The format of output tensor is also NCHW.
Attributes:  strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride) of convolution transpose operator.
 paddings (Duplicable): (vector<int> default:{0, 0}), the paddings(h_pad, w_pad) of convolution transpose operator.
gru
GRU Operator implements part calculations of the complete GRU as following:
f[ update gate: u_t = actGate(xu_t + W_u * h_{t1} + b_u) \ reset gate: r_t = actGate(xr_t + W_r * h_{t1} + b_r) \ output candidate: {h}t = actNode(xc_t + W_c * dot(r_t, h{t1}) + b_c) \ output: h_t = dot((1  u_t), h_{t1}) + dot(u_t, {h}_t) f]
@note To implement the complete GRU, fullyconnected operator must be used
before to feed xu, xr and xc as the Input of GRU operator.Inputs:  Input : (LoDTensor) The first input is a LodTensor, which supports variabletime length input sequence. The underlying tensor in this LoDTenosr is a matrix with shape (T X 3D), where, T is the total time steps in this minibatch, D is the hidden size.
 H0 : (Tensor, optional) The initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size, D is the hidden size.
 Weight : (Tensor) The learnable hiddenhidden weight matrix with shape (D x 3D), where D is the hidden size. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape (D x 2D), and the second part are weights of output candidate with shape (D x D).
 Bias : (Tensor, optional) Bias vector with shape (1 x 3D) concating bias of the update gate, reset gate and output candidate.
Outputs:  BatchGate (Intermediate) : (LoDTensor) To compute with batches, sequence data will be reorganized into several successive batches each containing data from the same time step. The LoDTensor BatchGate contains the update gate, reset gate and output candidate values organized in batches. The LoD size is 2. The first LoD contains the batch offsets and the second LoD contains the indexes in the raw sequence data.
 BatchResetHiddenPrev (Intermediate) : (LoDTensor) The reseted hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.
 BatchHidden (Intermediate) : (LoDTensor) The hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.
 Hidden : (LoDTensor) the hidden state LoDTensor organized in sequences. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.
Attributes:  activation (Duplicable): (string, default tanh) The activation type used for output candidate {h}_t.
 gate_activation (Duplicable): (string, default sigmoid) The activation type used in update gate and reset gate.
 is_reverse (Duplicable): (bool, defalut: False) whether to compute reversed GRU.
recurrent
Static Length Recurrent Operator.
The static length recurrent operator can only operate on fixed size sequence data, i.e. in each minibatch, the sequence length of all inputs are the same.
Inputs:  inputs (Duplicable) : rnn inputs
 initial_states (Duplicable) : rnn initial states
 parameters (Duplicable) : Parameters are used by step block as its input. However, the input is not a sequence tensor. Every time step, each operator in step block just use the parameter directly.
Outputs:  outputs (Duplicable) : The output sequence of RNN. The sequence length must be same.
 step_scopes : StepScopes contain all local variables in each time step.
Attributes:  ex_states (Duplicable): The exstate variable names. The exstate means the state value in the extimestep or the previous time step [ex_states, states, initial_states@GRAD] must be the same order
 states (Duplicable): The state variable names. [ex_states, states, initial_states@GRAD] must be the same order
 step_block (Duplicable): The step block inside RNN
 reverse (Duplicable): Calculate RNN reversely or not. By default reverse=False Assume the input data is [A, B, C, D] if reverse is False: the computation of RNN is like A B C D     v v v v rnn > rnn > rnn > rnn     v v v v o o o o if reverse is True the computation of RNN is like A B C D     v v v v rnn < rnn < rnn < rnn     v v v v o o o o
 is_train (Duplicable):
save
Save operator
This operator will serialize and write a tensor variable to file on disk.
Inputs:  X : (Tensor ) Input tensor to be saved
Outputs: Attributes:  overwrite (Duplicable): (boolean, default true)Overwrite the output file if exist
 file_path (Duplicable): (string)The "file_path" where the variable will be saved.
load
Load Operator.
Load operator will load a tensor variable from disk file.
Inputs: Outputs:  Out : (Tensor) The tensor need to be loaded
Attributes:  file_path (Duplicable): (string) Variable will be loaded from "file_path".
auc
Area Under The Curve (AUC) Operator.
This implementation computes the AUC according to forward output and label. It is used very widely in binary classification evaluation. As a note: If input label contains values other than 0 and 1, it will be cast to bool. You can find the relevant definitions here: https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve
There are two types of possible curves: 1. ROC: Receiver operating characteristic 2. PR: Precision Recall
Inputs:  Out : A floating point 2D tensor, values are in the range [0, 1].Each row is sorted in descending order. This input should be theoutput of topk.Typically, this tensor indicates the probability of each label
 Indices : An int 2D tensor, indicating the indices of originaltensor before sorting. Typically, this tensor indicates which label the probability stands for.
 Label : A 2D int tensor indicating the label of the training data.The height is batch size and width is always 1.
Outputs:  AUC : A scalar representing the current areaunderthecurve.
Attributes:  curve (Duplicable): Curve type, can be 'ROC' or 'PR'.
 num_thresholds (Duplicable): The number of thresholds to use when discretizing the roc curve.
hard_sigmoid
HardSigmoid Activation Operator.
Segmentwise linear approximation of sigmoid(https://arxiv.org/abs/1603.00391), which is much faster than sigmoid.
$y = max(0, min(1, slope * x + shift))$
The slope should be positive. The offset can be either positive or negative. The default slope and shift are set according to the above reference. It is recommended to use the defaults for this activation.
Inputs:  X : Input of HardSigmoid operator
Outputs:  Y : Output of HardSigmoid operator
Attributes:  slope (Duplicable): Slope for linear approximation of sigmoid
 offset (Duplicable): Offset for linear approximation of sigmoid
cond
Sample Dependent Conditional Operator.
Given Cond[i] as a 1/0 vector to indicate true/false: Out[i] = subnet_true[i], if Cond[i] == true Out[i] = subnet_false[i], if Cond[i] == false
Inputs:  Cond : The condition, which is a bool vector
 Xs (Duplicable) : Inputs of Subnets
Outputs:  Outs (Duplicable) : Outputs of Cond_Op after merge
 SubScopes : sub scopes for true and false branches
 IndexTensors : Index Tensors contains indices for true/false
max_pool2d_with_index
MaxPool2d Operator.
The maxPooling2d with index operation calculates the output and the mask based on the input, ksize, strides, and paddings parameters. Input(X) and output(Out, Mask) are in NCHW format, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Parameters(ksize, strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out, Mask) size may be different.
Example: Input: X shape: $(N, C, H_{in}, W_{in})$ Output: Out shape: $(N, C, H_{out}, W_{out})$ Mask shape: $(N, C, H_{out}, W_{out})$ Where $$ H_{out} = \frac{(H_{in}  ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ W_{out} = \frac{(W_{in}  ksize[1] + 2 * paddings[1])}{strides[1]} + 1 $$
Inputs:  X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the image, and W is the width of the image.
Outputs:  Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the image and W is the width of the image.
 Mask : (Tensor) The Mask tensor of pooling operator.The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the image, and W is the width of the image. It represents the index in the current feature map.
Attributes:  ksize (Duplicable): (vector<int>) The pooling window size(height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
 global_pooling (Duplicable): (bool, default:false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.
 strides (Duplicable): (vector<int>, default {1, 1}), strides(height, width) of pooling operator.
 paddings (Duplicable): (vector<int>, default:{0, 0}), paddings(height, width) of pooling operator. If global_pooling = true, paddings and will be ignored.
thresholded_relu
ThresholdedRelu Activation Operator.
$$ y = \begin{cases} x, \text{if } x > threshold \\ 0, \text{otherwise} \end{cases} $$
Inputs:  X : Input of ThresholdedRelu operator
Outputs:  Y : Output of ThresholdedRelu operator
Attributes:  threshold (Duplicable): The threshold location of activation
hard_shrink
HardShrink Activation Operator.
$$ y = \begin{cases} x, \text{if } x > \lambda \\ x, \text{if } x < \lambda \\ 0, \text{otherwise} \end{cases} $$
Inputs:  X : Input of HardShrink operator
Outputs:  Y : Output of HardShrink operator
Attributes:  threshold (Duplicable): The value of threshold for HardShrink
relu6
Relu6 Activation Operator.
$y = min(max(0, x), 6)$
Inputs:  X : Input of Relu6 operator
Outputs:  Y : Output of Relu6 operator
Attributes:  threshold (Duplicable): The threshold value of Relu6
elu
ELU Activation Operator.
Applies the following elementwise computation on the input according to https://arxiv.org/abs/1511.07289.
$y = max(0, x) + min(0, alpha * (e^x  1))$
Inputs:  X : Input of ELU operator
Outputs:  Y : Output of ELU operator
Attributes:  alpha (Duplicable): The alpha value of ELU
leaky_relu
LeakyRelu Activation Operator.
$y = max(x, alpha * x)$
Inputs:  X : Input of LeakyRelu operator
Outputs:  Y : Output of LeakyRelu operator
Attributes:  alpha (Duplicable): The small negative slope
top_k
Top K operator
If the input is a vector (1d tensor), this operator finds the k largest entries in the vector and outputs their values and indices as vectors. Thus values[j] is the jth largest entry in input, and its index is indices[j].
For matrices, this operator computes the top k entries in each row.
Inputs:  X : (Tensor) The input of Topk op
Outputs:  Out : (Tensor) The output tensor of Topk op
 Indices : (Tensor) The indices of Topk elements of input
Attributes:  k (Duplicable): (int, default 1) Number of top elements to look for along the last dimension (along each row for matrices).
sequence_softmax
Sequence Softmax Operator.
SequenceSoftmaxOp computes the softmax activation among all timesteps for each sequence. The dimension of each timestep should be 1. Thus, the shape of input Tensor can be either [N, 1] or [N], where N is the sum of the length of all sequences.
The algorithm works as follows: for ith sequence in a minibatch: $$Out(X[lod[i]:lod[i+1]], :) = \frac{\exp(X[lod[i]:lod[i+1], :])} {\sum(\exp(X[lod[i]:lod[i+1], :]))}$$
For example, for a minibatch of 3 sequences with variablelength, each containing 2, 3, 2 timesteps, the lod of which is [0, 2, 5, 7], then softmax will be computed among X[0:2, :], X[2:5, :], X[5:7, :] and N turns out to be 7.
Inputs:  X : (LoDTensor) 1D or 2D input LoDTensor with the 2nd dimension of length 1.
Outputs:  Out : (LoDTensor) 1D or 2D output LoDTensor with the 2nd dimension of length 1.
decayed_adagrad
Decayed Adagrad Optimizer.
The update is done as follows:
$$ moment\_out = decay * moment + (1  decay) * grad * grad \\ param\_out = param  \frac{learning\_rate * grad}{\sqrt{moment\_out} + epsilon} $$
The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have an epsilon attribute. It is added here for numerical stability to avoid the division by zero error.
Inputs:  Param : (Tensor) Input parameter
 Grad : (Tensor) Input gradient
 Moment : (Tensor) Second moment
 LearningRate : (Tensor) Learning rate
Outputs:  ParamOut : (Tensor) Output parameter
 MomentOut : (Tensor) Output second moment
Attributes:  decay (Duplicable): (float, default 0.95) Discounting factor for coming gradient
 epsilon (Duplicable): (float, default 1.0e6) Constant for numerical stability
scale
Scale operator
$$Out = scale*X$$
Inputs:  X : (Tensor) Input tensor of scale operator.
Outputs:  Out : (Tensor) Output tensor of scale operator.
Attributes:  scale (Duplicable): (float, default 0)The scaling factor of the scale operator.
increment
Increment Operator.
The equation is: $$Out = X + step$$
Inputs:  X : (Tensor) The input tensor of increment operator
Outputs:  Out : (Tensor) The output tensor of increment operator.
Attributes:  step (Duplicable): (float, default 1.0) The step size by which the input tensor will be incremented.
expand
Expand operator tiles the input by given times number. You should set times number for each dimension by providing attribute 'expand_times'. The rank of X should be in [1, 6]. Please notice that size of 'expand_times' must be same with X's rank. Following is a using case:
Input(X) is a 3D tensor with shape [2, 3, 1]:
[ [[1], [2], [3]], [[4], [5], [6]] ]
Attr(expand_times): [1, 2, 2]
Output(Out) is a 3D tensor with shape [2, 6, 2]:
[ [[1, 1], [2, 2], [3, 3], [1, 1], [2, 2], [3, 3]], [[4, 4], [5, 5], [6, 6], [4, 4], [5, 5], [6, 6]] ]
Inputs:  X : (Tensor, default Tensor<float>) A tensor with rank in [1, 6].X is the input tensor to be expanded.
Outputs:  Out : (Tensor, default Tensor<float>) A tensor with rank in [1, 6].The rank of Output(Out) is same as Input(X) except that each dimension size of Output(Out) is equal to corresponding dimension size of Input(X) multiplying corresponding value of Attr(expand_times).
Attributes:  expand_times (Duplicable): Expand times number for each dimension.
lod_array_length
LoDArrayLength Operator.
This operator obtains the length of lod tensor array:
$$Out = len(X)$$
NOTE: The output is a CPU Tensor since the control variable should be only in CPU and the length of LoDTensorArray should be used as control variables.
Inputs:  X : (LoDTensorArray) The input tensor array.
Outputs:  Out : (Tensor) 1x1 CPU Tensor of length, int64_t
reduce_sum
{ReduceOp} Operator.
This operator computes the sum of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true.
Inputs:  X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
Outputs:  Out : (Tensor) The result tensor.
Attributes:  dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
 keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.
tanh_shrink
TanhShrink Activation Operator.
$$y = x  \frac{e^{x}  e^{x}}{e^{x} + e^{x}}$$
Inputs:  X : Input of TanhShrink operator
Outputs:  Y : Output of TanhShrink operator
adam
Adam Optimizer.
This implements the Adam optimizer from Section 2 of the Adam paper : https://arxiv.org/abs/1412.6980. Adam is a firstorder gradientbased optimization method based on adaptive estimates of lowerorder moments.
Adam updates:
$$ moment\_1\_out = \beta_1 * moment\_1 + (1  \beta_1) * grad \\ moment\_2_\out = \beta_2 * moment\_2 + (1  \beta_2) * grad * grad \\ learning\_rate = learning\_rate * \frac{\sqrt{1  \beta_{2\_pow}}}{1  \beta_{1\_pow}} \\ param\_out = param  learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon} $$
Inputs:  Param : (Tensor) Input parameter
 Grad : (Tensor) Input gradient
 LearningRate : (Tensor) Learning rate
 Moment1 : (Tensor) Input first moment
 Moment2 : (Tensor) Input second moment
 Beta1Pow : (Tensor) Input beta1 power accumulator
 Beta2Pow : (Tensor) Input beta2 power accumulator
Outputs:  ParamOut : (Tensor) Output parameter
 Moment1Out : (Tensor) Output first moment
 Moment2Out : (Tensor) Output second moment
Attributes:  beta1 (Duplicable): (float, default 0.9) Exponential decay rate for the first moment estimates.
 beta2 (Duplicable): (float, default 0.999) exponential decay rate for the second moment estimates.
 epsilon (Duplicable): (float, default 1.0e8) Constant for numerical stability
reduce_min
{ReduceOp} Operator.
This operator computes the min of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true.
Inputs:  X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
Outputs:  Out : (Tensor) The result tensor.
Attributes:  dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
 keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.
lod_reset
LoDReset operator
Reset LoD of Input(X) into a new one specified by Input(TargetLoD) or Attr(target_lod), or set LoD for Input(X) if it doesn't have one. Currently the lod_reset operator only supports the reset of level 0 LoD. At least one of Input(TargetLoD) and Attr(target_lod) must be set, and if both of them are set, Input(TargetLoD) will be chosen as the target LoD.
An example: Given a float LoDTensor X with shape (6, 1), its transpose form represents
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
with LoD = [[0, 2, 5, 6]] and the three (transposed) sequences look like
[1.0, 2.0], [3.0, 4.0, 5.0], [6.0].
If target LoD = [0, 4, 6], the lod_reset operator will reset the LoD and the sequences that the LoDTensor Output(Out) contains becomes:
[1.0, 2.0, 3.0, 4.0], [5.0, 6.0].
Inputs:  X : (LoDTensor) The input tensor of lod_reset operator.
 TargetLoD : (Tensor, optional) The target level 0 LoD from Input().
Outputs:  Out : (LoDTensor) The output tensor of lod_reset operator.
Attributes:  target_lod (Duplicable): The target level 0 LoD from Attr().
write_to_array
WriteToArray Operator.
This operator writes a LoDTensor to a LoDTensor array.
Assume $T$ is LoDTensor, $i$ is the subscript of the array, and $A$ is the array. The equation is
$$A[i] = T$$
Inputs:  X : (LoDTensor) the tensor will be written to tensor array
 I : (Tensor) the subscript index in tensor array. The number of element should be 1
Outputs:  Out : (TensorArray) the tensor array will be written
reshape
Reshape Operator.
Reshape Input(X) into the shape specified by Attr(shape).
An example: Given a 2D tensor X with 2 rows and 2 columns
[[1, 2], [3, 4]]
and target shape = [1, 4], the reshape operator will transform the tensor X into a 1D tensor:
[1, 2, 3, 4]
Inputs:  X : The input tensor of reshape operator.
Outputs:  Out : The output tensor of reshape operator.
Attributes:  shape (Duplicable): (vector<int>) Target shape of reshape operator.
fill_constant
FillConstantBatchSizeLike Operator.
Fill up a variable with specified constant value.
Inputs: Outputs:  Out : (Tensor) Tensor of specified shape will be filled with the specified value
Attributes:  dtype (Duplicable): (int, default 5 (FP32)) Output data type
 shape (Duplicable): (vector<int>) The shape of the output
 value (Duplicable): (float, default 0) The value to be filled
 force_cpu (Duplicable): (bool, default false) Force fill output variable to cpu memory. Otherwise, fill output variable to the running device
elementwise_div
Limited Elementwise Div Operator.
The equation is:
$Out = X / Y$
X is a tensor of any dimension and the dimensions of tensor Y must be smaller than or equal to the dimensions of X.
There are two cases for this operator: 1. The shape of Y is same with X; 2. The shape of Y is a subset of X.
For case 2: Y will be broadcasted to match the shape of X and axis should be the starting dimension index for broadcasting Y onto X.
example: shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.
Inputs:  X : (Tensor) The first input tensor of elementwise op
 Y : (Tensor) The second input tensor of elementwise op
Outputs:  Out : The output of elementwise op
Attributes:  axis (Duplicable): (int, default 1) The starting dimension index for broadcasting Y onto X
conv2d_cudnn
Convolution Operator.
The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and Output(Output) are in NCHW format. Where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCHW format. Where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$ H_{out}= \frac{(H_{in} + 2 * paddings[0]  (dilations[0] * (H_f  1) + 1))}{strides[0]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[1]  (dilations[1] * (W_f  1) + 1))}{strides[1]}+ 1 $$
Inputs:  Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCHW, where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups.
Outputs:  Output : (Tensor) The output tensor of convolution operator. The format of output tensor is also NCHW.
Attributes:  strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride) of convolution operator.
 paddings (Duplicable): (vector<int> default:{0, 0}), the paddings(h_pad, w_pad) of convolution operator.
 groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.
 dilations (Duplicable): (vector<int> default:{1, 1}), the dilations(h_dilation, w_dilation) of convolution operator.
 workspace_size_MB (Duplicable): workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardware. This size should be chosen carefully.
mul
Mul Operator.
This operator is used to perform matrix multiplication for input X and Y.
The equation is:
<span class="markdownequation" id="equation0"></span>
Both the input
X
andY
can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with inputX
.Inputs:  X : The first input of mul op
 Y : The second input of mul op
Outputs:  Out : The output of mul op
Attributes:  x_num_col_dims (Duplicable): (int, default 1) mul_op can take tensors with more than two dimensions as input `X`, in that case, tensors will be reshaped to a matrix. The matrix's first dimension(column length) will be the product of tensor's last `num_col_dims` dimensions, and the matrix's second dimension(row length) will be the product of tensor's first `rank  num_col_dims` dimensions.
 y_num_col_dims (Duplicable): (int, default 1) mul_op can take tensors with more than two dimensions as input `Y`, in that case, tensors will be reshaped to a matrix. Just like input `X`.
margin_rank_loss
MarginRankLoss Operator.
This operator measures the loss given a pair of training sample {
X1
,X2
} and theLabel
with attributemargin
, whereLabel = +1
indicating X1 is ranked higher thanX2
andLabel = 1
otherwise. The loss is calculated as:$loss(X1, X2, Label) = max(0, Label * (X1  X2) + margin)$
The attribute
margin
here helps make the predictions more robust. Denote the item ranked higher as the positive sample, otherwise the negative sample. If the score of the two samples satisfies$positive sample  negative sample < margin$
the pair of samples will contribute to the final loss, which will backpropagate and train the ranking model to enlarge the difference between the two scores.
For batch input with size
batch_size
,X1
,X2
andLabel
all have the same shape [batch_size x 1].Inputs:  X1 : (2D tensor with shape [batch_size x 1]) The score for one item X1 to be ranked, from pairwise ranking model.
 X2 : (2D tensor with shape [batch_size x 1]) The score for another item X2 to be ranked, from pairwise ranking model.
 Label : (2D tensor with shape [batch_size x 1]) The label indicating X1 ranked higher than X2 or not, can only be +1 or 1.
Outputs:  Activated (Intermediate) : (2D tensor with shape [batch_size x 1]) Intermediate tensor to indicate whether each element of Output(Out) is activated.
 Out : (2D tensor with shape [batch_size x 1]) The output loss of MarginRankLoss operator.
Attributes:  margin (Duplicable): (scalar, default 0) Margin for MarginRankLossOp.
greater_equal
greater_equal Operator
It operates elementwise on X and Y, and returns the Out. Each of them is a Ndim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X >= Y
Inputs:  X : (LoDTensor) the left hand operand of greater_equal operator
 Y : (LoDTensor) the right hand operand of greater_equal operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is Out = X >= Y
reciprocal
Reciprocal Activation Operator.
$$y = \frac{1}{x}$$
Inputs:  X : Input of Reciprocal operator
Outputs:  Y : Output of Reciprocal operator
squared_l2_norm
SquaredL2Norm Operator.
Computes the squared L2 norm of a tensor.
$$Out = \sum_{i} X_{i}^2$$
Inputs:  X : (Tensor) The input of squared_l2_norm op.
Outputs:  Out : (Scalar) The output of squared_l2_norm op.
shrink_rnn_memory
In dynamic RNN, we are able to handle sequences of different lengths. Because of the multiple lengths, the size of each step input can be different, which may lead to a mismatching between the input of the current step and the memory generated by the previous one. This operator shrinks memory according to the size of the next step input, to make sure that they can match each other.
Inputs:  X : (LoDTensor) The RNN step memory to be shrinked.
 RankTable : (LoDRankTable) The lod_rank_table of dynamic RNN.
 I : (LoDTensor) The step index. The RNN step memory 'X' will be shrinked to match the size of the input of the index'th step.
Outputs:  Out : (LoDTensor) The shrinked RNN step memory.
conditional_block
Conditional block operator
Run the subblock if X is not empty. Params is the other inputs and Out is the outputs of the subblock.
Inputs:  X (Duplicable) : The conditional variable of this operator. If X is empty, the whole subblock will not be executed.
 Params (Duplicable) : The input variables of the subblock.
Outputs:  Out (Duplicable) : The output variables of the subblock.
 Scope : (std::vector<Scope*>) The step scope of conditional block. To unify the conditional block, rnn and while op, the type of scope is std::vector<Scope*>
Attributes:  block (Duplicable): The step block of conditional block operator
lookup_table
Lookup Table Operator.
This operator is used to perform lookups on the parameter W, then concatenated into a dense tensor.
The input Ids can carry the LoD (Level of Details) information, or not. And the output only shares the LoD information with input Ids.
Inputs:  W : An input represents embedding tensors, which is a learnable parameter.
 Ids : An input with type int32 or int64 contains the ids to be looked up in W. Ids must be a column vector with rank = 2. The 2nd dimension size must be 1.
Outputs:  Out : The lookup results, which have the same type as W.
Attributes:  is_sparse (Duplicable): (boolean, default false) Sparse update
pad
Pad Operator.
Pad input into output, as specified by paddings and pad_value. The input should be a kD tensor(k > 0 and k < 7). As an example:
Given:
X = [[1, 2], [3, 4]],
paddings = [0, 1, 1, 2],
and
pad_value = 0,
we have:
Out = [[0, 1, 2, 0, 0] [0, 3, 4, 0, 0] [0, 0, 0, 0, 0]]
Inputs:  X : The input of pad op. The input should be a kD tensor(k > 0 and k < 7)
Outputs:  Out : The output of pad op. A tensor with the same shape as X.
Attributes:  paddings (Duplicable): (vector<int>) A list<int> to describe the padding rules for each dimension. For 2D image tensor, paddings=[0, 1, 2, 3] means padding 0 row to top, 1 row to bottom, 2 columns to left and 3 columns to right. Size of paddings should be equal to 2 * dimension size of the input tensor.
 pad_value (Duplicable): (float, default 0.0) The value to fill the padded areas.
split_lod_tensor
Split a LoDTensor with a Mask at certain level. The input LoDTensor has 3 sequence at certain lod level. The Mask is a bool column vector, such as [0, 1, 0] at the same level. The first and third sequence will be send to False Output LoDTensor; whereas the second sequence will be send to True Output LoDTensor. Please refer to MergeLoDTensorOp.
Inputs:  X : The input LoDTensor
 Mask : A bool column vector which mask the input
Outputs:  OutTrue : True branch of input LoDTensor
 OutFalse : False branch of input LoDTensor
Attributes:  level (Duplicable): (int) the specific lod level to split.
max_sequence_len
Calculate the max sequence length through lod_rank_table.
Inputs:  RankTable : The lod_rank_table.
Outputs:  Out : The max sequence length.
multiplex
Multiplex Operator.
Multiplex multiple tensors according to the index provided by the index tensor.
Ids: the index tensor. X[0 : N  1]: the candidate tensors for output (N >= 2). For each index i from 0 to batchSize  1, the output is the ith row of the the (Ids[i])th tensor.
For ith row of the output tensor:
$$y[i] = x_{k}[i]$$
where
y
is the output tensor,x_{k}
is the kth input tensor, andk = Ids[i]
.Inputs:  Ids : The index tensor of multiplex operator.
 X (Duplicable) : The candidate tensors of multiplex operator.
Outputs:  Out : The output tensor of multiplex operator.
stanh
STanh Activation Operator.
$$y = b * \frac{e^{a * x}  e^{a * x}}{e^{a * x} + e^{a * x}}$$
Inputs:  X : Input of STanh operator
Outputs:  Y : Output of STanh operator
Attributes:  scale_a (Duplicable): The scale parameter of a for the input
 scale_b (Duplicable): The scale parameter of b for the input
adamax
Adamax Optimizer.
We implement the Adamax optimizer from Section 7 of the Adam paper: https://arxiv.org/abs/1412.6980. Adamax is a variant of the Adam algorithm based on the infinity norm.
Adamax updates:
$$ moment\_out = \beta_1 * moment + (1  \beta_1) * grad \\ inf\_norm\_out = max(\beta_2 * inf\_norm + \epsilon, grad) \\ learning\_rate = \frac{learning\_rate}{1  \beta_{1\_pow}} \\ param\_out = param  learning\_rate * \frac{moment\_out}{inf\_norm\_out} $$
The original paper does not have an epsilon attribute. However, it is added here for numerical stability to prevent the division by 0 error.
Inputs:  Param : (Tensor) Input parameter
 Grad : (Tensor) Input gradient
 LearningRate : (Tensor) Learning rate
 Moment : (Tensor) First moment
 InfNorm : (Tensor) Input exponentially weighted infinity norm
 Beta1Pow : (Tensor) Input beta1 power accumulator
Outputs:  ParamOut : (Tensor) Output parameter
 MomentOut : (Tensor) Output first moment
 InfNormOut : (Tensor) Output exponentially weighted infinity norm
Attributes:  beta1 (Duplicable): (float, default 0.9) Exponential decay rate for the 1st moment estimates.
 beta2 (Duplicable): (float, default 0.999) exponential decay rate for the weighted infinity norm estimates.
 epsilon (Duplicable): (float, default 1.0e8) Constant for numerical stability
l1_norm
L1 Norm Operator.
Computes the L1 norm of a tensor.
$$Out = \sum{X}$$
Inputs:  X : (Tensor) The input of l1_norm op.
Outputs:  Out : (Scalar) The output of l1_norm op.
dropout
Dropout Operator.
Dropout refers to randomly dropping out units in a nerual network. It is a regularization technique for reducing overfitting by preventing neuron coadaption during training. The dropout operator randomly set (according to the given dropout probability) the outputs of some units to zero, while others are set equal to their corresponding inputs.
Inputs:  X : The input of dropout op.
Outputs:  Out : The output of dropout op.
 Mask (Intermediate) : The random sampled dropout mask.
Attributes:  dropout_prob (Duplicable): Probability of setting units to zero.
 is_test (Duplicable): True if in test phase.
 seed (Duplicable): Dropout random seed.
lod_tensor_to_array
Inputs:  X :
 RankTable :
Outputs:  Out :
pool2d_cudnn
Pool2d Operator.
The pooling2d operation calculates the output based on the input, pooling_type and ksize, strides, paddings parameters. Input(X) and output(Out) are in NCHW format, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Parameters(ksize, strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.
Example:
Input: X shape: $(N, C, H_{in}, W_{in})$ Output: Out shape: $(N, C, H_{out}, W_{out})$ Where $$ H_{out} = \frac{(H_{in}  ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ W_{out} = \frac{(W_{in}  ksize[1] + 2 * paddings[1])}{strides[1]} + 1 $$Inputs:  X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
Outputs:  Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
Attributes:  pooling_type (Duplicable): (string), pooling type, can be "max" for maxpooling and "avg" for averagepooling.
 ksize (Duplicable): (vector<int>) The pooling window size(height, width) of the pooling operator. If global_pooling = true, ksize and paddings will be ignored.
 global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.
 strides (Duplicable): (vector<int>, default {1, 1}), strides(height, width) of pooling operator.
 paddings (Duplicable): (vector<int>, default {0,0}), paddings(height, width) of pooling operator.If global_pooling = true, paddings and ksize will be ignored.
conv2d_transpose_cudnn
Convolution2D Transpose Operator.
The convolution transpose operation calculates the output based on the input, filter and strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and output(Output) are in NCHW format. Where N is batchsize, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCHW format. Where M is the number of input feature channels, C is the number of output feature channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$ H_{out} = (H_{in}  1) * strides[0]  2 * paddings[0] + H_f \\ W_{out} = (W_{in}  1) * strides[1]  2 * paddings[1] + W_f $$
Inputs:  Input : (Tensor) The input tensor of convolution transpose operator. The format of input tensor is NCHW. Where N is batch size, C is the number of input channels, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution transpose operator. The format of the filter tensor is MCHW, where M is the number of input feature channels, C is the number of output feature channels,H is the height of the filter, and W is the width of the filter. We enforce groups number == 1 in the convolution transpose scenario.
Outputs:  Output : (Tensor) The output tensor of convolution transpose operator. The format of output tensor is also NCHW.
Attributes:  strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride) of convolution transpose operator.
 paddings (Duplicable): (vector<int> default:{0, 0}), the paddings(h_pad, w_pad) of convolution transpose operator.
 dilations (Duplicable): dilations of convolution operator.
 workspace_size_MB (Duplicable): workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardward. This size should be carefully setted.
gaussian_random
GaussianRandom Operator.
Used to initialize tensors with gaussian random generator.
Inputs: Outputs:  Out : Output matrix of gaussian random op
Attributes:  shape (Duplicable): (vector<int>) The dimension of random tensor.
 mean (Duplicable): (float, default 0.0) mean of random tensor.
 std (Duplicable): (float, default 1.0) std of random tensor.
 seed (Duplicable): (int, default 0) Random seed of generator.0 means use system wide seed.
 dtype (Duplicable): (int, default 5(FP32)) Output data type.
lstm_unit
Lstm Unit Operator
Equation:
$$ i, f, o, j = split(X) \\ C = C_{prev} * sigm(f + forget\_bias) + sigm(i) * tanh(j) \\ H = C * sigm(o) $$
Inputs:  X : FC input before the nonlinear activation.
 C_prev : The cell state tensor of last timestep in the Lstm Unit operator.
Outputs:  C : The cell tensor of Lstm Unit operator.
 H : The hidden state tensor of Lstm Unit operator.
Attributes:  forget_bias (Duplicable): (float, default 0.0) The forget bias of Lstm Unit.
sign
Sign operator
$$Out = X.sign()$$
Inputs:  X : (Tensor) Input tensor of sign operator.
Outputs:  Out : (Tensor) Output tensor of sign operator.
pow
Pow Activation Operator.
$y = x^{factor}$
Inputs:  X : Input of Pow operator
Outputs:  Y : Output of Pow operator
Attributes:  factor (Duplicable): The exponential factor of Pow
clip
Clip Operator.
The clip operator limits the value of given input within an interval. The interval is specified with arguments 'min' and 'max':
$$ Out = \min(\max(X, min), max) $$
Inputs:  X : (Tensor)The input of clip op.The number of dimensions must be between [1, 9].
Outputs:  Out : (Tensor)The output of clip op with shape as input(X)
Attributes:  min (Duplicable): (float)Minimum value, under which element is replaced by min.
 max (Duplicable): (float)Maximum value, above which element is replaced by max
huber_loss
HuberLoss Operator.
Huber loss is a loss function used in robust regression. We define X as the input value and Y as the target value. Huber loss can evaluate the fitness of X to Y. Different from MSE loss, Huber loss is more robust for outliers. The shape of X and Y are [batch_size, 1]. The equation is:
$$ Out_{\delta}(X, Y)_i = \begin{cases} 0.5 * (Y_i  X_i)^2, \quad Y_i  X_i \leq \delta \\ \delta * (Y_i  X_i  0.5 * \delta), \quad otherwise \end{cases} $$
In the above equation, $Out_delta(X, Y)_i$, $X_i$ and $Y_i$ represent the ith element of Out, X and Y.
Inputs:  X : The input value of huber loss op.X is a 2D tensor with shape [batch_size, 1].
 Y : The target value of huber loss op.Y is a 2D tensor with shape [batch_size, 1].
Outputs:  Residual (Intermediate) : Intermediate tensor to cache residual value between Y and X.The shape is same as Input(X) and will be reused in backward.
 Out : The output tensor with shape [batch_size, 1] which represents the huber loss.
Attributes:  delta (Duplicable): Hyper parameter in huber loss.
smooth_l1_loss
Smooth L1 Loss Operator.
This operator computes the smooth l1 loss for X and Y. The operator takes the first dimension of X and Y as batch size. For each instance, it computes the smooth l1 loss element by element first and then sums all the losses. So the shape of Out is [batch_size, 1].
The equation is: $$ Out_{\sigma}(X, Y)_i = \begin{cases} 0.5 * (\sigma * (X_i  Y_i)) ^ 2 \quad X_i  Y_i \lt \frac{1} {{\sigma} ^ 2} \\ \frac{X_i  Y_i  0.5}{{\sigma}^2}, \quad otherwise \end{cases} $$
In the above equation, $Out_{sigma}(X, Y)_i$, $X_i$ and $Y_i$ represent the ith element of Out, X and Y.
Inputs:  X : (Tensor, default Tensor<float>) A tensor with rank at least 2. The input value of smooth l1 loss op with shape [batch_size, dim1, ..., dimN].
 Y : (Tensor, default Tensor<float>) A tensor with rank at least 2. The target value of smooth l1 loss op with same shape as X.
 InsideWeight : (Tensor, default Tensor<float>) A tensor with rank at least 2. This input is optional and should have same shape with X. If provided, the result of (X  Y) will be multiplied by this tensor element by element.
 OutsideWeight : (Tensor, default Tensor<float>) A tensor with rank at least 2. This input is optional and should have same shape with X. If provided, the out smooth l1 loss will be multiplied by this tensor element by element.
Outputs:  Diff (Intermediate) : Intermediate variable to cache InsideWeight * (X  Y).
 Out : (Tensor, default Tensor<float>) A tensor with rank be 2. The output smooth l1 loss with shape [batch_size, 1].
Attributes:  sigma (Duplicable): Hyper parameter of smooth l1 loss op.A float scalar with default value 3.0.
beam_search
This is a beam search operator that help to generate sequences.
Inputs:  pre_ids : ids in previous step
 ids : a LoDTensor of shape of [None,k]
 scores : a LoDTensor that has the same shape and LoD with `ids`
Outputs:  selected_ids : a LoDTensor that stores the IDs selected by beam search
 selected_scores : a LoDTensor that has the same shape and LoD with `selected_ids`
Attributes:  level (Duplicable): the level of LoDTensor
 beam_size (Duplicable): beam size for beam search
 end_id (Duplicable): the token id which indicates the end of a sequence
sum
Sum operator.
This operators sums the input tensors. All the inputs can carry the LoD (Level of Details) information. However, the output only shares the LoD information with the first input.
Inputs:  X (Duplicable) : (vector<Tensor>) The input tensors of sum operator.
Outputs:  Out : (Tensor) The output tensor of sum operator.
concat
Concat Operator.
Concatenate the input tensors along dimension axis. Examples: Input[0] = [[1,2],[3,4]] Input[1] = [[5,6]] axis = 0 Output = [[1,2], [3,4], [5,6]]
Inputs:  X (Duplicable) : Input tensors of concat operator.
Outputs:  Out : Output tensor of concat operator.
Attributes:  axis (Duplicable): The axis along which the input tensors will be concatenated.
softmax_with_cross_entropy
Softmax With Cross Entropy Operator.
Cross entropy loss with softmax is used as the output layer extensively. This operator computes the softmax normalized values for each row of the input tensor, after which crossentropy loss is computed. This provides a more numerically stable gradient.
Because this operator performs a softmax on logits internally, it expects unscaled logits. This operator should not be used with the output of softmax operator since that would produce incorrect results.
When the attribute soft_label is set false, this operators expects mutually exclusive hard labels, each sample in a batch is in exactly one class with a probability of 1.0. Each sample in the batch will have a single label.
The equation is as follows:
1) Hard label (onehot label, so every sample has exactly one class)
$$Loss_j = \text{Logit}_{Label_j} + \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right), j = 1,..., K$$
2) Soft label (each sample can have a distribution over all classes)
$$Loss_j = \sum_{i=0}^{K}\text{Label}_i \left(\text{Logit}_i  \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right)\right), j = 1,...,K$$
Inputs:  Logits : (Tensor, default: Tensor<float>), The unscaled log probabilities which is a 2D tensor with shape [N x K]. N is the batch_size, and K is the class number.
 Label : (Tensor) The ground truth which is a 2D tensor. If soft_label is set to false, Label is a Tensor<int64> with shape [N x 1]. If soft_label is set to true, Label is a Tensor<float/double> with shape [N x K].
Outputs:  Softmax (Intermediate) : (Tensor, default: Tensor<float>), A 2D tensor with shape [N x K]. The outputs value of softmax activation by given the input batch, which will be used in backward calculation.
 Loss : (Tensor, default: Tensor<float>), A 2D tensor. The cross entropy loss with shape [N x 1].
Attributes:  soft_label (Duplicable): (bool, default: false), A flag to indicate whether to interpretate the given labels as soft labels.
fill_constant_batch_size_like
FillConstantBatchSizeLike Operator.
Fill up a variable with specified constant value.
Inputs:  Input : (Tensor) Tensor whose dim_idx th dimension is used to specify the batch_size
Outputs:  Out : (Tensor) Tensor of specified shape will be filled with the specified value
Attributes:  dtype (Duplicable): (int, default 5 (FP32)) Output data type
 shape (Duplicable): (vector<int>) The shape of the output
 input_dim_idx (Duplicable): (int, default 0) The index of input's batch size dimension
 output_dim_idx (Duplicable): (int, default 0) The index of output's batch size dimension
 value (Duplicable): (float, default 0) The value to be filled
adadelta
Adadelta Optimizer.
Adadelta optimizer is implemented as explained in: https://arxiv.org/abs/1212.5701 Adadelta is a perdimension adaptive learning rate method used for gradient descent.
Adadelta updates are as follows:
$$ avg\_squared\_grad\_out = \rho * avg\_squared\_grad + (1  \rho) * grad * grad \\ param\_update =  \sqrt{\frac{avg\_squared\_update + \epsilon}{avg\_squared\_grad\_out + \epsilon}} * grad \\ avg\_squared\_update\_out = \rho * avg\_squared\_update + (1  \rho) * {param\_update}^2 \\ param\_out = param + param\_update $$
Inputs:  Param : (Tensor) Input parameter
 Grad : (Tensor) Input gradient
 AvgSquaredGrad : (Tensor) Input average of squared gradient
 AvgSquaredUpdate : (Tensor) Input average of squared parameter updates
Outputs:  ParamOut : (Tensor) Output parameter
 AvgSquaredGradOut : (Tensor) Output average of squared gradient
 AvgSquaredUpdateOut : (Tensor) Output average of squared parameter updates
Attributes:  rho (Duplicable): (float, default 0.95) Exponential decay rate for squared gradients.
 epsilon (Duplicable): (float, default 1.0e6) Constant for numerical stability
log
Log Activation Operator.
$y = ln(x)$
Natural logarithm of x.
Inputs:  X : Input of Log operator
Outputs:  Y : Output of Log operator
conv3d_cudnn
Convolution3D Operator.
The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and output(Output) are in NCDHW format, where N is batch size, C is the number of channels,D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCDHW format, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$ D_{out}= \frac{(D_{in} + 2 * paddings[0]  (dilations[0] * (D_f  1) + 1))}{ strides[0]}+ 1 \\ H_{out}= \frac{(H_{in} + 2 * paddings[1]  (dilations[1] * (H_f  1) + 1))}{ strides[1]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[2]  (dilations[2] * (W_f  1) + 1))}{ strides[2]}+ 1 $$
Inputs:  Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCDHW, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups.
Outputs:  Output : (Tensor) The output tensor of convolution operator.The format of output tensor is also NCDHW.
Attributes:  strides (Duplicable): (vector<int>, default:{1, 1, 1}), the strides(d_stride, h_stride, w_stride) of convolution operator.
 paddings (Duplicable): (vector<int>, default:{0, 0, 0}), the paddings(d_pad, h_pad, w_pad) of convolution operator.
 groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.
 dilations (Duplicable): (vector<int> default:{1, 1, 1}), the dilations(d_dilation, h_dilation, w_dilation) of convolution operator.
 workspace_size_MB (Duplicable): workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardware. This size should be chosen carefully.
conv3d_transpose_cudnn
Convolution3D Transpose Operator.
The convolution transpose operation calculates the output based on the input, filter and strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and output(Output) are in NCDHW format. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCDHW format. Where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter,H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.
Example:
Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = (D_{in}  1) * strides[0]  2 * paddings[0] + D_f \\ H_{out} = (H_{in}  1) * strides[1]  2 * paddings[1] + H_f \\ W_{out} = (W_{in}  1) * strides[2]  2 * paddings[2] + W_f $$Inputs:  Input : (Tensor) The input tensor of convolution transpose operator.The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution transpose operator.The format of the filter tensor is MCDHW, where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.We enforce groups number == 1 and padding == 0 in the convolution3d transpose scenario.
Outputs:  Output : (Tensor) The output tensor of convolution transpose operator.The format of output tensor is also NCDHW.Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
Attributes:  strides (Duplicable): (vector<int> default:{1, 1, 1}), the strides{d_stride, h_stride, w_stride} of convolution transpose operator.
 paddings (Duplicable): (vector<int> default:{0, 0, 0}), paddings(d_pad, h_pad, w_pad) of convolution transpose operator.
 dilations (Duplicable): dilations of convolution operator.
 workspace_size_MB (Duplicable): workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardward. This size should be carefully setted.
cross_entropy
CrossEntropy Operator.
It supports both standard crossentropy and softlabel crossentropy loss computation. 1) Onehot crossentropy: soft_label = false, Label[i, 0] indicates the class index for sample i:
$Y[i] = \log(X[i, Label[i]])$
2) Softlabel crossentropy: soft_label = true, Label[i, j] indicates the soft label of class j for sample i:
$Y[i] = \sum_j{Label[i, j] * log(X[i, j])}$
Please make sure that in this case the summuation of each row of Label equals one.
3) Onehot crossentropy with vecterized Input(Label): As a special case of 2), when each row of Input(Label) has only one nonzero element (equals 1), softlabel crossentropy degenerates to a onehot crossentropy with onehot label representation.
Both the input X and Label can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.
Inputs:  X : (Tensor, default Tensor<float>), a 2D tensor with shape N x D, where N is the batch size and D is the number of classes. This input is a probability computed by the previous operator, which is almost always the result of a softmax operator.
 Label : (Tensor), the ground truth which is a 2D tensor. When soft_label is set to false, Label is a Tensor<int64> with shape [N x 1]. When soft_label is set to true, Label is a Tensor<float/double> with shape [N x K].
Outputs:  Y : (Tensor, default Tensor<float>), a 2D tensor with shape [N x 1]. The cross entropy loss.
Attributes:  soft_label (Duplicable): (bool, default false), a flag indicating whether to interpretate the given labels as soft labels.
matmul
MatMul Operator.
This operator is used to perform (batched) matrix multiplication over the last two dimensions of the input tensors
X
andY
.If a transpose flag is specified, the last two dimensions of the tensor are transposed. If the tensor is rank1 of shape [D], then for
X
it is treated as [1, D] in nontransposed form and as [D, 1] in transposed form, whereas forY
it is the opposite: It is treated as [D, 1] in nontransposed form and as [1, D] in transposed form.Examples without transpose:  X: [K], Y: [K] => Out: [1]  X: [K], Y: [K, N] => Out: [N]  X: [B, M, K], Y: [K] => Out: [B, M]  X: [M, K], Y: [B, K, N] => Out: [B, M, N]  X: [B, M, K], Y: [B, K, N] => Out: [B, M, N]
The behavior is designed to be similar to the
numpy.matmul
function. The differences are:  Currently only rank 1 to rank 3 input tensors are supported.  We addtranspose_X
andtranspose_Y
flags.Both the input
X
andY
can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with inputX
.Inputs:  X : The first input of MatMul op
 Y : The second input of MatMul op
Outputs:  Out : The output of MatMul op
Attributes:  transpose_X (Duplicable): If true, use the transpose of `X`.
 transpose_Y (Duplicable): If true, use the transpose of `Y`.
brelu
BRelu Activation Operator.
$y = max(min(x, t_{min}), t_{max})$
Inputs:  X : Input of BRelu operator
Outputs:  Y : Output of BRelu operator
Attributes:  t_min (Duplicable): The min marginal value of BRelu
 t_max (Duplicable): The max marginal value of BRelu
crf_decoding
The crf_decoding operator reads the emission feature weights and the transition feature weights learned by the linear_chain_crf operator. It implements the Viterbi algorithm which is a dynamic programming algorithm for finding the most likely sequence of hidden states, called the Viterbi path, that results in a sequence of observed tags.
The output of this operator changes according to whether Input(Label) is given:
 Input(Label) is given:
This happens in training. This operator is used to cowork with the chunk_eval operator.
When Input(Label) is given, the crf_decoding operator returns a row vector with shape [N x 1] whose values are fixed to be 0, indicating an incorrect prediction, or 1 indicating a tag is correctly predicted. Such an output is the input to chunk_eval operator.
 Input(Label) is not given:
This is the standard decoding process.
The crf_decoding operator returns a row vector with shape [N x 1] whose values range from 0 to maximum tag number  1. Each element indicates an index of a predicted tag.
Inputs:  Emission : (LoDTensor, default: LoDTensor<float>). A LoDTensor with shape [N x D] where N is the size of the minibatch and D is the total tag number. This input is the unscaled emission weight matrix of the linear_chain_crf operator.
 Transition : (Tensor, default: Tensor<float>). A Tensor with shape [(D + 2) x D]. This input is the transition weights learned by the linear_chain_crf operator, denoted as w. The 1st row of w are transition weights for the start mask. The 2nd row of w are transition weights for the end mask. Transition weights between other tags begin from the 3rd row of w. See more details in comments of the linear_chain_crf operator.
 Label : (LoDTensor, LoDTensor<int64_t>). The ground truth with shape [N x 1]. This input is optional. See more details in the operator's comments.
Outputs:  ViterbiPath : (LoDTensor, LoDTensor<int64_t>). The decoding results. What to return changes depending on whether the Input(Label) (the ground truth) is given. See more details in the operator's comment.
clip_by_norm
ClipByNorm Operator.
This operator limits the L2 norm of the input $X$ within $max_norm$. If the L2 norm of $X$ is less than or equal to $max_norm$, $Out$ will be the same as $X$. If the L2 norm of $X$ is greater than $max_norm$, $X$ will be linearly scaled to make the L2 norm of $Out$ equal to $max_norm$, as shown in the following formula:
$$ Out = \frac{max\_norm * X}{norm(X)}, $$
where $norm(X)$ represents the L2 norm of $X$.
Inputs:  X : (Tensor) The input of clip_by_norm op.The number of dimensions must be between [1, 9].
Outputs:  Out : (Tensor) The output of clip_by_norm op with shape as input(X)
Attributes:  max_norm (Duplicable): (float) The maximum norm value.
gather
Gather Operator.
$Out = X[Index]$
Out is obtained by gathering entries of the outermost dimension of X indexed by Index and concatenate them together.
Example:
X = [[1, 2], [3, 4], [5, 6]]
Index = [[1, 2]]
Then:
Out = [[3, 4], [5, 6]]
Inputs:  X : The source input of gather op
 Index : The index input of gather op
Outputs:  Out : The output of gather op
pool3d_cudnn
Pool3d Operator.
The pooling3d operation calculates the output based on the input, pooling_type, ksize, strides, and paddings parameters. Input(X) and output(Out) are in NCDHW format, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the feature, respectively. Parameters(ksize, strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: X shape: $(N, C, D_{in}, H_{in}, W_{in})$ Output: Out shape: $(N, C, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = \frac{(D_{in}  ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ H_{out} = \frac{(H_{in}  ksize[1] + 2 * paddings[1])}{strides[1]} + 1 \\ W_{out} = \frac{(W_{in}  ksize[2] + 2 * paddings[2])}{strides[2]} + 1 $$
Inputs:  X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively.
Outputs:  Out : (Tensor) The output tensor of pooling operator.The format of output tensor is also NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively.
Attributes:  pooling_type (Duplicable): (string) Pooling type, can be "max" for maxpooling and "avg" for averagepooling.
 ksize (Duplicable): (vector<int>) The pooling window size(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
 global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings wille be ignored.
 strides (Duplicable): (vector<int>, default {1,1,1}) Strides(depth, height, width) of the pooling operator.
 paddings (Duplicable): (vector<int>, default {0,0,0}), paddings(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
crop
Crop Operator.
Crop input into output, as specified by offsets and shape.
There are two ways to set shape: 1. reference input: crop input X into the same shape as reference input. The dimension of reference input should be the same as the dimension of input X. 2. shape list: crop input X into the shape described by a list
. The size of shape list should be the same as the dimension size of input X. The input should be a kD tensor(k > 0 and k < 7). As an example:
Given:
X = [[0, 1, 2, 0, 0] [0, 3, 4, 0, 0] [0, 0, 0, 0, 0]],
and
offsets = [0, 1],
and
shape = [2, 2],
we get:
Out = [[1, 2], [3, 4]].
Inputs:  X : The input of pad op. The input should be a kD tensor(k > 0 and k < 7).
 Y : The input used as reference for cropping, which is of the same dimensions as X.
Outputs:  Out : The output of crop op, which is of the same dimensions as X.
Attributes:  offsets (Duplicable): A list<int> describing offsets to be cropped. The size of offsets list should be the same as the dimension size of input X.
 shape (Duplicable): A list<int> describing the shape of output. The size of shape list should be the same as the dimension size of input X.
merge_lod_tensor
Merge True and False branches of LoDTensor into a single Output, with a mask at certain lod level. X is used to obtain complete lod information. Please refer to SplitLoDTensorOp.
Inputs:  X : The input LoDTensor, contains complete lod information to construct the output
 Mask : A bool column vector which mask the input
 InTrue : The True branch to be merged
 InFalse : The False branch to be merged
Outputs:  Out : The merged output LoDTensor
Attributes:  level (Duplicable): (int) the specific lod level to rank.
elementwise_mul
Limited Elementwise Mul Operator.
The equation is:
$Out = X odot Y$
X is a tensor of any dimension and the dimensions of tensor Y must be smaller than or equal to the dimensions of X.
There are two cases for this operator: 1. The shape of Y is same with X; 2. The shape of Y is a subset of X.
For case 2: Y will be broadcasted to match the shape of X and axis should be the starting dimension index for broadcasting Y onto X.
example: shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.
Inputs:  X : (Tensor) The first input tensor of elementwise op
 Y : (Tensor) The second input tensor of elementwise op
Outputs:  Out : The output of elementwise op
Attributes:  axis (Duplicable): (int, default 1) The starting dimension index for broadcasting Y onto X
rmsprop
Rmsprop Optimizer.
$$ MeanSquareOut = decay * MeanSquare + (1  decay) * Grad * Grad \\ MomentOut = momentum * Moment + \frac{LearningRate * Grad}{\sqrt{MeanSquareOut + epsilon}} \\ ParamOut = Param  MomentOut $$
The original slides that proposed Rmsprop: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
Inputs:  Param : (Tensor, default Tensor<float>) Input parameter value that has to be updated.
 MeanSquare : (Tensor, default Tensor<float>) The mean square value that gets updated.
 LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
 Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
 Moment : (Tensor, default Tensor<float>) The moment that gets updated.
Outputs:  ParamOut : (Tensor) Output updated parameter value.
 MomentOut : (Tensor) Output updated moment.
 MeanSquareOut : (Tensor) Output Mean squared updated value.
Attributes:  epsilon (Duplicable): (float, default 1e10) Constant for numerical stability.
 decay (Duplicable): (float, default 0.9) Discounting factor for coming gradient.
 momentum (Duplicable): (float, default 0.0) Constant value.
proximal_gd
ProximalGD Operator.
Optimizer that implements the proximal gradient descent algorithm:
$$ prox\_param = param  learning\_rate * grad \\ param = sign(prox\_param) / (1 + learning\_rate * l2) * \max(prox\_param  learning\_rate * l1, 0) $$
The paper that proposed Proximal Gradient Descent: (http://papers.nips.cc/paper/3793efficientlearningusingforwardbackwardsplitting.pdf)
Inputs:  Param : (Tensor, default Tensor<float>) Input parameter value that has to be updated.
 Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
 LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
Outputs:  ParamOut : (Tensor) Output updated parameter value.
Attributes:  l1 (Duplicable): (float, default 0.0) L1 regularization strength.
 l2 (Duplicable): (float, default 0.0) L2 regularization strength.
positive_negative_pair
PositiveNegativePairOp can be used to evaluate Learning To Rank(LTR) model performance. Within some context, e.g. the "query", a LTR model generates scores for a list of items, which gives a partial order of the items. PositiveNegativePairOp takes a list of reference rank order (Input("Label")) and the model generated scores (Input(Score)) as inputs and counts the pairs that ranked correctly and incorrectly.
Inputs:  Score : (Tensor, float) Model Score on an item (with respect to QueryID). It's a 2D tensor with shape [batch_size, depth], where the column specified by the attribute "column" is used as item score.
 Label : (Tensor, float) Label of an item (with repsect to QueryId). It's a 2D tensor with shape [batch_size, 1].
 QueryID : (Tensor, int64) Query ID that indicates the context. Its shape should be the same as Label.
 AccumulatePositivePair : (float) Optional. The accumulated number of positive pairs over a stream of data. If provided, the output PositivePair will be initialized with this number rather than 0. it won't be modified in place.
 AccumulateNegativePair : (float) Optional. The accumulated number of negative pairs over a stream of data. If provided, the output NegativePair will be initialized with this number rather than 0. it won't be modified in place.
 AccumulateNeutralPair : (float) Optional. The accumulated number of neutral pairs over a stream of data. If provided, the output NeutralPair will be initialized with this number rather than 0. it won't be modified in place.
 Weight : (float) Optional. Weight of current item. If specified, its shape should be the same as Label, and the meaning of the output changes from numbers of pairs to the total sum of pairs' weights. Weight of a pair of items is the average of their weights.
Outputs:  PositivePair : (float) Number of positive pairs, i.e. the pairs of items that are ranked correctly.
 NegativePair : (float) Number of negative pairs, i.e. the pairs of items that are ranked incorrectly.
 NeutralPair : (float) Number of neutral pairs, i.e. the pairs of items that have the same score.
Attributes:  column (Duplicable): (int, default 1) The column position of Score used to rank items in descending order. It must be in the range of [rank(Score), rank(Score)). If `dim < 0`, the dim to reduce is `rank + dim`. Noting that reducing on the first dim will make the LoD info lost.
log_loss
LogLoss Operator.
Log loss is a loss function used for binary classification. Log Loss quantifies the accuracy of a classifier by penalising false classifications. Minimising the Log Loss is equivalent to maximising the accuracy of the classifier. We define Predicted as the values predicted by our model and Labels as the target ground truth value. Log loss can evaluate how close the predicted values are to the target. The shapes of Predicted and Labels are both [batch_size, 1]. The equation is:
$$ Loss =  Labels * log(Predicted + \epsilon)  (1  Labels) * log(1  Predicted + \epsilon) $$
Inputs:  Predicted : The input value (Predicted) of Log loss op.Predicted is a 2D tensor with shape [batch_size, 1].
 Labels : The target value (Labels) of Log loss op.Labels is a 2D tensor with shape [batch_size, 1].
Outputs:  Loss : The output tensor with shape [batch_size, 1] which represents the log loss.
Attributes:  epsilon (Duplicable): Epsilon in log loss.
mean
Mean Operator.
Out is a scalar which is the mean of all elements in X.
Inputs:  X : The input of mean op
Outputs:  Out : The output of mean op
elementwise_add
Limited Elementwise Add Operator.
The equation is:
$Out = X + Y$
X is a tensor of any dimension and the dimensions of tensor Y must be smaller than or equal to the dimensions of X.
There are two cases for this operator: 1. The shape of Y is same with X; 2. The shape of Y is a subset of X.
For case 2: Y will be broadcasted to match the shape of X and axis should be the starting dimension index for broadcasting Y onto X.
example: shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.
Inputs:  X : (Tensor) The first input tensor of elementwise op
 Y : (Tensor) The second input tensor of elementwise op
Outputs:  Out : The output of elementwise op
Attributes:  axis (Duplicable): (int, default 1) The starting dimension index for broadcasting Y onto X
fill_zeros_like
FillZerosLike Operator.
Fill up a variable with zeros. The output will have the same size as the input.
Inputs:  X : The input of fillzeroslike op.
Outputs:  Y : The variable will be filled up with zeros.
prelu
PRelu Operator.
The equation is:
$$ f(x) = \begin{cases} \alpha * x, \quad \text{if} \ x < 0 \\ x, \qquad \text{if} \ x >= 0 \end{cases} $$
The input
X
can carry the LoD (Level of Details) information, or not. And the output shares the LoD information with inputX
.Inputs:  X : The input tensor of prelu operator.
 Alpha : The alpha weight of prelu operator.
Outputs:  Out : The output tensor of prelu operator.
fill
Fill operator
Fill an tensor with
value
andshape
. The type of the tensor is specify bydtype
.Inputs: Outputs:  Out : (LoDTensor) The output tensor.
Attributes:  value (Duplicable): The float values of tensor, which are flatten in row major
 shape (Duplicable): The shape of output tensor
 dtype (Duplicable): The data type of output tensor, Default is float
 force_cpu (Duplicable): Whether the output tensor must be at CPU memory or not. Default is false.
sigmoid_cross_entropy_with_logits
SigmoidCrossEntropyWithLogits Operator.
This measures the elementwise probability error in classification tasks in which each class is independent. This can be thought of as predicting labels for a datapoint, where labels are not mutually exclusive. For example, a news article can be about politics, technology or sports at the same time or none of these.
The logistic loss is given as follows:
<span class="markdownequation" id="equation0"></span>
We know that $$\sigma(X) = (1 / (1 + \exp(X)))$$. By substituting this we get:
<span class="markdownequation" id="equation2"></span>
For stability and to prevent overflow of $$\exp(X)$$ when X < 0, we reformulate the loss as follows:
<span class="markdownequation" id="equation4"></span>
Both the input
X
andLabels
can carry the LoD (Level of Details) information. However the output only shares the LoD with inputX
.Inputs:  X : (Tensor, default Tensor<float>), a 2D tensor with shape N x D, where N is the batch size and D is the number of classes. This input is a tensor of logits computed by the previous operator. Logits are unscaled log probabilities given as log(p/(1p)).
 Label : (Tensor, default Tensor<float>), a 2D tensor of the same type and shape as X. This input is a tensor of probabalistic labels for each logit
Outputs:  Out : (Tensor, default Tensor<float>), a 2D tensor with shape N x D of elementwise logistic losses.
modified_huber_loss
Modified Huber Loss Operator.
This operator is used in binary classification problem. The shape of input X and target Y are both [N, 1] and so is the shape of the output loss. Since target Y is not differentiable, calculating gradient for Y is illegal. The formula of modified huber loss is:
$$ L(y, f(x)) = \begin{cases} (\max(0, 1  yf(x)))^2, \text{if} \ yf(x) >= 1 \\ 4yf(x), \quad \text{otherwise} \end{cases} $$
Make sure the values of target label Y are in {0, 1} here. This operator will scale values of Y to {1, +1} when computing losses and gradients.
Inputs:  X : The input tensor of modified huber loss op. X is 2D tensor with shape [batch_size, 1].
 Y : The target labels of modified huber loss op. The shape of Y is the same as X. Values of Y must be 0 or 1.
Outputs:  IntermediateVal (Intermediate) : Variable to save intermediate result which will be reused in backward processing.
 Out : Classification loss for X.
elementwise_sub
Limited Elementwise Sub Operator.
The equation is:
$Out = X  Y$
X is a tensor of any dimension and the dimensions of tensor Y must be smaller than or equal to the dimensions of X.
There are two cases for this operator: 1. The shape of Y is same with X; 2. The shape of Y is a subset of X.
For case 2: Y will be broadcasted to match the shape of X and axis should be the starting dimension index for broadcasting Y onto X.
example: shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.
Inputs:  X : (Tensor) The first input tensor of elementwise op
 Y : (Tensor) The second input tensor of elementwise op
Outputs:  Out : The output of elementwise op
Attributes:  axis (Duplicable): (int, default 1) The starting dimension index for broadcasting Y onto X
reduce_mean
{ReduceOp} Operator.
This operator computes the mean of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true.
Inputs:  X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
Outputs:  Out : (Tensor) The result tensor.
Attributes:  dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
 keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.
square
Square Activation Operator.
$y = x^2$
Inputs:  X : Input of Square operator
Outputs:  Y : Output of Square operator
reduce_max
{ReduceOp} Operator.
This operator computes the max of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true.
Inputs:  X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
Outputs:  Out : (Tensor) The result tensor.
Attributes:  dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
 keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.
logical_or
logical_or Operator
It operates elementwise on X and Y, and returns the Out. X, Y and Out are Ndim boolean tensors. Each element of Out is calculated by $$Out = X  Y$$
Inputs:  X : (LoDTensor) Left hand operand of logical_or operator
 Y : (LoDTensor) Right hand operand of logical_or operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is $$Out = X  Y$$
less_than
less_than Operator
It operates elementwise on X and Y, and returns the Out. Each of them is a Ndim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X < Y
Inputs:  X : (LoDTensor) the left hand operand of less_than operator
 Y : (LoDTensor) the right hand operand of less_than operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is Out = X < Y
gru_unit
GRUUnit Operator implements partial calculations of the GRU unit as following:
$$ update \ gate: u_t = actGate(xu_t + W_u * h_{t1} + b_u) \\ reset \ gate: r_t = actGate(xr_t + W_r * h_{t1} + b_r) \\ output \ candidate: {h}_t = actNode(xc_t + W_c * dot(r_t, h_{t1}) + b_c) \\ output: h_t = dot((1  u_t), h_{t1}) + dot(u_t, {h}_t) $$
which is same as one time step of GRU Operator.
@note To implement the complete GRU unit, fullyconnected operator must be used before to feed xu, xr and xc as the Input of GRUUnit operator.
Inputs:  Input : (Tensor) Matrix with shape [batch_size, frame_size * 3] for the input.
 HiddenPrev : (Tensor) Matrix with shape [batch_size, frame_size] for the states of previous time step.
 Weight : (Tensor) Weight matrix with shape [frame_size, frame_size * 3]. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape [frame_size, frame_size * 2], and the second part are weights of output candidate with shape [frame_size, frame_size].
 Bias : (Tensor) Bias vector with shape [1, frame_size * 3] concatenating bias of the update gate, reset gate and output candidate.
Outputs:  Gate (Intermediate) : (Tensor) Matrix with shape [batch_size, frame_size * 3] for the output of update gate, reset gate and output candidate.
 ResetHiddenPrev (Intermediate) : (Tensor) Matrix with shape [batch_size, frame_size] for the reseted hidden state of previous time step.
 Hidden : (Tensor) The GRU hidden state of the current time step with shape [batch_size, frame_size].
Attributes:  activation (Duplicable): (enum int, default tanh) The activation type used for output candidate {h}_t.
 gate_activation (Duplicable): (enum int, default sigmoid) The activation type used in update gate and reset gate.
swish
Swish Activation Operator.
$$y = \frac{x}{1 + e^{ \beta x}}$$
Inputs:  X : Input of Swish operator
Outputs:  Y : Output of Swish operator
Attributes:  beta (Duplicable): Constant beta of swish operator
is_empty
IsEmpty Operator which checks whether a tensor is empty.
It will just return product(tensor.ddims()) > 0;
Inputs:  X : (Tensor) Tensor which is to be checked.
Outputs:  Out : (Tensor) a boolean Tensor that indicate empty or not.
sequence_concat
The sequence_concat operator concatenates multiple LoDTensors. It only supports sequence (LoD Tensor with level number is 1) or a nested sequence (LoD tensor with level number is 2) as its input.  Case1: If the axis is other than 0(here, axis is 1 and level is 1), each input should have the same LoD information and the LoD information of the output keeps the same as the input.
LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,2,4}, {0,1,2,3,4}}; Dims(x1) = (4,4,4) LoD(Out) = {{0,2,4}, {0,1,2,3,4}}; Dims(Out) = (4,7,4)
 Case2: If the axis is 0(here, leve is 0), the inputs are concatenated along time steps, the LoD information of the output need to recompute. The LoD information of level1 should be same.
LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,2,4}, {0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,2,4}, {0,2,5,8,11}}; Dims(Out) = (11,3,4)
 Case3: If the axis is 0(here, level is 1).
LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,3,4}, {0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,5,8}, {0,1,2,3,5,7,8,9,11}}; Dims(Out) = (11,3,4)
 Case4: If the LoD number is 1, axis is 0, level is 0
LoD(x0) = {{0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,2,5,8,11}}; Dims(Out) = (11,3,4)
NOTE: The levels of all the inputs should be the same.
Inputs:  X (Duplicable) : (LodTensorArray) Input is a vector of LoDTensor, each of which is a variablelength sequence or nested sequence.
Outputs:  Out : (LoDTensor), Variablelength output of sequence_concat Op.
Attributes:  axis (Duplicable): (int, default 0) The axis along which the inputs will be joined. If axis is 0, the inputs will be joined with LoD index.
 level (Duplicable): (int, default 0) The level at which the inputs will be joined. If the level is 0, the inputs will be joined at the nested sequence level. If the level is 1, the inputs will be joined at the sequence level. The level should be less than the level number of inputs.
floor
Floor Activation Operator.
$y = floor(x)$
Inputs:  X : Input of Floor operator
Outputs:  Y : Output of Floor operator
cast
Cast Operator.
This Operator casts the input tensor to another data type and returns tha Output Tensor.
Inputs:  X : The input tensor of cast op
Outputs:  Out : The output tensor of cast op
Attributes:  out_dtype (Duplicable): output data type
 in_dtype (Duplicable): input data type
ceil
Ceil Activation Operator.
$y = ceil(x)$
Inputs:  X : Input of Ceil operator
Outputs:  Y : Output of Ceil operator
tanh
Tanh Activation Operator.
$$y = \frac{e^{x}  e^{x}}{e^{x} + e^{x}}$$
Inputs:  X : Input of Tanh operator
Outputs:  Y : Output of Tanh operator
feed
Feed Operator.
It should not be configured by users directly.
Inputs:  X : The input of feed op
Outputs:  Out : The output of feed op
Attributes:  col (Duplicable): (int) The column of feed
rnn_memory_helper
Inputs:  X :
Outputs:  Out :
Attributes:  dtype (Duplicable): (int, default 5 (FP32)) Output data type
unpool
"Input shape: $(N, C_{in}, H_{in}, W_{in})$ Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where <span class="markdownequation" id="equation0"></span> Paper: http://www.matthewzeiler.com/wpcontent/uploads/2017 /07/iccv2011.pdf
Inputs:  X : (Tensor) The input tensor of unpool operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.
 Indices : (Tensor) The input tensor of the indices given out by MaxPool2d. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.
Outputs:  Out : (Tensor) The output tensor of unpool operator.The format of output tensor is also NCHW.Where N is batch size, C is the number of channels, H and W is the height and width of feature.
Attributes:  ksize (Duplicable): (vector), the unpooling window size(height, width) of unpooling operator.
 strides (Duplicable): (vector, default:{1, 1}), strides (height, width) of unpooling operator.
 paddings (Duplicable): (vector defalut:{0,0}), paddings (height, width) of unpooling operator.
 unpooling_type (Duplicable): (string), unpooling type, can be "max" for maxunpooling
transpose
Transpose Operator.
The input tensor will be permuted according to the axis values given. The op functions similar to how numpy.transpose works in python. For example:
input = numpy.arange(6).reshape((2,3)) input array([[0, 1, 2], [3, 4, 5]]) axis = [1, 0] output = input.transpose(axis) output array([[0, 3], [1, 4], [2, 5]]) So, given a input tensor of shape(N, C, H, W) and the axis is {0, 2, 3, 1}, the output tensor shape will be (N, H, W, C)
Inputs:  X : (Tensor)The input tensor, tensors with rank at most 6 are supported
Outputs:  Out : (Tensor)The output tensor
Attributes:  axis (Duplicable): (vector<int>)A list of values, and the size of the list should be the same with the input tensor rank, the tensor will permute the axes according the the values given
rnn_memory_helper_grad
Inputs:  Out@GRAD :
 X :
 Out :
Outputs:  X@GRAD :
Attributes:  dtype (Duplicable): (int, default 5 (FP32)) Output data type
momentum
Momentum Optimizer.
This optimizer has a flag for Nestrov Momentum. The update equations are as follows:
$$ velocity = mu * velocity + gradient \\ if (use\_nesterov): \\ param = param  gradient * learning\_rate + mu * velocity * learning\_rate \\ else: \\ param = param  learning\_rate * velocity. \\ $$
Inputs:  Param : (Tensor, default Tensor<float>) Input parameter that has to be updated
 Grad : (Tensor, default Tensor<float>) Input gradient of the parameter
 Velocity : (Tensor, default Tensor<float>) Input velocity (corresponding to the parameter) that has to be updated
 LearningRate : (Tensor, default Tensor<float>) Input learning rate
Outputs:  ParamOut : (Tensor) This output is updated parameter. It shared memory with Input(Param).
 VelocityOut : (Tensor) This output is updated velocity. It shared memory with Input(Velocity).
Attributes:  mu (Duplicable): (float) Momentum coefficient
 use_nesterov (Duplicable): (bool, default false) Use Nesterov Momentum
scatter
Scatter Operator.
This operator obtains output by updating the input on selected indices on the first axis:
$$ Out = Ref \\ Out[Index] = Ref[Index] + Updates $$
Inputs:  Ref : The source input of scatter op
 Index : The index input of scatter op where Ref will be updated
 Updates : The updated value of updates op
Outputs:  Out : The output of add op
less_equal
less_equal Operator
It operates elementwise on X and Y, and returns the Out. Each of them is a Ndim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X <= Y
Inputs:  X : (LoDTensor) the left hand operand of less_equal operator
 Y : (LoDTensor) the right hand operand of less_equal operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is Out = X <= Y
rank_loss
RankLoss Operator.
RankLoss operator for RankNet (http://icml.cc/2015/wpcontent/uploads/2015/06/icml_ranking.pdf). RankNet is a pairwise ranking model with one training sample consisting of a pair of doc A and B, and the label P indicating that A is ranked higher than B or not:
P = {0, 1} or {0, 0.5, 1}, where 0.5 means no information about the rank of the input pair.
The RankLoss operator takes three inputs: Left (o_i), Right (o_j) and Label (P_{i,j}), which represent the output score of RankNet for the two docs and the label respectively, and yields the rank loss C_{i,j} using the following equation:
$$ C_{i,j} = \tilde{P_{ij}} * o_{i,j} + \log(1 + e^{o_{i,j}}) \\ o_{i,j} = o_i  o_j \\ \tilde{P_{i,j}} = \left \{0, 0.5, 1 \right \} \ or \ \left \{0, 1 \right \} $$
The operator can take batch inputs with size batch_size (batch_size >= 1).
Inputs:  Label : (2D Tensor with shape [batch_size x 1]) The label indicating A ranked higher than B or not.
 Left : (2D Tensor with shape [batch_size x 1]) The output of RankNet for doc A.
 Right : (2D Tensor with shape [batch_size x 1]) The output of RankNet for doc B.
Outputs:  Out : (2D Tensor with shape [batch_size x 1]) The output loss of RankLoss operator.
greater_than
greater_than Operator
It operates elementwise on X and Y, and returns the Out. Each of them is a Ndim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X > Y
Inputs:  X : (LoDTensor) the left hand operand of greater_than operator
 Y : (LoDTensor) the right hand operand of greater_than operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is Out = X > Y
equal
equal Operator
It operates elementwise on X and Y, and returns the Out. Each of them is a Ndim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X == Y
Inputs:  X : (LoDTensor) the left hand operand of equal operator
 Y : (LoDTensor) the right hand operand of equal operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is Out = X == Y
uniform_random
Uniform random operator.
This operator initializes a tensor with random values sampled from a uniform distribution.
Inputs: Outputs:  Out : (Tensor) The output tensor of uniform random op
Attributes:  shape (Duplicable): (vector<int>) The shape of the output tensor
 min (Duplicable): (float, default 1.0) Minimum value of uniform random
 max (Duplicable): (float, default 1.0) Maximun value of uniform random
 seed (Duplicable): (int, default 0) Random seed used for generating samples. 0 means use a seed generated by the system.
 dtype (Duplicable): (int, default 5(FP32)) Output tensor data type
roi_pool
ROIPool operator
ROI Pooling for FasterRCNN. The link below is a further introduction: https://stackoverflow.com/questions/43430056/whatisroilayerinfastrcnn
Inputs:  X : (Tensor), the input of ROIPoolOp. The format of input tensor is NCHW. Where N is batch size, C is the number of input channels, H is the height of the feature, and W is the width of the feature.
 ROIs : (Tensor), ROIs (Regions of Interest) to pool over. should be a 2D tensor of shape (num_rois, 5)given as [[batch_id, x1, y1, x2, y2], …]. Where batch_id is the id of the data, (x1, y1) is the top left coordinates, and (x2, y2) is the bottom right coordinates.
Outputs:  Out : (Tensor), The output of ROIPoolOp is a 4D tensor with shape (num_rois, channels, pooled_h, pooled_w).
 Argmax (Intermediate) : (Tensor), Argmaxes corresponding to indices in X used for gradient computation. Only output if arg “is_test” is false.
Attributes:  spatial_scale (Duplicable): (float, default 1.0), Multiplicative spatial scale factor to translate ROI coords from their input scale to the scale used when pooling.
 pooled_height (Duplicable): (int, default 1), The pooled output height.
 pooled_width (Duplicable): (int, default 1), The pooled output width.
softmax
Softmax Operator.
The input of the softmax operator is a 2D tensor with shape N x K (N is the batch_size, K is the dimension of input feature). The output tensor has the same shape as the input tensor.
For each row of the input tensor, the softmax operator squashes the Kdimensional vector of arbitrary real values to a Kdimensional vector of real values in the range [0, 1] that add up to 1. It computes the exponential of the given dimension and the sum of exponential values of all the other dimensions in the Kdimensional vector input. Then the ratio of the exponential of the given dimension and the sum of exponential values of all the other dimensions is the output of the softmax operator.
For each row $i$ and each column $j$ in Input(X), we have: $$Y[i, j] = \frac{\exp(X[i, j])}{\sum_j(exp(X[i, j])}$$
Inputs:  X : The input tensor of softmax. 2D with shape [batch_size, input_feature_dimensions].
Outputs:  Y : The normalized values with the same shape as X.
seq_expand
Seq Expand Operator.
This operator expands input(X) according to LOD of input(Y). Following are cases to better explain how this works: Case 1:
Given 2level a LoDTensor input(X) X.lod = [[0, 2, 3], [0, 1, 3, 4]] X.data = [a, b, c, d] X.dims = [4, 1] and input(Y) Y.lod = [[0, 2, 4], [0, 3, 6, 7, 8]] with condition len(Y.lod[1]) 1 == X.dims[0] then we get 2level LoDTensor Out.lod = [[0, 2, 4], [0, 3, 6, 7, 8]] Out.data = [a, a, a, b, b, b, c, d] Out.dims = [8, 1]
Case 2:
Given a 0level LoDTensor input(X) X.data = [a, b, c] X.lod = NULL X.dims = [3, 1] and input(Y) Y.lod = [[0, 2, 3, 6]] with condition len(Y.lod[1]) 1 == X.dims[0] then we get 1level LoDTensor Out.lod = [[0, 2, 3, 6]] Out.data = [a, a, b, c, c, c] Out.dims = [6, 1]
Case 3:
Given a 0level LoDTensor input(X) X.data = [[a, b], [c, d], [e, f]] X.lod = NULL X.dims = [3, 2] and input(Y) Y.lod = [[0, 2, 3, 6]] with condition len(Y.lod[1]) 1 == X.dims[0] then we get 1level LoDTensor Out.lod = [[0, 2, 3, 6]] Out.data = [[a,b], [a,b] [c,d], [e, f], [e, f], [e, f]] Out.dims = [6, 2]
Case 4:
Given 2level a LoDTensor input(X) X.lod = [[0, 2, 3], [0, 1, 3, 4]] X.data = [a, b, c, d] X.dims = [4, 1] and input(Y) Y.lod = [[0, 2, 4], [0, 3, 6, 6, 8]] with condition len(Y.lod[1]) 1 == X.dims[0] then we get 2level LoDTensor Out.lod = [[0, 2, 4], [0, 3, 6, 6, 8]] Out.data = [a, a, a, b, b, b, d, d] Out.dims = [8, 1]
Inputs:  X : (Tensor or LoDTensor) The input(X) of this operator can be a LoDTensor or a base Tensor.
 Y : (LoDTensor)The reference input(Y) of seq_expand op.It must be a LoDTensor with klevel(k>0).The input(X) will be expanded according to LOD of input(Y).The element numbers of last level in input(Y) must be equal to dims[0] of input(X).
Outputs:  Out : (LodTensor)The output of seq_expand op.The lod of output will be as same as input(Y)'s lod.
sqrt
Sqrt Activation Operator.
$y = sqrt{x}$
Inputs:  X : Input of Sqrt operator
Outputs:  Y : Output of Sqrt operator
logical_and
logical_and Operator
It operates elementwise on X and Y, and returns the Out. X, Y and Out are Ndim boolean tensors. Each element of Out is calculated by $$Out = X \&\& Y$$
Inputs:  X : (LoDTensor) Left hand operand of logical_and operator
 Y : (LoDTensor) Right hand operand of logical_and operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is $$Out = X \&\& Y$$
logical_not
logical_not Operator
It operates elementwise on X, and returns the Out. X and Out are Ndim boolean tensors. Each element of Out is calculated by $$Out = !X$$
Inputs:  X : (LoDTensor) Operand of logical_not operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is $$Out = !X$$
abs
Abs Activation Operator.
$y = x$
Inputs:  X : Input of Abs operator
Outputs:  Y : Output of Abs operator
logical_xor
logical_xor Operator
It operates elementwise on X and Y, and returns the Out. X, Y and Out are Ndim boolean tensors. Each element of Out is calculated by $$Out = (X  Y) \, \&\& \, !(X \&\& Y)$$
Inputs:  X : (LoDTensor) Left hand operand of logical_xor operator
 Y : (LoDTensor) Right hand operand of logical_xor operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is $$Out = (X  Y) \, \&\& \, !(X \&\& Y)$$
sequence_slice
Sequence slice operator
The operator crops a subsequence from given sequence with given start offset and subsequence length. It only supports sequence (LoD Tensor with level number is 1).  Case: X = [[a1, a2; b1, b2; c1, c2] [d1, d2; e1, e2]] LoD(X) = {{0, 3, 5}}; Dims(X) = (5, 2) Offset = [[0], [1]]; Length = [[2], [1]]
Out = [[a1, a2; b1, b2] [e1, e2]] LoD(Out) = {{0, 2, 3}}; Dims(Out) = (3, 2)
NOTE: The first dimension size of input, the size of offset and Length, should be equal. The offset start from 0.
Inputs:  X : (LoDTensor), the input of SequenceSliceOp.
 Offset : (Tensor), a vector<int> to describe the offset of every input sequence for sub sequence item.
 Length : (Tensor), a vector<int> to describe the length of every input sequence for sub sequence item.
Outputs:  Out : (LoDTensor), the output of SequenceSliceOp.
hinge_loss
HingeLoss Operator.
Let x be a logit (prediction) and y be the actual label. The logit can take any values from (inf, inf), but the labels should be either 1 or 1. Then, the hinge loss is computed as follows:
$$ L_(x, y) = max(1  y.x, 0) $$
Note that the labels passed as input will have values as either 0 or 1.
Inputs:  Logits : The input value (Logits) of Hinge loss op.Logits is a 2D tensor with shape [batch_size, 1].
 Labels : The target value (Labels) of Hinge loss op.Labels is a 2D tensor with shape [batch_size, 1].
Outputs:  Loss : The output tensor with shape [batch_size, 1] which represents the hinge loss.
bilinear_tensor_product
Bilinear Tensor Product operator. Given input X and Y, a 3D tensor Weight and a Bias. Each column of the Output is computed by one slice $i = 1, . . . , k$ of the tensor:
$$ M = (X W_i) * Y \\ Out_i = \sum_j {M_j} + Bias_i $$
Where $W_i$ is the $i$th slice of Input(Weight); $M_j$ is the $j$th column of $M$; $Out_i$ is the $i$th column of Output(Out); $Bias_i$ is a column vector, each element of it is equal to the $i$th element of $Bias$;
Inputs:  X : The first input of bilinear_tensor_product operator.
 Y : The second input of bilinear_tensor_product operator.
 Weight : The learnable parameters of bilinear_tensor_product operator.
 Bias : The learnable bias of bilinear_tensor_product operator.
Outputs:  Out : The output of bilinear_tensor_product operator.
lrn
Local Response Normalization Operator.
This operator comes from the paper: <
>. The original formula is:
$$ Output(i, x, y) = Input(i, x, y) / \left( k + \alpha \sum\limits^{\min(C, c + n/2)}_{j = \max(0, c  n/2)} (Input(j, x, y))^2 \right)^{\beta} $$
Function implementation:
Inputs and outpus are in NCHW format, while input.shape.ndims() equals 4. And dimensions 0 ~ 3 represent batch size, feature maps, rows, and columns, respectively.
Input and Output in the formula above is for each map(i) of one image, and Input(i, x, y), Output(i, x, y) represents an element in an image.
C is the number of feature maps of one image. n is a hyperparameter configured when operator is initialized. The sum in the denominator is the sum of the same positions in the neighboring maps.
Inputs:  X : (Tensor) The input of LRN operator. It must be a 4D tenor with NCHW format.
Outputs:  Out : (Tensor) The output of LRN operator, which is also the 4D tensor with NCHW format.
 MidOut : (Tensor) Middle result of LRN operator. It's computed in forward process and also used in backward process.
Attributes:  n (Duplicable): (int default 5) n is the "adjacent" kernel that maps at the same spatial position.
 k (Duplicable): (float, default 2.0) k is the bias.
 alpha (Duplicable): (float, default 0.0001) alpha is the scale number.
 beta (Duplicable): (float, default 0.75) beta is the power number.
beam_search_decode
Pack the result of Beam search op into SentenceIds and SentenceScores.
Inputs:  Ids : (LodTensorArray)score of the candidate words in each step
 Scores : (LodTensorArray)score of the candidate words in each step
Outputs:  SentenceIds : (LodTensor)All possible result sentences of word ids
 SentenceScores : (LodTensor)All possible result sentences of word scores
assign
Assign Operator
Out = X, when type in [LoDTensor/SelectedRows/LoDTensorArray] raise error if the type is not listed above.
Inputs:  X : (LoDTensor, SelectedRows or LoDTensorArray) The input variable could be LoDTensor, SelectedRows or LoDTensorArray.
Outputs:  Out : (LoDTensor, SelectedRows or LoDTensorArray) The type of output is the same as input X.
split
Split operator
This operator splits the input tensor into multiple subtensors.
Example: Input = [[1,2], [3,4], [5,6]] sections = [2,1] axis = 0 Output[0] = [[1,2], [3,4]] Output[1] = [[5,6]]
Inputs:  X : (Tensor) Input tensor of the split operator.
Outputs:  Out (Duplicable) : (Tensor) Output tensors of the split operator.
Attributes:  sections (Duplicable): (vector<int>) the length of each output along the specified axis.
 num (Duplicable): (int, default 0)Number of subtensors. This must evenly divide Input.dims()[axis]
 axis (Duplicable): (int, default 0) The axis which the input will be splited on.
chunk_eval
For some basics of chunking, please refer to ‘Chunking with Support Vector Machines https://aclanthology.info/pdf/N/N01/N011025.pdf’.
CheckEvalOp computes the precision, recall, and F1score of chunk detection, and supports IOB, IOE, IOBES and IO (also known as plain) tagging schemes. Here is a NER example of labeling for these tagging schemes:
Li Ming works at Agricultural Bank of China in Beijing.
IO: IPER IPER O O IORG IORG IORG IORG O ILOC IOB: BPER IPER O O BORG IORG IORG IORG O BLOC IOE: IPER EPER O O IORG IORG IORG EORG O ELOC IOBES: BPER EPER O O IORG IORG IORG EORG O SLOC
There are three chunk types(named entity types) including PER(person), ORG(organization) and LOC(LOCATION), and we can see that the labels have the form
 . Since the calculations actually use label ids rather than labels, extra attention should be paid when mapping labels to ids to make CheckEvalOp work. The key point is that the listed equations are satisfied by ids.
tag_type = label % num_tag_type chunk_type = label / num_tag_type
where
num_tag_type
is the num of tag types in the tagging scheme,num_chunk_type
is the num of chunk types, andtag_type
get its value from the following table.Scheme Begin Inside End Single plain 0    IOB 0 1   IOE  0 1  IOBES 0 1 2 3
Still use NER as example, assuming the tagging scheme is IOB while chunk types are ORG, PER and LOC. To satisfy the above equations, the label map can be like this:
BORG 0 IORG 1 BPER 2 IPER 3 BLOC 4 ILOC 5 O 6
It’s not hard to verify the equations noting that the num of chunk types is 3 and the num of tag types in IOB scheme is 2. For example, the label id of ILOC is 5, the tag type id of ILOC is 1, and the chunk type id of ILOC is 2, which consistent with the results from the equations.
Inputs:  Inference : (Tensor, default: Tensor<int64_t>). Predictions from the network.
 Label : (Tensor, default: Tensor<int64_t>). The true tag sequences.
Outputs:  Precision : (float). The evaluated precision (called positive predictive value) of chunks on the given minibatch.
 Recall : (float). The evaluated recall (true positive rate or sensitivity) of chunks on the given minibatch.
 F1Score : (float). The evaluated F1Score on the given minibatch.
Attributes:  num_chunk_types (Duplicable): (int). The number of chunk type. See below for details.
 chunk_scheme (Duplicable): (string, default IOB). The labeling scheme indicating how to encode the chunks. Must be IOB, IOE, IOBES or plain. See below for details.
 excluded_chunk_types (Duplicable): (list<int>) A list including chunk type ids indicating chunk types that are not counted. See below for details.
sigmoid
Sigmoid Activation Operator
$$y = \frac{1}{1 + e^{x}}$$
Inputs:  X : Input of Sigmoid operator
Outputs:  Y : Output of Sigmoid operator
squared_l2_distance
SquaredL2Distance operator
This operator will cacluate the squared L2 distance for the input and the target. Number of distance value will be equal to the first dimension of input. First dimension of the target could be equal to the input or to 1. If the first dimension of target is 1, the operator will broadcast target's first dimension to input's first dimension. During backward propagation, the user can decide whether to calculate the gradient of the input or the target or both.
Both the input X and Y can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input X.
Inputs:  X : (Tensor) Input of SquaredL2DistanceOp.
 Y : (Tensor) Target of SquaredL2DistanceOp.
Outputs:  sub_result (Intermediate) : (Tensor) Buffering subtraction result which will be reused in backward.
 Out : (Tensor) Squared l2 distance between input and target.
relu
Relu Activation Operator.
$y = max(x, 0)$
Inputs:  X : Input of Relu operator
Outputs:  Y : Output of Relu operator
fetch
Fetch Operator.
It should not be configured by users directly.
Inputs:  X : The input of fetch op
Outputs:  Out : The output of fetch op
Attributes:  col (Duplicable): (int) The column of fetch
while
Inputs:  X (Duplicable) : A set of variables, which are required by operators inside the block of While Op.
 Condition (Duplicable) : (Bool) An scalar. When it's False, the While Op will be terminated.
Outputs:  Out (Duplicable) : A set of variables, which will be assigned with values generated by the operators inside the block of While Op.
 StepScopes : (StepScopeVar) A vector of local scope, which size equals the step number of While Op. The i'th scope storages temporary variables generated in the i'th step.
Attributes:  step_block (Duplicable): The step block inside WhileOp
proximal_adagrad
Proximal Adagrad Optimizer.
Optimizer that implements the proximal adagrad algorithm:
$$ moment = moment + grad * grad \\ prox\_param = param  learning\_rate * grad * (1 / \sqrt{moment}) \\ param = sign(prox\_param) / (1 + learning\_rate * l2) * \max(prox\_param  learning\_rate * l1 , 0) $$
The paper that proposed Proximal GD: (http://papers.nips.cc/paper/3793efficientlearningusingforwardbackwardsplitting.pdf) Here, we use the adagrad learning rate as specified here: (http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)
Inputs:  Param : (Tensor, default Tensor<float>) Input parameter that has to be updated.
 Moment : (Tensor, default Tensor<float>) Moment parameter that has to be updated.
 Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
 LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
Outputs:  ParamOut : (Tensor) Output updated parameter value.
 MomentOut : (Tensor) Output updated moment value.
Attributes:  l1 (Duplicable): (float, default 0.0) L1 regularization strength.
 l2 (Duplicable): (float, default 0.0) L2 regularization strength.
minus
Minus Operator.
Equation:
$Out = X  Y$
Both the input
X
andY
can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with inputX
.Inputs:  X : The left tensor of minus operator.
 Y : The right tensor of minus operator.
Outputs:  Out : The output tensor of minus operator.
cos_sim
Cosine Similarity Operator.
$Out = X^T * Y / (sqrt{X^T * X} * sqrt{Y^T * Y})$
The input X and Y must have the same shape, except that the 1st dimension of input Y could be just 1 (different from input X), which will be broadcasted to match the shape of input X before computing their cosine similarity.
Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.
Inputs:  X : The 1st input of cos_sim op.
 Y : The 2nd input of cos_sim op.
Outputs:  Out : The output of cos_sim op.
 XNorm (Intermediate) : Norm of the first input, reduced along the 1st dimension.
 YNorm (Intermediate) : Norm of the second input, reduced along the 1st dimension.
precision_recall
Precision Recall Operator.
When given Input(Indices) and Input(Labels), this operator can be used to compute various metrics including: 1. macro average precision 2. macro average recall 3. macro f1 score 4. micro average precision 5. micro average recall 6. micro f1 score
To compute the above metrics, we need to do statistics for true positives, false positives and false negatives. Here the count of true negatives is not necessary, but counting it may provide potential usage and the cost is trivial, so the operator also provides the count of true negatives.
We define state as a 2D tensor with shape [class_number, 4]. Each row of a state contains statistic variables for corresponding class. Layout of each row is: TP(true positives), FP(false positives), TN(true negatives), FN(false negatives). If Input(Weights) is provided, TP, FP, TN, FN will be calculated by given weight instead of the instance count.
This operator also supports metrics computing for crossbatch situation. To achieve this, Input(StatesInfo) should be provided. State of current batch data will be accumulated to Input(StatesInfo) and Output(AccumStatesInfo) is the accumulation state.
Output(BatchMetrics) is metrics of current batch data while Output(AccumStatesInfo) is metrics of accumulation data.
Inputs:  MaxProbs : (Tensor, default Tensor<float>) A 2D tensor with shape N x 1, where N is the batch size. Each row contains the max probability of an instance which computed by the previous top_k (k=1) operator.
 Indices : (Tensor, default Tensor<int>) A 2D tensor with shape N x 1, where N is the batch size. Each row contains the corresponding index which computed by the previous top_k (k=1) operator.
 Labels : (Tensor, default Tensor<int>) A 2D tensor with shape N x 1, where N is the batch size. Each element is a label and the value should be in [0, class_number  1].
 Weights : (Tensor, default Tensor<float>) A 2D tensor with shape N x 1, where N is the batch size. This input is optional. If provided, weight of instance would be considered when computing metrics.
 StatesInfo : (Tensor, default Tensor<int>) A 2D tensor with shape D x 4, where D is the number of classes. This input is optional. If provided, current state will be accumulated to this state and the accumulation state will be the output state.
Outputs:  BatchMetrics : (Tensor, default Tensor<float>) A 1D tensor with shape {6}. This output tensor contains metrics for current batch data. The layout is [macro average precision, macro average recall, macro f1 score, micro average precision, micro average recall, micro f1 score].
 AccumMetrics : (Tensor, default Tensor<float>) A 1D tensor with shape {6}. This output tensor contains metrics for accumulated data. The layout is [macro average precision, macro average recall, macro f1 score, micro average precision, micro average recall, micro f1 score].
 AccumStatesInfo : (Tensor, default Tensor<float>) A 2D tensor with shape D x 4, where D is equal to class number. This output tensor contains accumulated state variables used to compute metrics. The layout for each class is [true positives, false positives, true negatives, false negatives].
Attributes:  class_number (Duplicable): (int) Number of classes to be evaluated.
batch_norm
Batch Normalization.
Batch Norm has been implemented as discussed in the paper: https://arxiv.org/pdf/1502.03167.pdf Can be used as a normalizer function for conv2d and fully_connected operations. The required data format for this layer is one of the following: 1. NHWC
[batch, in_height, in_width, in_channels]
2. NCHW[batch, in_channels, in_height, in_width]
Inputs:  X : The input tensor
 Scale : Scale is a 1dimensional tensor of size C that is applied to the output
 Bias : Bias is a 1dimensional tensor of size C that is applied to the output
 Mean : The global mean (for training) or estimated mean (for testing)
 Variance : The global variance (for training) or estimated Variance (for testing)
Outputs:  Y : result after normalization
 MeanOut : Share memory with Mean. Store the global mean when training
 VarianceOut : Share memory with Variance. Store the global Variance when training
 SavedMean (Intermediate) : Mean of the current mini batch, will apply to output when training
 SavedVariance (Intermediate) : Variance of the current mini batch, will apply to output when training
Attributes:  is_test (Duplicable):
 momentum (Duplicable):
 epsilon (Duplicable):
 tensor_format (Duplicable):
read_from_array
ReadFromArray Operator.
Read a LoDTensor from a LoDTensor Array.
Assume $T$ is LoDTensor, $i$ is the subscript of the array, and $A$ is the array. The equation is
$$T = A[i]$$
Inputs:  X : (TensorArray) the array will be read from.
 I : (Tensor) the subscript index in tensor array. The number of element should be 1
Outputs:  Out : (LoDTensor) the tensor will be read from.
softplus
Softplus Activation Operator.
$y = ln(1 + e^{x})$
Inputs:  X : Input of Softplus operator
Outputs:  Y : Output of Softplus operator
accuracy
Accuracy Operator.
It will print accuracy rate for classification. The accuracy is calculated as follows:
$$accuracy = \frac{NumOfCorrectPredicts}{NumOfAllSamples}$$
Both the input Out and Label can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with the input Out(Inference).
Inputs:  Out : The network output of topk (inferences)
 Indices : The the network output of topk (indices)
 Label : Label of the training data
Outputs:  Accuracy : The accuracy of current batch
 Correct : The correct samples count of current batch
 Total : The samples count of current batch
conv_shift
ConvShift Operator.
A layer for circular convolution of two vectors, as used in the Neural Turing Machine: https://arxiv.org/abs/1410.5401
The equation is:
$$Out[i] = \sum_{j=(N1)/2}^{(N1)/2} X_{i+j} * Y_{j}$$
where X's index is computed modulo M, and Y's index is computed modulo N.
Both inputs X and Y can carry LoD (Level of Details) information. However, the output only shares the LoD information with input X.
Inputs:  X : (Tensor, default Tensor<float>), a 2D tensor with shape B x M, where B is the batch size and M is the data dimension.
 Y : (Tensor, default Tensor<float>), a 2D tensor with shape B x N, where B is the batch size and N is the data dimension. N must be odd.
Outputs:  Out : (Tensor, default Tensor<float>), a 2D tensor with shape B x M, i.e., the same shape as X.
nce
Compute and return the noisecontrastive estimation training loss. See Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. By default this operator uses a uniform distribution for sampling.
Inputs:  Input : (Tensor) A tensor of shape [batch_size, dim].
 Label : (Tensor) A tensor of shape [batch_size, num_true_class]. 'num_true_class' is the number of target classes in each sample.The number of target classes per sample should be same. If you have a variable number of target classes, you can pad them out to a constant number by either repeating them or by padding with an otherwise unused class.)
 Weight : (Tensor) A tensor of shape [num_class, dim]. 'num_class' is the total number of class.
 Bias : (Tensor) A tensor of shape [num_class, 1]. 'num_class' is the total number of class. It is a dispensable input.
 SampleWeight : (Tensor) A tensor of shape [batch_size, 1] storing a weight for each sample. And it is a dispensable input. The default value of sample is 1.
Outputs:  Cost : (Tensor) A tensor of shape [batch_size, 1]. Cost of samples.
 SampleLogits (Intermediate) : An intermediate tensor of shape[batch_size, num_neg_samples + num_pos_samples].This tensor is output of forward kernel and used in backward kernel to compute grads.Given X is the dot product of input tensor and sampled labels' weights.Then 'SampleLogits' is sigmoid(X).
 SampleLabels (Intermediate) : An intermediate tensor of shape[batch_size, num_neg_samples + num_pos_samples].This tensor is output of forward kernel and used in backward kernel to compute grads.
Attributes:  num_total_classes (Duplicable): Total number of classes in all samples.
 num_neg_samples (Duplicable): The number of negative classes. The default value is 10.
 custom_neg_classes (Duplicable): This attribute only be used in unitest. Classes in this list wiil be used as negative classes for every samples. Under normal conditions, user should avoid setting this attribute.
linear_chain_crf
LinearChainCRF Operator.
Conditional Random Field defines an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. CRF learns the conditional probability $P(YX)$, where $X = (x_1, x_2, ... , x_n)$ are structured inputs and $Y = (y_1, y_2, ... , y_n)$ are labels for the inputs.
Linear chain CRF is a special case of CRF that is useful for sequence labeling task. Sequence labeling tasks do not assume a lot of conditional independences among inputs. The only constraint they impose is that the input and output must be linear sequences. Thus, the graph of such a CRF is a simple chain or a line, which results in the linear chain CRF.
This operator implements the ForwardBackward algorithm for the linear chain CRF. Please refer to http://www.cs.columbia.edu/~mcollins/fb.pdf and http://cseweb.ucsd.edu/~elkan/250Bwinter2012/loglinearCRFs.pdf for details.
Equation: 1. Denote Input(Emission) to this operator as $x$ here. 2. The first D values of Input(Transition) to this operator are for starting weights, denoted as $a$ here. 3. The next D values of Input(Transition) of this operator are for ending weights, denoted as $b$ here. 4. The remaning values of Input(Transition) are for transition weights, denoted as $w$ here. 5. Denote Input(Label) as $s$ here.
The probability of a sequence $s$ of length $L$ is defined as: $$P(s) = (1/Z) \exp(a_{s_1} + b_{s_L} + \sum_{l=1}^L x_{s_l} + \sum_{l=2}^L w_{s_{l1},s_l})$$
where $Z$ is a normalization value so that the sum of $P(s)$ over all possible sequences is 1, and $x$ is the emission feature weight to the linear chain CRF.
Finally, the linear chain CRF operator outputs the logarithm of the conditional likelihood of each training sample in a minibatch.
NOTE: 1. The feature function for a CRF is made up of the emission features and the transition features. The emission feature weights are NOT computed in this operator. They MUST be computed first before this operator is called.

Because this operator performs global normalization over all possible sequences internally, it expects UNSCALED emission feature weights. Please do not call this op with the emission feature being output of any nonlinear activation.

The 2nd dimension of Input(Emission) MUST be equal to the tag number.
Inputs:  Emission : (LoDTensor, default LoDTensor<float>) A 2D LoDTensor with shape [N x D], where N is the size of the minibatch and D is the total tag number. The unscaled emission weight matrix for the linear chain CRF.
 Transition : (Tensor, default Tensor<float>) A 2D Tensor with shape [(D + 2) x D]. The learnable parameter for the linear_chain_crf operator. See more details in the operator's comments.
 Label : (LoDTensor, default LoDTensor<int64_t>) A LoDTensor with shape [N x 1], where N is the total element number in a minibatch. The ground truth.
Outputs:  Alpha (Intermediate) : (Tensor, default Tensor<float>) A 2D Tensor with shape [N x D]. The forward vectors for the entire batch. Denote it as $lpha$. $lpha$ is a memo table used to calculate the normalization factor in CRF. $lpha[k, v]$ stores the unnormalized probabilites of all possible unfinished sequences of tags that end at position $k$ with tag $v$. For each $k$, $lpha[k, v]$ is a vector of length $D$ with a component for each tag value $v$. This vector is called a forward vecotr and will also be used in backward computations.
 EmissionExps (Intermediate) : (Tensor, default Tensor<float>) A 2D Tensor with shape [N x D]. The exponentials of Input(Emission). This is an intermediate computational result in forward computation, and will be reused in backward computation.
 TransitionExps (Intermediate) : (Tensor, default Tensor<float>) A 2D Tensor with shape [(D + 2) x D]. The exponentials of Input(Transition). This is an intermediate computational result in forward computation, and will be reused in backward computation.
 LogLikelihood : (Tensor, default Tensor<float>) The logarithm of the conditional likelihood of each training sample in a minibatch. This is a 2D tensor with shape [S x 1], where S is the sequence number in a minibatch. Note: S is equal to the sequence number in a minibatch. The output is no longer a LoDTensor.

logsigmoid
Logsigmoid Activation Operator
$$y = \log \frac{1}{1 + e^{x}}$$
Inputs:  X : Input of LogSigmoid operator
Outputs:  Y : Output of LogSigmoid operator
row_conv
Rowconvolution Operator.
The row convolution is called lookahead convolution. This operator was introduced in the following paper for DeepSpeech2: http://www.cs.cmu.edu/~dyogatam/papers/wang+etal.iclrworkshop2016.pdf
The main motivation is that a bidirectional RNN, useful in DeepSpeech like speech models, learns representation for a sequence by performing a forward and a backward pass through the entire sequence. However, unlike unidirectional RNNs, bidirectional RNNs are challenging to deploy in an online and lowlatency setting. The lookahead convolution incorporates information from future subsequences in a computationally efficient manner to improve unidirectional recurrent neural networks. The row convolution operator is different from the 1D sequence convolution, and is computed as follows:
Given an input sequence $in$ of length $t$ and input dimension $d$, and a filter ($W$) of size $context times d$, the output sequence is convolved as:
$$ out_{i, :} = \sum_{j=i}^{i + context} in_{j,:} \dot W_{ij, :} $$
Inputs:  X : (LoDTensor), the input(X) is a LodTensor, which supports variable timelength input sequences. The underlying tensor in this LoDTensor is a matrix with shape (T x N), where T is the total time steps in this minibatch and N is the input data dimension.
 Filter : (Tensor), the input(Filter) is a learnable parameter. It is a 2D tensor with shape (future_context x N), where, future_context is the future context length and N is the data dimension.
Outputs:  Out : (LoDTensor), the output(Out) is a LodTensor, which supports variable timelength input sequences. The underlying tensor in this LodTensor is a matrix with shape T x N, i.e., the same shape as X.
exp
Exp Activation Operator.
$y = e^x$
Inputs:  X : Input of Exp operator
Outputs:  Y : Output of Exp operator
soft_relu
SoftRelu Activation Operator.
$y = ln(1 + exp(max(min(x, threshold), threshold))$
Inputs:  X : Input of SoftRelu operator
Outputs:  Y : Output of SoftRelu operator
Attributes:  threshold (Duplicable): The threshold value of SoftRelu
softshrink
Softshrink Activation Operator.
$$ y = \begin{cases} x  \lambda, \text{if } x > \lambda \\ x + \lambda, \text{if } x < \lambda \\ 0, \text{otherwise} \end{cases} $$
Inputs:  X : Input of Softshrink operator
Outputs:  Y : Output of Softshrink operator
Attributes:  lambda (Duplicable): nonnegative offset
maxout
MaxOut Operator.
Assumed the input shape is (N, Ci, H, W). The output shape is (N, Co, H, W). Then $Co = Ci / groups$ and the operator formula is as follows:
$$ y_{si+j} = \max_k x_{gsi + sk + j} \\ g = groups \\ s = \frac{input.size}{num\_channels} \\ 0 \le i < \frac{num\_channels}{groups} \\ 0 \le j < s \\ 0 \le k < groups $$
Please refer to Paper:  Maxout Networks: http://www.jmlr.org/proceedings/papers/v28/goodfellow13.pdf  Multidigit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks: https://arxiv.org/pdf/1312.6082v4.pdf
Inputs:  X : (Tensor) The input tensor of maxout operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.
Outputs:  Out : (Tensor) The output tensor of maxout operator.The format of output tensor is also NCHW.Where N is batch size, C is the number of channels, H and W is the height and width of feature.
Attributes:  groups (Duplicable): "Specifies how many groups the input tensor will be split" "in the channel dimension. And the number of output channel is " "the number of channels divided by groups.."
ftrl
FTRL (Follow The Regularized Leader) Operator.
Optimizer that implements the FTRL algorithm:
$$ new\_accum = squared\_accum + grad^2 \\ if (lr\_power == 0.5) { linear\_accum += grad  (\surd(new\_accum)  \surd(squared\_accum)) / (learning\_rate * param) \\ } else { linear\_accum += grad  (new\_accum^{lr\_power}  accum^{lr\_power}) / (learning\_rate * param) \\ } x = (l1 * sign(linear\_accum)  linear\_accum) if (lr\_power == 0.5) { y = \frac{\surd(new\_accum)}{learning\_rate} + (2 * l2) \\ pre\_shrink = \frac{x}{y} \\ param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0) \\ } else { y = \frac{new\_accum^{lr\_power}}{learning\_rate} + (2 * l2) \\ pre\_shrink = \frac{x}{y} \\ param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0) \\ } squared\_accum += grad^2; $$
The paper that proposed Follow The Regularized Leader (FTRL): (https://www.eecs.tufts.edu/~dsculley/papers/adclickprediction.pdf)
Inputs:  Param : (Tensor, default Tensor<float>) Input parameter value that has to be updated.
 SquaredAccumulator : (Tensor, default Tensor<float>) Accumulator that accumulates squared gradients.
 LinearAccumulator : (Tensor, default Tensor<float>) Accumulator that accumulates linear gradients.
 Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
 LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
Outputs:  ParamOut : (Tensor) Output updated parameter value.
 SquaredAccumOut : (Tensor) Output accumulated squared gradients.
 LinearAccumOut : (Tensor) Output accumulated linear gradients.
Attributes:  l1 (Duplicable): (float, default 0.0) L1 regularization strength.
 l2 (Duplicable): (float, default 0.0) L2 regularization strength.
 lr_power (Duplicable): (float, default 0.5f) Learning Rate Power.
round
Round Activation Operator.
$y = [x]$
Inputs:  X : Input of Round operator
Outputs:  Y : Output of Round operator
softsign
Softsign Activation Operator.
$$y = \frac{x}{1 + x}$$
Inputs:  X : Input of Softsign operator
Outputs:  Y : Output of Softsign operator