Operators
sgd
SGD operator
This operator implements one step of the stochastic gradient descent algorithm.
$$param\_out = param  learning\_rate * grad$$
Inputs:  Param : (Tensor) Input parameter
 LearningRate : (Tensor) Learning rate of SGD
 Grad : (Tensor) Input gradient
Outputs:  ParamOut : (Tensor) Output parameter
Creates a print op that will print when a tensor is accessed.
Wraps the tensor passed in so that whenever that a tensor is accessed, the message
message
is printed, along with the current value of the tensort
.Inputs:  In : Input tensor to be displayed.
Outputs:  Out : Output tensor with same data as input tensor.
Attributes:  first_n (Duplicable): Only log `first_n` number of times.
 message (Duplicable): A string message to print as a prefix.
 summarize (Duplicable): Number of elements printed.
 print_tensor_name (Duplicable): Whether to print the tensor name.
 print_tensor_type (Duplicable): Whether to print the tensor's dtype.
 print_tensor_shape (Duplicable): Whether to print the tensor's shape.
 print_tensor_lod (Duplicable): Whether to print the tensor's lod.
 print_phase (Duplicable): (string, default 'BOTH') Which phase to display including 'FORWARD' 'BACKWARD' and 'BOTH'.
adagrad
Adaptive Gradient Algorithm (Adagrad).
The update is done as follows:
$$moment\_out = moment + grad * grad \\ param\_out = param  \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon} $$
The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have the epsilon attribute. It is added here in our implementation as also proposed here: http://cs231n.github.io/neuralnetworks3/#ada for numerical stability to avoid the division by zero error.
Inputs:  Param : (Tensor) Input parameter
 Grad : (Tensor) Input gradient
 Moment : (Tensor) Second moment
 LearningRate : (Tensor) Learning rate
Outputs:  ParamOut : (Tensor) Output parameter
 MomentOut : (Tensor) Output second moment
Attributes:  epsilon (Duplicable): (float, default 1.0e6) Constant for numerical stability
max_pool3d_with_index
MaxPool3d Operator.
The maxpooling3d with index operation calculates the output and the mask based on the input and ksize, strides, paddings parameters. Input(X) and output(Out, Mask) are in NCDHW format, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the feature, respectively. Parameters(ksize, strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out, Mask) size may be different.
Example: Input: X shape: $(N, C, D_{in}, H_{in}, W_{in})$ Output: Out shape: $(N, C, D_{out}, H_{out}, W_{out})$ Mask shape: $(N, C, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = \frac{(D_{in}  ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ H_{out} = \frac{(H_{in}  ksize[1] + 2 * paddings[1])}{strides[1]} + 1 \\ W_{out} = \frac{(W_{in}  ksize[2] + 2 * paddings[2])}{strides[2]} + 1 $$
Inputs:  X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCDHW, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively
Outputs:  Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCDHW, where N is the batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively.
 Mask : (Tensor) The Mask tensor of pooling operator. The format of output tensor is also NCDHW, where N is the batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively. It represents the index in the current feature map.
Attributes:  ksize (Duplicable): (vector<int>) The pooling window size(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
 global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.
 strides (Duplicable): (vector<int>, default {1,1,1}), strides(depth, height, width) of pooling operator.
 paddings (Duplicable): (vector, default {0,0,0}), paddings(depth, height, width) of pooling operator. If global_pooling = true, paddings and ksize will be ignored.
lod_rank_table
Create LoDRanTable by LoDTensor
LoD Rank Table stores the
level
oflod
which is ordered by sequence length in descending order. It is useful when implement dynamic RNN and is shared by dynamic RNN memory, dynamic RNN slice input and dynamic RNN slice output operators.Inputs:  X : (LoDTensor) input lod tensor, must contain lod information.
Outputs:  Out : (LoDRankTable) The rank table of specific level.
Attributes:  level (Duplicable): (int) the specific lod level to rank.
array_to_lod_tensor
This Op build a big LoDTensor from a std::vector
and a LoDRankTable. It is supposed to be used in getting dynamic RNN's outputs back to a normal LoDTensor. The std::vector would be the output of RNN Op and the LoDRankTable would be build with RNN's input. Inputs:  X : (std::vector<LodTensor>) A vector of tensors that is going to be casted to a big LoDTensor.
 RankTable : (LoDRankTable) RankTable provides the coarse lod infomation to build the output LoDTensor. See 'paddle/framework/lod_rank_table.h' for more details.
Outputs:  Out : (LoDTensor) The LoDTensor formed by input tensor array.
sequence_conv
Sequence Conv Operator.
SequenceConvOp performs convolution operation on features of contextLength timesteps of each instance. The convolution operation calculates the output based on the input, filter, strides and paddings parameters. The size of each dimension of the parameters is checked during infershape. In order to ensure the equal length of sequence before and after convolution, it is necessary to fill the top and bottom of each sequence based on context_length, context_stride and context_start.
Inputs:  X : (LoDTensor) the input(X) is a LodTensor, which supports variabletime length input sequence. The underlying tensor in this LoDTensor is a matrix with shape (T, N), where T is the total time steps in this minibatch and N is the input_hidden_size.
 PaddingData : (Tensor, optional) the input(PaddingData) is an optional parameter, and it is learnable. This is a tensor with shape (P, N), where P is the top_pad + bottom_pad, N is the input_hidden_size. In order to ensure the equal length of sequence before and after convolution, it is necessary to fill the top and bottom of each sequence according to context_length, context_stride and context_start
 Filter : (Tensor) the input(Filter) is an learnable parameter.This is a tensor with shape (K, M), where K is the context_length * input_hidden_size, M is the output feature size.
Outputs:  Out : (LoDTensor) the output(Out) is a LodTensor, which support variabletime length output sequence. The underlying tensor in this LoDTensor is a matrix with shape (T, M), where, T is the total time steps in this minibatch, M is the output feature size.
Attributes:  paddingTrainable (Duplicable): (bool, default:false) the padding data of SequenceConvOp is trainable or not.
 contextLength (Duplicable): (int) the contextLength of SequenceConvOp is the height of the convolution kernel.
 contextStart (Duplicable): (int, default:0) the contextStart of SequenceConvOp represents the beginning of the convolution of the number of rows of sequence, which can be negative. The negative number means to pad contextStart timesteps of zeros or learnable parameters at the beginning of each instance. The positive number means to skip contextStart timesteps of each instance.
 contextStride (Duplicable): (int, default:1) the contextStride of SequenceConvOp represents the stride length of convolution kernel. Currently, SequenceConvOp only supportscontextStride=1.
lstm
LongShort Term Memory (LSTM) Operator.
The defalut implementation is diagonal/peephole connection (https://arxiv.org/pdf/1402.1128.pdf), the formula is as follows:
$$ i_t = \sigma(W_{ix}x_{t} + W_{ih}h_{t1} + W_{ic}c_{t1} + b_i) \\ f_t = \sigma(W_{fx}x_{t} + W_{fh}h_{t1} + W_{fc}c_{t1} + b_f) \\ \tilde{c_t} = act_g(W_{cx}x_t + W_{ch}h_{t1} + b_c) \\ o_t = \sigma(W_{ox}x_{t} + W_{oh}h_{t1} + W_{oc}c_t + b_o) \\ c_t = f_t \odot c_{t1} + i_t \odot \tilde{c_t} \\ h_t = o_t \odot act_h(c_t) $$
where the W terms denote weight matrices (e.g. $W_{xi}$ is the matrix of weights from the input gate to the input), $W_{ic}, W_{fc}, W_{oc}$ are diagonal weight matrices for peephole connections. In our implementation, we use vectors to reprenset these diagonal weight matrices. The b terms denote bias vectors ($b_i$ is the input gate bias vector), $sigma$ is the nonline activations, such as logistic sigmoid function, and $i, f, o$ and $c$ are the input gate, forget gate, output gate, and cell activation vectors, respectively, all of which have the same size as the cell output activation vector $h$.
The $odot$ is the elementwise product of the vectors. $act_g$ and $act_h$ are the cell input and cell output activation functions and
tanh
is usually used for them. $tilde{c_t}$ is also called candidate hidden state, which is computed based on the current input and the previous hidden state.Set
use_peepholes
False to disable peephole connection. The formula is omitted here, please refer to the paper http://www.bioinf.jku.at/publications/older/2604.pdf for details.Note that these $W_{xi}x_{t}, W_{xf}x_{t}, W_{xc}x_{t}, W_{xo}x_{t}$ operations on the input $x_{t}$ are NOT included in this operator. Users can choose to use fullyconnect operator before LSTM operator.
Inputs:  Input : (LoDTensor) the first input is a LodTensor, which support variabletime length input sequence. The underlying tensor in this LoDTensor is a matrix with shape (T X 4D), where T is the total time steps in this minibatch, D is the hidden size.
 H0 : (Tensor, optional) the initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size and D is the hidden size.
 C0 : (Tensor, optional) the initial cell state is an optional input. This is a tensor with shape (N x D), where N is the batch size. `H0` and `C0` can be NULL but only at the same time.
 Weight : (Tensor) the learnable hiddenhidden weights.  The shape is (D x 4D), where D is the hidden size.  Weight = {W_ch, W_ih, W_fh, W_oh}
 Bias : (Tensor) the learnable weights, which contains two parts: inputhidden bias weight and peephole connections weight if setting `use_peepholes` True. 1. `use_peepholes = False`  The shape is (1 x 4D).  Bias = {b_c, b_i, b_f, b_o}.2. `use_peepholes = True`  The shape is (1 x 7D).  Bias = {b_c, b_i, b_f, b_o, W_ic, W_fc, W_oc}.
Outputs:  Hidden : (LoDTensor) the hidden state of LSTM operator. The shape is (T x D), and lod is the same with the `Input`.
 Cell : (LoDTensor) the cell state of LSTM operator. The shape is (T x D), and lod is the same with the `Input`.
 BatchGate (Intermediate) : (LoDTensor) This LoDTensor contains input gate, forget gate and output gate after the nonlinear computation. This LoDTensor has the same shape as the reorganized input, which is also be called batch input. The LoD size is 2. The first LoD is the batch offsets and the second LoD contains the indexes, which denote the position of reorganized sequence in the raw input.
 BatchCellPreAct (Intermediate) : (LoDTensor) This LoDTensor is obtained in the forward and used in the backward.
Attributes:  use_peepholes (Duplicable): (bool, defalut: True) whether to enable diagonal/peephole connections.
 is_reverse (Duplicable): (bool, defalut: False) whether to compute reversed LSTM.
 gate_activation (Duplicable): (string, default: sigmoid)The activation for input gate, forget gate and output gate, `sigmoid` by default.
 cell_activation (Duplicable): (string, default: tanh)The activation for cell output, `tanh` by defalut.
 candidate_activation (Duplicable): (string, default: tanh)The activation for candidate hidden state, `tanh` by default.
warpctc
An operator integrating the opensource warpctc library, which is used in Deep Speech 2: EndtoEnd Speech Recognition in English and Mandarin, to compute Connectionist Temporal Classification (CTC) loss. It can be aliased as softmax with ctc, since a native softmax activation is interated to the warpctc library, to to normlize values for each row of the input tensor.
More detail of CTC loss can be found by refering to Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.
Inputs:  Logits : (LodTensor, default: LoDTensor<float>), the unscaled probabilities of variablelength sequences, which is a 2D Tensor with LoD information. It's shape is [Lp, num_classes + 1], where Lp is the sum of all input sequences' length and num_classes is the true number of classes (not including the blank label).
 Label : (LodTensor, default: LoDTensor<int>), the ground truth of variablelength sequence, which is a 2D Tensor with LoD information. It is of the shape [Lg, 1], where Lg is th sum of all labels' length.
Outputs:  WarpCTCGrad (Intermediate) : (Tensor, default: Tensor<float>), a temporary output Tensor to store the gradients of warpctc, which is computed with loss together in one call. It is a 3D Tensor of the shape [max_sequence_length, batch_size, num_classes + 1].
 Loss : (Tensor, default: Tensor<float>), the Connectionist Temporal Classification (CTC) loss, which is a 2D Tensor of the shape [batch_size, 1]
Attributes:  blank (Duplicable): (int, default: 0), the blank label of Connectionist Temporal Classification (CTC) loss, which is in the halfopened interval [0, num_classes + 1).
 norm_by_times (Duplicable): (bool, default: false), whether to normalize the gradients by the number of timestep, which is also the sequence's length.
cos_sim
Cosine Similarity Operator.
$Out = X^T * Y / (sqrt{X^T * X} * sqrt{Y^T * Y})$
The input X and Y must have the same shape, except that the 1st dimension of input Y could be just 1 (different from input X), which will be broadcasted to match the shape of input X before computing their cosine similarity.
Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.
Inputs:  X : The 1st input of cos_sim op.
 Y : The 2nd input of cos_sim op.
Outputs:  Out : The output of cos_sim op.
 XNorm (Intermediate) : Norm of the first input, reduced along the 1st dimension.
 YNorm (Intermediate) : Norm of the second input, reduced along the 1st dimension.
conv3d
Convolution3D Operator.
The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and output(Output) are in NCDHW format, where N is batch size, C is the number of channels,D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCDHW format, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$ D_{out}= \frac{(D_{in} + 2 * paddings[0]  (dilations[0] * (D_f  1) + 1))}{ strides[0]}+ 1 \\ H_{out}= \frac{(H_{in} + 2 * paddings[1]  (dilations[1] * (H_f  1) + 1))}{ strides[1]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[2]  (dilations[2] * (W_f  1) + 1))}{ strides[2]}+ 1 $$
Inputs:  Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCDHW, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups.
Outputs:  Output : (Tensor) The output tensor of convolution operator.The format of output tensor is also NCDHW.
Attributes:  strides (Duplicable): (vector<int>, default:{1, 1, 1}), the strides(d_stride, h_stride, w_stride) of convolution operator.
 paddings (Duplicable): (vector<int>, default:{0, 0, 0}), the paddings(d_pad, h_pad, w_pad) of convolution operator.
 groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.
 dilations (Duplicable): (vector<int> default:{1, 1, 1}), the dilations(d_dilation, h_dilation, w_dilation) of convolution operator.
 use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnn
 data_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically.
 workspace_size_MB (Duplicable): Only used in cudnn kernel. workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardware. This size should be chosen carefully.
depthwise_conv2d
Convolution Operator.
The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and Output(Output) are in NCHW format. Where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCHW format. Where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$ H_{out}= \frac{(H_{in} + 2 * paddings[0]  (dilations[0] * (H_f  1) + 1))}{strides[0]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[1]  (dilations[1] * (W_f  1) + 1))}{strides[1]}+ 1 $$
Inputs:  Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCHW, where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups.
Outputs:  Output : (Tensor) The output tensor of convolution operator. The format of output tensor is also NCHW.
Attributes:  strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride) of convolution operator.
 paddings (Duplicable): (vector<int> default:{0, 0}), the paddings(h_pad, w_pad) of convolution operator.
 groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.
 dilations (Duplicable): (vector<int> default:{1, 1}), the dilations(h_dilation, w_dilation) of convolution operator.
 use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnn
 data_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically.
 workspace_size_MB (Duplicable): Only used in cudnn kernel. Need set use_cudnn to true.workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardware. This size should be chosen carefully.
conv2d
Convolution Operator.
The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and Output(Output) are in NCHW format. Where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCHW format. Where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$ H_{out}= \frac{(H_{in} + 2 * paddings[0]  (dilations[0] * (H_f  1) + 1))}{strides[0]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[1]  (dilations[1] * (W_f  1) + 1))}{strides[1]}+ 1 $$
Inputs:  Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCHW, where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups.
Outputs:  Output : (Tensor) The output tensor of convolution operator. The format of output tensor is also NCHW.
Attributes:  strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride) of convolution operator.
 paddings (Duplicable): (vector<int> default:{0, 0}), the paddings(h_pad, w_pad) of convolution operator.
 groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.
 dilations (Duplicable): (vector<int> default:{1, 1}), the dilations(h_dilation, w_dilation) of convolution operator.
 use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnn
 data_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically.
 workspace_size_MB (Duplicable): Only used in cudnn kernel. Need set use_cudnn to true.workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardware. This size should be chosen carefully.
pool3d
Pool3d Operator.
The pooling3d operation calculates the output based on the input, pooling_type, ksize, strides, and paddings parameters. Input(X) and output(Out) are in NCDHW format, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the feature, respectively. Parameters(ksize, strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: X shape: $(N, C, D_{in}, H_{in}, W_{in})$ Output: Out shape: $(N, C, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = \frac{(D_{in}  ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ H_{out} = \frac{(H_{in}  ksize[1] + 2 * paddings[1])}{strides[1]} + 1 \\ W_{out} = \frac{(W_{in}  ksize[2] + 2 * paddings[2])}{strides[2]} + 1 $$
Inputs:  X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively.
Outputs:  Out : (Tensor) The output tensor of pooling operator.The format of output tensor is also NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively.
Attributes:  pooling_type (Duplicable): (string) Pooling type, can be "max" for maxpooling and "avg" for averagepooling.
 ksize (Duplicable): (vector<int>) The pooling window size(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
 global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings wille be ignored.
 strides (Duplicable): (vector<int>, default {1,1,1}) Strides(depth, height, width) of the pooling operator.
 paddings (Duplicable): (vector<int>, default {0,0,0}), paddings(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
 use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnn
 data_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically.
pool2d
Pool2d Operator.
The pooling2d operation calculates the output based on the input, pooling_type and ksize, strides, paddings parameters. Input(X) and output(Out) are in NCHW format, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Parameters(ksize, strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.
Example:
Input: X shape: $(N, C, H_{in}, W_{in})$ Output: Out shape: $(N, C, H_{out}, W_{out})$ Where $$ H_{out} = \frac{(H_{in}  ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ W_{out} = \frac{(W_{in}  ksize[1] + 2 * paddings[1])}{strides[1]} + 1 $$Inputs:  X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
Outputs:  Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
Attributes:  pooling_type (Duplicable): (string), pooling type, can be "max" for maxpooling and "avg" for averagepooling.
 ksize (Duplicable): (vector<int>) The pooling window size(height, width) of the pooling operator. If global_pooling = true, ksize and paddings will be ignored.
 global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.
 strides (Duplicable): (vector<int>, default {1, 1}), strides(height, width) of pooling operator.
 paddings (Duplicable): (vector<int>, default {0,0}), paddings(height, width) of pooling operator.If global_pooling = true, paddings and ksize will be ignored.
 use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnn
 data_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically.
conv3d_transpose
Convolution3D Transpose Operator.
The convolution transpose operation calculates the output based on the input, filter and dilations, strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and output(Output) are in NCDHW format. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCDHW format. Where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter,H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.
Example:
Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = (D_{in}  1) * strides[0]  2 * paddings[0] + dilations[0] * (D_f  1) + 1 \\ H_{out} = (H_{in}  1) * strides[1]  2 * paddings[1] + dilations[1] * (H_f  1) + 1 \\ W_{out} = (W_{in}  1) * strides[2]  2 * paddings[2] + dilations[2] * (W_f  1) + 1 $$Inputs:  Input : (Tensor) The input tensor of convolution transpose operator.The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution transpose operator.The format of the filter tensor is MCDHW, where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.We enforce groups number == 1 and padding == 0 in the convolution3d transpose scenario.
Outputs:  Output : (Tensor) The output tensor of convolution transpose operator.The format of output tensor is also NCDHW.Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
Attributes:  dilations (Duplicable): (vector<int> default:{1, 1, 1}), the dilations(d_dilation,h_dilation, w_dilation) of convolution transpose operator.
 strides (Duplicable): (vector<int> default:{1, 1, 1}), the strides{d_stride, h_stride, w_stride} of convolution transpose operator.
 paddings (Duplicable): (vector<int> default:{0, 0, 0}), paddings(d_pad, h_pad, w_pad) of convolution transpose operator.
 use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnn
 data_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically.
 workspace_size_MB (Duplicable): Used in cudnn kernel only. workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardward. This size should be carefully setted.
parallel_do
ParallelDo Operator.
Inputs:  inputs (Duplicable) :
 parameters (Duplicable) :
 places :
Outputs:  outputs (Duplicable) :
 parallel_scopes :
Attributes:  sub_block (Duplicable):
recurrent
Static Length Recurrent Operator.
The static length recurrent operator can only operate on fixed size sequence data, i.e. in each minibatch, the sequence length of all inputs are the same.
Inputs:  inputs (Duplicable) : rnn inputs
 initial_states (Duplicable) : rnn initial states
 parameters (Duplicable) : Parameters are used by step block as its input. However, the input is not a sequence tensor. Every time step, each operator in step block just use the parameter directly.
Outputs:  outputs (Duplicable) : The output sequence of RNN. The sequence length must be same.
 step_scopes : StepScopes contain all local variables in each time step.
Attributes:  ex_states (Duplicable): The exstate variable names. The exstate means the state value in the extimestep or the previous time step [ex_states, states, initial_states@GRAD] must be the same order
 states (Duplicable): The state variable names. [ex_states, states, initial_states@GRAD] must be the same order
 sub_block (Duplicable): The step block inside RNN
 reverse (Duplicable): Calculate RNN reversely or not. By default reverse=False Assume the input data is [A, B, C, D] if reverse is False: the computation of RNN is like A B C D     v v v v rnn > rnn > rnn > rnn     v v v v o o o o if reverse is True the computation of RNN is like A B C D     v v v v rnn < rnn < rnn < rnn     v v v v o o o o
 is_train (Duplicable):
create_shuffle_reader
CreateShuffleReader Operator A shuffle reader takes another reader as its 'underlying reader' and yields the underlying reader's outputs in a shuffled order.
Inputs:  UnderlyingReader : (ReaderHolder) The underlying reader for creating a shuffle reader.
Outputs:  Out : (ReaderHolder) The created shuffle reader.
Attributes:  buffer_size (Duplicable): The shuffle buffer size.
save
Save operator
This operator will serialize and write a tensor variable to file on disk.
Inputs:  X : (Tensor ) Input tensor to be saved
Outputs: Attributes:  overwrite (Duplicable): (boolean, default true)Overwrite the output file if exist
 file_path (Duplicable): (string)The "file_path" where the variable will be saved.
load
Load Operator.
Load operator will load a tensor variable from disk file.
Inputs: Outputs:  Out : (Tensor) The tensor need to be loaded
Attributes:  file_path (Duplicable): (string) Variable will be loaded from "file_path".
load_combine
LoadCombine Operator.
LoadCombine operator loads LoDTensor variables from a file. The file should contain one or more LoDTensors serialized using the SaveCombine operator. The LoadCombine operator applies a deserialization strategy to appropriately load the LodTensors, and this strategy complements the serialization strategy used in the SaveCombine operator. Hence, the LoadCombine operator is tightly coupled with the SaveCombine operator, and can only deserialize one or more LoDTensors that were saved using the SaveCombine operator.
Inputs: Outputs:  Out (Duplicable) : (vector) The output LoDTensors that will be read from the input file.
Attributes:  file_path (Duplicable): (string) LoDTensors will be loaded from "file_path".
accuracy
Accuracy Operator.
It will print accuracy rate for classification. The accuracy is calculated as follows:
$$accuracy = \frac{NumOfCorrectPredicts}{NumOfAllSamples}$$
Both the input Out and Label can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with the input Out(Inference).
Inputs:  Out : The network output of topk (inferences)
 Indices : The the network output of topk (indices)
 Label : Label of the training data
Outputs:  Accuracy : The accuracy of current batch
 Correct : The correct samples count of current batch
 Total : The samples count of current batch
hard_sigmoid
HardSigmoid Activation Operator.
Segmentwise linear approximation of sigmoid(https://arxiv.org/abs/1603.00391), which is much faster than sigmoid.
$out = max(0, min(1, slope * x + shift))$
The slope should be positive. The offset can be either positive or negative. The default slope and shift are set according to the above reference. It is recommended to use the defaults for this activation.
Inputs:  X : Input of HardSigmoid operator
Outputs:  Out : Output of HardSigmoid operator
Attributes:  slope (Duplicable): Slope for linear approximation of sigmoid
 offset (Duplicable): Offset for linear approximation of sigmoid
cond
Sample Dependent Conditional Operator.
Given Cond[i] as a 1/0 vector to indicate true/false: Out[i] = subnet_true[i], if Cond[i] == true Out[i] = subnet_false[i], if Cond[i] == false
Inputs:  Cond : The condition, which is a bool vector
 Xs (Duplicable) : Inputs of Subnets
Outputs:  Outs (Duplicable) : Outputs of Cond_Op after merge
 SubScopes : sub scopes for true and false branches
 IndexTensors : Index Tensors contains indices for true/false
max_pool2d_with_index
MaxPool2d Operator.
The maxPooling2d with index operation calculates the output and the mask based on the input, ksize, strides, and paddings parameters. Input(X) and output(Out, Mask) are in NCHW format, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Parameters(ksize, strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out, Mask) size may be different.
Example: Input: X shape: $(N, C, H_{in}, W_{in})$ Output: Out shape: $(N, C, H_{out}, W_{out})$ Mask shape: $(N, C, H_{out}, W_{out})$ Where $$ H_{out} = \frac{(H_{in}  ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ W_{out} = \frac{(W_{in}  ksize[1] + 2 * paddings[1])}{strides[1]} + 1 $$
Inputs:  X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the image, and W is the width of the image.
Outputs:  Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the image and W is the width of the image.
 Mask : (Tensor) The Mask tensor of pooling operator.The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the image, and W is the width of the image. It represents the index in the current feature map.
Attributes:  ksize (Duplicable): (vector<int>) The pooling window size(height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
 global_pooling (Duplicable): (bool, default:false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.
 strides (Duplicable): (vector<int>, default {1, 1}), strides(height, width) of pooling operator.
 paddings (Duplicable): (vector<int>, default:{0, 0}), paddings(height, width) of pooling operator. If global_pooling = true, paddings and will be ignored.
thresholded_relu
ThresholdedRelu Activation Operator.
$$ out = \begin{cases} x, \text{if } x > threshold \\ 0, \text{otherwise} \end{cases} $$
Inputs:  X : Input of ThresholdedRelu operator
Outputs:  Out : Output of ThresholdedRelu operator
Attributes:  threshold (Duplicable): The threshold location of activation
hard_shrink
HardShrink Activation Operator.
$$ out = \begin{cases} x, \text{if } x > \lambda \\ x, \text{if } x < \lambda \\ 0, \text{otherwise} \end{cases} $$
Inputs:  X : Input of HardShrink operator
Outputs:  Out : Output of HardShrink operator
Attributes:  threshold (Duplicable): The value of threshold for HardShrink
create_batch_reader
CreateBatchReader Operator A batch reader takes another reader as its 'underlying reader', gathers the underlying reader's outputs and then yields them in batches.
Inputs:  UnderlyingReader : (ReaderHolder) The underlying reader for creating a batch reader.
Outputs:  Out : (ReaderHolder) The created batch reader.
Attributes:  batch_size (Duplicable): How many instances the batch reader yields each time.
relu6
Relu6 Activation Operator.
$out = min(max(0, x), 6)$
Inputs:  X : Input of Relu6 operator
Outputs:  Out : Output of Relu6 operator
Attributes:  threshold (Duplicable): The threshold value of Relu6
elu
ELU Activation Operator.
Applies the following elementwise computation on the input according to https://arxiv.org/abs/1511.07289.
$out = max(0, x) + min(0, alpha * (e^x  1))$
Inputs:  X : Input of ELU operator
Outputs:  Out : Output of ELU operator
Attributes:  alpha (Duplicable): The alpha value of ELU
save_combine
SaveCombine operator
This operator will serialize and write a list of input LoDTensor variables to a file on disk.
Inputs:  X (Duplicable) : (vector) Input LoDTensors that need to be saved together in a file.
Outputs: Attributes:  overwrite (Duplicable): (boolean, default true)Overwrite the output file if it exists.
 file_path (Duplicable): (string)The "file_path" where the LoDTensor variables will be saved.
leaky_relu
LeakyRelu Activation Operator.
$out = max(x, alpha * x)$
Inputs:  X : Input of LeakyRelu operator
Outputs:  Out : Output of LeakyRelu operator
Attributes:  alpha (Duplicable): The small negative slope
softsign
Softsign Activation Operator.
$$out = \frac{x}{1 + x}$$
Inputs:  X : Input of Softsign operator
Outputs:  Out : Output of Softsign operator
square
Square Activation Operator.
$out = x^2$
Inputs:  X : Input of Square operator
Outputs:  Out : Output of Square operator
log
Log Activation Operator.
$out = ln(x)$
Natural logarithm of x.
Inputs:  X : Input of Log operator
Outputs:  Out : Output of Log operator
reciprocal
Reciprocal Activation Operator.
$$out = \frac{1}{x}$$
Inputs:  X : Input of Reciprocal operator
Outputs:  Out : Output of Reciprocal operator
ceil
Ceil Activation Operator.
$out = ceil(x)$
Inputs:  X : Input of Ceil operator
Outputs:  Out : Output of Ceil operator
abs
Abs Activation Operator.
$out = x$
Inputs:  X : Input of Abs operator
Outputs:  Out : Output of Abs operator
soft_relu
SoftRelu Activation Operator.
$out = ln(1 + exp(max(min(x, threshold), threshold))$
Inputs:  X : Input of SoftRelu operator
Outputs:  Out : Output of SoftRelu operator
Attributes:  threshold (Duplicable): The threshold value of SoftRelu
softshrink
Softshrink Activation Operator.
$$ out = \begin{cases} x  \lambda, \text{if } x > \lambda \\ x + \lambda, \text{if } x < \lambda \\ 0, \text{otherwise} \end{cases} $$
Inputs:  X : Input of Softshrink operator
Outputs:  Out : Output of Softshrink operator
Attributes:  lambda (Duplicable): nonnegative offset
softmax
Softmax Operator.
The input of the softmax operator is a 2D tensor with shape N x K (N is the batch_size, K is the dimension of input feature). The output tensor has the same shape as the input tensor.
For each row of the input tensor, the softmax operator squashes the Kdimensional vector of arbitrary real values to a Kdimensional vector of real values in the range [0, 1] that add up to 1. It computes the exponential of the given dimension and the sum of exponential values of all the other dimensions in the Kdimensional vector input. Then the ratio of the exponential of the given dimension and the sum of exponential values of all the other dimensions is the output of the softmax operator.
For each row $i$ and each column $j$ in Input(X), we have: $$Out[i, j] = \frac{\exp(X[i, j])}{\sum_j(exp(X[i, j])}$$
Inputs:  X : The input tensor of softmax. 2D with shape [batch_size, input_feature_dimensions].
Outputs:  Out : The normalized values with the same shape as X.
top_k
Top K operator
If the input is a vector (1d tensor), this operator finds the k largest entries in the vector and outputs their values and indices as vectors. Thus values[j] is the jth largest entry in input, and its index is indices[j].
For matrices, this operator computes the top k entries in each row.
Inputs:  X : (Tensor) The input of Topk op
Outputs:  Out : (Tensor) The output tensor of Topk op
 Indices : (Tensor) The indices of Topk elements of input
Attributes:  k (Duplicable): (int, default 1) Number of top elements to look for along the last dimension (along each row for matrices).
clip
Clip Operator.
The clip operator limits the value of given input within an interval. The interval is specified with arguments 'min' and 'max':
$$ Out = \min(\max(X, min), max) $$
Inputs:  X : (Tensor)The input of clip op.The number of dimensions must be between [1, 9].
Outputs:  Out : (Tensor)The output of clip op with shape as input(X)
Attributes:  min (Duplicable): (float)Minimum value, under which element is replaced by min.
 max (Duplicable): (float)Maximum value, above which element is replaced by max
margin_rank_loss
MarginRankLoss Operator.
This operator measures the loss given a pair of training sample {
X1
,X2
} and theLabel
with attributemargin
, whereLabel = +1
indicating X1 is ranked higher thanX2
andLabel = 1
otherwise. The loss is calculated as:$loss(X1, X2, Label) = max(0, Label * (X1  X2) + margin)$
The attribute
margin
here helps make the predictions more robust. Denote the item ranked higher as the positive sample, otherwise the negative sample. If the score of the two samples satisfies$positive sample  negative sample < margin$
the pair of samples will contribute to the final loss, which will backpropagate and train the ranking model to enlarge the difference between the two scores.
For batch input with size
batch_size
,X1
,X2
andLabel
all have the same shape [batch_size x 1].Inputs:  X1 : (2D tensor with shape [batch_size x 1]) The score for one item X1 to be ranked, from pairwise ranking model.
 X2 : (2D tensor with shape [batch_size x 1]) The score for another item X2 to be ranked, from pairwise ranking model.
 Label : (2D tensor with shape [batch_size x 1]) The label indicating X1 ranked higher than X2 or not, can only be +1 or 1.
Outputs:  Activated (Intermediate) : (2D tensor with shape [batch_size x 1]) Intermediate tensor to indicate whether each element of Output(Out) is activated.
 Out : (2D tensor with shape [batch_size x 1]) The output loss of MarginRankLoss operator.
Attributes:  margin (Duplicable): (scalar, default 0) Margin for MarginRankLossOp.
mul
Mul Operator.
This operator is used to perform matrix multiplication for input $X$ and $Y$.
The equation is:
$$Out = X * Y$$
Both the input $X$ and $Y$ can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input $X$.
Inputs:  X : (Tensor), The first input tensor of mul op.
 Y : (Tensor), The second input tensor of mul op.
Outputs:  Out : (Tensor), The output tensor of mul op.
Attributes:  x_num_col_dims (Duplicable): (int, default 1), The mul_op can take tensors with more than two dimensions as its inputs. If the input $X$ is a tensor with more than two dimensions, $X$ will be flattened into a twodimensional matrix first. The flattening rule is: the first `num_col_dims` will be flattened to form the first dimension of the final matrix (the height of the matrix), and the rest `rank(X)  num_col_dims` dimensions are flattened to form the second dimension of the final matrix (the width of the matrix). As a result, height of the flattened matrix is equal to the product of $X$'s first `x_num_col_dims` dimensions' sizes, and width of the flattened matrix is equal to the product of $X$'s last `rank(x)  num_col_dims` dimensions' size. For example, suppose $X$ is a 6dimensional tensor with the shape [2, 3, 4, 5, 6], and `x_num_col_dims` = 3. Thus, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] = [24, 30].
 y_num_col_dims (Duplicable): (int, default 1), The mul_op can take tensors with more than two, dimensions as its inputs. If the input $Y$ is a tensor with more than two dimensions, $Y$ will be flattened into a twodimensional matrix first. The attribute `y_num_col_dims` determines how $Y$ is flattened. See comments of `x_num_col_dims` for more details.
mine_hard_examples
Mine hard examples Operator. This operator implements hard example mining to select a subset of negative box indices. For each image, selects the box with highest losses. subject to the condition that the box cannot have an Matcht > neg_dist_threshold when mining_type is max_negative. The selected number is min(sample_size, max_negative_box_number) when mining_type is hard_example, or min(neg_pos_ratio * positive_box_number, max_negative_box_number) when mining_type is max_negative, where the max_negative_box_number is the count of MatchIndices elements with value 1.
Inputs:  ClsLoss : (Tensor, default Tensor<float>), The classification loss with shape [N, Np], N is the batch size and Np is the number of prior box.
 LocLoss : (Tensor, optional, default Tensor<float>), The localization loss with shape [N, Np], N is the batch size and Np is the number of prior box.
 MatchIndices : (Tensor, Tensor<int>), Matched indices with shape [N, Np], N is the batch size and Np is the number of prior box. MatchIndices[i][j] equal 1 means the jth prior box in ith instance does not match any entity, otherwise means it is matched to row.
 MatchDist : (Tensor, default Tensor<float>) Matched indices with shape [N, Np], N is the batch size and Np is the number of prior box.
Outputs:  NegIndices : (LoDTensor<int>) The output of negative example indices. a LoDTensor with shape [Neg, 1]. The size of lod[0] minus 1 is batch size, and each element is the prior box index. For example, the batch size is 2, the lod is [[0, 1, 2]], the sample 0's box 1(MatchIndices[0][1]) is selected, and sample 1's box 0 is selected. The output NegIndices is [[1], [0]].
 UpdatedMatchIndices : (Tensor<int>) The output of updated MatchIndices, a tensor with shape [N, Np]. Only update when mining_type is hard_example. The input MatchIndices elements will be update to 1 when it is not in the candidate high loss list of negative examples.
Attributes:  neg_pos_ratio (Duplicable): (float) The ratio of the negative box to the positive box. Use only when mining_type is max_negative.
 neg_dist_threshold (Duplicable): (float) The negative overlap upper bound for the unmatched predictions. Use only when mining_type is max_negative.
 sample_size (Duplicable): (float) The max sample size of negative box. Use only when mining_type is hard_example.
 mining_type (Duplicable): (float) The mining algorithm name, the value is hard_example or max_negative.
swish
Swish Activation Operator.
$$out = \frac{x}{1 + e^{ \beta x}}$$
Inputs:  X : Input of Swish operator
Outputs:  Out : Output of Swish operator
Attributes:  beta (Duplicable): Constant beta of swish operator
is_empty
IsEmpty Operator which checks whether a tensor is empty.
It will just return product(tensor.ddims()) > 0;
Inputs:  X : (Tensor) Tensor which is to be checked.
Outputs:  Out : (Tensor) a boolean Tensor that indicate empty or not.
minus
Minus Operator.
Equation:
$Out = X  Y$
Both the input
X
andY
can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with inputX
.Inputs:  X : The left tensor of minus operator.
 Y : The right tensor of minus operator.
Outputs:  Out : The output tensor of minus operator.
scatter
Scatter Operator.
This operator obtains output by updating the input on selected indices on the first axis:
$$ Out = Ref \\ Out[Index] = Ref[Index] + Updates $$
Inputs:  Ref : The source input of scatter op
 Index : The index input of scatter op where Ref will be updated
 Updates : The updated value of updates op
Outputs:  Out : The output of add op
max_sequence_len
Calculate the max sequence length through lod_rank_table.
Inputs:  RankTable : The lod_rank_table.
Outputs:  Out : The max sequence length.
multiplex
Multiplex Operator.
Multiplex multiple tensors according to the index provided by the index tensor.
Ids: the index tensor. X[0 : N  1]: the candidate tensors for output (N >= 2). For each index i from 0 to batchSize  1, the output is the ith row of the the (Ids[i])th tensor.
For ith row of the output tensor:
$$y[i] = x_{k}[i]$$
where
y
is the output tensor,x_{k}
is the kth input tensor, andk = Ids[i]
.Inputs:  Ids : The index tensor of multiplex operator.
 X (Duplicable) : The candidate tensors of multiplex operator.
Outputs:  Out : The output tensor of multiplex operator.
elementwise_pow
Limited Elementwise Pow Operator.
The equation is:
$$Out = X ^ Y$$
$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.
There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.
For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.
For example .. codeblock:: python
shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.
Inputs:  X : (Tensor), The first input tensor of elementwise op.
 Y : (Tensor), The second input tensor of elementwise op.
Outputs:  Out : The output of elementwise op.
Attributes:  axis (Duplicable): (int, default 1). The start dimension index for broadcasting Y onto X.
proximal_gd
ProximalGD Operator.
Optimizer that implements the proximal gradient descent algorithm:
$$ prox\_param = param  learning\_rate * grad \\ param = sign(prox\_param) / (1 + learning\_rate * l2) * \max(prox\_param  learning\_rate * l1, 0) $$
The paper that proposed Proximal Gradient Descent: (http://papers.nips.cc/paper/3793efficientlearningusingforwardbackwardsplitting.pdf)
Inputs:  Param : (Tensor, default Tensor<float>) Input parameter value that has to be updated.
 Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
 LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
Outputs:  ParamOut : (Tensor) Output updated parameter value.
Attributes:  l1 (Duplicable): (float, default 0.0) L1 regularization strength.
 l2 (Duplicable): (float, default 0.0) L2 regularization strength.
prelu
PRelu Operator.
The equation is:
$$ f(x) = \begin{cases} \alpha * x, \quad \text{if} \ x < 0 \\ x, \qquad \text{if} \ x >= 0 \end{cases} $$
The input
X
can carry the LoD (Level of Details) information, or not. And the output shares the LoD information with inputX
.Inputs:  X : The input tensor of prelu operator.
 Alpha : The alpha weight of prelu operator.
Outputs:  Out : The output tensor of prelu operator.
prior_box
Prior box operator Generate prior boxes for SSD(Single Shot MultiBox Detector) algorithm. Each position of the input produce N prior boxes, N is determined by the count of min_sizes, max_sizes and aspect_ratios, The size of the box is in range(min_size, max_size) interval, which is generated in sequence according to the aspect_ratios.
Please get more information from the following papers: https://arxiv.org/abs/1512.02325.
Inputs:  Input : (Tensor, default Tensor<float>), the input feature data of PriorBoxOp, The layout is NCHW.
 Image : (Tensor, default Tensor<float>), the input image data of PriorBoxOp, The layout is NCHW.
Outputs:  Boxes : (Tensor, default Tensor<float>), the output prior boxes of PriorBoxOp. The layout is [H, W, num_priors, 4]. H is the height of input, W is the width of input, num_priors is the box count of each position.
 Variances : (Tensor, default Tensor<float>), the expanded variances of PriorBoxOp. The layout is [H, W, num_priors, 4]. H is the height of input, W is the width of input, num_priors is the box count of each position.
Attributes:  min_sizes (Duplicable): (vector<int>) List of min sizes of generated prior boxes.
 max_sizes (Duplicable): (vector<int>) List of max sizes of generated prior boxes.
 aspect_ratios (Duplicable): (vector<float>) List of aspect ratios of generated prior boxes.
 variances (Duplicable): (vector<float>) List of variances to be encoded in prior boxes.
 flip (Duplicable): (bool) Whether to flip aspect ratios.
 clip (Duplicable): (bool) Whether to clip outofboundary boxes.
 step_w (Duplicable): Prior boxes step across width, 0 for auto calculation.
 step_h (Duplicable): Prior boxes step across height, 0 for auto calculation.
 offset (Duplicable): (float) Prior boxes center offset.
proximal_adagrad
Proximal Adagrad Optimizer.
Optimizer that implements the proximal adagrad algorithm:
$$ moment = moment + grad * grad \\ prox\_param = param  learning\_rate * grad * (1 / \sqrt{moment}) \\ param = sign(prox\_param) / (1 + learning\_rate * l2) * \max(prox\_param  learning\_rate * l1 , 0) $$
The paper that proposed Proximal GD: (http://papers.nips.cc/paper/3793efficientlearningusingforwardbackwardsplitting.pdf) Here, we use the adagrad learning rate as specified here: (http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)
Inputs:  Param : (Tensor, default Tensor<float>) Input parameter that has to be updated.
 Moment : (Tensor, default Tensor<float>) Moment parameter that has to be updated.
 Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
 LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
Outputs:  ParamOut : (Tensor) Output updated parameter value.
 MomentOut : (Tensor) Output updated moment value.
Attributes:  l1 (Duplicable): (float, default 0.0) L1 regularization strength.
 l2 (Duplicable): (float, default 0.0) L2 regularization strength.
rank_loss
RankLoss Operator.
RankLoss operator for RankNet (http://icml.cc/2015/wpcontent/uploads/2015/06/icml_ranking.pdf). RankNet is a pairwise ranking model with one training sample consisting of a pair of doc A and B, and the label P indicating that A is ranked higher than B or not:
P = {0, 1} or {0, 0.5, 1}, where 0.5 means no information about the rank of the input pair.
The RankLoss operator takes three inputs: Left (o_i), Right (o_j) and Label (P_{i,j}), which represent the output score of RankNet for the two docs and the label respectively, and yields the rank loss C_{i,j} using the following equation:
$$ C_{i,j} = \tilde{P_{ij}} * o_{i,j} + \log(1 + e^{o_{i,j}}) \\ o_{i,j} = o_i  o_j \\ \tilde{P_{i,j}} = \left \{0, 0.5, 1 \right \} \ or \ \left \{0, 1 \right \} $$
The operator can take batch inputs with size batch_size (batch_size >= 1).
Inputs:  Label : (2D Tensor with shape [batch_size x 1]) The label indicating A ranked higher than B or not.
 Left : (2D Tensor with shape [batch_size x 1]) The output of RankNet for doc A.
 Right : (2D Tensor with shape [batch_size x 1]) The output of RankNet for doc B.
Outputs:  Out : (2D Tensor with shape [batch_size x 1]) The output loss of RankLoss operator.
reduce_min
ReduceMin Operator.
This operator computes the min of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true. If reduce_all is true, just reduce along all dimensions and output a scalar.
Inputs:  X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
Outputs:  Out : (Tensor) The result tensor.
Attributes:  dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
 keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.
 reduce_all (Duplicable): (bool, default false) If true, output a scalar reduced along all dimensions.
reduce_max
ReduceMax Operator.
This operator computes the max of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true. If reduce_all is true, just reduce along all dimensions and output a scalar.
Inputs:  X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
Outputs:  Out : (Tensor) The result tensor.
Attributes:  dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
 keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.
 reduce_all (Duplicable): (bool, default false) If true, output a scalar reduced along all dimensions.
reduce_mean
ReduceMean Operator.
This operator computes the mean of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true. If reduce_all is true, just reduce along all dimensions and output a scalar.
Inputs:  X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
Outputs:  Out : (Tensor) The result tensor.
Attributes:  dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
 keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.
 reduce_all (Duplicable): (bool, default false) If true, output a scalar reduced along all dimensions.
round
Round Activation Operator.
$out = [x]$
Inputs:  X : Input of Round operator
Outputs:  Out : Output of Round operator
norm
"Input shape: $(N, C, H, W)$ Scale shape: $(C, 1)$ Output shape: $(N, C, H, W)$ Where forward <span class="markdownequation" id="equation0"></span> backward <span class="markdownequation" id="equation1"></span>
Inputs:  X : (Tensor) The input tensor of norm operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.
 Scale : (Tensor) The input tensor of norm operator. The format of input tensor is C * 1.
Outputs:  Out : (Tensor) The output tensor of norm operator.N * M.M = C * H * W
Attributes:  epsilon (Duplicable): (float, default 1e10) Constant for numerical stability.
modified_huber_loss
Modified Huber Loss Operator.
This operator is used in binary classification problem. The shape of input X and target Y are both [N, 1] and so is the shape of the output loss. Since target Y is not differentiable, calculating gradient for Y is illegal. The formula of modified huber loss is:
$$ L(y, f(x)) = \begin{cases} (\max(0, 1  yf(x)))^2, \text{if} \ yf(x) >= 1 \\ 4yf(x), \quad \text{otherwise} \end{cases} $$
Make sure the values of target label Y are in {0, 1} here. This operator will scale values of Y to {1, +1} when computing losses and gradients.
Inputs:  X : The input tensor of modified huber loss op. X is 2D tensor with shape [batch_size, 1].
 Y : The target labels of modified huber loss op. The shape of Y is the same as X. Values of Y must be 0 or 1.
Outputs:  IntermediateVal (Intermediate) : Variable to save intermediate result which will be reused in backward processing.
 Out : Classification loss for X.
elementwise_sub
Limited Elementwise Sub Operator.
The equation is:
$$Out = X  Y$$
$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.
There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.
For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.
For example .. codeblock:: python
shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.
Inputs:  X : (Tensor), The first input tensor of elementwise op.
 Y : (Tensor), The second input tensor of elementwise op.
Outputs:  Out : The output of elementwise op.
Attributes:  axis (Duplicable): (int, default 1). The start dimension index for broadcasting Y onto X.
conv2d_transpose
Convolution2D Transpose Operator.
The convolution transpose operation calculates the output based on the input, filter and dilations, strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infershape. Input(Input) and output(Output) are in NCHW format. Where N is batchsize, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCHW format. Where M is the number of input feature channels, C is the number of output feature channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.
Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$ H_{out} = (H_{in}  1) * strides[0]  2 * paddings[0] + dilations[0] * (H_f  1) + 1 \\ W_{out} = (W_{in}  1) * strides[1]  2 * paddings[1] + dilations[1] * (W_f  1) + 1 $$
Inputs:  Input : (Tensor) The input tensor of convolution transpose operator. The format of input tensor is NCHW. Where N is batch size, C is the number of input channels, H is the height of the feature, and W is the width of the feature.
 Filter : (Tensor) The filter tensor of convolution transpose operator. The format of the filter tensor is MCHW, where M is the number of input feature channels, C is the number of output feature channels,H is the height of the filter, and W is the width of the filter. We enforce groups number == 1 in the convolution transpose scenario.
Outputs:  Output : (Tensor) The output tensor of convolution transpose operator. The format of output tensor is also NCHW.
Attributes:  dilations (Duplicable): (vector<int> default:{1, 1}), the dilations(h_dilation, w_dilation) of convolution transpose operator.
 strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride) of convolution transpose operator.
 paddings (Duplicable): (vector<int> default:{0, 0}), the paddings(h_pad, w_pad) of convolution transpose operator.
 use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnn
 data_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically.
 workspace_size_MB (Duplicable): Used in cudnn kernel only. workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardward. This size should be carefully setted.
elementwise_max
Limited Elementwise Max Operator.
The equation is:
$$Out = max(X, Y)$$
$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.
There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.
For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.
For example .. codeblock:: python
shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.
Inputs:  X : (Tensor), The first input tensor of elementwise op.
 Y : (Tensor), The second input tensor of elementwise op.
Outputs:  Out : The output of elementwise op.
Attributes:  axis (Duplicable): (int, default 1). The start dimension index for broadcasting Y onto X.
smooth_l1_loss
Smooth L1 Loss Operator.
This operator computes the smooth l1 loss for X and Y. The operator takes the first dimension of X and Y as batch size. For each instance, it computes the smooth l1 loss element by element first and then sums all the losses. So the shape of Out is [batch_size, 1].
The equation is: $$ Out_{\sigma}(X, Y)_i = \begin{cases} 0.5 * (\sigma * (X_i  Y_i)) ^ 2 \quad X_i  Y_i \lt \frac{1} {{\sigma} ^ 2} \\ \frac{X_i  Y_i  0.5}{{\sigma}^2}, \quad otherwise \end{cases} $$
In the above equation, $Out_{sigma}(X, Y)_i$, $X_i$ and $Y_i$ represent the ith element of Out, X and Y.
Inputs:  X : (Tensor, default Tensor<float>) A tensor with rank at least 2. The input value of smooth l1 loss op with shape [batch_size, dim1, ..., dimN].
 Y : (Tensor, default Tensor<float>) A tensor with rank at least 2. The target value of smooth l1 loss op with same shape as X.
 InsideWeight : (Tensor, default Tensor<float>) A tensor with rank at least 2. This input is optional and should have same shape with X. If provided, the result of (X  Y) will be multiplied by this tensor element by element.
 OutsideWeight : (Tensor, default Tensor<float>) A tensor with rank at least 2. This input is optional and should have same shape with X. If provided, the out smooth l1 loss will be multiplied by this tensor element by element.
Outputs:  Diff (Intermediate) : Intermediate variable to cache InsideWeight * (X  Y).
 Out : (Tensor, default Tensor<float>) A tensor with rank be 2. The output smooth l1 loss with shape [batch_size, 1].
Attributes:  sigma (Duplicable): Hyper parameter of smooth l1 loss op.A float scalar with default value 3.0.
reorder_lod_tensor_by_rank
ReorderLoDTensorByRankTable operator.
Input(X) is a batch of sequences. Input(RankTable) stores new orders of the input sequence batch. The reorder_lod_tensor_by_rank operator reorders the Input(X) according to the information provided by Input(RankTable).
For example:
If the indices stored in the Input(RankTable) are [3, 0, 2, 1], the Input(X) will be reordered that the fourth sequence in Input(X) will become the first one, and then followed by the original first, third, and the second one.
This is: X = [Seq0, Seq1, Seq2, Seq3]. The indices in RankTable are [3, 0, 2, 1]. Out = [Seq3, Seq0, Seq2, Seq1] with a new LoD information.
If the LoD information of Input(X) is empty, this means Input(X) is not sequence data. This is also identical to a batch of sequences where each sequence has a fixed length 1. In this case, the reorder_lod_tensor_by_rank operator reorders each slice of Input(X) along the first axis according to Input(RankTable).
This is: X = [Slice0, Slice1, Slice2, Slice3] and its LoD information is empty. The indices in RankTable are [3, 0, 2, 1]. Out = [Slice3, Slice0, Slice2, Slice1] with no LoD information is appended.
NOTE: This operator sorts Input(X) according to a given LoDRankTable which does not need to be calculated according to Input(X). It can be calculated according to another different sequence, and then this operator sorts Input(X) according to the given LoDRankTable.
Inputs:  X : (LoDTensor), the input lod tensor to be reordered according to Input(RankTable).
 RankTable : (LoDRankTable), the rank table according to which Input(X) is reordered.
Outputs:  Out : (LoDTensor), the reordered lod tensor.
pad
Pad Operator.
Pad input into output, as specified by paddings and pad_value. The input should be a kD tensor(k > 0 and k < 7). As an example:
Given:
X = [[1, 2], [3, 4]],
paddings = [0, 1, 1, 2],
and
pad_value = 0,
we have:
Out = [[0, 1, 2, 0, 0] [0, 3, 4, 0, 0] [0, 0, 0, 0, 0]]
Inputs:  X : The input of pad op. The input should be a kD tensor(k > 0 and k < 7)
Outputs:  Out : The output of pad op. A tensor with the same shape as X.
Attributes:  paddings (Duplicable): (vector<int>) A list<int> to describe the padding rules for each dimension. For 2D image tensor, paddings=[0, 1, 2, 3] means padding 0 row to top, 1 row to bottom, 2 columns to left and 3 columns to right. Size of paddings should be equal to 2 * dimension size of the input tensor.
 pad_value (Duplicable): (float, default 0.0) The value to fill the padded areas.
lstm_unit
Lstm Unit Operator
Equation:
$$ i, f, o, j = split(X) \\ C = C_{prev} * sigm(f + forget\_bias) + sigm(i) * tanh(j) \\ H = C * sigm(o) $$
Inputs:  X : Lstm unit only applies nonlinear activations, please make surethat linear tranformation has already been applied to `X`. Linear tranformation can be applied by adding a `fc` layer
 C_prev : The cell state tensor of last timestep in the Lstm Unit operator.
Outputs:  C : The cell tensor of Lstm Unit operator.
 H : The hidden state tensor of Lstm Unit operator.
Attributes:  forget_bias (Duplicable): (float, default 0.0) The forget bias of Lstm Unit.
squared_l2_norm
SquaredL2Norm Operator.
Computes the squared L2 norm of a tensor.
$$Out = \sum_{i} X_{i}^2$$
Inputs:  X : (Tensor) The input of squared_l2_norm op.
Outputs:  Out : (Scalar) The output of squared_l2_norm op.
sequence_expand
Sequence Expand Operator.
This operator expands input(X) according to LOD of input(Y). Following are cases to better explain how this works: Case 1:
Given a 2level LoDTensor input(X) X.lod = [[0, 2, 3], [0, 1, 3, 4]] X.data = [a, b, c, d] X.dims = [4, 1] and input(Y) Y.lod = [[0, 2, 4], [0, 3, 6, 7, 8]] with condition len(Y.lod[1]) 1 == X.dims[0] then we get 2level LoDTensor Out.lod = [[0, 2, 4], [0, 3, 6, 7, 8]] Out.data = [a, a, a, b, b, b, c, d] Out.dims = [8, 1]
Case 2:
Given a common Tensor input(X) X.data = [a, b, c] X.dims = [3, 1] and input(Y) Y.lod = [[0, 2, 3, 6]] with condition len(Y.lod[1]) 1 == X.dims[0] then we get 1level LoDTensor Out.lod = [[0, 2, 3, 6]] Out.data = [a, a, b, c, c, c] Out.dims = [6, 1]
Case 3:
Given a common Tensor input(X) X.data = [[a, b], [c, d], [e, f]] X.dims = [3, 2] and input(Y) Y.lod = [[0, 2, 3, 6]] with condition len(Y.lod[1]) 1 == X.dims[0] then we get 1level LoDTensor Out.lod = [[0, 2, 3, 6]] Out.data = [[a,b], [a,b] [c,d], [e, f], [e, f], [e, f]] Out.dims = [6, 2]
Case 4:
Given 2level a LoDTensor input(X) X.lod = [[0, 2, 3], [0, 1, 3, 4]] X.data = [a, b, c, d] X.dims = [4, 1] and input(Y) Y.lod = [[0, 2, 4], [0, 3, 6, 6, 8]] with condition len(Y.lod[1]) 1 == X.dims[0] then we get 2level LoDTensor Out.lod = [[0, 2, 4], [0, 3, 6, 6, 8]] Out.data = [a, a, a, b, b, b, d, d] Out.dims = [8, 1]
Inputs:  X : (Tensor or LoDTensor) The input(X) of this operator can be a LoDTensor or a base Tensor.
 Y : (LoDTensor)The reference input(Y) of sequence_expand op.It must be a LoDTensor with klevel(k>0).The input(X) will be expanded according to LOD of input(Y).The element numbers of last level in input(Y) must be equal to dims[0] of input(X).
Outputs:  Out : (LodTensor)The output of sequence_expand op.The lod of output will be as same as input(Y)'s lod.
momentum
Momentum Optimizer.
This optimizer has a flag for Nestrov Momentum. The update equations are as follows:
$$ velocity = mu * velocity + gradient \\ if (use\_nesterov): \\ param = param  gradient * learning\_rate + mu * velocity * learning\_rate \\ else: \\ param = param  learning\_rate * velocity. \\ $$
Inputs:  Param : (Tensor, default Tensor<float>) Input parameter that has to be updated
 Grad : (Tensor, default Tensor<float>) Input gradient of the parameter
 Velocity : (Tensor, default Tensor<float>) Input velocity (corresponding to the parameter) that has to be updated
 LearningRate : (Tensor, default Tensor<float>) Input learning rate
Outputs:  ParamOut : (Tensor) This output is updated parameter. It shared memory with Input(Param).
 VelocityOut : (Tensor) This output is updated velocity. It shared memory with Input(Velocity).
Attributes:  mu (Duplicable): (float) Momentum coefficient
 use_nesterov (Duplicable): (bool, default false) Use Nesterov Momentum
uniform_random
Uniform random operator.
This operator initializes a tensor with random values sampled from a uniform distribution.
Inputs: Outputs:  Out : (Tensor) The output tensor of uniform random op
Attributes:  shape (Duplicable): (vector<int>) The shape of the output tensor
 min (Duplicable): (float, default 1.0) Minimum value of uniform random
 max (Duplicable): (float, default 1.0) Maximun value of uniform random
 seed (Duplicable): (int, default 0) Random seed used for generating samples. 0 means use a seed generated by the system.
 dtype (Duplicable): (int, default 5(FP32)) Output tensor data type
split_selected_rows
Split a SelectedRows with a specified rows section. height_sections is only needed when need to split the dims of the original tensor.
Example: Input: X.rows = {7, 5} X.height = 12 Attr: height_sections = {4, 8} Out: out0.rows = {} out0.height = 4
out1.rows = {5, 7} out2.height = 8
Inputs:  X : The input SelectedRows.
Outputs:  Out (Duplicable) : The outputs of input SelectedRows.
Attributes:  height_sections (Duplicable): Height for each output SelectedRows.
adam
Adam Optimizer.
This implements the Adam optimizer from Section 2 of the Adam paper : https://arxiv.org/abs/1412.6980. Adam is a firstorder gradientbased optimization method based on adaptive estimates of lowerorder moments.
Adam updates:
$$ moment\_1\_out = \beta_1 * moment\_1 + (1  \beta_1) * grad \\ moment\_2_\out = \beta_2 * moment\_2 + (1  \beta_2) * grad * grad \\ learning\_rate = learning\_rate * \frac{\sqrt{1  \beta_{2\_pow}}}{1  \beta_{1\_pow}} \\ param\_out = param  learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon} $$
Inputs:  Param : (Tensor) Input parameter
 Grad : (Tensor) Input gradient
 LearningRate : (Tensor) Learning rate
 Moment1 : (Tensor) Input first moment
 Moment2 : (Tensor) Input second moment
 Beta1Pow : (Tensor) Input beta1 power accumulator
 Beta2Pow : (Tensor) Input beta2 power accumulator
Outputs:  ParamOut : (Tensor) Output parameter
 Moment1Out : (Tensor) Output first moment
 Moment2Out : (Tensor) Output second moment
Attributes:  beta1 (Duplicable): (float, default 0.9) Exponential decay rate for the first moment estimates.
 beta2 (Duplicable): (float, default 0.999) exponential decay rate for the second moment estimates.
 epsilon (Duplicable): (float, default 1.0e8) Constant for numerical stability
increment
Increment Operator.
The equation is: $$Out = X + step$$
Inputs:  X : (Tensor) The input tensor of increment operator
Outputs:  Out : (Tensor) The output tensor of increment operator.
Attributes:  step (Duplicable): (float, default 1.0) The step size by which the input tensor will be incremented.
gru_unit
GRUUnit Operator implements partial calculations of the GRU unit as following:
$$ update \ gate: u_t = actGate(xu_t + W_u * h_{t1} + b_u) \\ reset \ gate: r_t = actGate(xr_t + W_r * h_{t1} + b_r) \\ output \ candidate: {h}_t = actNode(xc_t + W_c * dot(r_t, h_{t1}) + b_c) \\ output: h_t = dot((1  u_t), h_{t1}) + dot(u_t, {h}_t) $$
which is same as one time step of GRU Operator.
@note To implement the complete GRU unit, fullyconnected operator must be used before to feed xu, xr and xc as the Input of GRUUnit operator.
Inputs:  Input : (Tensor) Matrix with shape [batch_size, frame_size * 3] for the input.
 HiddenPrev : (Tensor) Matrix with shape [batch_size, frame_size] for the states of previous time step.
 Weight : (Tensor) Weight matrix with shape [frame_size, frame_size * 3]. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape [frame_size, frame_size * 2], and the second part are weights of output candidate with shape [frame_size, frame_size].
 Bias : (Tensor) Bias vector with shape [1, frame_size * 3] concatenating bias of the update gate, reset gate and output candidate.
Outputs:  Gate (Intermediate) : (Tensor) Matrix with shape [batch_size, frame_size * 3] for the output of update gate, reset gate and output candidate.
 ResetHiddenPrev (Intermediate) : (Tensor) Matrix with shape [batch_size, frame_size] for the reseted hidden state of previous time step.
 Hidden : (Tensor) The GRU hidden state of the current time step with shape [batch_size, frame_size].
Attributes:  activation (Duplicable): (enum int, default tanh) The activation type used for output candidate {h}_t.
 gate_activation (Duplicable): (enum int, default sigmoid) The activation type used in update gate and reset gate.
less_than
less_than Operator
It operates elementwise on X and Y, and returns the Out. Each of them is a Ndim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X < Y
Inputs:  X : (LoDTensor) the left hand operand of less_than operator
 Y : (LoDTensor) the right hand operand of less_than operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is Out = X < Y
Attributes:  axis (Duplicable): (int, default 1). The start dimension index for broadcasting Y onto X.
sequence_pool
Sequence Pool Operator.
The SequencePoolOp pools features of all timesteps of each instance. It supports six pooling types: 1. AVERAGE: $$Out[i] = \frac{\sum_i X_i}{N}$$ 2. SUM: $$Out[i] = \sum_jX_{ij}$$ 3. SQRT: $$Out[i] = \frac{\sum_jX_{ij}}{\sqrt{len(X_i)}}$$ 4. LAST: Out[i] = last instance in ith sequence X[i] 5. FIRST: Out[i] = first instance in ith sequence X[i] 6. MAX: $$Out[i] = max(X_i)$$
The following example explains how this works: For a minibatch of 3 variablelength sentences, containing 2, 3, and 2 timesteps:
Assume X is a [7,M,N] LoDTensor, and X>lod()[0] = [0, 2, 5, 7], 7=2+3+2. Besides, for the sake of simplicity, we assume M=1 and N=1, and the value of X = [[1, 3], [2, 4, 6], [5, 1]].
Thus, Out is a [3,1,1] Tensor without LoD infomation. And for different pooltype, the value of Out is as follows:
 AVERAGE: [2, 4, 3], where 2=(1+3)/2, 4=(2+4+6)/3, 3=(5+1)/2
 SUM: [4, 12, 6], where 4=1+3, 12=2+4+6, 6=5+1
 SQRT: [2.82, 6.93, 4.24], where 2.82=(1+3)/sqrt(2), 6.93=(2+4+6)/sqrt(3), 4.24=(5+1)/sqrt(2)
 MAX: [3, 6, 5], where 3=max(1,3), 6=max(2,4,6), 5=max(5,1)
 LAST: [3, 6, 1], where 3=last(1,3), 6=last(2,4,6), 1=last(5,1)
 FIRST: [1, 2, 5], where 1=first(1,3), 2=first(2,4,6), 5=first(5,1)
Inputs:  X : (LoDTensor) The variablelength input of SequencePoolOp
Outputs:  Out : (Tensor) The output of SequencePoolOp does not contain LoD infomation.
 MaxIndex (Intermediate) : (Tensor<int>) This tensor is used for the sequence maxpooling to record the max indexes.
Attributes:  pooltype (Duplicable): (string, default 'AVERAGE') the pooling pooltype of SequencePoolOp.
spp
"With spatial pyramid pooling, the input image can be of any sizes. This not only allows arbitrary aspect ratios, but also allows arbitrary scales. We can resize the input image to any scale (e.g., min(w, h)=180, 224, ...) and apply the same deep network. When the input image is at different scales, the network (with the same filter sizes) will extract features at different scales. The scales play important roles in traditional methods. Input shape: $(N, C_{in}, H_{in}, W_{in})$ Output shape: $(H_{out}, W_{out})$ Where $$ H_{out} = N \\ W_{out} = (((4^pyramid_height)  1) / (4  1))$ * C_{in} $$ paper https://arxiv.org/pdf/1406.4729v4.pdf
Inputs:  X : (Tensor) The input tensor of spp operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.
Outputs:  Out : (Tensor) The output tensor of spp operator.N * M.M = C * H * W
Attributes:  pyramid_height (Duplicable): (int), multi level pooling
 pooling_type (Duplicable): (string), pooling type, can be "max" for maxpooling and "avg" for averagepooling.
sign
Sign operator
$$Out = X.sign()$$
Inputs:  X : (Tensor) Input tensor of sign operator.
Outputs:  Out : (Tensor) Output tensor of sign operator.
reduce_sum
ReduceSum Operator.
This operator computes the sum of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true. If reduce_all is true, just reduce along all dimensions and output a scalar.
Inputs:  X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
Outputs:  Out : (Tensor) The result tensor.
Attributes:  dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
 keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.
 reduce_all (Duplicable): (bool, default false) If true, output a scalar reduced along all dimensions.
im2sequence
This op uses kernels to scan images and converts these images to sequences. After expanding, The number of time steps are output_height * output_width and the dimension of each time step is kernel_height * kernel_width * channels, in which:
output_height = 1 + (padding_height + padding_down + img_height  kernel_height + stride_height  1) / stride_height; output_width = 1 + (padding_left + padding+right + img_width  kernel_width + stride_width  1) / stride_width;
This op can be used after convolution neural network, and before recurrent neural network.
Given:
x = [[[[ 6. 2. 1.] [ 8. 3. 5.] [ 0. 2. 6.]]
[[ 2. 4. 4.] [ 6. 3. 0.] [ 6. 4. 7.]]] [[[ 6. 7. 1.] [ 5. 7. 9.] [ 2. 4. 8.]] [[ 1. 2. 1.] [ 1. 3. 5.] [ 9. 0. 8.]]]]
x.dims = {2, 2, 3, 3}
And:
kernels = [2, 2] strides = [1, 1] paddings = [0, 0, 0, 0]
Then:
output.data = [[ 6. 2. 8. 3. 2. 4. 6. 3.] [ 2. 1. 3. 5. 4. 4. 3. 0.] [ 8. 3. 0. 2. 6. 3. 6. 4.] [ 3. 5. 2. 6. 3. 0. 4. 7.] [ 6. 7. 5. 7. 1. 2. 1. 3.] [ 7. 1. 7. 9. 2. 1. 3. 5.] [ 5. 7. 2. 4. 1. 3. 9. 0.] [ 7. 9. 4. 8. 3. 5. 0. 8.]] output.dims = {8, 9} output.lod = [[0, 4, 8]]
Inputs:  X : (Tensor) The input tensor has NCHW format.N: batch sizeC: channelsH: heightW: width
Outputs:  Out : (LodTensor) The output data of im2sequence op,
Attributes:  kernels (Duplicable): (vector<int>), the kernels(kernel_height, kernel_width)
 strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride)
 paddings (Duplicable): (vector<int> default:{0, 0, 0, 0}), the paddings(up_pad, left_pad, down_pad, right_pad)
stanh
STanh Activation Operator.
$$out = b * \frac{e^{a * x}  e^{a * x}}{e^{a * x} + e^{a * x}}$$
Inputs:  X : Input of STanh operator
Outputs:  Out : Output of STanh operator
Attributes:  scale_a (Duplicable): The scale parameter of a for the input
 scale_b (Duplicable): The scale parameter of b for the input
adamax
Adamax Optimizer.
We implement the Adamax optimizer from Section 7 of the Adam paper: https://arxiv.org/abs/1412.6980. Adamax is a variant of the Adam algorithm based on the infinity norm.
Adamax updates:
$$ moment\_out = \beta_1 * moment + (1  \beta_1) * grad \\ inf\_norm\_out = max(\beta_2 * inf\_norm + \epsilon, grad) \\ learning\_rate = \frac{learning\_rate}{1  \beta_{1\_pow}} \\ param\_out = param  learning\_rate * \frac{moment\_out}{inf\_norm\_out} $$
The original paper does not have an epsilon attribute. However, it is added here for numerical stability to prevent the division by 0 error.
Inputs:  Param : (Tensor) Input parameter
 Grad : (Tensor) Input gradient
 LearningRate : (Tensor) Learning rate
 Moment : (Tensor) First moment
 InfNorm : (Tensor) Input exponentially weighted infinity norm
 Beta1Pow : (Tensor) Input beta1 power accumulator
Outputs:  ParamOut : (Tensor) Output parameter
 MomentOut : (Tensor) Output first moment
 InfNormOut : (Tensor) Output exponentially weighted infinity norm
Attributes:  beta1 (Duplicable): (float, default 0.9) Exponential decay rate for the 1st moment estimates.
 beta2 (Duplicable): (float, default 0.999) exponential decay rate for the weighted infinity norm estimates.
 epsilon (Duplicable): (float, default 1.0e8) Constant for numerical stability
tanh_shrink
TanhShrink Activation Operator.
$$out = x  \frac{e^{x}  e^{x}}{e^{x} + e^{x}}$$
Inputs:  X : Input of TanhShrink operator
Outputs:  Out : Output of TanhShrink operator
positive_negative_pair
PositiveNegativePairOp can be used to evaluate Learning To Rank(LTR) model's performance.
Within some context, e.g. the "query", a LTR model generates scores for a list of items, which gives a partial order of the items. PositiveNegativePairOp takes a list of reference rank order (Input("Label")) and the model generated scores (Input(Score)) as inputs and counts the pairs that ranked correctly and incorrectly.
Inputs:  Score : (Tensor, float) Model Score on an item (with respect to QueryID). It's a 2D tensor with shape [batch_size, depth], where the column specified by the attribute "column" is used as item score.
 Label : (Tensor, float) Label of an item (with repsect to QueryId). It's a 2D tensor with shape [batch_size, 1].
 QueryID : (Tensor, int64) Query ID that indicates the context. Its shape should be the same as Label.
 AccumulatePositivePair : (float) Optional. The accumulated number of positive pairs over a stream of data. If provided, the output PositivePair will be initialized with this number rather than 0. it won't be modified in place.
 AccumulateNegativePair : (float) Optional. The accumulated number of negative pairs over a stream of data. If provided, the output NegativePair will be initialized with this number rather than 0. it won't be modified in place.
 AccumulateNeutralPair : (float) Optional. The accumulated number of neutral pairs over a stream of data. If provided, the output NeutralPair will be initialized with this number rather than 0. it won't be modified in place.
 Weight : (float) Optional. Weight of current item. If specified, its shape should be the same as Label, and the meaning of the output changes from numbers of pairs to the total sum of pairs' weights. Weight of a pair of items is the average of their weights.
Outputs:  PositivePair : (float) Number of positive pairs, i.e. the pairs of items that are ranked correctly.
 NegativePair : (float) Number of negative pairs, i.e. the pairs of items that are ranked incorrectly.
 NeutralPair : (float) Number of neutral pairs, i.e. the pairs of items that have the same score.
Attributes:  column (Duplicable): (int, default 1) The column position of Score used to rank items in descending order. It must be in the range of [rank(Score), rank(Score)). If `dim < 0`, the dim to reduce is `rank + dim`. Noting that reducing on the first dim will make the LoD info lost.
one_hot
One Hot Operator. This operator creates the onehot representations for input index values. The following example will help to explain the function of this operator:
X is a LoDTensor: X.lod = [[0, 1, 4]] X.shape = [4, 1] X.data = [[1], [1], [3], [0]]
set depth = 4
Out is a LoDTensor: Out.lod = [[0, 1, 4]] Out.shape = [4, 4] Out.data = [[0., 1., 0., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [1., 0., 0., 0.]]
Inputs:  X : (LoDTensor, LoDTensor<int>) Input variable with rank at least 2. The last dimension of X should be 1. Each value of X is an index to indicate the position.
Outputs:  Out : (Tensor, Tensor<float>) Output tensor with same rank as X. The tensor consists of onehot representations of values in X.
Attributes:  depth (Duplicable): A positive integer to specify the length of onehot vector.
 dtype (Duplicable): An integer to specify the data type of onehot vector. The default value is FP32.
l1_norm
L1 Norm Operator.
Computes the L1 norm of a tensor.
$$Out = \sum{X}$$
Inputs:  X : (Tensor) The input of l1_norm op.
Outputs:  Out : (Scalar) The output of l1_norm op.
create_random_data_generator
CreateRandomDataGenerator Operator This Op creates a random reader. The reader generates random data instead of really reading from files. Generated data follow an uniform distribution between 'min' and 'max'.
Inputs: Outputs:  Out : (ReaderHolder) The created random reader.
Attributes:  shape_concat (Duplicable): The concat of all data's shapes.
 ranks (Duplicable): The ranks of each data.e.g.shape_concat = [2,3,4,5,6]ranks = [3,2]It means the reader will generate two data each time,whose shapes are [2,3,4] and [5,6] respectively.
 min (Duplicable): The lower bound of reader's uniform distribution.
 max (Duplicable): The upper bound of reader's uniform distribution.
roi_pool
ROIPool operator
ROI Pooling for FasterRCNN. The link below is a further introduction: https://stackoverflow.com/questions/43430056/whatisroilayerinfastrcnn
Inputs:  X : (Tensor), the input of ROIPoolOp. The format of input tensor is NCHW. Where N is batch size, C is the number of input channels, H is the height of the feature, and W is the width of the feature.
 ROIs : (Tensor), ROIs (Regions of Interest) to pool over. should be a 2D tensor of shape (num_rois, 5)given as [[batch_id, x1, y1, x2, y2], …]. Where batch_id is the id of the data, (x1, y1) is the top left coordinates, and (x2, y2) is the bottom right coordinates.
Outputs:  Out : (Tensor), The output of ROIPoolOp is a 4D tensor with shape (num_rois, channels, pooled_h, pooled_w).
 Argmax (Intermediate) : (Tensor), Argmaxes corresponding to indices in X used for gradient computation. Only output if arg “is_test” is false.
Attributes:  spatial_scale (Duplicable): (float, default 1.0), Multiplicative spatial scale factor to translate ROI coords from their input scale to the scale used when pooling.
 pooled_height (Duplicable): (int, default 1), The pooled output height.
 pooled_width (Duplicable): (int, default 1), The pooled output width.
pow
Pow Activation Operator.
$out = x^{factor}$
Inputs:  X : Input of Pow operator
Outputs:  Out : Output of Pow operator
Attributes:  factor (Duplicable): The exponential factor of Pow
unpool
Input shape is: $(N, C_{in}, H_{in}, W_{in})$, Output shape is: $(N, C_{out}, H_{out}, W_{out})$, where $$ H_{out} = (H_{in}−1) * strides[0] − 2 * paddings[0] + ksize[0] \\ W_{out} = (W_{in}−1) * strides[1] − 2 * paddings[1] + ksize[1] $$ Paper: http://www.matthewzeiler.com/wpcontent/uploads/2017/07/iccv2011.pdf
Inputs:  X : (Tensor) The input tensor of unpool operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.
 Indices : (Tensor) The input tensor of the indices given out by MaxPool2d. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.
Outputs:  Out : (Tensor) The output tensor of unpool operator.The format of output tensor is also NCHW.Where N is batch size, C is the number of channels, H and W is the height and width of feature.
Attributes:  ksize (Duplicable): (vector), the unpooling window size(height, width) of unpooling operator.
 strides (Duplicable): (vector, default:{1, 1}), strides (height, width) of unpooling operator.
 paddings (Duplicable): (vector defalut:{0,0}), paddings (height, width) of unpooling operator.
 unpooling_type (Duplicable): (string), unpooling type, can be "max" for maxunpooling
transpose
Transpose Operator.
The input tensor will be permuted according to the axes given. The behavior of this operator is similar to how
numpy.transpose
works.
suppose the input
X
is a 2D tensor: $$ X = \begin{pmatrix} 0 &1 &2 \\ 3 &4 &5 \end{pmatrix}$$the given
axes
is: $[1, 0]$, and $Y$ = transpose($X$, axis)then the output $Y$ is:
$$ Y = \begin{pmatrix} 0 &3 \\ 1 &4 \\ 2 &5 \end{pmatrix}$$

Given a input tensor with shape $(N, C, H, W)$ and the
axes
is $[0, 2, 3, 1]$, then shape of the output tensor will be: $(N, H, W, C)$.
Inputs:  X : (Tensor) The input tensor, tensors with rank up to 6 are supported.
Outputs:  Out : (Tensor)The output tensor.
Attributes:  axis (Duplicable): (vector<int>) A list of values, and the size of the list should be the same with the input tensor rank. This operator permutes the input tensor's axes according to the values given.

rnn_memory_helper_grad
Inputs:  Out@GRAD :
 X :
 Out :
Outputs:  X@GRAD :
Attributes:  dtype (Duplicable): (int, default 5 (FP32)) Output data type
lstmp
LongShort Term Memory with recurrent Projection layer (LSTMP) Operator.
LSTMP has a separate projection layer after the LSTM layer, projecting the original hidden state to a lowerdimensional one, which is proposed to reduce the number of total parameters and furthermore computational complexity for the LSTM, espeacially for the case that the size of output units is relative large (https://research.google.com/pubs/archive/43905.pdf).
The formula is as follows:
$$ i_t = \sigma(W_{ix}x_{t} + W_{ir}r_{t1} + W_{ic}c_{t1} + b_i) \\ f_t = \sigma(W_{fx}x_{t} + W_{fr}r_{t1} + W_{fc}c_{t1} + b_f) \\ \tilde{c_t} = act_g(W_{cx}x_t + W_{cr}r_{t1} + b_c) \\ o_t = \sigma(W_{ox}x_{t} + W_{or}r_{t1} + W_{oc}c_t + b_o) \\ c_t = f_t \odot c_{t1} + i_t \odot \tilde{c_t} \\ h_t = o_t \odot act_h(c_t) \\ r_t = \overline{act_h}(W_{rh}h_t) $$
where the W terms denote weight matrices (e.g. $W_{xi}$ is the matrix of weights from the input gate to the input), $W_{ic}, W_{fc}, W_{oc}$ are diagonal weight matrices for peephole connections. In our implementation, we use vectors to reprenset these diagonal weight matrices. The b terms denote bias vectors ($b_i$ is the input gate bias vector), $sigma$ is the activation, such as logistic sigmoid function, and $i, f, o$ and $c$ are the input gate, forget gate, output gate, and cell activation vectors, respectively, all of which have the same size as the cell output activation vector $h$. Here $h$ is usually called the hidden state and $r$ denotes its recurrent projection. And $tilde{c_t}$ is also called the candidate hidden state, whose computation is based on the current input and previous hidden state.
The $odot$ is the elementwise product of the vectors. $act_g$ and $act_h$ are the cell input and cell output activation functions and
tanh
is usually used for them. $overline{act_h}$ is the activation function for the projection output, usually usingidentity
or same as $act_h$.Note that these $W_{xi}x_{t}, W_{xf}x_{t}, W_{xc}x_{t}, W_{xo}x_{t}$ operations on the input $x_{t}$ are NOT included in this operator. Users can choose to use fullyconnected operator before LSTMP operator.
Inputs:  Input : (LoDTensor) the input for sequence data, which supports variabletime length input sequence. The underlying tensor in this LoDTensor is a matrix with shape (T X 4D), where T is the total time steps in this minibatch, D is the hidden size.
 H0 : (Tensor, optional) the initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size and D is the hidden size.
 C0 : (Tensor, optional) the initial cell state is an optional input. This is a tensor with shape (N x D), where N is the batch size. `C0` should not be null if `H0` provided.
 Weight : (Tensor) the learnable hiddenhidden weights.  The shape is (P x 4D), where P is the projection layer size and D is the hidden size.  Weight = {W_cr, W_ir, W_fr, W_or}
 ProjWeight : (Tensor) the learnable weight of the projection layer.  The shape is (D x P), where P is the recurrent projection layer size and D is the hidden size.  ProjWeight = {W_rh}
 Bias : (Tensor) the learnable biases, which contains two parts: inputhidden biases and peephole connections weights if setting `use_peepholes` to `True`. 1. `use_peepholes = False`  The shape is (1 x 4D).  Bias = {b_c, b_i, b_f, b_o}.2. `use_peepholes = True`  The shape is (1 x 7D).  Bias = {b_c, b_i, b_f, b_o, W_ic, W_fc, W_oc}.
Outputs:  Projection : (LoDTensor) the projection of the hidden state of LSTMP operator. The shape is (T x P), and LoD is the same with the `Input`.
 Cell : (LoDTensor) the cell state of LSTMP operator. The shape is (T x D), and lod is the same with the `Input`.
 BatchGate (Intermediate) : (LoDTensor) This LoDTensor contains input gate, forget gate and output gate after the activations. This LoDTensor has the same shape as the reorganized input, which is also be called batch input. The LoD size is 2. The firstlevel LoD is the batch offsets and the second contains the indices, which denotes the position of reorganized sequence in the raw input.
 BatchCellPreAct (Intermediate) : (LoDTensor) the preactivation cell state reorganized in batch. This LoDTensor is obtained in the forward and used in the backward.
 BatchHidden (Intermediate) : (LoDTensor) the hidden state reorganized in batch. This LoDTensor is obtained in the forward and used in the backward.
 OrderedP0 (Intermediate) : (Tensor) the projection of the initial hidden state H0. This is a tensor with shape (N x P), where N is the batch size and P is the hidden size.
Attributes:  use_peepholes (Duplicable): (bool, defalut: True) whether to enable diagonal/peephole connections.
 is_reverse (Duplicable): (bool, defalut: False) whether to compute reversed LSTMP.
 gate_activation (Duplicable): (string, default: sigmoid)The activation for input gate, forget gate and output gate, `sigmoid` by default.
 cell_activation (Duplicable): (string, default: tanh)The activation for cell output, `tanh` by defalut.
 candidate_activation (Duplicable): (string, default: tanh)The activation for candidate hidden state, `tanh` by default.
 proj_activation (Duplicable): (string, default: tanh)The activation for projection output, `tanh` by defalut.
target_assign
This operator is, for given the encoded boxes between prior boxes and groundtruth boxes and groundtruth class labels, to assign classification and regression targets to each prior box as well as weights to each prior box. The weights is used to specify which prior box would not contribute to training loss.
For each instance, the output
PredBBoxLabel
,PredBBoxWeight
,PredScoreLabel
andPredScoreWeight
are assigned based onMatchIndices
. Assumed that the row offset for each instance inEncodedGTBBox
is called lod, this operato assigns classification/regression targets by performing the following steps: Assigning all outpts based on
MatchIndices
:
If id = MatchIndices[i][j] > 0,
PredBBoxLabel[i][j] = EncodedGTBBox[lod[i] + id][j] PredBBoxWeight[i][j] = 1. PredScoreLabel[i][j] = GTScoreLabel[lod[i] + id] PredScoreWeight[i][j] = 1.
Otherwise,
PredBBoxLabel[j][j] = [0., 0., 0., 0.] PredBBoxWeight[i][j] = 0. PredScoreLabel[i][j] = background_label PredScoreWeight[i][j] = 0.
 Assigning PredScoreWeight based on
NegIndices
:
Assumed that the row offset for each instance in
NegIndices
is caleed neg_lod, for ith instance and all ids of NegIndices in this instance:PredScoreLabel[i][id] = background_label PredScoreWeight[i][id] = 1.0
Inputs:  EncodedGTBBox : (LoDTensor), The encoded groundtruth bounding boxes with shape [Ng, Np, 4], where Ng is the total number of groundtruth boxes in this minibatch, Np the number of predictions, 4 is the number of coordinate in [xmin, ymin, xmax, ymax] layout.
 GTScoreLabel : (LoDTensor, default LoDTensor<int>), The input groundtruth labels with shape [Ng, 1], where the Ng is the same as it in the input of EncodedGTBBox.
 MatchIndices : (Tensor, default Tensor<int>), The input matched indices with shape [N, Np], where N is the batch size, Np is the same as it in the input of EncodedGTBBox. If MatchIndices[i][j] is 1, the jth prior box is not matched to any groundtruh box in ith instance.
 NegIndices : (LoDTensor, default LoDTensor<int>), The input negative example indices with shape [Neg, 1], where is the total number of negative example indices.
Outputs:  PredBBoxLabel : (Tensor), The output encoded groundtruth labels with shape [N, Np, 4], N is the batch size and Np, 4 is the same as they in input of EncodedGTBBox. If MatchIndices[i][j] is 1, the PredBBoxLabel[i][j][:] is the encoded groundtruth box for background_label in ith instance.
 PredBBoxWeight : (Tensor), The weight for PredBBoxLabel with the shape of [N, Np, 1]
 PredScoreLabel : (Tensor, default Tensor<int>), The output score labels for each predictions with shape [N, Np, 1]. If MatchIndices[i][j] is 1, PredScoreLabel[i][j] = background_label.
 PredScoreWeight : (Tensor), The weight for PredScoreLabel with the shape of [N, Np, 1]
Attributes:  background_label (Duplicable): (int, default 0), Label index of background class.
 Assigning all outpts based on
mean
Mean Operator.
Out is a scalar which is the mean of all elements in X.
Inputs:  X : The input of mean op
Outputs:  Out : The output of mean op
precision_recall
Precision Recall Operator.
When given Input(Indices) and Input(Labels), this operator can be used to compute various metrics including: 1. macro average precision 2. macro average recall 3. macro f1 score 4. micro average precision 5. micro average recall 6. micro f1 score
To compute the above metrics, we need to do statistics for true positives, false positives and false negatives. Here the count of true negatives is not necessary, but counting it may provide potential usage and the cost is trivial, so the operator also provides the count of true negatives.
We define state as a 2D tensor with shape [class_number, 4]. Each row of a state contains statistic variables for corresponding class. Layout of each row is: TP(true positives), FP(false positives), TN(true negatives), FN(false negatives). If Input(Weights) is provided, TP, FP, TN, FN will be calculated by given weight instead of the instance count.
This operator also supports metrics computing for crossbatch situation. To achieve this, Input(StatesInfo) should be provided. State of current batch data will be accumulated to Input(StatesInfo) and Output(AccumStatesInfo) is the accumulation state.
Output(BatchMetrics) is metrics of current batch data while Output(AccumStatesInfo) is metrics of accumulation data.
Inputs:  MaxProbs : (Tensor, default Tensor<float>) A 2D tensor with shape N x 1, where N is the batch size. Each row contains the max probability of an instance which computed by the previous top_k (k=1) operator.
 Indices : (Tensor, default Tensor<int>) A 2D tensor with shape N x 1, where N is the batch size. Each row contains the corresponding index which computed by the previous top_k (k=1) operator.
 Labels : (Tensor, default Tensor<int>) A 2D tensor with shape N x 1, where N is the batch size. Each element is a label and the value should be in [0, class_number  1].
 Weights : (Tensor, default Tensor<float>) A 2D tensor with shape N x 1, where N is the batch size. This input is optional. If provided, weight of instance would be considered when computing metrics.
 StatesInfo : (Tensor, default Tensor<int>) A 2D tensor with shape D x 4, where D is the number of classes. This input is optional. If provided, current state will be accumulated to this state and the accumulation state will be the output state.
Outputs:  BatchMetrics : (Tensor, default Tensor<float>) A 1D tensor with shape {6}. This output tensor contains metrics for current batch data. The layout is [macro average precision, macro average recall, macro f1 score, micro average precision, micro average recall, micro f1 score].
 AccumMetrics : (Tensor, default Tensor<float>) A 1D tensor with shape {6}. This output tensor contains metrics for accumulated data. The layout is [macro average precision, macro average recall, macro f1 score, micro average precision, micro average recall, micro f1 score].
 AccumStatesInfo : (Tensor, default Tensor<float>) A 2D tensor with shape D x 4, where D is equal to class number. This output tensor contains accumulated state variables used to compute metrics. The layout for each class is [true positives, false positives, true negatives, false negatives].
Attributes:  class_number (Duplicable): (int) Number of classes to be evaluated.
softplus
Softplus Activation Operator.
$out = ln(1 + e^{x})$
Inputs:  X : Input of Softplus operator
Outputs:  Out : Output of Softplus operator
get_places
Returns a list of places based on flags. The list will be used for parallel execution.
Inputs: Outputs:  Out : vector of Place
Attributes:  device_count (Duplicable): device count
 device_type (Duplicable): device type
read_from_array
ReadFromArray Operator.
Read a LoDTensor from a LoDTensor Array.
Assume $T$ is LoDTensor, $i$ is the subscript of the array, and $A$ is the array. The equation is
$$T = A[i]$$
Inputs:  X : (TensorArray) the array will be read from.
 I : (Tensor) the subscript index in tensor array. The number of element should be 1
Outputs:  Out : (LoDTensor) the tensor will be read from.
rnn_memory_helper
Inputs:  X :
Outputs:  Out :
Attributes:  dtype (Duplicable): (int, default 5 (FP32)) Output data type
shrink_rnn_memory
This operator is used to shrink output batch of memory defined in dynamic RNN.
Dynamic RNN is able to handle variablelength sequences, in which, sequences in a minibatch are sorted by their lengths first. After that, the longest sequence becomes the first one in the sorted batch, followed by the second longest, the third longest, and so on. Dynamic RNN then slices a batch input timestep by timestep from the sorted input. Once any sequence in the input batch reaches its end, memory defined in dynamicRNN has to shrink its outputs to adapt to the input batch size for the next time step.
Inputs:  X : (LoDTensor) The RNN step memory to be shrinked.
 RankTable : (LoDRankTable) The lod_rank_table of dynamic RNN.
 I : (LoDTensor) The step index. The RNN step memory 'X' will be shrinked to match the size of the input of the index'th step.
Outputs:  Out : (LoDTensor) The shrinked RNN step memory.
merge_lod_tensor
Merge True and False branches of LoDTensor into a single Output, with a mask at certain lod level. X is used to obtain complete lod information. Please refer to SplitLoDTensorOp.
Inputs:  X : The input LoDTensor, contains complete lod information to construct the output
 Mask : A bool column vector which mask the input
 InTrue : The True branch to be merged
 InFalse : The False branch to be merged
Outputs:  Out : The merged output LoDTensor
Attributes:  level (Duplicable): (int) the specific lod level to rank.
reshape
Reshape Operator.
Reshape Input(X) into the shape specified by Attr(shape).
An example: Given a 2D tensor X with 2 rows and 2 columns : [[1, 2], [3, 4]]
and target shape = [1, 4], the reshape operator will transform the tensor X into a 2D tensor: [[1, 2, 3, 4]]
One dimension in the target shape can be set 1, representing that its size is unknown. In this case, the real dimension will be infered from the original shape of Input(X) and other dimensions in the target shape.
Inputs:  X : The input tensor of reshape operator.
Outputs:  Out : The output tensor of reshape operator.
Attributes:  shape (Duplicable): (vector<int>) Target shape of reshape operator.
sigmoid_cross_entropy_with_logits
SigmoidCrossEntropyWithLogits Operator.
This measures the elementwise probability error in classification tasks in which each class is independent. This can be thought of as predicting labels for a datapoint, where labels are not mutually exclusive. For example, a news article can be about politics, technology or sports at the same time or none of these.
The logistic loss is given as follows:
<span class="markdownequation" id="equation0"></span>
We know that $$\sigma(X) = (1 / (1 + \exp(X)))$$. By substituting this we get:
<span class="markdownequation" id="equation2"></span>
For stability and to prevent overflow of $$\exp(X)$$ when X < 0, we reformulate the loss as follows:
<span class="markdownequation" id="equation4"></span>
Both the input
X
andLabels
can carry the LoD (Level of Details) information. However the output only shares the LoD with inputX
.Inputs:  X : (Tensor, default Tensor<float>), a 2D tensor with shape N x D, where N is the batch size and D is the number of classes. This input is a tensor of logits computed by the previous operator. Logits are unscaled log probabilities given as log(p/(1p)).
 Label : (Tensor, default Tensor<float>), a 2D tensor of the same type and shape as X. This input is a tensor of probabalistic labels for each logit
Outputs:  Out : (Tensor, default Tensor<float>), a 2D tensor with shape N x D of elementwise logistic losses.
fill
Fill operator
Fill an tensor with
value
andshape
. The type of the tensor is specify bydtype
.Inputs: Outputs:  Out : (LoDTensor) The output tensor.
Attributes:  value (Duplicable): The float values of tensor, which are flatten in row major
 shape (Duplicable): The shape of output tensor
 dtype (Duplicable): The data type of output tensor, Default is float
 force_cpu (Duplicable): Whether the output tensor must be at CPU memory or not. Default is false.
sequence_reshape
Sequence Reshape Operator.
This operator will rearrange the input sequences. The new dimension is set by attribute and length of each sequence may change longer or shorter which is decided by original length, original dimension and new dimension. The following example will help to illustrate the function of this operator:
x is a LoDTensor: x.lod = [[0, 2, 6]] x.data = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]] x.dims = [6, 2]
set new_dim = 4
then out is a LoDTensor: out.lod = [[0, 1, 3]] out.data = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]] out.dims = [3, 4]
Currently, only 1level LoDTensor is supported and please make sure (original length * original dimension) can be divided by new_dim with no remainder for each sequence.
Inputs:  X : (LoDTensor, default LoDTensor<float>) A 2D LoDTensor with shape being [N, M].
Outputs:  Out : (LoDTensor, default LoDTensor<float>) A 2D LoDTensor with shape [T, new_dim] where T is calculated based on X.lod, M and new_dim.
Attributes:  new_dim (Duplicable): Sequence dimension of the output LoDTensor.
huber_loss
HuberLoss Operator.
Huber loss is a loss function used in robust regression. We define X as the input value and Y as the target value. Huber loss can evaluate the fitness of X to Y. Different from MSE loss, Huber loss is more robust for outliers. The shape of X and Y are [batch_size, 1]. The equation is:
$$ Out_{\delta}(X, Y)_i = \begin{cases} 0.5 * (Y_i  X_i)^2, \quad Y_i  X_i \leq \delta \\ \delta * (Y_i  X_i  0.5 * \delta), \quad otherwise \end{cases} $$
In the above equation, $Out_delta(X, Y)_i$, $X_i$ and $Y_i$ represent the ith element of Out, X and Y.
Inputs:  X : The input value of huber loss op.X is a 2D tensor with shape [batch_size, 1].
 Y : The target value of huber loss op.Y is a 2D tensor with shape [batch_size, 1].
Outputs:  Residual (Intermediate) : Intermediate tensor to cache residual value between Y and X.The shape is same as Input(X) and will be reused in backward.
 Out : The output tensor with shape [batch_size, 1] which represents the huber loss.
Attributes:  delta (Duplicable): Hyper parameter in huber loss.
sequence_softmax
Sequence Softmax Operator.
SequenceSoftmaxOp computes the softmax activation among all timesteps for each sequence. The dimension of each timestep should be 1. Thus, the shape of input Tensor can be either [N, 1] or [N], where N is the sum of the length of all sequences.
The algorithm works as follows:
for ith sequence in a minibatch:
$$ Out(X[lod[i]:lod[i+1]], :) = \ \frac{\exp(X[lod[i]:lod[i+1], :])} \ {\sum(\exp(X[lod[i]:lod[i+1], :]))} $$
For example, for a minibatch of 3 sequences with variablelength, each containing 2, 3, 2 timesteps, the lod of which is [0, 2, 5, 7], then softmax will be computed among X[0:2, :], X[2:5, :], X[5:7, :] and N turns out to be 7.
Inputs:  X : (LoDTensor) 1D or 2D input LoDTensor with the 2nd dimension of length 1.
Outputs:  Out : (LoDTensor) 1D or 2D output LoDTensor with the 2nd dimension of length 1.
multiclass_nms
This operator is to do multiclass non maximum suppression (NMS) on a batched of boxes and scores.
In the NMS step, this operator greedily selects a subset of detection bounding boxes that have high scores larger than score_threshold, if providing this threshold, then selects the largest nms_top_k confidences scores if nms_top_k is larger than 1. Then this operator pruns away boxes that have high IOU (intersection over union) overlap with already selected boxes by adaptive threshold NMS based on parameters of nms_threshold and nms_eta.
Aftern NMS step, at most keep_top_k number of total bboxes are to be kept per image if keep_top_k is larger than 1.
This operator support multiclass and batched inputs. It applying NMS independently for each class. The outputs is a 2D LoDTenosr, for each image, the offsets in first dimension of LoDTensor are called LoD, the number of offset is N + 1, where N is the batch size. If LoD[i + 1]  LoD[i] == 0, means there is no detected bbox for this image. If there is no detected boxes for all images, all the elements in LoD are 0, and the Out only contains one value which is 1.
Inputs:  BBoxes : (Tensor) A 2D Tensor with shape [M, 4] represents the predicted locations of M bounding bboxes. Each bounding box has four coordinate values and the layout is [xmin, ymin, xmax, ymax].
 Scores : (Tensor) A 3D Tensor with shape [N, C, M] represents the predicted confidence predictions. N is the batch size, C is the class number, M is number of bounding boxes. For each category there are total M scores which corresponding M bounding boxes. Please note, M is equal to the 1st dimension of BBoxes.
Outputs:  Out : (LoDTensor) A 2D LoDTensor with shape [No, 6] represents the detections. Each row has 6 values: [label, confidence, xmin, ymin, xmax, ymax], No is the total number of detections in this minibatch. For each instance, the offsets in first dimension are called LoD, the number of offset is N + 1, if LoD[i + 1]  LoD[i] == 0, means there is no detected bbox.
Attributes:  background_label (Duplicable): (int64_t, defalut: 0) The index of background label, the background label will be ignored. If set to 1, then all categories will be considered.
 score_threshold (Duplicable): (float) Threshold to filter out bounding boxes with low confidence score. If not provided, consider all boxes.
 nms_top_k (Duplicable): (int64_t) Maximum number of detections to be kept according to the confidences aftern the filtering detections based on score_threshold
 nms_threshold (Duplicable): (float, defalut: 0.3) The threshold to be used in NMS.
 nms_eta (Duplicable): (float) The parameter for adaptive NMS.
 keep_top_k (Duplicable): (int64_t) Number of total bboxes to be kept per image after NMS step. 1 means keeping all bboxes after NMS step.
sequence_erase
Sequence Erase Operator.
Sequence erase operator erases tokens specified by Attr(tokens) from the input sequences Input(X), and outputs the remaining data and modifies the LoD information at the same time. For example, given a 2D LoDTensor
X = [[2, 2, 6, 1, 3, 9, 6, 1, 0, 1]]^T
with lod = [[0, 3, 6, 10]], there are three sequences in the input:
X1 = [[2, 2, 6]]^T, X2 = [[1, 3, 9]]^T and X3 = [[6, 1, 0, 1]]^T.
If the tokens to be erased are Attr(tokens) = [2, 3, 5], after the erasing operation, the three sequences become
X1' = [[6]]^T, X2' = [[1, 9]]^T and X3' = [[6, 1, 0, 1]]^T.
Hence the LoDTensor Output(Out) should be
Out = [[6, 1, 9, 6, 1, 0, 1]]^T,
with lod = [[0, 1, 3, 7]].
An example usage for this operator is to remove the special tokens when computing the edit distance between two strings, such as blank, start token, and end token.
Inputs:  X : (2D LoDTensor with the 2nd dim. equal to 1) Input LoDTensor of SequenceEraseOp.
Outputs:  Out : (2D LoDTensor with the 2nd dim. equal to 1) Output LoDTensor of SequenceEraseOp.
Attributes:  tokens (Duplicable): (vector<int>) Tokens need to be erased from input sequences.
scale
Scale operator
$$Out = scale*X$$
Inputs:  X : (Tensor) Input tensor of scale operator.
Outputs:  Out : (Tensor) Output tensor of scale operator.
Attributes:  scale (Duplicable): (float, default 1.0)The scaling factor of the scale operator.
lookup_table
Lookup Table Operator.
This operator is used to perform lookups on the parameter W, then concatenated into a dense tensor.
The input Ids can carry the LoD (Level of Details) information, or not. And the output only shares the LoD information with input Ids.
Inputs:  W : An input represents embedding tensors, which is a learnable parameter.
 Ids : An input with type int32 or int64 contains the ids to be looked up in W. Ids must be a column vector with rank = 2. The 2nd dimension size must be 1.
Outputs:  Out : The lookup results, which have the same type as W.
Attributes:  is_sparse (Duplicable): (boolean, default false) Sparse update
 padding_idx (Duplicable): (int64, default 1) If the value is 1, it makes no effect to lookup. Otherwise the given value indicates padding the output with zeros whenever lookup encounters it in Ids.
lod_tensor_to_array
Inputs:  X :
 RankTable :
Outputs:  Out :
logical_not
logical_not Operator
It operates elementwise on X, and returns the Out. X and Out are Ndim boolean tensors. Each element of Out is calculated by $$Out = !X$$
Inputs:  X : (LoDTensor) Operand of logical_not operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is $$Out = !X$$
logical_and
logical_and Operator
It operates elementwise on X and Y, and returns the Out. X, Y and Out are Ndim boolean tensors. Each element of Out is calculated by $$Out = X \&\& Y$$
Inputs:  X : (LoDTensor) Left hand operand of logical_and operator
 Y : (LoDTensor) Right hand operand of logical_and operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is $$Out = X \&\& Y$$
logical_or
logical_or Operator
It operates elementwise on X and Y, and returns the Out. X, Y and Out are Ndim boolean tensors. Each element of Out is calculated by $$Out = X  Y$$
Inputs:  X : (LoDTensor) Left hand operand of logical_or operator
 Y : (LoDTensor) Right hand operand of logical_or operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is $$Out = X  Y$$
logical_xor
logical_xor Operator
It operates elementwise on X and Y, and returns the Out. X, Y and Out are Ndim boolean tensors. Each element of Out is calculated by $$Out = (X  Y) \, \&\& \, !(X \&\& Y)$$
Inputs:  X : (LoDTensor) Left hand operand of logical_xor operator
 Y : (LoDTensor) Right hand operand of logical_xor operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is $$Out = (X  Y) \, \&\& \, !(X \&\& Y)$$
log_loss
LogLoss Operator.
Log loss is a loss function used for binary classification. Log Loss quantifies the accuracy of a classifier by penalising false classifications. Minimising the Log Loss is equivalent to maximising the accuracy of the classifier. We define Predicted as the values predicted by our model and Labels as the target ground truth value. Log loss can evaluate how close the predicted values are to the target. The shapes of Predicted and Labels are both [batch_size, 1]. The equation is:
$$ Loss =  Labels * log(Predicted + \epsilon)  (1  Labels) * log(1  Predicted + \epsilon) $$
Inputs:  Predicted : The input value (Predicted) of Log loss op.Predicted is a 2D tensor with shape [batch_size, 1].
 Labels : The target value (Labels) of Log loss op.Labels is a 2D tensor with shape [batch_size, 1].
Outputs:  Loss : The output tensor with shape [batch_size, 1] which represents the log loss.
Attributes:  epsilon (Duplicable): Epsilon in log loss.
sqrt
Sqrt Activation Operator.
$out = sqrt{x}$
Inputs:  X : Input of Sqrt operator
Outputs:  Out : Output of Sqrt operator
lod_reset
LoDReset operator
Reset LoD of Input(X) into a new one specified by Input(TargetLoD) or Attr(target_lod), or set LoD for Input(X) if it doesn't have one. Currently the lod_reset operator only supports the reset of level 0 LoD. At least one of Input(TargetLoD) and Attr(target_lod) must be set, and if both of them are set, Input(TargetLoD) will be chosen as the target LoD.
An example: Given a float LoDTensor X with shape (6, 1), its transpose form represents
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
with LoD = [[0, 2, 5, 6]] and the three (transposed) sequences look like
[1.0, 2.0], [3.0, 4.0, 5.0], [6.0].
If target LoD = [0, 4, 6], the lod_reset operator will reset the LoD and the sequences that the LoDTensor Output(Out) contains becomes:
[1.0, 2.0, 3.0, 4.0], [5.0, 6.0].
Inputs:  X : (LoDTensor) The input tensor of lod_reset operator.
 TargetLoD : (Tensor, optional) The target level 0 LoD from Input().
Outputs:  Out : (LoDTensor) The output tensor of lod_reset operator.
Attributes:  target_lod (Duplicable): The target level 0 LoD from Attr().
write_to_array
WriteToArray Operator.
This operator writes a LoDTensor to a LoDTensor array.
Assume $T$ is LoDTensor, $i$ is the subscript of the array, and $A$ is the array. The equation is
$$A[i] = T$$
Inputs:  X : (LoDTensor) the tensor will be written to tensor array
 I : (Tensor) the subscript index in tensor array. The number of element should be 1
Outputs:  Out : (TensorArray) the tensor array will be written
lod_array_length
LoDArrayLength Operator.
This operator obtains the length of lod tensor array:
$$Out = len(X)$$
NOTE: The output is a CPU Tensor since the control variable should be only in CPU and the length of LoDTensorArray should be used as control variables.
Inputs:  X : (LoDTensorArray) The input tensor array.
Outputs:  Out : (Tensor) 1x1 CPU Tensor of length, int64_t
edit_distance
EditDistance operator computes the edit distances between a batch of hypothesis strings and their references.
Edit distance, also called Levenshtein distance, measures how dissimilar two strings are by counting the minimum number of operations to transform one string into anthor. Here the operations include insertion, deletion, and substitution. For example, given hypothesis string A = "kitten" and reference B = "sitting", the edit distance is 3 for A will be transformed into B at least after two substitutions and one insertion:
"kitten" > "sitten" > "sittin" > "sitting"
Input(Hyps) is a LoDTensor consisting of all the hypothesis strings with the total number denoted by
batch_size
, and the separation is specified by the LoD information. And thebatch_size
reference strings are arranged in order in the same way in the LoDTensor Input(Refs).Output(Out) contains the
batch_size
results and each stands for the edit stance for a pair of strings respectively. If Attr(normalized) is true, the edit distance will be divided by the length of reference string.Inputs:  Hyps : (2D LoDTensor<int64_t>, 2nd dim. equal to 1) The indices for hypothesis strings.
 Refs : (2D LoDTensor<int64_t>, 2nd dim. equal to 1) The indices for reference strings.
Outputs:  SequenceNum : The sequence count of current batch
 Out : (2D Tensor with shape [`batch_size` x 1]) The output edit distances of EditDistance operator.
Attributes:  normalized (Duplicable): (bool, default false) Indicated whether to normalize the edit distance by the length of reference string.
layer_norm
Layer Normalization.
Layer Norm has been implemented as discussed in the paper: https://arxiv.org/abs/1607.06450 ...
Inputs:  X : (LoDTensor) The input tensor.
 Scale : (Tensor, optional) Scale is a 1dimensional tensor of size H(`begin_norm_axis` splits the tensor(`X`) to a matrix [N,H]).It is applied to the output.
 Bias : (Tensor, optional) Bias is a 1dimensional tensor of size H(`begin_norm_axis` splits the tensor(`X`) to a matrix [N,H]).It is applied to the output.
Outputs:  Y : (LoDTensor) Result after normalization.
 Mean (Intermediate) : (Tensor) Mean of the current mini batch.
 Variance (Intermediate) : (Tensor) Variance of the current mini batch.
Attributes:  epsilon (Duplicable): (float, default 1e5) Constant for numerical stability
 begin_norm_axis (Duplicable): (int default:1), the axis of `begin_norm_axis ... Rank(X)  1` will be normalized. `begin_norm_axis` splits the tensor(`X`) to a matrix [N,H].
gaussian_random
GaussianRandom Operator.
Used to initialize tensors with gaussian random generator.
Inputs: Outputs:  Out : Output matrix of gaussian random op
Attributes:  shape (Duplicable): (vector<int>) The dimension of random tensor.
 mean (Duplicable): (float, default 0.0) mean of random tensor.
 std (Duplicable): (float, default 1.0) std of random tensor.
 seed (Duplicable): (int, default 0) Random seed of generator.0 means use system wide seed.
 dtype (Duplicable): (int, default 5(FP32)) Output data type.
lrn
Local Response Normalization Operator.
This operator comes from the paper: <
>. The original formula is:
$$ Output(i, x, y) = Input(i, x, y) / \left( k + \alpha \sum\limits^{\min(C, c + n/2)}_{j = \max(0, c  n/2)} (Input(j, x, y))^2 \right)^{\beta} $$
Function implementation:
Inputs and outpus are in NCHW format, while input.shape.ndims() equals 4. And dimensions 0 ~ 3 represent batch size, feature maps, rows, and columns, respectively.
Input and Output in the formula above is for each map(i) of one image, and Input(i, x, y), Output(i, x, y) represents an element in an image.
C is the number of feature maps of one image. n is a hyperparameter configured when operator is initialized. The sum in the denominator is the sum of the same positions in the neighboring maps.
Inputs:  X : (Tensor) The input of LRN operator. It must be a 4D tenor with NCHW format.
Outputs:  Out : (Tensor) The output of LRN operator, which is also the 4D tensor with NCHW format.
 MidOut : (Tensor) Middle result of LRN operator. It's computed in forward process and also used in backward process.
Attributes:  n (Duplicable): (int default 5) n is the "adjacent" kernel that maps at the same spatial position.
 k (Duplicable): (float, default 2.0) k is the bias.
 alpha (Duplicable): (float, default 0.0001) alpha is the scale number.
 beta (Duplicable): (float, default 0.75) beta is the power number.
bilinear_tensor_product
Bilinear Tensor Product operator. Given input X and Y, a 3D tensor Weight and a Bias. Each column of the Output is computed by one slice $i = 1, . . . , k$ of the tensor:
$$ M = (X W_i) * Y \\ Out_i = \sum_j {M_j} + Bias_i $$
Where $W_i$ is the $i$th slice of Input(Weight); $M_j$ is the $j$th column of $M$; $Out_i$ is the $i$th column of Output(Out); $Bias_i$ is a column vector, each element of it is equal to the $i$th element of $Bias$;
Inputs:  X : The first input of bilinear_tensor_product operator.
 Y : The second input of bilinear_tensor_product operator.
 Weight : The learnable parameters of bilinear_tensor_product operator.
 Bias : The learnable bias of bilinear_tensor_product operator.
Outputs:  Out : The output of bilinear_tensor_product operator.
iou_similarity
IOU Similarity Operator. Computes intersectionoverunion (IOU) between two box lists. Box list 'X' should be a LoDTensor and 'Y' is a common Tensor, boxes in 'Y' are shared by all instance of the batched inputs of X. Given two boxes A and B, the calculation of IOU is as follows:
$$ IOU(A, B) = \frac{area(A\cap B)}{area(A)+area(B)area(A\cap B)} $$
Inputs:  X : (LoDTensor, default LoDTensor<float>) Box list X is a 2D LoDTensor with shape [N, 4] holds N boxes, each box is represented as [xmin, ymin, xmax, ymax], the shape of X is [N, 4]. [xmin, ymin] is the left top coordinate of the box if the input is image feature map, they are close to the origin of the coordinate system. [xmax, ymax] is the right bottom coordinate of the box. This tensor can contain LoD information to represent a batch of inputs. One instance of this batch can contain different numbers of entities.
 Y : (Tensor, default Tensor<float>) Box list Y holds M boxes, each box is represented as [xmin, ymin, xmax, ymax], the shape of X is [N, 4]. [xmin, ymin] is the left top coordinate of the box if the input is image feature map, and [xmax, ymax] is the right bottom coordinate of the box.
Outputs:  Out : (LoDTensor, the lod is same as input X) The output of iou_similarity op, a tensor with shape [N, M] representing pairwise iou scores.
conditional_block
Conditional block operator
Run the subblock if X is not empty. Params is the other inputs and Out is the outputs of the subblock.
Inputs:  X (Duplicable) : The conditional variable of this operator. If X is empty, the whole subblock will not be executed.
 Params (Duplicable) : The input variables of the subblock.
Outputs:  Out (Duplicable) : The output variables of the subblock.
 Scope : (std::vector<Scope*>) The step scope of conditional block. To unify the conditional block, rnn and while op, the type of scope is std::vector<Scope*>
Attributes:  sub_block (Duplicable): The step block of conditional block operator
 is_scalar_condition (Duplicable): the input X is used as scalar condition
rmsprop
Rmsprop Optimizer.
$$ MeanSquareOut = decay * MeanSquare + (1  decay) * Grad * Grad \\ MomentOut = momentum * Moment + \frac{LearningRate * Grad}{\sqrt{MeanSquareOut + epsilon}} \\ ParamOut = Param  MomentOut $$
The original slides that proposed Rmsprop: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
Inputs:  Param : (Tensor, default Tensor<float>) Input parameter value that has to be updated.
 MeanSquare : (Tensor, default Tensor<float>) The mean square value that gets updated.
 LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
 Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
 Moment : (Tensor, default Tensor<float>) The moment that gets updated.
Outputs:  ParamOut : (Tensor) Output updated parameter value.
 MomentOut : (Tensor) Output updated moment.
 MeanSquareOut : (Tensor) Output Mean squared updated value.
Attributes:  epsilon (Duplicable): (float, default 1e10) Constant for numerical stability.
 decay (Duplicable): (float, default 0.9) Discounting factor for coming gradient.
 momentum (Duplicable): (float, default 0.0) Constant value.
elementwise_mul
Limited Elementwise Mul Operator.
The equation is:
$$Out = X \odot\ Y$$
$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.
There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.
For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.
For example .. codeblock:: python
shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.
Inputs:  X : (Tensor), The first input tensor of elementwise op.
 Y : (Tensor), The second input tensor of elementwise op.
Outputs:  Out : The output of elementwise op.
Attributes:  axis (Duplicable): (int, default 1). The start dimension index for broadcasting Y onto X.
sequence_slice
Sequence slice operator
The operator crops a subsequence from given sequence with given start offset and subsequence length. It only supports sequence (LoD Tensor with level number is 1).  Case: X = [[a1, a2; b1, b2; c1, c2] [d1, d2; e1, e2]] LoD(X) = {{0, 3, 5}}; Dims(X) = (5, 2) Offset = [[0], [1]]; Length = [[2], [1]]
Out = [[a1, a2; b1, b2] [e1, e2]] LoD(Out) = {{0, 2, 3}}; Dims(Out) = (3, 2)
NOTE: The first dimension size of input, the size of offset and Length, should be equal. The offset start from 0.
Inputs:  X : (LoDTensor), the input of SequenceSliceOp.
 Offset : (Tensor), a vector<int> to describe the offset of every input sequence for sub sequence item.
 Length : (Tensor), a vector<int> to describe the length of every input sequence for sub sequence item.
Outputs:  Out : (LoDTensor), the output of SequenceSliceOp.
hinge_loss
HingeLoss Operator.
Let x be a logit (prediction) and y be the actual label. The logit can take any values from (inf, inf), but the labels should be either 1 or 1. Then, the hinge loss is computed as follows:
$$ L_(x, y) = max(1  y.x, 0) $$
Note that the labels passed as input will have values as either 0 or 1.
Inputs:  Logits : The input value (Logits) of Hinge loss op.Logits is a 2D tensor with shape [batch_size, 1].
 Labels : The target value (Labels) of Hinge loss op.Labels is a 2D tensor with shape [batch_size, 1].
Outputs:  Loss : The output tensor with shape [batch_size, 1] which represents the hinge loss.
fill_constant
FillConstantBatchSizeLike Operator.
Fill up a variable with specified constant value.
Inputs: Outputs:  Out : (Tensor) Tensor of specified shape will be filled with the specified value
Attributes:  dtype (Duplicable): (int, default 5 (FP32)) Output data type
 shape (Duplicable): (vector<int>) The shape of the output
 value (Duplicable): (float, default 0) The value to be filled
 force_cpu (Duplicable): (bool, default false) Force fill output variable to cpu memory. Otherwise, fill output variable to the running device
detection_output
detection output for SSD(single shot multibox detector) Apply the NMS to the output of network and compute the predict bounding box location. The output’s shape of this layer could be zero if there is no valid bounding box.
Inputs:  Loc : (Tensor) The input tensor of detection_output operator.The input predict locationsThe format of input tensor is kNCHW. Where K is priorbox point numbers,N is How many boxes are there on each point, C is 4, H and W both are 1.
 Conf : (Tensor) The input tensor of detection_output operator.The input priorbox confidence.The format of input tensor is kNCHW. Where K is priorbox point numbers,N is How many boxes are there on each point, C is the number of classes, H and W both are 1.
 PriorBox : (Tensor) The input tensor of detection_output operator.The format of input tensor is the position and variance of the boxes
Outputs:  Out : (Tensor) The output tensor of detection_output operator.
Attributes:  background_label_id (Duplicable): (int), The background class index.
 num_classes (Duplicable): (int), The number of the classification.
 nms_threshold (Duplicable): (float), The Nonmaximum suppression threshold.
 confidence_threshold (Duplicable): (float), The classification confidence threshold.
 top_k (Duplicable): (int), The bbox number kept of the layer’s output.
 nms_top_k (Duplicable): (int), The bbox number kept of the NMS’s output.
fill_zeros_like
FillZerosLike Operator.
Fill up a variable with zeros. The output will have the same size as the input.
Inputs:  X : The input of fillzeroslike op.
Outputs:  Out : The variable will be filled up with zeros.
softmax_with_cross_entropy
Softmax With Cross Entropy Operator.
Cross entropy loss with softmax is used as the output layer extensively. This operator computes the softmax normalized values for each row of the input tensor, after which crossentropy loss is computed. This provides a more numerically stable gradient.
Because this operator performs a softmax on logits internally, it expects unscaled logits. This operator should not be used with the output of softmax operator since that would produce incorrect results.
When the attribute soft_label is set false, this operators expects mutually exclusive hard labels, each sample in a batch is in exactly one class with a probability of 1.0. Each sample in the batch will have a single label.
The equation is as follows:
1) Hard label (onehot label, so every sample has exactly one class)
$$Loss_j = \text{Logit}_{Label_j} + \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right), j = 1,..., K$$
2) Soft label (each sample can have a distribution over all classes)
$$Loss_j = \sum_{i=0}^{K}\text{Label}_i \left(\text{Logit}_i  \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right)\right), j = 1,...,K$$
Inputs:  Logits : (Tensor, default: Tensor<float>), The unscaled log probabilities which is a 2D tensor with shape [N x K]. N is the batch_size, and K is the class number.
 Label : (Tensor) The ground truth which is a 2D tensor. If soft_label is set to false, Label is a Tensor<int64> with shape [N x 1]. If soft_label is set to true, Label is a Tensor<float/double> with shape [N x K].
Outputs:  Softmax (Intermediate) : (Tensor, default: Tensor<float>), A 2D tensor with shape [N x K]. The outputs value of softmax activation by given the input batch, which will be used in backward calculation.
 Loss : (Tensor, default: Tensor<float>), A 2D tensor. The cross entropy loss with shape [N x 1].
Attributes:  soft_label (Duplicable): (bool, default: false), A flag to indicate whether to interpretate the given labels as soft labels.
fill_constant_batch_size_like
FillConstantBatchSizeLike Operator.
Fill up a variable with specified constant value.
Inputs:  Input : (Tensor) Tensor whose dim_idx th dimension is used to specify the batch_size
Outputs:  Out : (Tensor) Tensor of specified shape will be filled with the specified value
Attributes:  dtype (Duplicable): (int, default 5 (FP32)) Output data type
 shape (Duplicable): (vector<int>) The shape of the output
 input_dim_idx (Duplicable): (int, default 0) The index of input's batch size dimension
 output_dim_idx (Duplicable): (int, default 0) The index of output's batch size dimension
 value (Duplicable): (float, default 0) The value to be filled
tanh
Tanh Activation Operator.
$$out = \frac{e^{x}  e^{x}}{e^{x} + e^{x}}$$
Inputs:  X : Input of Tanh operator
Outputs:  Out : Output of Tanh operator
feed
Feed Operator.
It should not be configured by users directly.
Inputs:  X : The input of feed op
Outputs:  Out : The output of feed op
Attributes:  col (Duplicable): (int) The column of feed
label_smooth
LabelSmooth Operator.
Label smoothing is a mechanism to regularize the classifier layer. In machine learning, optimizing the loglikelihood of the correct label directly may cause two problems. First, it may result in overfitting: if the model learns to assign full probability to the groundtruth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, reducing the ability of the model to adapt. Label smoothing is proposed to encourage the model to be less confident, which replaces the groundtruth label $y$ with the weighted sum of itself and some fixed distribution $mu$, i.e.
$$ \tilde{y} = (1  \epsilon) * y + \epsilon * \mu, $$
where $(1  epsilon)$ and $epsilon$ are the weights respectively, and $tilde{y}$ is the smoothed label. Usually uniform distribution is used for $mu$. This change in the groundtruth label is called labelsmoothing regularization or LSR.
See more details about label smoothing in https://arxiv.org/abs/1512.00567.
Inputs:  X : (LoDTensor) The input labels of LabelSmooth operator. This input can be batched labels in onehot encoding or output from softmax, with shape [N x K], where N is the batch size and K is the number of classes
 PriorDist : (Tensor, optional)The prior distribution to be added to the smoothed label. It is fixed during training and the number of elements should be equal to the dimension K of each label. Default is uniform distribution and each element will be set to 1/K if not provided in input.
Outputs:  Out : (loDTensor) The smoothed label of LabelSmooth operator. It hasthe same shape and LoD with the Input(LoDTensor).
Attributes:  epsilon (Duplicable): (float, default 0.0f)The smoothing parameter of LabelSmooth operator.
expand
Expand operator tiles the input by given times number. You should set times number for each dimension by providing attribute 'expand_times'. The rank of X should be in [1, 6]. Please note that size of 'expand_times' must be the same with X's rank. Following is a using case:
Input(X) is a 3D tensor with shape [2, 3, 1]:
[ [[1], [2], [3]], [[4], [5], [6]] ]
Attr(expand_times): [1, 2, 2]
Output(Out) is a 3D tensor with shape [2, 6, 2]:
[ [[1, 1], [2, 2], [3, 3], [1, 1], [2, 2], [3, 3]], [[4, 4], [5, 5], [6, 6], [4, 4], [5, 5], [6, 6]] ]
Inputs:  X : (Tensor, default Tensor<float>). A tensor with rank in [1, 6].X is the input to be expanded.
Outputs:  Out : (Tensor, default Tensor<float>). A tensor with rank in [1, 6].The rank of Output(Out) have the same with Input(X). After expanding, size of each dimension of Output(Out) is equal to size of the corresponding dimension of Input(X) multiplying the corresponding value given by Attr(expand_times).
Attributes:  expand_times (Duplicable): Expand times number for each dimension.
elementwise_min
Limited Elementwise Max Operator.
The equation is:
$$Out = min(X, Y)$$
$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.
There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.
For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.
For example .. codeblock:: python
shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.
Inputs:  X : (Tensor), The first input tensor of elementwise op.
 Y : (Tensor), The second input tensor of elementwise op.
Outputs:  Out : The output of elementwise op.
Attributes:  axis (Duplicable): (int, default 1). The start dimension index for broadcasting Y onto X.
elementwise_div
Limited Elementwise Div Operator.
The equation is:
$$Out = X / Y$$
$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.
There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.
For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.
For example .. codeblock:: python
shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.
Inputs:  X : (Tensor), The first input tensor of elementwise op.
 Y : (Tensor), The second input tensor of elementwise op.
Outputs:  Out : The output of elementwise op.
Attributes:  axis (Duplicable): (int, default 1). The start dimension index for broadcasting Y onto X.
elementwise_add
Limited Elementwise Add Operator.
The equation is:
$$Out = X + Y$$
$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.
There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.
For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.
For example .. codeblock:: python
shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.
Inputs:  X : (Tensor), The first input tensor of elementwise op.
 Y : (Tensor), The second input tensor of elementwise op.
Outputs:  Out : The output of elementwise op.
Attributes:  axis (Duplicable): (int, default 1). The start dimension index for broadcasting Y onto X.
cross_entropy
CrossEntropy Operator.
It supports both standard crossentropy and softlabel crossentropy loss computation. 1) Onehot crossentropy: soft_label = false, Label[i, 0] indicates the class index for sample i:
$Y[i] = \log(X[i, Label[i]])$
2) Softlabel crossentropy: soft_label = true, Label[i, j] indicates the soft label of class j for sample i:
$Y[i] = \sum_j{Label[i, j] * log(X[i, j])}$
Please make sure that in this case the summuation of each row of Label equals one.
3) Onehot crossentropy with vecterized Input(Label): As a special case of 2), when each row of Input(Label) has only one nonzero element (equals 1), softlabel crossentropy degenerates to a onehot crossentropy with onehot label representation.
Both the input X and Label can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.
Inputs:  X : (Tensor, default Tensor<float>), a 2D tensor with shape [N x D], where N is the batch size and D is the number of classes. This input is a probability computed by the previous operator, which is almost always the result of a softmax operator.
 Label : (Tensor), the ground truth which is a 2D tensor. When soft_label is set to false, Label is a Tensor<int64> with shape [N x 1]. When soft_label is set to true, Label is a Tensor<float/double> with shape [N x D].
Outputs:  Y : (Tensor, default Tensor<float>), a 2D tensor with shape [N x 1]. The cross entropy loss.
Attributes:  soft_label (Duplicable): (bool, default false), a flag indicating whether to interpretate the given labels as soft labels.
matmul
MatMul Operator.
This operator is used to perform (batched) matrix multiplication over the last two dimensions of the input tensors
X
andY
.If a transpose flag is specified, the last two dimensions of the tensor are transposed. If the tensor is rank1 of shape [D], then for
X
it is treated as [1, D] in nontransposed form and as [D, 1] in transposed form, whereas forY
it is the opposite: It is treated as [D, 1] in nontransposed form and as [1, D] in transposed form.Examples without transpose:  X: [K], Y: [K] => Out: [1]  X: [K], Y: [K, N] => Out: [N]  X: [B, M, K], Y: [K] => Out: [B, M]  X: [M, K], Y: [B, K, N] => Out: [B, M, N]  X: [B, M, K], Y: [B, K, N] => Out: [B, M, N]  X: [B, ..., M, K], Y: [B, ..., K, N] => Out: [B, ..., M, N]
The behavior is designed to be similar to the
numpy.matmul
function. The differences are:  When the rank of the input data is less than or equal to 3, it is similar to thenumpy.matmul
function.  When the rank of the input is greater than 3, the rank of X and Y must be equal, and the firstrank  2
dimensions must be equal.  We addtranspose_X
andtranspose_Y
flags.Both the input
X
andY
can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with inputX
.Inputs:  X : The first input of MatMul op
 Y : The second input of MatMul op
Outputs:  Out : The output of MatMul op
Attributes:  transpose_X (Duplicable): If true, use the transpose of `X`.
 transpose_Y (Duplicable): If true, use the transpose of `Y`.
dropout
Dropout Operator.
Dropout refers to randomly dropping out units in a nerual network. It is a regularization technique for reducing overfitting by preventing neuron coadaption during training. The dropout operator randomly set (according to the given dropout probability) the outputs of some units to zero, while others are set equal to their corresponding inputs.
Inputs:  X : The input of dropout op.
Outputs:  Out : The output of dropout op.
 Mask (Intermediate) : The random sampled dropout mask.
Attributes:  dropout_prob (Duplicable): Probability of setting units to zero.
 is_test (Duplicable): True if in test phase.
 fix_seed (Duplicable): A flag indicating whether to use a fixed seed to generate random mask. NOTE: DO NOT set this flag to true in training. Setting this flag to true is only useful in unittest or for debug that always the same output units will be dropped.
 seed (Duplicable): Dropout random seed.
fetch
Fetch Operator.
It should not be configured by users directly.
Inputs:  X : The input of fetch op
Outputs:  Out : The output of fetch op
Attributes:  col (Duplicable): (int) The column of fetch
squared_l2_distance
SquaredL2Distance operator
This operator will cacluate the squared L2 distance for the input and the target. Number of distance value will be equal to the first dimension of input. First dimension of the target could be equal to the input or to 1. If the first dimension of target is 1, the operator will broadcast target's first dimension to input's first dimension. During backward propagation, the user can decide whether to calculate the gradient of the input or the target or both.
Both the input X and Y can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input X.
Inputs:  X : (Tensor) Input of SquaredL2DistanceOp.
 Y : (Tensor) Target of SquaredL2DistanceOp.
Outputs:  sub_result (Intermediate) : (Tensor) Buffering subtraction result which will be reused in backward.
 Out : (Tensor) Squared l2 distance between input and target.
while
Inputs:  X (Duplicable) : A set of variables, which are required by operators inside the block of While Op.
 Condition (Duplicable) : (Bool) An scalar. When it's False, the While Op will be terminated.
Outputs:  Out (Duplicable) : A set of variables, which will be assigned with values generated by the operators inside the block of While Op.
 StepScopes : (StepScopeVar) A vector of local scope, which size equals the step number of While Op. The i'th scope storages temporary variables generated in the i'th step.
Attributes:  sub_block (Duplicable): The step block inside WhileOp
relu
Relu Activation Operator.
$out = max(x, 0)$
Inputs:  X : Input of Relu operator
Outputs:  Out : Output of Relu operator
decayed_adagrad
Decayed Adagrad Optimizer.
The update is done as follows:
$$ moment\_out = decay * moment + (1  decay) * grad * grad \\ param\_out = param  \frac{learning\_rate * grad}{\sqrt{moment\_out} + epsilon} $$
The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have an epsilon attribute. It is added here for numerical stability to avoid the division by zero error.
Inputs:  Param : (Tensor) Input parameter
 Grad : (Tensor) Input gradient
 Moment : (Tensor) Second moment
 LearningRate : (Tensor) Learning rate
Outputs:  ParamOut : (Tensor) Output parameter
 MomentOut : (Tensor) Output second moment
Attributes:  decay (Duplicable): (float, default 0.95) Discounting factor for coming gradient
 epsilon (Duplicable): (float, default 1.0e6) Constant for numerical stability
gru
GRU Operator implements part calculations of the complete GRU as following:
$$ update\_gate: u_t = actGate(xu_t + W_u * h_{t1} + b_u) \\ reset\_gate: r_t = actGate(xr_t + W_r * h_{t1} + b_r) \\ output\_candidate: {h}_t = actNode(xc_t + W_c * dot(r_t, h_{t1}) + b_c) \\ output: h_t = dot((1  u_t), h_{t1}) + dot(u_t, {h}_t) $$
@note To implement the complete GRU, fullyconnected operator must be used before to feed xu, xr and xc as the Input of GRU operator.
Inputs:  Input : (LoDTensor) The first input is a LodTensor, which supports variabletime length input sequence. The underlying tensor in this LoDTenosr is a matrix with shape (T X 3D), where, T is the total time steps in this minibatch, D is the hidden size.
 H0 : (Tensor, optional) The initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size, D is the hidden size.
 Weight : (Tensor) The learnable hiddenhidden weight matrix with shape (D x 3D), where D is the hidden size. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape (D x 2D), and the second part are weights of output candidate with shape (D x D).
 Bias : (Tensor, optional) Bias vector with shape (1 x 3D) concating bias of the update gate, reset gate and output candidate.
Outputs:  BatchGate (Intermediate) : (LoDTensor) To compute with batches, sequence data will be reorganized into several successive batches each containing data from the same time step. The LoDTensor BatchGate contains the update gate, reset gate and output candidate values organized in batches. The LoD size is 2. The first LoD contains the batch offsets and the second LoD contains the indexes in the raw sequence data.
 BatchResetHiddenPrev (Intermediate) : (LoDTensor) The reseted hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.
 BatchHidden (Intermediate) : (LoDTensor) The hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.
 Hidden : (LoDTensor) the hidden state LoDTensor organized in sequences. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.
Attributes:  activation (Duplicable): (string, default tanh) The activation type used for output candidate {h}_t.
 gate_activation (Duplicable): (string, default sigmoid) The activation type used in update gate and reset gate.
 is_reverse (Duplicable): (bool, defalut: False) whether to compute reversed GRU.
ctc_align
CTCAlign op is used to merge repeated elements between two blanks and then delete all blanks in sequence.
Given: Input.data = [0, 1, 2, 2, 0, 4, 0, 4, 5, 0, 6, 6, 0, 0, 7, 7, 7, 0] Input.dims = {18, 1} Input.LoD = [[0, 11, 18]]
And: blank = 0 merge_repeated = True
Then: Output.data = [1, 2, 4, 4, 5, 6, 6, 7] Output.dims = {8, 1} Output.LoD = [[0, 6, 8]]
Inputs:  Input : (LodTensor, default: LoDTensor<int>), Its shape is [Lp, 1], where Lp is the sum of all input sequences' length.
Outputs:  Output : (Tensor, default: Tensor<int>), The align result.
Attributes:  blank (Duplicable): (int, default: 0), the blank label setted in Connectionist Temporal Classification (CTC) op.
 merge_repeated (Duplicable): (bool, default: true), whether to merge repeated elements between two blanks.
beam_search
This is a beam search operator that help to generate sequences.
Inputs:  pre_ids : ids in previous step
 ids : a LoDTensor of shape of [None,k]
 scores : a LoDTensor that has the same shape and LoD with `ids`
Outputs:  selected_ids : a LoDTensor that stores the IDs selected by beam search
 selected_scores : a LoDTensor that has the same shape and LoD with `selected_ids`
Attributes:  level (Duplicable): the level of LoDTensor
 beam_size (Duplicable): beam size for beam search
 end_id (Duplicable): the token id which indicates the end of a sequence
split_lod_tensor
Split a LoDTensor with a Mask at certain level. The input LoDTensor has 3 sequence at certain lod level. The Mask is a bool column vector, such as [0, 1, 0] at the same level. The first and third sequence will be send to False Output LoDTensor; whereas the second sequence will be send to True Output LoDTensor. Please refer to MergeLoDTensorOp.
Inputs:  X : The input LoDTensor
 Mask : A bool column vector which mask the input
Outputs:  OutTrue : True branch of input LoDTensor
 OutFalse : False branch of input LoDTensor
Attributes:  level (Duplicable): (int) the specific lod level to split.
read
Read Operator Execute a given reader once and output data.
Inputs:  Reader : (ReaderHolder) The executed reader.
Outputs:  Out (Duplicable) : (LoDTensor) The output data.
crop
Crop Operator.
Crop input into output, as specified by offsets and shape.
There are two ways to set shape: 1. reference input: crop input X into the same shape as reference input. The dimension of reference input should be the same as the dimension of input X. 2. shape list: crop input X into the shape described by a list
. The size of shape list should be the same as the dimension size of input X. The input should be a kD tensor(k > 0 and k < 7). As an example:
Case 1: Given
X = [[0, 1, 2, 0, 0] [0, 3, 4, 0, 0] [0, 0, 0, 0, 0]],
and
offsets = [0, 1],
and
shape = [2, 2],
we get:
Out = [[1, 2], [3, 4]].
Case 2: Given
X = [[0, 1, 2, 5, 0] [0, 3, 4, 6, 0] [0, 0, 0, 0, 0]],
and
offsets = [0, 1],
and
Y = [[0, 0, 0] [0, 0, 0]],
we get:
Out = [[1, 2, 5], [3, 4, 6]].
Inputs:  X : The input of pad op. The input should be a kD tensor(k > 0 and k < 7).
 Y : The input used as reference for cropping, which is of the same dimensions as X.
Outputs:  Out : The output of crop op, which is of the same dimensions as X.
Attributes:  offsets (Duplicable): A list<int> describing offsets to be cropped. The size of offsets list should be the same as the dimension size of input X.
 shape (Duplicable): A list<int> describing the shape of output. The size of shape list should be the same as the dimension size of input X.
brelu
BRelu Activation Operator.
$out = max(min(x, t_{min}), t_{max})$
Inputs:  X : Input of BRelu operator
Outputs:  Out : Output of BRelu operator
Attributes:  t_min (Duplicable): The min marginal value of BRelu
 t_max (Duplicable): The max marginal value of BRelu
crf_decoding
The crf_decoding operator reads the emission feature weights and the transition feature weights learned by the linear_chain_crf operator. It implements the Viterbi algorithm which is a dynamic programming algorithm for finding the most likely sequence of hidden states, called the Viterbi path, that results in a sequence of observed tags.
The output of this operator changes according to whether Input(Label) is given:
 Input(Label) is given:
This happens in training. This operator is used to cowork with the chunk_eval operator.
When Input(Label) is given, the crf_decoding operator returns a row vector with shape [N x 1] whose values are fixed to be 0, indicating an incorrect prediction, or 1 indicating a tag is correctly predicted. Such an output is the input to chunk_eval operator.
 Input(Label) is not given:
This is the standard decoding process.
The crf_decoding operator returns a row vector with shape [N x 1] whose values range from 0 to maximum tag number  1. Each element indicates an index of a predicted tag.
Inputs:  Emission : (LoDTensor, default: LoDTensor<float>). A LoDTensor with shape [N x D] where N is the size of the minibatch and D is the total tag number. This input is the unscaled emission weight matrix of the linear_chain_crf operator.
 Transition : (Tensor, default: Tensor<float>). A Tensor with shape [(D + 2) x D]. This input is the transition weights learned by the linear_chain_crf operator, denoted as w. The 1st row of w are transition weights for the start mask. The 2nd row of w are transition weights for the end mask. Transition weights between other tags begin from the 3rd row of w. See more details in comments of the linear_chain_crf operator.
 Label : (LoDTensor, LoDTensor<int64_t>). The ground truth with shape [N x 1]. This input is optional. See more details in the operator's comments.
Outputs:  ViterbiPath : (LoDTensor, LoDTensor<int64_t>). The decoding results. What to return changes depending on whether the Input(Label) (the ground truth) is given. See more details in the operator's comment.
maxout
MaxOut Operator.
Assumed the input shape is (N, Ci, H, W). The output shape is (N, Co, H, W). Then $Co = Ci / groups$ and the operator formula is as follows:
$$ y_{si+j} = \max_k x_{gsi + sk + j} \\ g = groups \\ s = \frac{input.size}{num\_channels} \\ 0 \le i < \frac{num\_channels}{groups} \\ 0 \le j < s \\ 0 \le k < groups $$
Please refer to Paper:  Maxout Networks: http://www.jmlr.org/proceedings/papers/v28/goodfellow13.pdf  Multidigit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks: https://arxiv.org/pdf/1312.6082v4.pdf
Inputs:  X : (Tensor) The input tensor of maxout operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.
Outputs:  Out : (Tensor) The output tensor of maxout operator.The format of output tensor is also NCHW.Where N is batch size, C is the number of channels, H and W is the height and width of feature.
Attributes:  groups (Duplicable): "Specifies how many groups the input tensor will be split" "in the channel dimension. And the number of output channel is " "the number of channels divided by groups.."
ftrl
FTRL (Follow The Regularized Leader) Operator.
Optimizer that implements the FTRL algorithm:
$$ new\_accum = squared\_accum + grad^2 \\ if (lr\_power == 0.5) { linear\_accum += grad  (\surd(new\_accum)  \surd(squared\_accum)) / (learning\_rate * param) \\ } else { linear\_accum += grad  (new\_accum^{lr\_power}  accum^{lr\_power}) / (learning\_rate * param) \\ } x = (l1 * sign(linear\_accum)  linear\_accum) if (lr\_power == 0.5) { y = \frac{\surd(new\_accum)}{learning\_rate} + (2 * l2) \\ pre\_shrink = \frac{x}{y} \\ param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0) \\ } else { y = \frac{new\_accum^{lr\_power}}{learning\_rate} + (2 * l2) \\ pre\_shrink = \frac{x}{y} \\ param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0) \\ } squared\_accum += grad^2; $$
The paper that proposed Follow The Regularized Leader (FTRL): (https://www.eecs.tufts.edu/~dsculley/papers/adclickprediction.pdf)
Inputs:  Param : (Tensor, default Tensor<float>) Input parameter value that has to be updated.
 SquaredAccumulator : (Tensor, default Tensor<float>) Accumulator that accumulates squared gradients.
 LinearAccumulator : (Tensor, default Tensor<float>) Accumulator that accumulates linear gradients.
 Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
 LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
Outputs:  ParamOut : (Tensor) Output updated parameter value.
 SquaredAccumOut : (Tensor) Output accumulated squared gradients.
 LinearAccumOut : (Tensor) Output accumulated linear gradients.
Attributes:  l1 (Duplicable): (float, default 0.0) L1 regularization strength.
 l2 (Duplicable): (float, default 0.0) L2 regularization strength.
 lr_power (Duplicable): (float, default 0.5f) Learning Rate Power.
conv_shift
ConvShift Operator.
A layer for circular convolution of two vectors, as used in the Neural Turing Machine: https://arxiv.org/abs/1410.5401
The equation is:
$$Out[i] = \sum_{j=(N1)/2}^{(N1)/2} X_{i+j} * Y_{j}$$
where X's index is computed modulo M, and Y's index is computed modulo N.
Both inputs X and Y can carry LoD (Level of Details) information. However, the output only shares the LoD information with input X.
Inputs:  X : (Tensor, default Tensor<float>), a 2D tensor with shape B x M, where B is the batch size and M is the data dimension.
 Y : (Tensor, default Tensor<float>), a 2D tensor with shape B x N, where B is the batch size and N is the data dimension. N must be odd.
Outputs:  Out : (Tensor, default Tensor<float>), a 2D tensor with shape B x M, i.e., the same shape as X.
sum
Sum operator.
This operators sums the input tensors. All the inputs can carry the LoD (Level of Details) information. However, the output only shares the LoD information with the first input.
Inputs:  X (Duplicable) : (vector<Tensor>) The input tensors of sum operator.
Outputs:  Out : (Tensor) The output tensor of sum operator.
concat
Concat Operator.
Concatenate the input tensors along dimension axis. Examples: Input[0] = [[1,2],[3,4]] Input[1] = [[5,6]] axis = 0 Output = [[1,2], [3,4], [5,6]]
Inputs:  X (Duplicable) : Input tensors of concat operator.
Outputs:  Out : Output tensor of concat operator.
Attributes:  axis (Duplicable): The axis along which the input tensors will be concatenated.
less_equal
less_equal Operator
It operates elementwise on X and Y, and returns the Out. Each of them is a Ndim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X <= Y
Inputs:  X : (LoDTensor) the left hand operand of less_equal operator
 Y : (LoDTensor) the right hand operand of less_equal operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is Out = X <= Y
Attributes:  axis (Duplicable): (int, default 1). The start dimension index for broadcasting Y onto X.
equal
equal Operator
It operates elementwise on X and Y, and returns the Out. Each of them is a Ndim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X == Y
Inputs:  X : (LoDTensor) the left hand operand of equal operator
 Y : (LoDTensor) the right hand operand of equal operator
Outputs:  Out : (LoDTensor) ndim bool tensor. Each element is Out = X == Y
Attributes:  axis (Duplicable): (int, default 1). The start dimension index for broadcasting Y onto X.
gather
Gather Operator.
$Out = X[Index]$
Out is obtained by gathering entries of the outermost dimension of X indexed by Index and concatenate them together.
Example:
X = [[1, 2], [3, 4], [5, 6]]
Index = [[1, 2]]
Then:
Out = [[3, 4], [5, 6]]
Inputs:  X : The source input of gather op
 Index : The index input of gather op
Outputs:  Out : The output of gather op
clip_by_norm
ClipByNorm Operator.
This operator limits the L2 norm of the input $X$ within $max_norm$. If the L2 norm of $X$ is less than or equal to $max_norm$, $Out$ will be the same as $X$. If the L2 norm of $X$ is greater than $max_norm$, $X$ will be linearly scaled to make the L2 norm of $Out$ equal to $max_norm$, as shown in the following formula:
$$ Out = \frac{max\_norm * X}{norm(X)}, $$
where $norm(X)$ represents the L2 norm of $X$.
Inputs:  X : (Tensor) The input of clip_by_norm op.The number of dimensions must be between [1, 9].
Outputs:  Out : (Tensor) The output of clip_by_norm op with shape as input(X)
Attributes:  max_norm (Duplicable): (float) The maximum norm value.
sigmoid
Sigmoid Activation Operator
$$out = \frac{1}{1 + e^{x}}$$
Inputs:  X : Input of Sigmoid operator
Outputs:  Out : Output of Sigmoid operator
floor
Floor Activation Operator.
$out = floor(x)$
Inputs:  X : Input of Floor operator
Outputs:  Out : Output of Floor operator
sequence_concat
The sequence_concat operator concatenates multiple LoDTensors. It only supports sequence (LoD Tensor with level number is 1) or a nested sequence (LoD tensor with level number is 2) as its input.  Case1: If the axis is other than 0(here, axis is 1 and level is 1), each input should have the same LoD information and the LoD information of the output keeps the same as the input.
LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,2,4}, {0,1,2,3,4}}; Dims(x1) = (4,4,4) LoD(Out) = {{0,2,4}, {0,1,2,3,4}}; Dims(Out) = (4,7,4)
 Case2: If the axis is 0(here, leve is 0), the inputs are concatenated along time steps, the LoD information of the output need to recompute. The LoD information of level1 should be same.
LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,2,4}, {0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,2,4}, {0,2,5,8,11}}; Dims(Out) = (11,3,4)
 Case3: If the axis is 0(here, level is 1).
LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,3,4}, {0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,5,8}, {0,1,2,3,5,7,8,9,11}}; Dims(Out) = (11,3,4)
 Case4: If the LoD number is 1, axis is 0, level is 0
LoD(x0) = {{0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,2,5,8,11}}; Dims(Out) = (11,3,4)
NOTE: The levels of all the inputs should be the same.
Inputs:  X (Duplicable) : (LodTensorArray) Input is a vector of LoDTensor, each of which is a variablelength sequence or nested sequence.
Outputs:  Out : (LoDTensor), Variablelength output of sequence_concat Op.
Attributes:  axis (Duplicable): (int, default 0) The axis along which the inputs will be joined. If axis is 0, the inputs will be joined with LoD index.
 level (Duplicable): (int, default 0) The level at which the inputs will be joined. If the level is 0, the inputs will be joined at the nested sequence level. If the level is 1, the inputs will be joined at the sequence level. The level should be less than the level number of inputs.
cast
Cast Operator.
This Operator casts the input tensor to another data type and returns tha Output Tensor.
Inputs:  X : The input tensor of cast op
Outputs:  Out : The output tensor of cast op
Attributes:  out_dtype (Duplicable): output data type
 in_dtype (Duplicable): input data type
chunk_eval
For some basics of chunking, please refer to ‘Chunking with Support Vector Machines https://aclanthology.info/pdf/N/N01/N011025.pdf’.
CheckEvalOp computes the precision, recall, and F1score of chunk detection, and supports IOB, IOE, IOBES and IO (also known as plain) tagging schemes. Here is a NER example of labeling for these tagging schemes:
Li Ming works at Agricultural Bank of China in Beijing.
IO: IPER IPER O O IORG IORG IORG IORG O ILOC IOB: BPER IPER O O BORG IORG IORG IORG O BLOC IOE: IPER EPER O O IORG IORG IORG EORG O ELOC IOBES: BPER EPER O O IORG IORG IORG EORG O SLOC
There are three chunk types(named entity types) including PER(person), ORG(organization) and LOC(LOCATION), and we can see that the labels have the form
 . Since the calculations actually use label ids rather than labels, extra attention should be paid when mapping labels to ids to make CheckEvalOp work. The key point is that the listed equations are satisfied by ids.
tag_type = label % num_tag_type chunk_type = label / num_tag_type
where
num_tag_type
is the num of tag types in the tagging scheme,num_chunk_type
is the num of chunk types, andtag_type
get its value from the following table.Scheme Begin Inside End Single plain 0    IOB 0 1   IOE  0 1  IOBES 0 1 2 3
Still use NER as example, assuming the tagging scheme is IOB while chunk types are ORG, PER and LOC. To satisfy the above equations, the label map can be like this:
BORG 0 IORG 1 BPER 2 IPER 3 BLOC 4 ILOC 5 O 6
It’s not hard to verify the equations noting that the num of chunk types is 3 and the num of tag types in IOB scheme is 2. For example, the label id of ILOC is 5, the tag type id of ILOC is 1, and the chunk type id of ILOC is 2, which consistent with the results from the equations.
Inputs:  Inference : (Tensor, default: Tensor<int64_t>). Predictions from the network.
 Label : (Tensor, default: Tensor<int64_t>). The true tag sequences.
Outputs:  Precision : (float). The evaluated precision (called positive predictive value) of chunks on the given minibatch.
 Recall : (float). The evaluated recall (true positive rate or sensitivity) of chunks on the given minibatch.
 F1Score : (float). The evaluated F1Score on the given minibatch.
 NumInferChunks : (int64_t). The number of chunks in Inference on the given minibatch.
 NumLabelChunks : (int64_t). The number of chunks in Label on the given minibatch.
 NumCorrectChunks : (int64_t). The number of chunks both in Inference and Label on the given minibatch.
Attributes:  num_chunk_types (Duplicable): (int). The number of chunk type. See below for details.
 chunk_scheme (Duplicable): (string, default IOB). The labeling scheme indicating how to encode the chunks. Must be IOB, IOE, IOBES or plain. See below for details.
 excluded_chunk_types (Duplicable): (list<int>) A list including chunk type ids indicating chunk types that are not counted. See below for details.
box_coder
Bounding Box Coder Operator. Encode/Decode the target bounding box with the priorbox information. The Encoding schema described below: ox = (tx  px) / pw / pxv oy = (ty  py) / ph / pyv ow = log(abs(tw / pw)) / pwv oh = log(abs(th / ph)) / phv The Decoding schema described below: ox = (pw * pxv * tx * + px)  tw / 2 oy = (ph * pyv * ty * + py)  th / 2 ow = exp(pwv * tw) * pw + tw / 2 oh = exp(phv * th) * ph + th / 2 where tx, ty, tw, th denote the target box's center coordinates, width and height respectively. Similarly, px, py, pw, ph denote the priorbox's(anchor) center coordinates, width and height. pxv, pyv, pwv, phv denote the variance of the priorbox and ox, oy, ow, oh denote the encoded/decoded coordinates, width and height.
Inputs:  PriorBox : (Tensor, default Tensor<float>) Box list PriorBox is a 2D Tensor with shape [M, 4] holds M boxes, each box is represented as [xmin, ymin, xmax, ymax], [xmin, ymin] is the left top coordinate of the anchor box, if the input is image feature map, they are close to the origin of the coordinate system. [xmax, ymax] is the right bottom coordinate of the anchor box.
 PriorBoxVar : (Tensor, default Tensor<float>) PriorBoxVar is a 2D Tensor with shape [M, 4] holds M group of variance.
 TargetBox : (LoDTensor or Tensor) this input is a 2D LoDTensor with shape [N, 4], each box is represented as [xmin, ymin, xmax, ymax], [xmin, ymin] is the left top coordinate of the box if the input is image feature map, they are close to the origin of the coordinate system. [xmax, ymax] is the right bottom coordinate of the box. This tensor can contain LoD information to represent a batch of inputs. One instance of this batch can contain different numbers of entities.
Outputs:  OutputBox : (LoDTensor or Tensor) (Tensor) The output of box_coder_op, a tensor with shape [N, M, 4] representing the result of N target boxes encoded/decoded with M Prior boxes and variances.
Attributes:  code_type (Duplicable): (string, default encode_center_size) the code type used with the target box
bipartite_match
This operator is a greedy bipartite matching algorithm, which is used to obtain the matching with the maximum distance based on the input distance matrix. For input 2D matrix, the bipartite matching algorithm can find the matched column for each row, also can find the matched row for each column. And this operator only calculate matched indices from column to row. For each instance, the number of matched indices is the number of of columns of the input ditance matrix.
There are two outputs to save matched indices and distance. A simple description, this algothrim matched the best (maximum distance) row entity to the column entity and the matched indices are not duplicated in each row of ColToRowMatchIndices. If the column entity is not matched any row entity, set 1 in ColToRowMatchIndices.
Please note that the input DistMat can be LoDTensor (with LoD) or Tensor. If LoDTensor with LoD, the height of ColToRowMatchIndices is batch size. If Tensor, the height of ColToRowMatchIndices is 1.
Inputs:  DistMat : (LoDTensor or Tensor) this input is a 2D LoDTensor with shape [K, M]. It is pairwise distance matrix between the entities represented by each row and each column. For example, assumed one entity is A with shape [K], another entity is B with shape [M]. The DistMat[i][j] is the distance between A[i] and B[j]. The bigger the distance is, the better macthing the pairs are. Please note, This tensor can contain LoD information to represent a batch of inputs. One instance of this batch can contain different numbers of entities.
Outputs:  ColToRowMatchIndices : (Tensor) A 2D Tensor with shape [N, M] in int type. N is the batch size. If ColToRowMatchIndices[i][j] is 1, it means B[j] does not match any entity in ith instance. Otherwise, it means B[j] is matched to row ColToRowMatchIndices[i][j] in ith instance. The row number of ith instance is saved in ColToRowMatchIndices[i][j].
 ColToRowMatchDist : (Tensor) A 2D Tensor with shape [N, M] in float type. N is batch size. If ColToRowMatchIndices[i][j] is 1, ColToRowMatchDist[i][j] is also 1.0. Otherwise, assumed ColToRowMatchIndices[i][j] = d, and the row offsets of each instance are called LoD. Then ColToRowMatchDist[i][j] = DistMat[d+LoD[i]][j]
batch_norm
Batch Normalization.
Batch Norm has been implemented as discussed in the paper: https://arxiv.org/pdf/1502.03167.pdf Can be used as a normalizer function for conv2d and fully_connected operations. The required data format for this layer is one of the following: 1. NHWC
[batch, in_height, in_width, in_channels]
2. NCHW[batch, in_channels, in_height, in_width]
Inputs:  X : The input tensor
 Scale : Scale is a 1dimensional tensor of size C that is applied to the output
 Bias : Bias is a 1dimensional tensor of size C that is applied to the output
 Mean : The global mean (for training) or estimated mean (for testing)
 Variance : The global variance (for training) or estimated Variance (for testing)
Outputs:  Y : result after normalization
 MeanOut : Share memory with Mean. Store the global mean when training
 VarianceOut : Share memory with Variance. Store the global Variance when training
 SavedMean (Intermediate) : Mean of the current mini batch, will apply to output when training
 SavedVariance (Intermediate) : Variance of the current mini batch, will apply to output when training
Attributes:  is_test (Duplicable):
 momentum (Duplicable):
 epsilon (Duplicable):
 data_layout (Duplicable):
auc
Area Under The Curve (AUC) Operator.
This implementation computes the AUC according to forward output and label. It is used very widely in binary classification evaluation. As a note: If input label contains values other than 0 and 1, it will be cast to bool. You can find the relevant definitions here: https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve
There are two types of possible curves: 1. ROC: Receiver operating characteristic 2. PR: Precision Recall
Inputs:  Out : A floating point 2D tensor, values are in the range [0, 1].Each row is sorted in descending order. This input should be theoutput of topk.Typically, this tensor indicates the probability of each label
 Indices : An int 2D tensor, indicating the indices of originaltensor before sorting. Typically, this tensor indicates which label the probability stands for.
 Label : A 2D int tensor indicating the label of the training data.The height is batch size and width is always 1.
Outputs:  AUC : A scalar representing the current areaunderthecurve.
Attributes:  curve (Duplicable): Curve type, can be 'ROC' or 'PR'.
 num_thresholds (Duplicable): The number of thresholds to use when discretizing the roc curve.
assign_value
AssignValue operator
$$Out = values$$
Inputs: Outputs:  Out : (Tensor) Output tensor of assign_value operator.
Attributes:  shape (Duplicable): (vector<int>) Shape of values.
 dtype (Duplicable): data type of values
 fp32_values (Duplicable): store the float values
 int32_values (Duplicable): store the int values
split
Split operator
This operator splits the input tensor into multiple subtensors.
Example: Input = [[1,2], [3,4], [5,6]] sections = [2,1] axis = 0 Output[0] = [[1,2], [3,4]] Output[1] = [[5,6]]
Inputs:  X : (Tensor) Input tensor of the split operator.
Outputs:  Out (Duplicable) : (Tensor) Output tensors of the split operator.
Attributes:  sections (Duplicable): (vector<int>) the length of each output along the specified axis.
 num (Duplicable): (int, default 0)Number of subtensors. This must evenly divide Input.dims()[axis]
 axis (Duplicable): (int, default 0) The axis which the input will be splited on.
beam_search_decode
Pack the result of Beam search op into SentenceIds and SentenceScores.
Inputs:  Ids : (LodTensorArray)score of the candidate words in each step
 Scores : (LodTensorArray)score of the candidate words in each step
Outputs:  SentenceIds : (LodTensor)All possible result sentences of word ids
 SentenceScores : (LodTensor)All possible result sentences of word scores
assign
Assign Operator
Out = X, when type in [LoDTensor/SelectedRows/LoDTensorArray] raise error if the type is not listed above.
Inputs:  X : (LoDTensor, SelectedRows or LoDTensorArray) The input variable could be LoDTensor, SelectedRows or LoDTensorArray.
Outputs:  Out : (LoDTensor, SelectedRows or LoDTensorArray) The type of output is the same as input X.
adadelta
Adadelta Optimizer.
Adadelta optimizer is implemented as explained in: https://arxiv.org/abs/1212.5701 Adadelta is a perdimension adaptive learning rate method used for gradient descent.
Adadelta updates are as follows:
$$ avg\_squared\_grad\_out = \rho * avg\_squared\_grad + (1  \rho) * grad * grad \\ param\_update =  \sqrt{\frac{avg\_squared\_update + \epsilon}{avg\_squared\_grad\_out + \epsilon}} * grad \\ avg\_squared\_update\_out = \rho * avg\_squared\_update + (1  \rho) * {param\_update}^2 \\ param\_out = param + param\_update $$
Inputs:  Param : (Tensor) Input parameter
 Grad : (Tensor) Input gradient
 AvgSquaredGrad : (Tensor) Input average of squared gradient
 AvgSquaredUpdate : (Tensor) Input average of squared parameter updates
Outputs:  ParamOut : (Tensor) Output parameter
 AvgSquaredGradOut : (Tensor) Output average of squared gradient
 AvgSquaredUpdateOut : (Tensor) Output average of squared parameter updates
Attributes:  rho (Duplicable): (float, default 0.95) Exponential decay rate for squared gradients.
 epsilon (Duplicable): (float, default 1.0e6) Constant for numerical stability
nce
Compute and return the noisecontrastive estimation training loss. See Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. By default this operator uses a uniform distribution for sampling.
Inputs:  Input : (Tensor) A tensor of shape [batch_size, dim].
 Label : (Tensor) A tensor of shape [batch_size, num_true_class]. 'num_true_class' is the number of target classes in each sample.The number of target classes per sample should be same. If you have a variable number of target classes, you can pad them out to a constant number by either repeating them or by padding with an otherwise unused class.)
 Weight : (Tensor) A tensor of shape [num_class, dim]. 'num_class' is the total number of class.
 Bias : (Tensor) A tensor of shape [num_class, 1]. 'num_class' is the total number of class. It is a dispensable input.
 SampleWeight : (Tensor) A tensor of shape [batch_size, 1] storing a weight for each sample. And it is a dispensable input. The default value of sample is 1.
Outputs:  Cost : (Tensor) A tensor of shape [batch_size, 1]. Cost of samples.
 SampleLogits (Intermediate) : An intermediate tensor of shape[batch_size, num_neg_samples + num_pos_samples].This tensor is output of forward kernel and used in backward kernel to compute grads.Given X is the dot product of input tensor and sampled labels' weights.Then 'SampleLogits' is sigmoid(X).
 SampleLabels (Intermediate) : An intermediate tensor of shape[batch_size, num_neg_samples + num_pos_samples].This tensor is output of forward kernel and used in backward kernel to compute grads.
Attributes:  num_total_classes (Duplicable): Total number of classes in all samples.
 num_neg_samples (Duplicable): The number of negative classes. The default value is 10.
 custom_neg_classes (Duplicable): This attribute only be used in unitest. Classes in this list wiil be used as negative classes for every samples. Under normal conditions, user should avoid setting this attribute.
linear_chain_crf
LinearChainCRF Operator.
Conditional Random Field defines an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. CRF learns the conditional probability $P(YX)$, where $X = (x_1, x_2, ... , x_n)$ are structured inputs and $Y = (y_1, y_2, ... , y_n)$ are labels for the inputs.
Linear chain CRF is a special case of CRF that is useful for sequence labeling task. Sequence labeling tasks do not assume a lot of conditional independences among inputs. The only constraint they impose is that the input and output must be linear sequences. Thus, the graph of such a CRF is a simple chain or a line, which results in the linear chain CRF.
This operator implements the ForwardBackward algorithm for the linear chain CRF. Please refer to http://www.cs.columbia.edu/~mcollins/fb.pdf and http://cseweb.ucsd.edu/~elkan/250Bwinter2012/loglinearCRFs.pdf for details.
Equation: 1. Denote Input(Emission) to this operator as $x$ here. 2. The first D values of Input(Transition) to this operator are for starting weights, denoted as $a$ here. 3. The next D values of Input(Transition) of this operator are for ending weights, denoted as $b$ here. 4. The remaning values of Input(Transition) are for transition weights, denoted as $w$ here. 5. Denote Input(Label) as $s$ here.
The probability of a sequence $s$ of length $L$ is defined as: $$P(s) = (1/Z) \exp(a_{s_1} + b_{s_L} + \sum_{l=1}^L x_{s_l} + \sum_{l=2}^L w_{s_{l1},s_l})$$
where $Z$ is a normalization value so that the sum of $P(s)$ over all possible sequences is 1, and $x$ is the emission feature weight to the linear chain CRF.
Finally, the linear chain CRF operator outputs the logarithm of the conditional likelihood of each training sample in a minibatch.
NOTE: 1. The feature function for a CRF is made up of the emission features and the transition features. The emission feature weights are NOT computed in this operator. They MUST be computed first before this operator is called.

Because this operator performs global normalization over all possible sequences internally, it expects UNSCALED emission feature weights. Please do not call this op with the emission feature being output of any nonlinear activation.

The 2nd dimension of Input(Emission) MUST be equal to the tag number.
Inputs:  Emission : (LoDTensor, default LoDTensor<float>) A 2D LoDTensor with shape [N x D], where N is the size of the minibatch and D is the total tag number. The unscaled emission weight matrix for the linear chain CRF.
 Transition : (Tensor, default Tensor<float>) A 2D Tensor with shape [(D + 2) x D]. The learnable parameter for the linear_chain_crf operator. See more details in the operator's comments.
 Label : (LoDTensor, default LoDTensor<int64_t>) A LoDTensor with shape [N x 1], where N is the total element number in a minibatch. The ground truth.
Outputs:  Alpha (Intermediate) : (Tensor, default Tensor<float>) A 2D Tensor with shape [N x D]. The forward vectors for the entire batch. Denote it as $lpha$. $lpha$ is a memo table used to calculate the normalization factor in CRF. $lpha[k, v]$ stores the unnormalized probabilites of all possible unfinished sequences of tags that end at position $k$ with tag $v$. For each $k$, $lpha[k, v]$ is a vector of length $D$ with a component for each tag value $v$. This vector is called a forward vecotr and will also be used in backward computations.
 EmissionExps (Intermediate) : (Tensor, default Tensor<float>) A 2D Tensor with shape [N x D]. The exponentials of Input(Emission). This is an intermediate computational result in forward computation, and will be reused in backward computation.
 TransitionExps (Intermediate) : (Tensor, default Tensor<float>) A 2D Tensor with shape [(D + 2) x D]. The exponentials of Input(Transition). This is an intermediate computational result in forward computation, and will be reused in backward computation.
 LogLikelihood : (Tensor, default Tensor<float>) The logarithm of the conditional likelihood of each training sample in a minibatch. This is a 2D tensor with shape [S x 1], where S is the sequence number in a minibatch. Note: S is equal to the sequence number in a minibatch. The output is no longer a LoDTensor.

logsigmoid
Logsigmoid Activation Operator
$$out = \log \frac{1}{1 + e^{x}}$$
Inputs:  X : Input of LogSigmoid operator
Outputs:  Out : Output of LogSigmoid operator
row_conv
Rowconvolution Operator.
The row convolution is called lookahead convolution. This operator was introduced in the following paper for DeepSpeech2: http://www.cs.cmu.edu/~dyogatam/papers/wang+etal.iclrworkshop2016.pdf
The main motivation is that a bidirectional RNN, useful in DeepSpeech like speech models, learns representation for a sequence by performing a forward and a backward pass through the entire sequence. However, unlike unidirectional RNNs, bidirectional RNNs are challenging to deploy in an online and lowlatency setting. The lookahead convolution incorporates information from future subsequences in a computationally efficient manner to improve unidirectional recurrent neural networks. The row convolution operator is different from the 1D sequence convolution, and is computed as follows:
Given an input sequence $in$ of length $t$ and input dimension $d$, and a filter ($W$) of size $context times d$, the output sequence is convolved as:
$$ out_{i, :} = \sum_{j=i}^{i + context} in_{j,:} \dot W_{ij, :} $$
Inputs:  X : (LoDTensor), the input(X) is a LodTensor, which supports variable timelength input sequences. The underlying tensor in this LoDTensor is a matrix with shape (T x N), where T is the total time steps in this minibatch and N is the input data dimension.
 Filter : (Tensor), the input(Filter) is a learnable parameter. It is a 2D tensor with shape (future_context x N), where, future_context is the future context length and N is the data dimension.
Outputs:  Out : (LoDTensor), the output(Out) is a LodTensor, which supports variable timelength input sequences. The underlying tensor in this LodTensor is a matrix with shape T x N, i.e., the same shape as X.
exp
Exp Activation Operator.
$out = e^x$
Inputs:  X : Input of Exp operator
Outputs:  Out : Output of Exp operator