# Operators

## sgd

SGD operator

This operator implements one step of the stochastic gradient descent algorithm.

$$param\_out = param - learning\_rate * grad$$

Inputs: Param : (Tensor) Input parameterLearningRate : (Tensor) Learning rate of SGDGrad : (Tensor) Input gradient ParamOut : (Tensor) Output parameter

## print

Creates a print op that will print when a tensor is accessed.

Wraps the tensor passed in so that whenever that a tensor is accessed, the message message is printed, along with the current value of the tensor t.

Inputs: In : Input tensor to be displayed. Out : Output tensor with same data as input tensor. first_n (Duplicable): Only log first_n number of times.message (Duplicable): A string message to print as a prefix.summarize (Duplicable): Number of elements printed.print_tensor_name (Duplicable): Whether to print the tensor name.print_tensor_type (Duplicable): Whether to print the tensor's dtype.print_tensor_shape (Duplicable): Whether to print the tensor's shape.print_tensor_lod (Duplicable): Whether to print the tensor's lod.print_phase (Duplicable): (string, default 'BOTH') Which phase to display including 'FORWARD' 'BACKWARD' and 'BOTH'.

The update is done as follows:

$$moment\_out = moment + grad * grad \\ param\_out = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}$$

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have the epsilon attribute. It is added here in our implementation as also proposed here: http://cs231n.github.io/neural-networks-3/#ada for numerical stability to avoid the division by zero error.

Inputs: Param : (Tensor) Input parameterGrad : (Tensor) Input gradientMoment : (Tensor) Second momentLearningRate : (Tensor) Learning rate ParamOut : (Tensor) Output parameterMomentOut : (Tensor) Output second moment epsilon (Duplicable): (float, default 1.0e-6) Constant for numerical stability

## max_pool3d_with_index

MaxPool3d Operator.

The maxpooling3d with index operation calculates the output and the mask based on the input and ksize, strides, paddings parameters. Input(X) and output(Out, Mask) are in NCDHW format, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the feature, respectively. Parameters(ksize, strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out, Mask) size may be different.

Example: Input: X shape: $(N, C, D_{in}, H_{in}, W_{in})$ Output: Out shape: $(N, C, D_{out}, H_{out}, W_{out})$ Mask shape: $(N, C, D_{out}, H_{out}, W_{out})$ Where $$D_{out} = \frac{(D_{in} - ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ H_{out} = \frac{(H_{in} - ksize[1] + 2 * paddings[1])}{strides[1]} + 1 \\ W_{out} = \frac{(W_{in} - ksize[2] + 2 * paddings[2])}{strides[2]} + 1$$

Inputs: X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCDHW, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCDHW, where N is the batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively.Mask : (Tensor) The Mask tensor of pooling operator. The format of output tensor is also NCDHW, where N is the batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively. It represents the index in the current feature map. ksize (Duplicable): (vector) The pooling window size(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.strides (Duplicable): (vector, default {1,1,1}), strides(depth, height, width) of pooling operator.paddings (Duplicable): (vector, default {0,0,0}), paddings(depth, height, width) of pooling operator. If global_pooling = true, paddings and ksize will be ignored.

## lod_rank_table

Create LoDRanTable by LoDTensor

LoD Rank Table stores the level of lod which is ordered by sequence length in descending order. It is useful when implement dynamic RNN and is shared by dynamic RNN memory, dynamic RNN slice input and dynamic RNN slice output operators.

Inputs: X : (LoDTensor) input lod tensor, must contain lod information. Out : (LoDRankTable) The rank table of specific level. level (Duplicable): (int) the specific lod level to rank.

## array_to_lod_tensor

This Op build a big LoDTensor from a std::vector and a LoDRankTable. It is supposed to be used in getting dynamic RNN's outputs back to a normal LoDTensor. The std::vector would be the output of RNN Op and the LoDRankTable would be build with RNN's input.

Inputs: X : (std::vector) A vector of tensors that is going to be casted to a big LoDTensor.RankTable : (LoDRankTable) RankTable provides the coarse lod infomation to build the output LoDTensor. See 'paddle/framework/lod_rank_table.h' for more details. Out : (LoDTensor) The LoDTensor formed by input tensor array.

## sequence_conv

Sequence Conv Operator.

SequenceConvOp performs convolution operation on features of contextLength time-steps of each instance. The convolution operation calculates the output based on the input, filter, strides and paddings parameters. The size of each dimension of the parameters is checked during infer-shape. In order to ensure the equal length of sequence before and after convolution, it is necessary to fill the top and bottom of each sequence based on context_length, context_stride and context_start.

Inputs: X : (LoDTensor) the input(X) is a LodTensor, which supports variable-time length input sequence. The underlying tensor in this LoDTensor is a matrix with shape (T, N), where T is the total time steps in this mini-batch and N is the input_hidden_size.PaddingData : (Tensor, optional) the input(PaddingData) is an optional parameter, and it is learnable. This is a tensor with shape (P, N), where P is the top_pad + bottom_pad, N is the input_hidden_size. In order to ensure the equal length of sequence before and after convolution, it is necessary to fill the top and bottom of each sequence according to context_length, context_stride and context_startFilter : (Tensor) the input(Filter) is an learnable parameter.This is a tensor with shape (K, M), where K is the context_length * input_hidden_size, M is the output feature size. Out : (LoDTensor) the output(Out) is a LodTensor, which support variable-time length output sequence. The underlying tensor in this LoDTensor is a matrix with shape (T, M), where, T is the total time steps in this mini-batch, M is the output feature size. paddingTrainable (Duplicable): (bool, default:false) the padding data of SequenceConvOp is trainable or not.contextLength (Duplicable): (int) the contextLength of SequenceConvOp is the height of the convolution kernel.contextStart (Duplicable): (int, default:0) the contextStart of SequenceConvOp represents the beginning of the convolution of the number of rows of sequence, which can be negative. The negative number means to pad contextStart time-steps of zeros or learnable parameters at the beginning of each instance. The positive number means to skip contextStart time-steps of each instance.contextStride (Duplicable): (int, default:1) the contextStride of SequenceConvOp represents the stride length of convolution kernel. Currently, SequenceConvOp only supportscontextStride=1.

## lstm

Long-Short Term Memory (LSTM) Operator.

The defalut implementation is diagonal/peephole connection (https://arxiv.org/pdf/1402.1128.pdf), the formula is as follows:

$$i_t = \sigma(W_{ix}x_{t} + W_{ih}h_{t-1} + W_{ic}c_{t-1} + b_i) \\ f_t = \sigma(W_{fx}x_{t} + W_{fh}h_{t-1} + W_{fc}c_{t-1} + b_f) \\ \tilde{c_t} = act_g(W_{cx}x_t + W_{ch}h_{t-1} + b_c) \\ o_t = \sigma(W_{ox}x_{t} + W_{oh}h_{t-1} + W_{oc}c_t + b_o) \\ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c_t} \\ h_t = o_t \odot act_h(c_t)$$

where the W terms denote weight matrices (e.g. $W_{xi}$ is the matrix of weights from the input gate to the input), $W_{ic}, W_{fc}, W_{oc}$ are diagonal weight matrices for peephole connections. In our implementation, we use vectors to reprenset these diagonal weight matrices. The b terms denote bias vectors ($b_i$ is the input gate bias vector), $sigma$ is the non-line activations, such as logistic sigmoid function, and $i, f, o$ and $c$ are the input gate, forget gate, output gate, and cell activation vectors, respectively, all of which have the same size as the cell output activation vector $h$.

The $odot$ is the element-wise product of the vectors. $act_g$ and $act_h$ are the cell input and cell output activation functions and tanh is usually used for them. $tilde{c_t}$ is also called candidate hidden state, which is computed based on the current input and the previous hidden state.

Set use_peepholes False to disable peephole connection. The formula is omitted here, please refer to the paper http://www.bioinf.jku.at/publications/older/2604.pdf for details.

Note that these $W_{xi}x_{t}, W_{xf}x_{t}, W_{xc}x_{t}, W_{xo}x_{t}$ operations on the input $x_{t}$ are NOT included in this operator. Users can choose to use fully-connect operator before LSTM operator.

Inputs: Input : (LoDTensor) the first input is a LodTensor, which support variable-time length input sequence. The underlying tensor in this LoDTensor is a matrix with shape (T X 4D), where T is the total time steps in this mini-batch, D is the hidden size.H0 : (Tensor, optional) the initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size and D is the hidden size.C0 : (Tensor, optional) the initial cell state is an optional input. This is a tensor with shape (N x D), where N is the batch size. H0 and C0 can be NULL but only at the same time.Weight : (Tensor) the learnable hidden-hidden weights. - The shape is (D x 4D), where D is the hidden size. - Weight = {W_ch, W_ih, W_fh, W_oh}Bias : (Tensor) the learnable weights, which contains two parts: input-hidden bias weight and peephole connections weight if setting use_peepholes True. 1. use_peepholes = False - The shape is (1 x 4D). - Bias = {b_c, b_i, b_f, b_o}.2. use_peepholes = True - The shape is (1 x 7D). - Bias = {b_c, b_i, b_f, b_o, W_ic, W_fc, W_oc}. Hidden : (LoDTensor) the hidden state of LSTM operator. The shape is (T x D), and lod is the same with the Input.Cell : (LoDTensor) the cell state of LSTM operator. The shape is (T x D), and lod is the same with the Input.BatchGate (Intermediate) : (LoDTensor) This LoDTensor contains input gate, forget gate and output gate after the nonlinear computation. This LoDTensor has the same shape as the reorganized input, which is also be called batch input. The LoD size is 2. The first LoD is the batch offsets and the second LoD contains the indexes, which denote the position of reorganized sequence in the raw input.BatchCellPreAct (Intermediate) : (LoDTensor) This LoDTensor is obtained in the forward and used in the backward. use_peepholes (Duplicable): (bool, defalut: True) whether to enable diagonal/peephole connections.is_reverse (Duplicable): (bool, defalut: False) whether to compute reversed LSTM.gate_activation (Duplicable): (string, default: sigmoid)The activation for input gate, forget gate and output gate, sigmoid by default.cell_activation (Duplicable): (string, default: tanh)The activation for cell output, tanh by defalut.candidate_activation (Duplicable): (string, default: tanh)The activation for candidate hidden state, tanh by default.

## warpctc

An operator integrating the open-source warp-ctc library, which is used in Deep Speech 2: End-toEnd Speech Recognition in English and Mandarin, to compute Connectionist Temporal Classification (CTC) loss. It can be aliased as softmax with ctc, since a native softmax activation is interated to the warp-ctc library, to to normlize values for each row of the input tensor.

More detail of CTC loss can be found by refering to Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.

Inputs: Logits : (LodTensor, default: LoDTensor), the unscaled probabilities of variable-length sequences, which is a 2-D Tensor with LoD information. It's shape is [Lp, num_classes + 1], where Lp is the sum of all input sequences' length and num_classes is the true number of classes (not including the blank label).Label : (LodTensor, default: LoDTensor), the ground truth of variable-length sequence, which is a 2-D Tensor with LoD information. It is of the shape [Lg, 1], where Lg is th sum of all labels' length. WarpCTCGrad (Intermediate) : (Tensor, default: Tensor), a temporary output Tensor to store the gradients of warp-ctc, which is computed with loss together in one call. It is a 3-D Tensor of the shape [max_sequence_length, batch_size, num_classes + 1].Loss : (Tensor, default: Tensor), the Connectionist Temporal Classification (CTC) loss, which is a 2-D Tensor of the shape [batch_size, 1] blank (Duplicable): (int, default: 0), the blank label of Connectionist Temporal Classification (CTC) loss, which is in the half-opened interval [0, num_classes + 1).norm_by_times (Duplicable): (bool, default: false), whether to normalize the gradients by the number of time-step, which is also the sequence's length.

## cos_sim

Cosine Similarity Operator.

$Out = X^T * Y / (sqrt{X^T * X} * sqrt{Y^T * Y})$

The input X and Y must have the same shape, except that the 1st dimension of input Y could be just 1 (different from input X), which will be broadcasted to match the shape of input X before computing their cosine similarity.

Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

Inputs: X : The 1st input of cos_sim op.Y : The 2nd input of cos_sim op. Out : The output of cos_sim op.XNorm (Intermediate) : Norm of the first input, reduced along the 1st dimension.YNorm (Intermediate) : Norm of the second input, reduced along the 1st dimension.

## conv3d

Convolution3D Operator.

The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and output(Output) are in NCDHW format, where N is batch size, C is the number of channels,D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCDHW format, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.

Example: Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$D_{out}= \frac{(D_{in} + 2 * paddings[0] - (dilations[0] * (D_f - 1) + 1))}{ strides[0]}+ 1 \\ H_{out}= \frac{(H_{in} + 2 * paddings[1] - (dilations[1] * (H_f - 1) + 1))}{ strides[1]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[2] - (dilations[2] * (W_f - 1) + 1))}{ strides[2]}+ 1$$

Inputs: Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCDHW, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups. Output : (Tensor) The output tensor of convolution operator.The format of output tensor is also NCDHW. strides (Duplicable): (vector, default:{1, 1, 1}), the strides(d_stride, h_stride, w_stride) of convolution operator.paddings (Duplicable): (vector, default:{0, 0, 0}), the paddings(d_pad, h_pad, w_pad) of convolution operator.groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.dilations (Duplicable): (vector default:{1, 1, 1}), the dilations(d_dilation, h_dilation, w_dilation) of convolution operator.use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnndata_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically. workspace_size_MB (Duplicable): Only used in cudnn kernel. workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardware. This size should be chosen carefully.

## depthwise_conv2d

Convolution Operator.

The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and Output(Output) are in NCHW format. Where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCHW format. Where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.

Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$H_{out}= \frac{(H_{in} + 2 * paddings[0] - (dilations[0] * (H_f - 1) + 1))}{strides[0]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[1] - (dilations[1] * (W_f - 1) + 1))}{strides[1]}+ 1$$

Inputs: Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCHW, where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups. Output : (Tensor) The output tensor of convolution operator. The format of output tensor is also NCHW. strides (Duplicable): (vector default:{1, 1}), the strides(h_stride, w_stride) of convolution operator.paddings (Duplicable): (vector default:{0, 0}), the paddings(h_pad, w_pad) of convolution operator.groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.dilations (Duplicable): (vector default:{1, 1}), the dilations(h_dilation, w_dilation) of convolution operator.use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnndata_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically. workspace_size_MB (Duplicable): Only used in cudnn kernel. Need set use_cudnn to true.workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardware. This size should be chosen carefully.

## conv2d

Convolution Operator.

The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and Output(Output) are in NCHW format. Where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCHW format. Where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.

Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$H_{out}= \frac{(H_{in} + 2 * paddings[0] - (dilations[0] * (H_f - 1) + 1))}{strides[0]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[1] - (dilations[1] * (W_f - 1) + 1))}{strides[1]}+ 1$$

Inputs: Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCHW, where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups. Output : (Tensor) The output tensor of convolution operator. The format of output tensor is also NCHW. strides (Duplicable): (vector default:{1, 1}), the strides(h_stride, w_stride) of convolution operator.paddings (Duplicable): (vector default:{0, 0}), the paddings(h_pad, w_pad) of convolution operator.groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.dilations (Duplicable): (vector default:{1, 1}), the dilations(h_dilation, w_dilation) of convolution operator.use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnndata_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically. workspace_size_MB (Duplicable): Only used in cudnn kernel. Need set use_cudnn to true.workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardware. This size should be chosen carefully.

## pool3d

Pool3d Operator.

The pooling3d operation calculates the output based on the input, pooling_type, ksize, strides, and paddings parameters. Input(X) and output(Out) are in NCDHW format, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the feature, respectively. Parameters(ksize, strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.

Example: Input: X shape: $(N, C, D_{in}, H_{in}, W_{in})$ Output: Out shape: $(N, C, D_{out}, H_{out}, W_{out})$ Where $$D_{out} = \frac{(D_{in} - ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ H_{out} = \frac{(H_{in} - ksize[1] + 2 * paddings[1])}{strides[1]} + 1 \\ W_{out} = \frac{(W_{in} - ksize[2] + 2 * paddings[2])}{strides[2]} + 1$$

Inputs: X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively. Out : (Tensor) The output tensor of pooling operator.The format of output tensor is also NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively. pooling_type (Duplicable): (string) Pooling type, can be "max" for max-pooling and "avg" for average-pooling.ksize (Duplicable): (vector) The pooling window size(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings wille be ignored.strides (Duplicable): (vector, default {1,1,1}) Strides(depth, height, width) of the pooling operator.paddings (Duplicable): (vector, default {0,0,0}), paddings(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnndata_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically.

## pool2d

Pool2d Operator.

The pooling2d operation calculates the output based on the input, pooling_type and ksize, strides, paddings parameters. Input(X) and output(Out) are in NCHW format, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Parameters(ksize, strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.

Example:
Input: X shape: $(N, C, H_{in}, W_{in})$ Output: Out shape: $(N, C, H_{out}, W_{out})$ Where $$H_{out} = \frac{(H_{in} - ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ W_{out} = \frac{(W_{in} - ksize[1] + 2 * paddings[1])}{strides[1]} + 1$$

Inputs: X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. pooling_type (Duplicable): (string), pooling type, can be "max" for max-pooling and "avg" for average-pooling.ksize (Duplicable): (vector) The pooling window size(height, width) of the pooling operator. If global_pooling = true, ksize and paddings will be ignored.global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.strides (Duplicable): (vector, default {1, 1}), strides(height, width) of pooling operator.paddings (Duplicable): (vector, default {0,0}), paddings(height, width) of pooling operator.If global_pooling = true, paddings and ksize will be ignored.use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnndata_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically.

## conv3d_transpose

Convolution3D Transpose Operator.

The convolution transpose operation calculates the output based on the input, filter and dilations, strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and output(Output) are in NCDHW format. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCDHW format. Where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter,H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.

Example:
Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$D_{out} = (D_{in} - 1) * strides[0] - 2 * paddings[0] + dilations[0] * (D_f - 1) + 1 \\ H_{out} = (H_{in} - 1) * strides[1] - 2 * paddings[1] + dilations[1] * (H_f - 1) + 1 \\ W_{out} = (W_{in} - 1) * strides[2] - 2 * paddings[2] + dilations[2] * (W_f - 1) + 1$$

Inputs: Input : (Tensor) The input tensor of convolution transpose operator.The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.Filter : (Tensor) The filter tensor of convolution transpose operator.The format of the filter tensor is MCDHW, where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.We enforce groups number == 1 and padding == 0 in the convolution3d transpose scenario. Output : (Tensor) The output tensor of convolution transpose operator.The format of output tensor is also NCDHW.Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature. dilations (Duplicable): (vector default:{1, 1, 1}), the dilations(d_dilation,h_dilation, w_dilation) of convolution transpose operator.strides (Duplicable): (vector default:{1, 1, 1}), the strides{d_stride, h_stride, w_stride} of convolution transpose operator.paddings (Duplicable): (vector default:{0, 0, 0}), paddings(d_pad, h_pad, w_pad) of convolution transpose operator.use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnndata_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically. workspace_size_MB (Duplicable): Used in cudnn kernel only. workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardward. This size should be carefully setted.

## parallel_do

ParallelDo Operator.

Inputs: inputs (Duplicable) : parameters (Duplicable) : places : outputs (Duplicable) : parallel_scopes : sub_block (Duplicable):

## recurrent

Static Length Recurrent Operator.

The static length recurrent operator can only operate on fixed size sequence data, i.e. in each mini-batch, the sequence length of all inputs are the same.

Inputs: inputs (Duplicable) : rnn inputsinitial_states (Duplicable) : rnn initial statesparameters (Duplicable) : Parameters are used by step block as its input. However, the input is not a sequence tensor. Every time step, each operator in step block just use the parameter directly. outputs (Duplicable) : The output sequence of RNN. The sequence length must be same.step_scopes : StepScopes contain all local variables in each time step. ex_states (Duplicable): The ex-state variable names. The ex-state means the state value in the ex-timestep or the previous time step [ex_states, states, initial_states@GRAD] must be the same orderstates (Duplicable): The state variable names. [ex_states, states, initial_states@GRAD] must be the same ordersub_block (Duplicable): The step block inside RNNreverse (Duplicable): Calculate RNN reversely or not. By default reverse=False Assume the input data is [A, B, C, D] if reverse is False: the computation of RNN is like A B C D | | | | v v v v rnn -----> rnn -----> rnn ----> rnn | | | | v v v v o o o o if reverse is True the computation of RNN is like A B C D | | | | v v v v rnn <----- rnn <----- rnn <---- rnn | | | | v v v v o o o o is_train (Duplicable):

  CreateShuffleReader Operator

and yields the underlying reader's outputs in a shuffled order.


## save

Save operator

This operator will serialize and write a tensor variable to file on disk.

Inputs: X : (Tensor ) Input tensor to be saved overwrite (Duplicable): (boolean, default true)Overwrite the output file if existfile_path (Duplicable): (string)The "file_path" where the variable will be saved.

Inputs: Out : (Tensor) The tensor need to be loaded file_path (Duplicable): (string) Variable will be loaded from "file_path".

LoadCombine operator loads LoDTensor variables from a file. The file should contain one or more LoDTensors serialized using the SaveCombine operator. The LoadCombine operator applies a deserialization strategy to appropriately load the LodTensors, and this strategy complements the serialization strategy used in the SaveCombine operator. Hence, the LoadCombine operator is tightly coupled with the SaveCombine operator, and can only deserialize one or more LoDTensors that were saved using the SaveCombine operator.

Inputs: Out (Duplicable) : (vector) The output LoDTensors that will be read from the input file. file_path (Duplicable): (string) LoDTensors will be loaded from "file_path".

## accuracy

Accuracy Operator.

It will print accuracy rate for classification. The accuracy is calculated as follows:

$$accuracy = \frac{NumOfCorrectPredicts}{NumOfAllSamples}$$

Both the input Out and Label can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with the input Out(Inference).

Inputs: Out : The network output of topk (inferences)Indices : The the network output of topk (indices)Label : Label of the training data Accuracy : The accuracy of current batchCorrect : The correct samples count of current batchTotal : The samples count of current batch

## hard_sigmoid

HardSigmoid Activation Operator.

Segment-wise linear approximation of sigmoid(https://arxiv.org/abs/1603.00391), which is much faster than sigmoid.

$out = max(0, min(1, slope * x + shift))$

The slope should be positive. The offset can be either positive or negative. The default slope and shift are set according to the above reference. It is recommended to use the defaults for this activation.

Inputs: X : Input of HardSigmoid operator Out : Output of HardSigmoid operator slope (Duplicable): Slope for linear approximation of sigmoidoffset (Duplicable): Offset for linear approximation of sigmoid

## cond

Sample Dependent Conditional Operator.

Given Cond[i] as a 1/0 vector to indicate true/false: Out[i] = subnet_true[i], if Cond[i] == true Out[i] = subnet_false[i], if Cond[i] == false

Inputs: Cond : The condition, which is a bool vectorXs (Duplicable) : Inputs of Subnets Outs (Duplicable) : Outputs of Cond_Op after mergeSubScopes : sub scopes for true and false branchesIndexTensors : Index Tensors contains indices for true/false

## max_pool2d_with_index

MaxPool2d Operator.

The maxPooling2d with index operation calculates the output and the mask based on the input, ksize, strides, and paddings parameters. Input(X) and output(Out, Mask) are in NCHW format, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Parameters(ksize, strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out, Mask) size may be different.

Example: Input: X shape: $(N, C, H_{in}, W_{in})$ Output: Out shape: $(N, C, H_{out}, W_{out})$ Mask shape: $(N, C, H_{out}, W_{out})$ Where $$H_{out} = \frac{(H_{in} - ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ W_{out} = \frac{(W_{in} - ksize[1] + 2 * paddings[1])}{strides[1]} + 1$$

Inputs: X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the image, and W is the width of the image. Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the image and W is the width of the image.Mask : (Tensor) The Mask tensor of pooling operator.The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the image, and W is the width of the image. It represents the index in the current feature map. ksize (Duplicable): (vector) The pooling window size(height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.global_pooling (Duplicable): (bool, default:false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.strides (Duplicable): (vector, default {1, 1}), strides(height, width) of pooling operator.paddings (Duplicable): (vector, default:{0, 0}), paddings(height, width) of pooling operator. If global_pooling = true, paddings and will be ignored.

## thresholded_relu

ThresholdedRelu Activation Operator.

$$out = \begin{cases} x, \text{if } x > threshold \\ 0, \text{otherwise} \end{cases}$$

Inputs: X : Input of ThresholdedRelu operator Out : Output of ThresholdedRelu operator threshold (Duplicable): The threshold location of activation

## hard_shrink

HardShrink Activation Operator.

$$out = \begin{cases} x, \text{if } x > \lambda \\ x, \text{if } x < -\lambda \\ 0, \text{otherwise} \end{cases}$$

Inputs: X : Input of HardShrink operator Out : Output of HardShrink operator threshold (Duplicable): The value of threshold for HardShrink

  CreateBatchReader Operator

gathers the underlying reader's outputs and then yields them in batches.


## relu6

Relu6 Activation Operator.

$out = min(max(0, x), 6)$

Inputs: X : Input of Relu6 operator Out : Output of Relu6 operator threshold (Duplicable): The threshold value of Relu6

## elu

ELU Activation Operator.

Applies the following element-wise computation on the input according to https://arxiv.org/abs/1511.07289.

$out = max(0, x) + min(0, alpha * (e^x - 1))$

Inputs: X : Input of ELU operator Out : Output of ELU operator alpha (Duplicable): The alpha value of ELU

## save_combine

SaveCombine operator

This operator will serialize and write a list of input LoDTensor variables to a file on disk.

Inputs: X (Duplicable) : (vector) Input LoDTensors that need to be saved together in a file. overwrite (Duplicable): (boolean, default true)Overwrite the output file if it exists.file_path (Duplicable): (string)The "file_path" where the LoDTensor variables will be saved.

## leaky_relu

LeakyRelu Activation Operator.

$out = max(x, alpha * x)$

Inputs: X : Input of LeakyRelu operator Out : Output of LeakyRelu operator alpha (Duplicable): The small negative slope

## softsign

Softsign Activation Operator.

$$out = \frac{x}{1 + |x|}$$

Inputs: X : Input of Softsign operator Out : Output of Softsign operator

## square

Square Activation Operator.

$out = x^2$

Inputs: X : Input of Square operator Out : Output of Square operator

## log

Log Activation Operator.

$out = ln(x)$

Natural logarithm of x.

Inputs: X : Input of Log operator Out : Output of Log operator

## reciprocal

Reciprocal Activation Operator.

$$out = \frac{1}{x}$$

Inputs: X : Input of Reciprocal operator Out : Output of Reciprocal operator

## ceil

Ceil Activation Operator.

$out = ceil(x)$

Inputs: X : Input of Ceil operator Out : Output of Ceil operator

## abs

Abs Activation Operator.

$out = |x|$

Inputs: X : Input of Abs operator Out : Output of Abs operator

## soft_relu

SoftRelu Activation Operator.

$out = ln(1 + exp(max(min(x, threshold), threshold))$

Inputs: X : Input of SoftRelu operator Out : Output of SoftRelu operator threshold (Duplicable): The threshold value of SoftRelu

## softshrink

Softshrink Activation Operator.

$$out = \begin{cases} x - \lambda, \text{if } x > \lambda \\ x + \lambda, \text{if } x < -\lambda \\ 0, \text{otherwise} \end{cases}$$

Inputs: X : Input of Softshrink operator Out : Output of Softshrink operator lambda (Duplicable): non-negative offset

## softmax

Softmax Operator.

The input of the softmax operator is a 2-D tensor with shape N x K (N is the batch_size, K is the dimension of input feature). The output tensor has the same shape as the input tensor.

For each row of the input tensor, the softmax operator squashes the K-dimensional vector of arbitrary real values to a K-dimensional vector of real values in the range [0, 1] that add up to 1. It computes the exponential of the given dimension and the sum of exponential values of all the other dimensions in the K-dimensional vector input. Then the ratio of the exponential of the given dimension and the sum of exponential values of all the other dimensions is the output of the softmax operator.

For each row $i$ and each column $j$ in Input(X), we have: $$Out[i, j] = \frac{\exp(X[i, j])}{\sum_j(exp(X[i, j])}$$

Inputs: X : The input tensor of softmax. 2-D with shape [batch_size, input_feature_dimensions]. Out : The normalized values with the same shape as X.

## top_k

Top K operator

If the input is a vector (1d tensor), this operator finds the k largest entries in the vector and outputs their values and indices as vectors. Thus values[j] is the j-th largest entry in input, and its index is indices[j].

For matrices, this operator computes the top k entries in each row.

Inputs: X : (Tensor) The input of Topk op Out : (Tensor) The output tensor of Topk opIndices : (Tensor) The indices of Topk elements of input k (Duplicable): (int, default 1) Number of top elements to look for along the last dimension (along each row for matrices).

## clip

Clip Operator.

The clip operator limits the value of given input within an interval. The interval is specified with arguments 'min' and 'max':

$$Out = \min(\max(X, min), max)$$

Inputs: X : (Tensor)The input of clip op.The number of dimensions must be between [1, 9]. Out : (Tensor)The output of clip op with shape as input(X) min (Duplicable): (float)Minimum value, under which element is replaced by min.max (Duplicable): (float)Maximum value, above which element is replaced by max

## margin_rank_loss

MarginRankLoss Operator.

This operator measures the loss given a pair of training sample {X1, X2} and the Label with attribute margin, where Label = +1 indicating X1 is ranked higher than X2 and Label = -1 otherwise. The loss is calculated as:

$loss(X1, X2, Label) = max(0, -Label * (X1 - X2) + margin)$

The attribute margin here helps make the predictions more robust. Denote the item ranked higher as the positive sample, otherwise the negative sample. If the score of the two samples satisfies

$positive sample - negative sample < margin$

the pair of samples will contribute to the final loss, which will backpropagate and train the ranking model to enlarge the difference between the two scores.

For batch input with size batch_size, X1, X2 and Label all have the same shape [batch_size x 1].

Inputs: X1 : (2-D tensor with shape [batch_size x 1]) The score for one item X1 to be ranked, from pairwise ranking model.X2 : (2-D tensor with shape [batch_size x 1]) The score for another item X2 to be ranked, from pairwise ranking model.Label : (2-D tensor with shape [batch_size x 1]) The label indicating X1 ranked higher than X2 or not, can only be +1 or -1. Activated (Intermediate) : (2-D tensor with shape [batch_size x 1]) Intermediate tensor to indicate whether each element of Output(Out) is activated.Out : (2-D tensor with shape [batch_size x 1]) The output loss of MarginRankLoss operator. margin (Duplicable): (scalar, default 0) Margin for MarginRankLossOp.

## mul

Mul Operator.

This operator is used to perform matrix multiplication for input $X$ and $Y$.

The equation is:

$$Out = X * Y$$

Both the input $X$ and $Y$ can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input $X$.

Inputs: X : (Tensor), The first input tensor of mul op.Y : (Tensor), The second input tensor of mul op. Out : (Tensor), The output tensor of mul op. x_num_col_dims (Duplicable): (int, default 1), The mul_op can take tensors with more than two dimensions as its inputs. If the input $X$ is a tensor with more than two dimensions, $X$ will be flattened into a two-dimensional matrix first. The flattening rule is: the first num_col_dims will be flattened to form the first dimension of the final matrix (the height of the matrix), and the rest rank(X) - num_col_dims dimensions are flattened to form the second dimension of the final matrix (the width of the matrix). As a result, height of the flattened matrix is equal to the product of $X$'s first x_num_col_dims dimensions' sizes, and width of the flattened matrix is equal to the product of $X$'s last rank(x) - num_col_dims dimensions' size. For example, suppose $X$ is a 6-dimensional tensor with the shape [2, 3, 4, 5, 6], and x_num_col_dims = 3. Thus, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] = [24, 30]. y_num_col_dims (Duplicable): (int, default 1), The mul_op can take tensors with more than two, dimensions as its inputs. If the input $Y$ is a tensor with more than two dimensions, $Y$ will be flattened into a two-dimensional matrix first. The attribute y_num_col_dims determines how $Y$ is flattened. See comments of x_num_col_dims for more details.

## mine_hard_examples

Mine hard examples Operator. This operator implements hard example mining to select a subset of negative box indices. For each image, selects the box with highest losses. subject to the condition that the box cannot have an Matcht > neg_dist_threshold when mining_type is max_negative. The selected number is min(sample_size, max_negative_box_number) when mining_type is hard_example, or min(neg_pos_ratio * positive_box_number, max_negative_box_number) when mining_type is max_negative, where the max_negative_box_number is the count of MatchIndices elements with value -1.

Inputs: ClsLoss : (Tensor, default Tensor), The classification loss with shape [N, Np], N is the batch size and Np is the number of prior box.LocLoss : (Tensor, optional, default Tensor), The localization loss with shape [N, Np], N is the batch size and Np is the number of prior box.MatchIndices : (Tensor, Tensor), Matched indices with shape [N, Np], N is the batch size and Np is the number of prior box. MatchIndices[i][j] equal -1 means the j-th prior box in i-th instance does not match any entity, otherwise means it is matched to row.MatchDist : (Tensor, default Tensor) Matched indices with shape [N, Np], N is the batch size and Np is the number of prior box. NegIndices : (LoDTensor) The output of negative example indices. a LoDTensor with shape [Neg, 1]. The size of lod[0] minus 1 is batch size, and each element is the prior box index. For example, the batch size is 2, the lod is [[0, 1, 2]], the sample 0's box 1(MatchIndices[0][1]) is selected, and sample 1's box 0 is selected. The output NegIndices is [[1], [0]].UpdatedMatchIndices : (Tensor) The output of updated MatchIndices, a tensor with shape [N, Np]. Only update when mining_type is hard_example. The input MatchIndices elements will be update to -1 when it is not in the candidate high loss list of negative examples. neg_pos_ratio (Duplicable): (float) The ratio of the negative box to the positive box. Use only when mining_type is max_negative.neg_dist_threshold (Duplicable): (float) The negative overlap upper bound for the unmatched predictions. Use only when mining_type is max_negative.sample_size (Duplicable): (float) The max sample size of negative box. Use only when mining_type is hard_example.mining_type (Duplicable): (float) The mining algorithm name, the value is hard_example or max_negative.

## swish

Swish Activation Operator.

$$out = \frac{x}{1 + e^{- \beta x}}$$

Inputs: X : Input of Swish operator Out : Output of Swish operator beta (Duplicable): Constant beta of swish operator

## is_empty

IsEmpty Operator which checks whether a tensor is empty.

It will just return product(tensor.ddims()) > 0;

Inputs: X : (Tensor) Tensor which is to be checked. Out : (Tensor) a boolean Tensor that indicate empty or not.

## minus

Minus Operator.

Equation:

$Out = X - Y$


Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

Inputs: X : The left tensor of minus operator.Y : The right tensor of minus operator. Out : The output tensor of minus operator.

## scatter

Scatter Operator.

This operator obtains output by updating the input on selected indices on the first axis:

$$Out = Ref \\ Out[Index] = Ref[Index] + Updates$$

Inputs: Ref : The source input of scatter opIndex : The index input of scatter op where Ref will be updatedUpdates : The updated value of updates op Out : The output of add op

## max_sequence_len

Calculate the max sequence length through lod_rank_table.

Inputs: RankTable : The lod_rank_table. Out : The max sequence length.

## multiplex

Multiplex Operator.

Multiplex multiple tensors according to the index provided by the index tensor.

Ids: the index tensor. X[0 : N - 1]: the candidate tensors for output (N >= 2). For each index i from 0 to batchSize - 1, the output is the i-th row of the the (Ids[i])-th tensor.

For i-th row of the output tensor:

$$y[i] = x_{k}[i]$$

where y is the output tensor, x_{k} is the k-th input tensor, and k = Ids[i].

Inputs: Ids : The index tensor of multiplex operator.X (Duplicable) : The candidate tensors of multiplex operator. Out : The output tensor of multiplex operator.

## elementwise_pow

Limited Elementwise Pow Operator.

The equation is:

$$Out = X ^ Y$$

$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.

There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.

For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.

For example .. code-block:: python

shape(X) = (2, 3, 4, 5), shape(Y) = (,)
shape(X) = (2, 3, 4, 5), shape(Y) = (5,)
shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5)
shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1
shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0


Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.

Inputs: X : (Tensor), The first input tensor of elementwise op.Y : (Tensor), The second input tensor of elementwise op. Out : The output of elementwise op. axis (Duplicable): (int, default -1). The start dimension index for broadcasting Y onto X.

## proximal_gd

ProximalGD Operator.

Optimizer that implements the proximal gradient descent algorithm:

$$prox\_param = param - learning\_rate * grad \\ param = sign(prox\_param) / (1 + learning\_rate * l2) * \max(|prox\_param| - learning\_rate * l1, 0)$$

The paper that proposed Proximal Gradient Descent: (http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf)

Inputs: Param : (Tensor, default Tensor) Input parameter value that has to be updated.Grad : (Tensor, default Tensor) Input gradient of the parameter.LearningRate : (Tensor, default Tensor) The learning rate should be a tensor of size 1. ParamOut : (Tensor) Output updated parameter value. l1 (Duplicable): (float, default 0.0) L1 regularization strength.l2 (Duplicable): (float, default 0.0) L2 regularization strength.

## prelu

PRelu Operator.

The equation is:

$$f(x) = \begin{cases} \alpha * x, \quad \text{if} \ x < 0 \\ x, \qquad \text{if} \ x >= 0 \end{cases}$$

The input X can carry the LoD (Level of Details) information, or not. And the output shares the LoD information with input X.

Inputs: X : The input tensor of prelu operator.Alpha : The alpha weight of prelu operator. Out : The output tensor of prelu operator.

## prior_box

Prior box operator Generate prior boxes for SSD(Single Shot MultiBox Detector) algorithm. Each position of the input produce N prior boxes, N is determined by the count of min_sizes, max_sizes and aspect_ratios, The size of the box is in range(min_size, max_size) interval, which is generated in sequence according to the aspect_ratios.

Inputs: Input : (Tensor, default Tensor), the input feature data of PriorBoxOp, The layout is NCHW.Image : (Tensor, default Tensor), the input image data of PriorBoxOp, The layout is NCHW. Boxes : (Tensor, default Tensor), the output prior boxes of PriorBoxOp. The layout is [H, W, num_priors, 4]. H is the height of input, W is the width of input, num_priors is the box count of each position.Variances : (Tensor, default Tensor), the expanded variances of PriorBoxOp. The layout is [H, W, num_priors, 4]. H is the height of input, W is the width of input, num_priors is the box count of each position. min_sizes (Duplicable): (vector) List of min sizes of generated prior boxes.max_sizes (Duplicable): (vector) List of max sizes of generated prior boxes.aspect_ratios (Duplicable): (vector) List of aspect ratios of generated prior boxes.variances (Duplicable): (vector) List of variances to be encoded in prior boxes.flip (Duplicable): (bool) Whether to flip aspect ratios.clip (Duplicable): (bool) Whether to clip out-of-boundary boxes.step_w (Duplicable): Prior boxes step across width, 0 for auto calculation.step_h (Duplicable): Prior boxes step across height, 0 for auto calculation.offset (Duplicable): (float) Prior boxes center offset.

$$moment = moment + grad * grad \\ prox\_param = param - learning\_rate * grad * (1 / \sqrt{moment}) \\ param = sign(prox\_param) / (1 + learning\_rate * l2) * \max(|prox\_param| - learning\_rate * l1 , 0)$$

The paper that proposed Proximal GD: (http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf) Here, we use the adagrad learning rate as specified here: (http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)

Inputs: Param : (Tensor, default Tensor) Input parameter that has to be updated.Moment : (Tensor, default Tensor) Moment parameter that has to be updated.Grad : (Tensor, default Tensor) Input gradient of the parameter.LearningRate : (Tensor, default Tensor) The learning rate should be a tensor of size 1. ParamOut : (Tensor) Output updated parameter value.MomentOut : (Tensor) Output updated moment value. l1 (Duplicable): (float, default 0.0) L1 regularization strength.l2 (Duplicable): (float, default 0.0) L2 regularization strength.

## rank_loss

RankLoss Operator.

RankLoss operator for RankNet (http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf). RankNet is a pairwise ranking model with one training sample consisting of a pair of doc A and B, and the label P indicating that A is ranked higher than B or not:

P = {0, 1} or {0, 0.5, 1}, where 0.5 means no information about the rank of the input pair.

The RankLoss operator takes three inputs: Left (o_i), Right (o_j) and Label (P_{i,j}), which represent the output score of RankNet for the two docs and the label respectively, and yields the rank loss C_{i,j} using the following equation:

$$C_{i,j} = -\tilde{P_{ij}} * o_{i,j} + \log(1 + e^{o_{i,j}}) \\ o_{i,j} = o_i - o_j \\ \tilde{P_{i,j}} = \left \{0, 0.5, 1 \right \} \ or \ \left \{0, 1 \right \}$$

The operator can take batch inputs with size batch_size (batch_size >= 1).

Inputs: Label : (2-D Tensor with shape [batch_size x 1]) The label indicating A ranked higher than B or not.Left : (2-D Tensor with shape [batch_size x 1]) The output of RankNet for doc A.Right : (2-D Tensor with shape [batch_size x 1]) The output of RankNet for doc B. Out : (2-D Tensor with shape [batch_size x 1]) The output loss of RankLoss operator.

## reduce_min

ReduceMin Operator.

This operator computes the min of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true. If reduce_all is true, just reduce along all dimensions and output a scalar.

Inputs: X : (Tensor) The input tensor. Tensors with rank at most 6 are supported. Out : (Tensor) The result tensor. dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [-rank(input), rank(input)). If dim < 0, the dim to reduce is rank + dim. Note that reducing on the first dim will make the LoD info lost.keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.reduce_all (Duplicable): (bool, default false) If true, output a scalar reduced along all dimensions.

## reduce_max

ReduceMax Operator.

This operator computes the max of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true. If reduce_all is true, just reduce along all dimensions and output a scalar.

Inputs: X : (Tensor) The input tensor. Tensors with rank at most 6 are supported. Out : (Tensor) The result tensor. dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [-rank(input), rank(input)). If dim < 0, the dim to reduce is rank + dim. Note that reducing on the first dim will make the LoD info lost.keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.reduce_all (Duplicable): (bool, default false) If true, output a scalar reduced along all dimensions.

## reduce_mean

ReduceMean Operator.

This operator computes the mean of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true. If reduce_all is true, just reduce along all dimensions and output a scalar.

Inputs: X : (Tensor) The input tensor. Tensors with rank at most 6 are supported. Out : (Tensor) The result tensor. dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [-rank(input), rank(input)). If dim < 0, the dim to reduce is rank + dim. Note that reducing on the first dim will make the LoD info lost.keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.reduce_all (Duplicable): (bool, default false) If true, output a scalar reduced along all dimensions.

## round

Round Activation Operator.

$out = [x]$

Inputs: X : Input of Round operator Out : Output of Round operator

## norm

   "Input shape: $(N, C, H, W)$
Scale shape: $(C, 1)$
Output shape: $(N, C, H, W)$
Where
forward
<span class="markdown-equation" id="equation-0"></span>
backward
<span class="markdown-equation" id="equation-1"></span>

Inputs: X : (Tensor) The input tensor of norm operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.Scale : (Tensor) The input tensor of norm operator. The format of input tensor is C * 1. Out : (Tensor) The output tensor of norm operator.N * M.M = C * H * W epsilon (Duplicable): (float, default 1e-10) Constant for numerical stability.

## modified_huber_loss

Modified Huber Loss Operator.

This operator is used in binary classification problem. The shape of input X and target Y are both [N, 1] and so is the shape of the output loss. Since target Y is not differentiable, calculating gradient for Y is illegal. The formula of modified huber loss is:

$$L(y, f(x)) = \begin{cases} (\max(0, 1 - yf(x)))^2, \text{if} \ yf(x) >= -1 \\ -4yf(x), \quad \text{otherwise} \end{cases}$$

Make sure the values of target label Y are in {0, 1} here. This operator will scale values of Y to {-1, +1} when computing losses and gradients.

Inputs: X : The input tensor of modified huber loss op. X is 2-D tensor with shape [batch_size, 1].Y : The target labels of modified huber loss op. The shape of Y is the same as X. Values of Y must be 0 or 1. IntermediateVal (Intermediate) : Variable to save intermediate result which will be reused in backward processing.Out : Classification loss for X.

## elementwise_sub

Limited Elementwise Sub Operator.

The equation is:

$$Out = X - Y$$

$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.

There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.

For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.

For example .. code-block:: python

shape(X) = (2, 3, 4, 5), shape(Y) = (,)
shape(X) = (2, 3, 4, 5), shape(Y) = (5,)
shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5)
shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1
shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0


Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.

Inputs: X : (Tensor), The first input tensor of elementwise op.Y : (Tensor), The second input tensor of elementwise op. Out : The output of elementwise op. axis (Duplicable): (int, default -1). The start dimension index for broadcasting Y onto X.

## conv2d_transpose

Convolution2D Transpose Operator.

The convolution transpose operation calculates the output based on the input, filter and dilations, strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and output(Output) are in NCHW format. Where N is batchsize, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCHW format. Where M is the number of input feature channels, C is the number of output feature channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.

Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$H_{out} = (H_{in} - 1) * strides[0] - 2 * paddings[0] + dilations[0] * (H_f - 1) + 1 \\ W_{out} = (W_{in} - 1) * strides[1] - 2 * paddings[1] + dilations[1] * (W_f - 1) + 1$$

Inputs: Input : (Tensor) The input tensor of convolution transpose operator. The format of input tensor is NCHW. Where N is batch size, C is the number of input channels, H is the height of the feature, and W is the width of the feature.Filter : (Tensor) The filter tensor of convolution transpose operator. The format of the filter tensor is MCHW, where M is the number of input feature channels, C is the number of output feature channels,H is the height of the filter, and W is the width of the filter. We enforce groups number == 1 in the convolution transpose scenario. Output : (Tensor) The output tensor of convolution transpose operator. The format of output tensor is also NCHW. dilations (Duplicable): (vector default:{1, 1}), the dilations(h_dilation, w_dilation) of convolution transpose operator.strides (Duplicable): (vector default:{1, 1}), the strides(h_stride, w_stride) of convolution transpose operator.paddings (Duplicable): (vector default:{0, 0}), the paddings(h_pad, w_pad) of convolution transpose operator.use_cudnn (Duplicable): (bool, default false) Only used in cudnn kernel, need install cudnndata_format (Duplicable): (string, default NCHW) Only used in An optional string from: "NHWC", "NCHW". Defaults to "NHWC". Specify the data format of the output data, the input will be transformed automatically. workspace_size_MB (Duplicable): Used in cudnn kernel only. workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardward. This size should be carefully setted.

## elementwise_max

Limited Elementwise Max Operator.

The equation is:

$$Out = max(X, Y)$$

$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.

There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.

For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.

For example .. code-block:: python

shape(X) = (2, 3, 4, 5), shape(Y) = (,)
shape(X) = (2, 3, 4, 5), shape(Y) = (5,)
shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5)
shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1
shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0


Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.

Inputs: X : (Tensor), The first input tensor of elementwise op.Y : (Tensor), The second input tensor of elementwise op. Out : The output of elementwise op. axis (Duplicable): (int, default -1). The start dimension index for broadcasting Y onto X.

## smooth_l1_loss

Smooth L1 Loss Operator.

This operator computes the smooth l1 loss for X and Y. The operator takes the first dimension of X and Y as batch size. For each instance, it computes the smooth l1 loss element by element first and then sums all the losses. So the shape of Out is [batch_size, 1].

The equation is: $$Out_{\sigma}(X, Y)_i = \begin{cases} 0.5 * (\sigma * (X_i - Y_i)) ^ 2 \quad |X_i - Y_i| \lt \frac{1} {{\sigma} ^ 2} \\ \frac{|X_i - Y_i| - 0.5}{{\sigma}^2}, \quad otherwise \end{cases}$$

In the above equation, $Out_{sigma}(X, Y)_i$, $X_i$ and $Y_i$ represent the ith element of Out, X and Y.

Inputs: X : (Tensor, default Tensor) A tensor with rank at least 2. The input value of smooth l1 loss op with shape [batch_size, dim1, ..., dimN].Y : (Tensor, default Tensor) A tensor with rank at least 2. The target value of smooth l1 loss op with same shape as X.InsideWeight : (Tensor, default Tensor) A tensor with rank at least 2. This input is optional and should have same shape with X. If provided, the result of (X - Y) will be multiplied by this tensor element by element.OutsideWeight : (Tensor, default Tensor) A tensor with rank at least 2. This input is optional and should have same shape with X. If provided, the out smooth l1 loss will be multiplied by this tensor element by element. Diff (Intermediate) : Intermediate variable to cache InsideWeight * (X - Y).Out : (Tensor, default Tensor) A tensor with rank be 2. The output smooth l1 loss with shape [batch_size, 1]. sigma (Duplicable): Hyper parameter of smooth l1 loss op.A float scalar with default value 3.0.

## reorder_lod_tensor_by_rank

ReorderLoDTensorByRankTable operator.

Input(X) is a batch of sequences. Input(RankTable) stores new orders of the input sequence batch. The reorder_lod_tensor_by_rank operator reorders the Input(X) according to the information provided by Input(RankTable).

For example:

If the indices stored in the Input(RankTable) are [3, 0, 2, 1], the Input(X) will be reordered that the fourth sequence in Input(X) will become the first one, and then followed by the original first, third, and the second one.

This is: X = [Seq0, Seq1, Seq2, Seq3]. The indices in RankTable are [3, 0, 2, 1]. Out = [Seq3, Seq0, Seq2, Seq1] with a new LoD information.

If the LoD information of Input(X) is empty, this means Input(X) is not sequence data. This is also identical to a batch of sequences where each sequence has a fixed length 1. In this case, the reorder_lod_tensor_by_rank operator reorders each slice of Input(X) along the first axis according to Input(RankTable).

This is: X = [Slice0, Slice1, Slice2, Slice3] and its LoD information is empty. The indices in RankTable are [3, 0, 2, 1]. Out = [Slice3, Slice0, Slice2, Slice1] with no LoD information is appended.

NOTE: This operator sorts Input(X) according to a given LoDRankTable which does not need to be calculated according to Input(X). It can be calculated according to another different sequence, and then this operator sorts Input(X) according to the given LoDRankTable.

Inputs: X : (LoDTensor), the input lod tensor to be reordered according to Input(RankTable).RankTable : (LoDRankTable), the rank table according to which Input(X) is reordered. Out : (LoDTensor), the reordered lod tensor.

Pad input into output, as specified by paddings and pad_value. The input should be a k-D tensor(k > 0 and k < 7). As an example:

Given:

X = [[1, 2], [3, 4]],

paddings = [0, 1, 1, 2],

and

we have:

Out = [[0, 1, 2, 0, 0] [0, 3, 4, 0, 0] [0, 0, 0, 0, 0]]

Inputs: X : The input of pad op. The input should be a k-D tensor(k > 0 and k < 7) Out : The output of pad op. A tensor with the same shape as X. paddings (Duplicable): (vector) A list to describe the padding rules for each dimension. For 2-D image tensor, paddings=[0, 1, 2, 3] means padding 0 row to top, 1 row to bottom, 2 columns to left and 3 columns to right. Size of paddings should be equal to 2 * dimension size of the input tensor.pad_value (Duplicable): (float, default 0.0) The value to fill the padded areas.

## lstm_unit

Lstm Unit Operator

Equation:

$$i, f, o, j = split(X) \\ C = C_{prev} * sigm(f + forget\_bias) + sigm(i) * tanh(j) \\ H = C * sigm(o)$$

Inputs: X : Lstm unit only applies non-linear activations, please make surethat linear tranformation has already been applied to X. Linear tranformation can be applied by adding a fc layerC_prev : The cell state tensor of last time-step in the Lstm Unit operator. C : The cell tensor of Lstm Unit operator.H : The hidden state tensor of Lstm Unit operator. forget_bias (Duplicable): (float, default 0.0) The forget bias of Lstm Unit.

## squared_l2_norm

SquaredL2Norm Operator.

Computes the squared L2 norm of a tensor.

$$Out = \sum_{i} X_{i}^2$$

Inputs: X : (Tensor) The input of squared_l2_norm op. Out : (Scalar) The output of squared_l2_norm op.

## sequence_expand

Sequence Expand Operator.

This operator expands input(X) according to LOD of input(Y). Following are cases to better explain how this works: Case 1:

Given a 2-level LoDTensor input(X) X.lod = [[0, 2, 3], [0, 1, 3, 4]] X.data = [a, b, c, d] X.dims = [4, 1] and input(Y) Y.lod = [[0, 2, 4], [0, 3, 6, 7, 8]] with condition len(Y.lod[-1]) -1 == X.dims[0] then we get 2-level LoDTensor Out.lod = [[0, 2, 4], [0, 3, 6, 7, 8]] Out.data = [a, a, a, b, b, b, c, d] Out.dims = [8, 1]

Case 2:

Given a common Tensor input(X) X.data = [a, b, c] X.dims = [3, 1] and input(Y) Y.lod = [[0, 2, 3, 6]] with condition len(Y.lod[-1]) -1 == X.dims[0] then we get 1-level LoDTensor Out.lod = [[0, 2, 3, 6]] Out.data = [a, a, b, c, c, c] Out.dims = [6, 1]

Case 3:

Given a common Tensor input(X) X.data = [[a, b], [c, d], [e, f]] X.dims = [3, 2] and input(Y) Y.lod = [[0, 2, 3, 6]] with condition len(Y.lod[-1]) -1 == X.dims[0] then we get 1-level LoDTensor Out.lod = [[0, 2, 3, 6]] Out.data = [[a,b], [a,b] [c,d], [e, f], [e, f], [e, f]] Out.dims = [6, 2]

Case 4:

Given 2-level a LoDTensor input(X) X.lod = [[0, 2, 3], [0, 1, 3, 4]] X.data = [a, b, c, d] X.dims = [4, 1] and input(Y) Y.lod = [[0, 2, 4], [0, 3, 6, 6, 8]] with condition len(Y.lod[-1]) -1 == X.dims[0] then we get 2-level LoDTensor Out.lod = [[0, 2, 4], [0, 3, 6, 6, 8]] Out.data = [a, a, a, b, b, b, d, d] Out.dims = [8, 1]

Inputs: X : (Tensor or LoDTensor) The input(X) of this operator can be a LoDTensor or a base Tensor.Y : (LoDTensor)The reference input(Y) of sequence_expand op.It must be a LoDTensor with k-level(k>0).The input(X) will be expanded according to LOD of input(Y).The element numbers of last level in input(Y) must be equal to dims[0] of input(X). Out : (LodTensor)The output of sequence_expand op.The lod of output will be as same as input(Y)'s lod.

## momentum

Momentum Optimizer.

This optimizer has a flag for Nestrov Momentum. The update equations are as follows:

$$velocity = mu * velocity + gradient \\ if (use\_nesterov): \\ param = param - gradient * learning\_rate + mu * velocity * learning\_rate \\ else: \\ param = param - learning\_rate * velocity. \\$$

Inputs: Param : (Tensor, default Tensor) Input parameter that has to be updatedGrad : (Tensor, default Tensor) Input gradient of the parameterVelocity : (Tensor, default Tensor) Input velocity (corresponding to the parameter) that has to be updatedLearningRate : (Tensor, default Tensor) Input learning rate ParamOut : (Tensor) This output is updated parameter. It shared memory with Input(Param).VelocityOut : (Tensor) This output is updated velocity. It shared memory with Input(Velocity). mu (Duplicable): (float) Momentum coefficientuse_nesterov (Duplicable): (bool, default false) Use Nesterov Momentum

## uniform_random

Uniform random operator.

This operator initializes a tensor with random values sampled from a uniform distribution.

Inputs: Out : (Tensor) The output tensor of uniform random op shape (Duplicable): (vector) The shape of the output tensormin (Duplicable): (float, default -1.0) Minimum value of uniform randommax (Duplicable): (float, default 1.0) Maximun value of uniform randomseed (Duplicable): (int, default 0) Random seed used for generating samples. 0 means use a seed generated by the system.dtype (Duplicable): (int, default 5(FP32)) Output tensor data type

## split_selected_rows

Split a SelectedRows with a specified rows section. height_sections is only needed when need to split the dims of the original tensor.

Example: Input: X.rows = {7, 5} X.height = 12 Attr: height_sections = {4, 8} Out: out0.rows = {} out0.height = 4

out1.rows = {5, 7}
out2.height = 8

Inputs: X : The input SelectedRows. Out (Duplicable) : The outputs of input SelectedRows. height_sections (Duplicable): Height for each output SelectedRows.

This implements the Adam optimizer from Section 2 of the Adam paper : https://arxiv.org/abs/1412.6980. Adam is a first-order gradient-based optimization method based on adaptive estimates of lower-order moments.

$$moment\_1\_out = \beta_1 * moment\_1 + (1 - \beta_1) * grad \\ moment\_2_\out = \beta_2 * moment\_2 + (1 - \beta_2) * grad * grad \\ learning\_rate = learning\_rate * \frac{\sqrt{1 - \beta_{2\_pow}}}{1 - \beta_{1\_pow}} \\ param\_out = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon}$$

Inputs: Param : (Tensor) Input parameterGrad : (Tensor) Input gradientLearningRate : (Tensor) Learning rateMoment1 : (Tensor) Input first momentMoment2 : (Tensor) Input second momentBeta1Pow : (Tensor) Input beta1 power accumulatorBeta2Pow : (Tensor) Input beta2 power accumulator ParamOut : (Tensor) Output parameterMoment1Out : (Tensor) Output first momentMoment2Out : (Tensor) Output second moment beta1 (Duplicable): (float, default 0.9) Exponential decay rate for the first moment estimates.beta2 (Duplicable): (float, default 0.999) exponential decay rate for the second moment estimates.epsilon (Duplicable): (float, default 1.0e-8) Constant for numerical stability

## increment

Increment Operator.

The equation is: $$Out = X + step$$

Inputs: X : (Tensor) The input tensor of increment operator Out : (Tensor) The output tensor of increment operator. step (Duplicable): (float, default 1.0) The step size by which the input tensor will be incremented.

## gru_unit

GRUUnit Operator implements partial calculations of the GRU unit as following:

$$update \ gate: u_t = actGate(xu_t + W_u * h_{t-1} + b_u) \\ reset \ gate: r_t = actGate(xr_t + W_r * h_{t-1} + b_r) \\ output \ candidate: {h}_t = actNode(xc_t + W_c * dot(r_t, h_{t-1}) + b_c) \\ output: h_t = dot((1 - u_t), h_{t-1}) + dot(u_t, {h}_t)$$

which is same as one time step of GRU Operator.

@note To implement the complete GRU unit, fully-connected operator must be used before to feed xu, xr and xc as the Input of GRUUnit operator.

Inputs: Input : (Tensor) Matrix with shape [batch_size, frame_size * 3] for the input.HiddenPrev : (Tensor) Matrix with shape [batch_size, frame_size] for the states of previous time step.Weight : (Tensor) Weight matrix with shape [frame_size, frame_size * 3]. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape [frame_size, frame_size * 2], and the second part are weights of output candidate with shape [frame_size, frame_size].Bias : (Tensor) Bias vector with shape [1, frame_size * 3] concatenating bias of the update gate, reset gate and output candidate. Gate (Intermediate) : (Tensor) Matrix with shape [batch_size, frame_size * 3] for the output of update gate, reset gate and output candidate.ResetHiddenPrev (Intermediate) : (Tensor) Matrix with shape [batch_size, frame_size] for the reseted hidden state of previous time step.Hidden : (Tensor) The GRU hidden state of the current time step with shape [batch_size, frame_size]. activation (Duplicable): (enum int, default tanh) The activation type used for output candidate {h}_t.gate_activation (Duplicable): (enum int, default sigmoid) The activation type used in update gate and reset gate.

## less_than

less_than Operator

It operates element-wise on X and Y, and returns the Out. Each of them is a N-dim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X < Y

Inputs: X : (LoDTensor) the left hand operand of less_than operatorY : (LoDTensor) the right hand operand of less_than operator Out : (LoDTensor) n-dim bool tensor. Each element is Out = X < Y axis (Duplicable): (int, default -1). The start dimension index for broadcasting Y onto X.

## sequence_pool

Sequence Pool Operator.

The SequencePoolOp pools features of all time-steps of each instance. It supports six pooling types: 1. AVERAGE: $$Out[i] = \frac{\sum_i X_i}{N}$$ 2. SUM: $$Out[i] = \sum_jX_{ij}$$ 3. SQRT: $$Out[i] = \frac{\sum_jX_{ij}}{\sqrt{len(X_i)}}$$ 4. LAST: Out[i] = last instance in i-th sequence X[i] 5. FIRST: Out[i] = first instance in i-th sequence X[i] 6. MAX: $$Out[i] = max(X_i)$$

The following example explains how this works: For a mini-batch of 3 variable-length sentences, containing 2, 3, and 2 time-steps:

Assume X is a [7,M,N] LoDTensor, and X->lod()[0] = [0, 2, 5, 7], 7=2+3+2. Besides, for the sake of simplicity, we assume M=1 and N=1, and the value of X = [[1, 3], [2, 4, 6], [5, 1]].

Thus, Out is a [3,1,1] Tensor without LoD infomation. And for different pooltype, the value of Out is as follows:

• AVERAGE: [2, 4, 3], where 2=(1+3)/2, 4=(2+4+6)/3, 3=(5+1)/2
• SUM: [4, 12, 6], where 4=1+3, 12=2+4+6, 6=5+1
• SQRT: [2.82, 6.93, 4.24], where 2.82=(1+3)/sqrt(2), 6.93=(2+4+6)/sqrt(3), 4.24=(5+1)/sqrt(2)
• MAX: [3, 6, 5], where 3=max(1,3), 6=max(2,4,6), 5=max(5,1)
• LAST: [3, 6, 1], where 3=last(1,3), 6=last(2,4,6), 1=last(5,1)
• FIRST: [1, 2, 5], where 1=first(1,3), 2=first(2,4,6), 5=first(5,1)
Inputs: X : (LoDTensor) The variable-length input of SequencePoolOp Out : (Tensor) The output of SequencePoolOp does not contain LoD infomation.MaxIndex (Intermediate) : (Tensor) This tensor is used for the sequence max-pooling to record the max indexes. pooltype (Duplicable): (string, default 'AVERAGE') the pooling pooltype of SequencePoolOp.

## spp

    "With spatial pyramid pooling, the input image can
be of any sizes. This not only allows arbitrary aspect
ratios, but also allows arbitrary scales. We can resize
the input image to any scale (e.g., min(w, h)=180, 224,
...) and apply the same deep network. When the
input image is at different scales, the network (with
the same filter sizes) will extract features at different
scales. The scales play important roles in traditional
methods.
Input shape: $(N, C_{in}, H_{in}, W_{in})$
Output shape: $(H_{out}, W_{out})$
Where
$$H_{out} = N \\ W_{out} = (((4^pyramid_height) - 1) / (4 - 1)) * C_{in}$$
paper https://arxiv.org/pdf/1406.4729v4.pdf

Inputs: X : (Tensor) The input tensor of spp operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature. Out : (Tensor) The output tensor of spp operator.N * M.M = C * H * W pyramid_height (Duplicable): (int), multi level poolingpooling_type (Duplicable): (string), pooling type, can be "max" for max-pooling and "avg" for average-pooling.

## sign

Sign operator

$$Out = X.sign()$$

Inputs: X : (Tensor) Input tensor of sign operator. Out : (Tensor) Output tensor of sign operator.

## reduce_sum

ReduceSum Operator.

This operator computes the sum of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true. If reduce_all is true, just reduce along all dimensions and output a scalar.

Inputs: X : (Tensor) The input tensor. Tensors with rank at most 6 are supported. Out : (Tensor) The result tensor. dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [-rank(input), rank(input)). If dim < 0, the dim to reduce is rank + dim. Note that reducing on the first dim will make the LoD info lost.keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.reduce_all (Duplicable): (bool, default false) If true, output a scalar reduced along all dimensions.

## im2sequence

This op uses kernels to scan images and converts these images to sequences. After expanding, The number of time steps are output_height * output_width and the dimension of each time step is kernel_height * kernel_width * channels, in which:

output_height = 1 + (padding_height + padding_down + img_height - kernel_height + stride_height - 1) / stride_height; output_width = 1 + (padding_left + padding+right + img_width - kernel_width + stride_width - 1) / stride_width;

This op can be used after convolution neural network, and before recurrent neural network.

Given:

x = [[[[ 6. 2. 1.] [ 8. 3. 5.] [ 0. 2. 6.]]

  [[ 2.  4.  4.]
[ 6.  3.  0.]
[ 6.  4.  7.]]]

[[[ 6.  7.  1.]
[ 5.  7.  9.]
[ 2.  4.  8.]]

[[ 1.  2.  1.]
[ 1.  3.  5.]
[ 9.  0.  8.]]]]


x.dims = {2, 2, 3, 3}

And:

kernels = [2, 2] strides = [1, 1] paddings = [0, 0, 0, 0]

Then:

output.data = [[ 6. 2. 8. 3. 2. 4. 6. 3.] [ 2. 1. 3. 5. 4. 4. 3. 0.] [ 8. 3. 0. 2. 6. 3. 6. 4.] [ 3. 5. 2. 6. 3. 0. 4. 7.] [ 6. 7. 5. 7. 1. 2. 1. 3.] [ 7. 1. 7. 9. 2. 1. 3. 5.] [ 5. 7. 2. 4. 1. 3. 9. 0.] [ 7. 9. 4. 8. 3. 5. 0. 8.]] output.dims = {8, 9} output.lod = [[0, 4, 8]]

Inputs: X : (Tensor) The input tensor has NCHW format.N: batch sizeC: channelsH: heightW: width Out : (LodTensor) The output data of im2sequence op, kernels (Duplicable): (vector), the kernels(kernel_height, kernel_width)strides (Duplicable): (vector default:{1, 1}), the strides(h_stride, w_stride)paddings (Duplicable): (vector default:{0, 0, 0, 0}), the paddings(up_pad, left_pad, down_pad, right_pad)

## stanh

STanh Activation Operator.

$$out = b * \frac{e^{a * x} - e^{-a * x}}{e^{a * x} + e^{-a * x}}$$

Inputs: X : Input of STanh operator Out : Output of STanh operator scale_a (Duplicable): The scale parameter of a for the inputscale_b (Duplicable): The scale parameter of b for the input

We implement the Adamax optimizer from Section 7 of the Adam paper: https://arxiv.org/abs/1412.6980. Adamax is a variant of the Adam algorithm based on the infinity norm.

$$moment\_out = \beta_1 * moment + (1 - \beta_1) * grad \\ inf\_norm\_out = max(\beta_2 * inf\_norm + \epsilon, |grad|) \\ learning\_rate = \frac{learning\_rate}{1 - \beta_{1\_pow}} \\ param\_out = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out}$$

The original paper does not have an epsilon attribute. However, it is added here for numerical stability to prevent the division by 0 error.

Inputs: Param : (Tensor) Input parameterGrad : (Tensor) Input gradientLearningRate : (Tensor) Learning rateMoment : (Tensor) First momentInfNorm : (Tensor) Input exponentially weighted infinity normBeta1Pow : (Tensor) Input beta1 power accumulator ParamOut : (Tensor) Output parameterMomentOut : (Tensor) Output first momentInfNormOut : (Tensor) Output exponentially weighted infinity norm beta1 (Duplicable): (float, default 0.9) Exponential decay rate for the 1st moment estimates.beta2 (Duplicable): (float, default 0.999) exponential decay rate for the weighted infinity norm estimates.epsilon (Duplicable): (float, default 1.0e-8) Constant for numerical stability

## tanh_shrink

TanhShrink Activation Operator.

$$out = x - \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$

Inputs: X : Input of TanhShrink operator Out : Output of TanhShrink operator

## positive_negative_pair

PositiveNegativePairOp can be used to evaluate Learning To Rank(LTR) model's performance.

Within some context, e.g. the "query", a LTR model generates scores for a list of items, which gives a partial order of the items. PositiveNegativePairOp takes a list of reference rank order (Input("Label")) and the model generated scores (Input(Score)) as inputs and counts the pairs that ranked correctly and incorrectly.

Inputs: Score : (Tensor, float) Model Score on an item (with respect to QueryID). It's a 2-D tensor with shape [batch_size, depth], where the column specified by the attribute "column" is used as item score.Label : (Tensor, float) Label of an item (with repsect to QueryId). It's a 2-D tensor with shape [batch_size, 1].QueryID : (Tensor, int64) Query ID that indicates the context. Its shape should be the same as Label.AccumulatePositivePair : (float) Optional. The accumulated number of positive pairs over a stream of data. If provided, the output PositivePair will be initialized with this number rather than 0. it won't be modified in place.AccumulateNegativePair : (float) Optional. The accumulated number of negative pairs over a stream of data. If provided, the output NegativePair will be initialized with this number rather than 0. it won't be modified in place.AccumulateNeutralPair : (float) Optional. The accumulated number of neutral pairs over a stream of data. If provided, the output NeutralPair will be initialized with this number rather than 0. it won't be modified in place.Weight : (float) Optional. Weight of current item. If specified, its shape should be the same as Label, and the meaning of the output changes from numbers of pairs to the total sum of pairs' weights. Weight of a pair of items is the average of their weights. PositivePair : (float) Number of positive pairs, i.e. the pairs of items that are ranked correctly.NegativePair : (float) Number of negative pairs, i.e. the pairs of items that are ranked incorrectly.NeutralPair : (float) Number of neutral pairs, i.e. the pairs of items that have the same score. column (Duplicable): (int, default -1) The column position of Score used to rank items in descending order. It must be in the range of [-rank(Score), rank(Score)). If dim < 0, the dim to reduce is rank + dim. Noting that reducing on the first dim will make the LoD info lost.

## one_hot

One Hot Operator. This operator creates the one-hot representations for input index values. The following example will help to explain the function of this operator:

X is a LoDTensor: X.lod = [[0, 1, 4]] X.shape = [4, 1] X.data = [[1], [1], [3], [0]]

set depth = 4

Out is a LoDTensor: Out.lod = [[0, 1, 4]] Out.shape = [4, 4] Out.data = [[0., 1., 0., 0.], [0., 1., 0., 0.], [0., 0., 0., 1.], [1., 0., 0., 0.]]

Inputs: X : (LoDTensor, LoDTensor) Input variable with rank at least 2. The last dimension of X should be 1. Each value of X is an index to indicate the position. Out : (Tensor, Tensor) Output tensor with same rank as X. The tensor consists of one-hot representations of values in X. depth (Duplicable): A positive integer to specify the length of one-hot vector.dtype (Duplicable): An integer to specify the data type of one-hot vector. The default value is FP32.

## l1_norm

L1 Norm Operator.

Computes the L1 norm of a tensor.

$$Out = \sum{|X|}$$

Inputs: X : (Tensor) The input of l1_norm op. Out : (Scalar) The output of l1_norm op.

## create_random_data_generator

  CreateRandomDataGenerator Operator

This Op creates a random reader.
Generated data follow an uniform distribution between 'min' and 'max'.

Inputs: Out : (ReaderHolder) The created random reader. shape_concat (Duplicable): The concat of all data's shapes.ranks (Duplicable): The ranks of each data.e.g.shape_concat = [2,3,4,5,6]ranks = [3,2]It means the reader will generate two data each time,whose shapes are [2,3,4] and [5,6] respectively.min (Duplicable): The lower bound of reader's uniform distribution.max (Duplicable): The upper bound of reader's uniform distribution.

## roi_pool

ROIPool operator

ROI Pooling for Faster-RCNN. The link below is a further introduction: https://stackoverflow.com/questions/43430056/what-is-roi-layer-in-fast-rcnn

Inputs: X : (Tensor), the input of ROIPoolOp. The format of input tensor is NCHW. Where N is batch size, C is the number of input channels, H is the height of the feature, and W is the width of the feature.ROIs : (Tensor), ROIs (Regions of Interest) to pool over. should be a 2-D tensor of shape (num_rois, 5)given as [[batch_id, x1, y1, x2, y2], …]. Where batch_id is the id of the data, (x1, y1) is the top left coordinates, and (x2, y2) is the bottom right coordinates. Out : (Tensor), The output of ROIPoolOp is a 4-D tensor with shape (num_rois, channels, pooled_h, pooled_w).Argmax (Intermediate) : (Tensor), Argmaxes corresponding to indices in X used for gradient computation. Only output if arg “is_test” is false. spatial_scale (Duplicable): (float, default 1.0), Multiplicative spatial scale factor to translate ROI coords from their input scale to the scale used when pooling.pooled_height (Duplicable): (int, default 1), The pooled output height.pooled_width (Duplicable): (int, default 1), The pooled output width.

## pow

Pow Activation Operator.

$out = x^{factor}$

Inputs: X : Input of Pow operator Out : Output of Pow operator factor (Duplicable): The exponential factor of Pow

## unpool

Input shape is: $(N, C_{in}, H_{in}, W_{in})$, Output shape is: $(N, C_{out}, H_{out}, W_{out})$, where $$H_{out} = (H_{in}−1) * strides[0] − 2 * paddings[0] + ksize[0] \\ W_{out} = (W_{in}−1) * strides[1] − 2 * paddings[1] + ksize[1]$$ Paper: http://www.matthewzeiler.com/wp-content/uploads/2017/07/iccv2011.pdf

Inputs: X : (Tensor) The input tensor of unpool operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.Indices : (Tensor) The input tensor of the indices given out by MaxPool2d. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature. Out : (Tensor) The output tensor of unpool operator.The format of output tensor is also NCHW.Where N is batch size, C is the number of channels, H and W is the height and width of feature. ksize (Duplicable): (vector), the unpooling window size(height, width) of unpooling operator.strides (Duplicable): (vector, default:{1, 1}), strides (height, width) of unpooling operator.paddings (Duplicable): (vector defalut:{0,0}), paddings (height, width) of unpooling operator.unpooling_type (Duplicable): (string), unpooling type, can be "max" for max-unpooling

## transpose

Transpose Operator.

The input tensor will be permuted according to the axes given. The behavior of this operator is similar to how numpy.transpose works.

• suppose the input X is a 2-D tensor: $$X = \begin{pmatrix} 0 &1 &2 \\ 3 &4 &5 \end{pmatrix}$$

the given axes is: $[1, 0]$, and $Y$ = transpose($X$, axis)

then the output $Y$ is:

$$Y = \begin{pmatrix} 0 &3 \\ 1 &4 \\ 2 &5 \end{pmatrix}$$

• Given a input tensor with shape $(N, C, H, W)$ and the axes is $[0, 2, 3, 1]$, then shape of the output tensor will be: $(N, H, W, C)$.

Inputs: X : (Tensor) The input tensor, tensors with rank up to 6 are supported. Out : (Tensor)The output tensor. axis (Duplicable): (vector) A list of values, and the size of the list should be the same with the input tensor rank. This operator permutes the input tensor's axes according to the values given.

Inputs: Out@GRAD : X : Out : X@GRAD : dtype (Duplicable): (int, default 5 (FP32)) Output data type

## lstmp

Long-Short Term Memory with recurrent Projection layer (LSTMP) Operator.

LSTMP has a separate projection layer after the LSTM layer, projecting the original hidden state to a lower-dimensional one, which is proposed to reduce the number of total parameters and furthermore computational complexity for the LSTM, espeacially for the case that the size of output units is relative large (https://research.google.com/pubs/archive/43905.pdf).

The formula is as follows:

$$i_t = \sigma(W_{ix}x_{t} + W_{ir}r_{t-1} + W_{ic}c_{t-1} + b_i) \\ f_t = \sigma(W_{fx}x_{t} + W_{fr}r_{t-1} + W_{fc}c_{t-1} + b_f) \\ \tilde{c_t} = act_g(W_{cx}x_t + W_{cr}r_{t-1} + b_c) \\ o_t = \sigma(W_{ox}x_{t} + W_{or}r_{t-1} + W_{oc}c_t + b_o) \\ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c_t} \\ h_t = o_t \odot act_h(c_t) \\ r_t = \overline{act_h}(W_{rh}h_t)$$

where the W terms denote weight matrices (e.g. $W_{xi}$ is the matrix of weights from the input gate to the input), $W_{ic}, W_{fc}, W_{oc}$ are diagonal weight matrices for peephole connections. In our implementation, we use vectors to reprenset these diagonal weight matrices. The b terms denote bias vectors ($b_i$ is the input gate bias vector), $sigma$ is the activation, such as logistic sigmoid function, and $i, f, o$ and $c$ are the input gate, forget gate, output gate, and cell activation vectors, respectively, all of which have the same size as the cell output activation vector $h$. Here $h$ is usually called the hidden state and $r$ denotes its recurrent projection. And $tilde{c_t}$ is also called the candidate hidden state, whose computation is based on the current input and previous hidden state.

The $odot$ is the element-wise product of the vectors. $act_g$ and $act_h$ are the cell input and cell output activation functions and tanh is usually used for them. $overline{act_h}$ is the activation function for the projection output, usually using identity or same as $act_h$.

Note that these $W_{xi}x_{t}, W_{xf}x_{t}, W_{xc}x_{t}, W_{xo}x_{t}$ operations on the input $x_{t}$ are NOT included in this operator. Users can choose to use fully-connected operator before LSTMP operator.

Inputs: Input : (LoDTensor) the input for sequence data, which supports variable-time length input sequence. The underlying tensor in this LoDTensor is a matrix with shape (T X 4D), where T is the total time steps in this mini-batch, D is the hidden size.H0 : (Tensor, optional) the initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size and D is the hidden size.C0 : (Tensor, optional) the initial cell state is an optional input. This is a tensor with shape (N x D), where N is the batch size. C0 should not be null if H0 provided.Weight : (Tensor) the learnable hidden-hidden weights. - The shape is (P x 4D), where P is the projection layer size and D is the hidden size. - Weight = {W_cr, W_ir, W_fr, W_or}ProjWeight : (Tensor) the learnable weight of the projection layer. - The shape is (D x P), where P is the recurrent projection layer size and D is the hidden size. - ProjWeight = {W_rh}Bias : (Tensor) the learnable biases, which contains two parts: input-hidden biases and peephole connections weights if setting use_peepholes to True. 1. use_peepholes = False - The shape is (1 x 4D). - Bias = {b_c, b_i, b_f, b_o}.2. use_peepholes = True - The shape is (1 x 7D). - Bias = {b_c, b_i, b_f, b_o, W_ic, W_fc, W_oc}. Projection : (LoDTensor) the projection of the hidden state of LSTMP operator. The shape is (T x P), and LoD is the same with the Input.Cell : (LoDTensor) the cell state of LSTMP operator. The shape is (T x D), and lod is the same with the Input.BatchGate (Intermediate) : (LoDTensor) This LoDTensor contains input gate, forget gate and output gate after the activations. This LoDTensor has the same shape as the reorganized input, which is also be called batch input. The LoD size is 2. The first-level LoD is the batch offsets and the second contains the indices, which denotes the position of reorganized sequence in the raw input.BatchCellPreAct (Intermediate) : (LoDTensor) the pre-activation cell state reorganized in batch. This LoDTensor is obtained in the forward and used in the backward.BatchHidden (Intermediate) : (LoDTensor) the hidden state reorganized in batch. This LoDTensor is obtained in the forward and used in the backward.OrderedP0 (Intermediate) : (Tensor) the projection of the initial hidden state H0. This is a tensor with shape (N x P), where N is the batch size and P is the hidden size. use_peepholes (Duplicable): (bool, defalut: True) whether to enable diagonal/peephole connections.is_reverse (Duplicable): (bool, defalut: False) whether to compute reversed LSTMP.gate_activation (Duplicable): (string, default: sigmoid)The activation for input gate, forget gate and output gate, sigmoid by default.cell_activation (Duplicable): (string, default: tanh)The activation for cell output, tanh by defalut.candidate_activation (Duplicable): (string, default: tanh)The activation for candidate hidden state, tanh by default.proj_activation (Duplicable): (string, default: tanh)The activation for projection output, tanh by defalut.

## target_assign

This operator is, for given the encoded boxes between prior boxes and ground-truth boxes and ground-truth class labels, to assign classification and regression targets to each prior box as well as weights to each prior box. The weights is used to specify which prior box would not contribute to training loss.

For each instance, the output PredBBoxLabel, PredBBoxWeight, PredScoreLabel and PredScoreWeight are assigned based on MatchIndices. Assumed that the row offset for each instance in EncodedGTBBox is called lod, this operato assigns classification/regression targets by performing the following steps:

1. Assigning all outpts based on MatchIndices:

If id = MatchIndices[i][j] > 0,

PredBBoxLabel[i][j] = EncodedGTBBox[lod[i] + id][j]
PredBBoxWeight[i][j] = 1.
PredScoreLabel[i][j] = GTScoreLabel[lod[i] + id]
PredScoreWeight[i][j] = 1.


Otherwise,

PredBBoxLabel[j][j] = [0., 0., 0., 0.]
PredBBoxWeight[i][j] = 0.
PredScoreLabel[i][j] = background_label
PredScoreWeight[i][j] = 0.

1. Assigning PredScoreWeight based on NegIndices:

Assumed that the row offset for each instance in NegIndices is caleed neg_lod, for i-th instance and all ids of NegIndices in this instance:

PredScoreLabel[i][id] = background_label
PredScoreWeight[i][id] = 1.0

Inputs: EncodedGTBBox : (LoDTensor), The encoded ground-truth bounding boxes with shape [Ng, Np, 4], where Ng is the total number of ground-truth boxes in this mini-batch, Np the number of predictions, 4 is the number of coordinate in [xmin, ymin, xmax, ymax] layout.GTScoreLabel : (LoDTensor, default LoDTensor), The input ground-truth labels with shape [Ng, 1], where the Ng is the same as it in the input of EncodedGTBBox.MatchIndices : (Tensor, default Tensor), The input matched indices with shape [N, Np], where N is the batch size, Np is the same as it in the input of EncodedGTBBox. If MatchIndices[i][j] is -1, the j-th prior box is not matched to any ground-truh box in i-th instance.NegIndices : (LoDTensor, default LoDTensor), The input negative example indices with shape [Neg, 1], where is the total number of negative example indices. PredBBoxLabel : (Tensor), The output encoded ground-truth labels with shape [N, Np, 4], N is the batch size and Np, 4 is the same as they in input of EncodedGTBBox. If MatchIndices[i][j] is -1, the PredBBoxLabel[i][j][:] is the encoded ground-truth box for background_label in i-th instance.PredBBoxWeight : (Tensor), The weight for PredBBoxLabel with the shape of [N, Np, 1]PredScoreLabel : (Tensor, default Tensor), The output score labels for each predictions with shape [N, Np, 1]. If MatchIndices[i][j] is -1, PredScoreLabel[i][j] = background_label.PredScoreWeight : (Tensor), The weight for PredScoreLabel with the shape of [N, Np, 1] background_label (Duplicable): (int, default 0), Label index of background class.

## mean

Mean Operator.

Out is a scalar which is the mean of all elements in X.

Inputs: X : The input of mean op Out : The output of mean op

## precision_recall

Precision Recall Operator.

When given Input(Indices) and Input(Labels), this operator can be used to compute various metrics including: 1. macro average precision 2. macro average recall 3. macro f1 score 4. micro average precision 5. micro average recall 6. micro f1 score

To compute the above metrics, we need to do statistics for true positives, false positives and false negatives. Here the count of true negatives is not necessary, but counting it may provide potential usage and the cost is trivial, so the operator also provides the count of true negatives.

We define state as a 2-D tensor with shape [class_number, 4]. Each row of a state contains statistic variables for corresponding class. Layout of each row is: TP(true positives), FP(false positives), TN(true negatives), FN(false negatives). If Input(Weights) is provided, TP, FP, TN, FN will be calculated by given weight instead of the instance count.

This operator also supports metrics computing for cross-batch situation. To achieve this, Input(StatesInfo) should be provided. State of current batch data will be accumulated to Input(StatesInfo) and Output(AccumStatesInfo) is the accumulation state.

Output(BatchMetrics) is metrics of current batch data while Output(AccumStatesInfo) is metrics of accumulation data.

Inputs: MaxProbs : (Tensor, default Tensor) A 2-D tensor with shape N x 1, where N is the batch size. Each row contains the max probability of an instance which computed by the previous top_k (k=1) operator.Indices : (Tensor, default Tensor) A 2-D tensor with shape N x 1, where N is the batch size. Each row contains the corresponding index which computed by the previous top_k (k=1) operator.Labels : (Tensor, default Tensor) A 2-D tensor with shape N x 1, where N is the batch size. Each element is a label and the value should be in [0, class_number - 1].Weights : (Tensor, default Tensor) A 2-D tensor with shape N x 1, where N is the batch size. This input is optional. If provided, weight of instance would be considered when computing metrics.StatesInfo : (Tensor, default Tensor) A 2-D tensor with shape D x 4, where D is the number of classes. This input is optional. If provided, current state will be accumulated to this state and the accumulation state will be the output state. BatchMetrics : (Tensor, default Tensor) A 1-D tensor with shape {6}. This output tensor contains metrics for current batch data. The layout is [macro average precision, macro average recall, macro f1 score, micro average precision, micro average recall, micro f1 score].AccumMetrics : (Tensor, default Tensor) A 1-D tensor with shape {6}. This output tensor contains metrics for accumulated data. The layout is [macro average precision, macro average recall, macro f1 score, micro average precision, micro average recall, micro f1 score].AccumStatesInfo : (Tensor, default Tensor) A 2-D tensor with shape D x 4, where D is equal to class number. This output tensor contains accumulated state variables used to compute metrics. The layout for each class is [true positives, false positives, true negatives, false negatives]. class_number (Duplicable): (int) Number of classes to be evaluated.

## softplus

Softplus Activation Operator.

$out = ln(1 + e^{x})$

Inputs: X : Input of Softplus operator Out : Output of Softplus operator

## get_places

Returns a list of places based on flags. The list will be used for parallel execution.

Inputs: Out : vector of Place device_count (Duplicable): device countdevice_type (Duplicable): device type

Read a LoDTensor from a LoDTensor Array.

Assume $T$ is LoDTensor, $i$ is the subscript of the array, and $A$ is the array. The equation is

$$T = A[i]$$

Inputs: X : (TensorArray) the array will be read from.I : (Tensor) the subscript index in tensor array. The number of element should be 1 Out : (LoDTensor) the tensor will be read from.

## rnn_memory_helper

Inputs: X : Out : dtype (Duplicable): (int, default 5 (FP32)) Output data type

## shrink_rnn_memory

This operator is used to shrink output batch of memory defined in dynamic RNN.

Dynamic RNN is able to handle variable-length sequences, in which, sequences in a mini-batch are sorted by their lengths first. After that, the longest sequence becomes the first one in the sorted batch, followed by the second longest, the third longest, and so on. Dynamic RNN then slices a batch input timestep by timestep from the sorted input. Once any sequence in the input batch reaches its end, memory defined in dynamicRNN has to shrink its outputs to adapt to the input batch size for the next time step.

Inputs: X : (LoDTensor) The RNN step memory to be shrinked.RankTable : (LoDRankTable) The lod_rank_table of dynamic RNN.I : (LoDTensor) The step index. The RNN step memory 'X' will be shrinked to match the size of the input of the index'th step. Out : (LoDTensor) The shrinked RNN step memory.

## merge_lod_tensor

    Merge True and False branches of LoDTensor into a single Output,
with a mask at certain lod level. X is used to obtain complete
lod information. Please refer to SplitLoDTensorOp.

Inputs: X : The input LoDTensor, contains complete lod information to construct the outputMask : A bool column vector which mask the inputInTrue : The True branch to be mergedInFalse : The False branch to be merged Out : The merged output LoDTensor level (Duplicable): (int) the specific lod level to rank.

## reshape

Reshape Operator.

Reshape Input(X) into the shape specified by Attr(shape).

An example: Given a 2-D tensor X with 2 rows and 2 columns : [[1, 2], [3, 4]]

and target shape = [1, 4], the reshape operator will transform the tensor X into a 2-D tensor: [[1, 2, 3, 4]]

One dimension in the target shape can be set -1, representing that its size is unknown. In this case, the real dimension will be infered from the original shape of Input(X) and other dimensions in the target shape.

Inputs: X : The input tensor of reshape operator. Out : The output tensor of reshape operator. shape (Duplicable): (vector) Target shape of reshape operator.

## sigmoid_cross_entropy_with_logits

SigmoidCrossEntropyWithLogits Operator.

This measures the element-wise probability error in classification tasks in which each class is independent. This can be thought of as predicting labels for a data-point, where labels are not mutually exclusive. For example, a news article can be about politics, technology or sports at the same time or none of these.

The logistic loss is given as follows:

   <span class="markdown-equation" id="equation-0"></span>


We know that $$\sigma(X) = (1 / (1 + \exp(-X)))$$. By substituting this we get:

   <span class="markdown-equation" id="equation-2"></span>


For stability and to prevent overflow of $$\exp(-X)$$ when X < 0, we reformulate the loss as follows:

   <span class="markdown-equation" id="equation-4"></span>


Both the input X and Labels can carry the LoD (Level of Details) information. However the output only shares the LoD with input X.

Inputs: X : (Tensor, default Tensor), a 2-D tensor with shape N x D, where N is the batch size and D is the number of classes. This input is a tensor of logits computed by the previous operator. Logits are unscaled log probabilities given as log(p/(1-p)).Label : (Tensor, default Tensor), a 2-D tensor of the same type and shape as X. This input is a tensor of probabalistic labels for each logit Out : (Tensor, default Tensor), a 2-D tensor with shape N x D of elementwise logistic losses.

## fill

Fill operator

Fill an tensor with value and shape. The type of the tensor is specify by dtype.

Inputs: Out : (LoDTensor) The output tensor. value (Duplicable): The float values of tensor, which are flatten in row majorshape (Duplicable): The shape of output tensordtype (Duplicable): The data type of output tensor, Default is floatforce_cpu (Duplicable): Whether the output tensor must be at CPU memory or not. Default is false.

## sequence_reshape

Sequence Reshape Operator.

This operator will rearrange the input sequences. The new dimension is set by attribute and length of each sequence may change longer or shorter which is decided by original length, original dimension and new dimension. The following example will help to illustrate the function of this operator:

x is a LoDTensor: x.lod = [[0, 2, 6]] x.data = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]] x.dims = [6, 2]

set new_dim = 4

then out is a LoDTensor: out.lod = [[0, 1, 3]] out.data = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]] out.dims = [3, 4]

Currently, only 1-level LoDTensor is supported and please make sure (original length * original dimension) can be divided by new_dim with no remainder for each sequence.

Inputs: X : (LoDTensor, default LoDTensor) A 2-D LoDTensor with shape being [N, M]. Out : (LoDTensor, default LoDTensor) A 2-D LoDTensor with shape [T, new_dim] where T is calculated based on X.lod, M and new_dim. new_dim (Duplicable): Sequence dimension of the output LoDTensor.

## huber_loss

HuberLoss Operator.

Huber loss is a loss function used in robust regression. We define X as the input value and Y as the target value. Huber loss can evaluate the fitness of X to Y. Different from MSE loss, Huber loss is more robust for outliers. The shape of X and Y are [batch_size, 1]. The equation is:

$$Out_{\delta}(X, Y)_i = \begin{cases} 0.5 * (Y_i - X_i)^2, \quad |Y_i - X_i| \leq \delta \\ \delta * (|Y_i - X_i| - 0.5 * \delta), \quad otherwise \end{cases}$$

In the above equation, $Out_delta(X, Y)_i$, $X_i$ and $Y_i$ represent the ith element of Out, X and Y.

Inputs: X : The input value of huber loss op.X is a 2-D tensor with shape [batch_size, 1].Y : The target value of huber loss op.Y is a 2-D tensor with shape [batch_size, 1]. Residual (Intermediate) : Intermediate tensor to cache residual value between Y and X.The shape is same as Input(X) and will be reused in backward.Out : The output tensor with shape [batch_size, 1] which represents the huber loss. delta (Duplicable): Hyper parameter in huber loss.

## sequence_softmax

Sequence Softmax Operator.

SequenceSoftmaxOp computes the softmax activation among all time-steps for each sequence. The dimension of each time-step should be 1. Thus, the shape of input Tensor can be either [N, 1] or [N], where N is the sum of the length of all sequences.

The algorithm works as follows:

for i-th sequence in a mini-batch:


$$Out(X[lod[i]:lod[i+1]], :) = \ \frac{\exp(X[lod[i]:lod[i+1], :])} \ {\sum(\exp(X[lod[i]:lod[i+1], :]))}$$

For example, for a mini-batch of 3 sequences with variable-length, each containing 2, 3, 2 time-steps, the lod of which is [0, 2, 5, 7], then softmax will be computed among X[0:2, :], X[2:5, :], X[5:7, :] and N turns out to be 7.

Inputs: X : (LoDTensor) 1-D or 2-D input LoDTensor with the 2-nd dimension of length 1. Out : (LoDTensor) 1-D or 2-D output LoDTensor with the 2-nd dimension of length 1.

## multiclass_nms

This operator is to do multi-class non maximum suppression (NMS) on a batched of boxes and scores.

In the NMS step, this operator greedily selects a subset of detection bounding boxes that have high scores larger than score_threshold, if providing this threshold, then selects the largest nms_top_k confidences scores if nms_top_k is larger than -1. Then this operator pruns away boxes that have high IOU (intersection over union) overlap with already selected boxes by adaptive threshold NMS based on parameters of nms_threshold and nms_eta.

Aftern NMS step, at most keep_top_k number of total bboxes are to be kept per image if keep_top_k is larger than -1.

This operator support multi-class and batched inputs. It applying NMS independently for each class. The outputs is a 2-D LoDTenosr, for each image, the offsets in first dimension of LoDTensor are called LoD, the number of offset is N + 1, where N is the batch size. If LoD[i + 1] - LoD[i] == 0, means there is no detected bbox for this image. If there is no detected boxes for all images, all the elements in LoD are 0, and the Out only contains one value which is -1.

Inputs: BBoxes : (Tensor) A 2-D Tensor with shape [M, 4] represents the predicted locations of M bounding bboxes. Each bounding box has four coordinate values and the layout is [xmin, ymin, xmax, ymax].Scores : (Tensor) A 3-D Tensor with shape [N, C, M] represents the predicted confidence predictions. N is the batch size, C is the class number, M is number of bounding boxes. For each category there are total M scores which corresponding M bounding boxes. Please note, M is equal to the 1st dimension of BBoxes. Out : (LoDTensor) A 2-D LoDTensor with shape [No, 6] represents the detections. Each row has 6 values: [label, confidence, xmin, ymin, xmax, ymax], No is the total number of detections in this mini-batch. For each instance, the offsets in first dimension are called LoD, the number of offset is N + 1, if LoD[i + 1] - LoD[i] == 0, means there is no detected bbox. background_label (Duplicable): (int64_t, defalut: 0) The index of background label, the background label will be ignored. If set to -1, then all categories will be considered.score_threshold (Duplicable): (float) Threshold to filter out bounding boxes with low confidence score. If not provided, consider all boxes.nms_top_k (Duplicable): (int64_t) Maximum number of detections to be kept according to the confidences aftern the filtering detections based on score_thresholdnms_threshold (Duplicable): (float, defalut: 0.3) The threshold to be used in NMS.nms_eta (Duplicable): (float) The parameter for adaptive NMS.keep_top_k (Duplicable): (int64_t) Number of total bboxes to be kept per image after NMS step. -1 means keeping all bboxes after NMS step.

## sequence_erase

Sequence Erase Operator.

Sequence erase operator erases tokens specified by Attr(tokens) from the input sequences Input(X), and outputs the remaining data and modifies the LoD information at the same time. For example, given a 2-D LoDTensor

X = [[2, 2, 6, 1, 3, 9, 6, 1, 0, 1]]^T


with lod = [[0, 3, 6, 10]], there are three sequences in the input:

 X1 = [[2, 2, 6]]^T, X2 = [[1, 3, 9]]^T and X3 = [[6, 1, 0, 1]]^T.


If the tokens to be erased are Attr(tokens) = [2, 3, 5], after the erasing operation, the three sequences become

X1' = [[6]]^T, X2' = [[1, 9]]^T and X3' = [[6, 1, 0, 1]]^T.


Hence the LoDTensor Output(Out) should be

Out = [[6, 1, 9, 6, 1, 0, 1]]^T,


with lod = [[0, 1, 3, 7]].

An example usage for this operator is to remove the special tokens when computing the edit distance between two strings, such as blank, start token, and end token.

Inputs: X : (2-D LoDTensor with the 2nd dim. equal to 1) Input LoDTensor of SequenceEraseOp. Out : (2-D LoDTensor with the 2nd dim. equal to 1) Output LoDTensor of SequenceEraseOp. tokens (Duplicable): (vector) Tokens need to be erased from input sequences.

## scale

Scale operator

$$Out = scale*X$$

Inputs: X : (Tensor) Input tensor of scale operator. Out : (Tensor) Output tensor of scale operator. scale (Duplicable): (float, default 1.0)The scaling factor of the scale operator.

## lookup_table

Lookup Table Operator.

This operator is used to perform lookups on the parameter W, then concatenated into a dense tensor.

The input Ids can carry the LoD (Level of Details) information, or not. And the output only shares the LoD information with input Ids.

Inputs: W : An input represents embedding tensors, which is a learnable parameter.Ids : An input with type int32 or int64 contains the ids to be looked up in W. Ids must be a column vector with rank = 2. The 2nd dimension size must be 1. Out : The lookup results, which have the same type as W. is_sparse (Duplicable): (boolean, default false) Sparse updatepadding_idx (Duplicable): (int64, default -1) If the value is -1, it makes no effect to lookup. Otherwise the given value indicates padding the output with zeros whenever lookup encounters it in Ids.

## lod_tensor_to_array

Inputs: X : RankTable : Out :

## logical_not

logical_not Operator

It operates element-wise on X, and returns the Out. X and Out are N-dim boolean tensors. Each element of Out is calculated by $$Out = !X$$

Inputs: X : (LoDTensor) Operand of logical_not operator Out : (LoDTensor) n-dim bool tensor. Each element is $$Out = !X$$

## logical_and

logical_and Operator

It operates element-wise on X and Y, and returns the Out. X, Y and Out are N-dim boolean tensors. Each element of Out is calculated by $$Out = X \&\& Y$$

Inputs: X : (LoDTensor) Left hand operand of logical_and operatorY : (LoDTensor) Right hand operand of logical_and operator Out : (LoDTensor) n-dim bool tensor. Each element is $$Out = X \&\& Y$$

## logical_or

logical_or Operator

It operates element-wise on X and Y, and returns the Out. X, Y and Out are N-dim boolean tensors. Each element of Out is calculated by $$Out = X || Y$$

Inputs: X : (LoDTensor) Left hand operand of logical_or operatorY : (LoDTensor) Right hand operand of logical_or operator Out : (LoDTensor) n-dim bool tensor. Each element is $$Out = X || Y$$

## logical_xor

logical_xor Operator

It operates element-wise on X and Y, and returns the Out. X, Y and Out are N-dim boolean tensors. Each element of Out is calculated by $$Out = (X || Y) \, \&\& \, !(X \&\& Y)$$

Inputs: X : (LoDTensor) Left hand operand of logical_xor operatorY : (LoDTensor) Right hand operand of logical_xor operator Out : (LoDTensor) n-dim bool tensor. Each element is $$Out = (X || Y) \, \&\& \, !(X \&\& Y)$$

## log_loss

LogLoss Operator.

Log loss is a loss function used for binary classification. Log Loss quantifies the accuracy of a classifier by penalising false classifications. Minimising the Log Loss is equivalent to maximising the accuracy of the classifier. We define Predicted as the values predicted by our model and Labels as the target ground truth value. Log loss can evaluate how close the predicted values are to the target. The shapes of Predicted and Labels are both [batch_size, 1]. The equation is:

$$Loss = - Labels * log(Predicted + \epsilon) - (1 - Labels) * log(1 - Predicted + \epsilon)$$

Inputs: Predicted : The input value (Predicted) of Log loss op.Predicted is a 2-D tensor with shape [batch_size, 1].Labels : The target value (Labels) of Log loss op.Labels is a 2-D tensor with shape [batch_size, 1]. Loss : The output tensor with shape [batch_size, 1] which represents the log loss. epsilon (Duplicable): Epsilon in log loss.

## sqrt

Sqrt Activation Operator.

$out = sqrt{x}$

Inputs: X : Input of Sqrt operator Out : Output of Sqrt operator

## lod_reset

LoDReset operator

Reset LoD of Input(X) into a new one specified by Input(TargetLoD) or Attr(target_lod), or set LoD for Input(X) if it doesn't have one. Currently the lod_reset operator only supports the reset of level 0 LoD. At least one of Input(TargetLoD) and Attr(target_lod) must be set, and if both of them are set, Input(TargetLoD) will be chosen as the target LoD.

An example: Given a float LoDTensor X with shape (6, 1), its transpose form represents

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0],


with LoD = [[0, 2, 5, 6]] and the three (transposed) sequences look like

[1.0, 2.0], [3.0, 4.0, 5.0], [6.0].


If target LoD = [0, 4, 6], the lod_reset operator will reset the LoD and the sequences that the LoDTensor Output(Out) contains becomes:

[1.0, 2.0, 3.0, 4.0], [5.0, 6.0].

Inputs: X : (LoDTensor) The input tensor of lod_reset operator.TargetLoD : (Tensor, optional) The target level 0 LoD from Input(). Out : (LoDTensor) The output tensor of lod_reset operator. target_lod (Duplicable): The target level 0 LoD from Attr().

## write_to_array

WriteToArray Operator.

This operator writes a LoDTensor to a LoDTensor array.

Assume $T$ is LoDTensor, $i$ is the subscript of the array, and $A$ is the array. The equation is

$$A[i] = T$$

Inputs: X : (LoDTensor) the tensor will be written to tensor arrayI : (Tensor) the subscript index in tensor array. The number of element should be 1 Out : (TensorArray) the tensor array will be written

## lod_array_length

LoDArrayLength Operator.

This operator obtains the length of lod tensor array:

$$Out = len(X)$$

NOTE: The output is a CPU Tensor since the control variable should be only in CPU and the length of LoDTensorArray should be used as control variables.

Inputs: X : (LoDTensorArray) The input tensor array. Out : (Tensor) 1x1 CPU Tensor of length, int64_t

## edit_distance

EditDistance operator computes the edit distances between a batch of hypothesis strings and their references.

Edit distance, also called Levenshtein distance, measures how dissimilar two strings are by counting the minimum number of operations to transform one string into anthor. Here the operations include insertion, deletion, and substitution. For example, given hypothesis string A = "kitten" and reference B = "sitting", the edit distance is 3 for A will be transformed into B at least after two substitutions and one insertion:

"kitten" -> "sitten" -> "sittin" -> "sitting"

Input(Hyps) is a LoDTensor consisting of all the hypothesis strings with the total number denoted by batch_size, and the separation is specified by the LoD information. And the batch_size reference strings are arranged in order in the same way in the LoDTensor Input(Refs).

Output(Out) contains the batch_size results and each stands for the edit stance for a pair of strings respectively. If Attr(normalized) is true, the edit distance will be divided by the length of reference string.

Inputs: Hyps : (2-D LoDTensor, 2nd dim. equal to 1) The indices for hypothesis strings.Refs : (2-D LoDTensor, 2nd dim. equal to 1) The indices for reference strings. SequenceNum : The sequence count of current batchOut : (2-D Tensor with shape [batch_size x 1]) The output edit distances of EditDistance operator. normalized (Duplicable): (bool, default false) Indicated whether to normalize the edit distance by the length of reference string.

## layer_norm

Layer Normalization.

Layer Norm has been implemented as discussed in the paper: https://arxiv.org/abs/1607.06450 ...

Inputs: X : (LoDTensor) The input tensor.Scale : (Tensor, optional) Scale is a 1-dimensional tensor of size H(begin_norm_axis splits the tensor(X) to a matrix [N,H]).It is applied to the output.Bias : (Tensor, optional) Bias is a 1-dimensional tensor of size H(begin_norm_axis splits the tensor(X) to a matrix [N,H]).It is applied to the output. Y : (LoDTensor) Result after normalization.Mean (Intermediate) : (Tensor) Mean of the current mini batch.Variance (Intermediate) : (Tensor) Variance of the current mini batch. epsilon (Duplicable): (float, default 1e-5) Constant for numerical stabilitybegin_norm_axis (Duplicable): (int default:1), the axis of begin_norm_axis ... Rank(X) - 1 will be normalized. begin_norm_axis splits the tensor(X) to a matrix [N,H].

## gaussian_random

GaussianRandom Operator.

Used to initialize tensors with gaussian random generator.

Inputs: Out : Output matrix of gaussian random op shape (Duplicable): (vector) The dimension of random tensor.mean (Duplicable): (float, default 0.0) mean of random tensor.std (Duplicable): (float, default 1.0) std of random tensor.seed (Duplicable): (int, default 0) Random seed of generator.0 means use system wide seed.dtype (Duplicable): (int, default 5(FP32)) Output data type.

## lrn

Local Response Normalization Operator.

This operator comes from the paper: <>.

The original formula is:

$$Output(i, x, y) = Input(i, x, y) / \left( k + \alpha \sum\limits^{\min(C, c + n/2)}_{j = \max(0, c - n/2)} (Input(j, x, y))^2 \right)^{\beta}$$

Function implementation:

Inputs and outpus are in NCHW format, while input.shape.ndims() equals 4. And dimensions 0 ~ 3 represent batch size, feature maps, rows, and columns, respectively.

Input and Output in the formula above is for each map(i) of one image, and Input(i, x, y), Output(i, x, y) represents an element in an image.

C is the number of feature maps of one image. n is a hyper-parameter configured when operator is initialized. The sum in the denominator is the sum of the same positions in the neighboring maps.

Inputs: X : (Tensor) The input of LRN operator. It must be a 4D tenor with NCHW format. Out : (Tensor) The output of LRN operator, which is also the 4D tensor with NCHW format.MidOut : (Tensor) Middle result of LRN operator. It's computed in forward process and also used in backward process. n (Duplicable): (int default 5) n is the "adjacent" kernel that maps at the same spatial position.k (Duplicable): (float, default 2.0) k is the bias.alpha (Duplicable): (float, default 0.0001) alpha is the scale number.beta (Duplicable): (float, default 0.75) beta is the power number.

## bilinear_tensor_product

Bilinear Tensor Product operator. Given input X and Y, a 3D tensor Weight and a Bias. Each column of the Output is computed by one slice $i = 1, . . . , k$ of the tensor:

$$M = (X W_i) * Y \\ Out_i = \sum_j {M_j} + Bias_i$$

Where $W_i$ is the $i$-th slice of Input(Weight); $M_j$ is the $j$-th column of $M$; $Out_i$ is the $i$-th column of Output(Out); $Bias_i$ is a column vector, each element of it is equal to the $i$-th element of $Bias$;

Inputs: X : The first input of bilinear_tensor_product operator.Y : The second input of bilinear_tensor_product operator.Weight : The learnable parameters of bilinear_tensor_product operator.Bias : The learnable bias of bilinear_tensor_product operator. Out : The output of bilinear_tensor_product operator.

## iou_similarity

IOU Similarity Operator. Computes intersection-over-union (IOU) between two box lists. Box list 'X' should be a LoDTensor and 'Y' is a common Tensor, boxes in 'Y' are shared by all instance of the batched inputs of X. Given two boxes A and B, the calculation of IOU is as follows:

$$IOU(A, B) = \frac{area(A\cap B)}{area(A)+area(B)-area(A\cap B)}$$

Inputs: X : (LoDTensor, default LoDTensor) Box list X is a 2-D LoDTensor with shape [N, 4] holds N boxes, each box is represented as [xmin, ymin, xmax, ymax], the shape of X is [N, 4]. [xmin, ymin] is the left top coordinate of the box if the input is image feature map, they are close to the origin of the coordinate system. [xmax, ymax] is the right bottom coordinate of the box. This tensor can contain LoD information to represent a batch of inputs. One instance of this batch can contain different numbers of entities.Y : (Tensor, default Tensor) Box list Y holds M boxes, each box is represented as [xmin, ymin, xmax, ymax], the shape of X is [N, 4]. [xmin, ymin] is the left top coordinate of the box if the input is image feature map, and [xmax, ymax] is the right bottom coordinate of the box. Out : (LoDTensor, the lod is same as input X) The output of iou_similarity op, a tensor with shape [N, M] representing pairwise iou scores.

## conditional_block

Conditional block operator

Run the sub-block if X is not empty. Params is the other inputs and Out is the outputs of the sub-block.

Inputs: X (Duplicable) : The conditional variable of this operator. If X is empty, the whole sub-block will not be executed.Params (Duplicable) : The input variables of the sub-block. Out (Duplicable) : The output variables of the sub-block.Scope : (std::vector) The step scope of conditional block. To unify the conditional block, rnn and while op, the type of scope is std::vector sub_block (Duplicable): The step block of conditional block operatoris_scalar_condition (Duplicable): the input X is used as scalar condition

## rmsprop

Rmsprop Optimizer.

$$MeanSquareOut = decay * MeanSquare + (1 - decay) * Grad * Grad \\ MomentOut = momentum * Moment + \frac{LearningRate * Grad}{\sqrt{MeanSquareOut + epsilon}} \\ ParamOut = Param - MomentOut$$

The original slides that proposed Rmsprop: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)

Inputs: Param : (Tensor, default Tensor) Input parameter value that has to be updated.MeanSquare : (Tensor, default Tensor) The mean square value that gets updated.LearningRate : (Tensor, default Tensor) The learning rate should be a tensor of size 1.Grad : (Tensor, default Tensor) Input gradient of the parameter.Moment : (Tensor, default Tensor) The moment that gets updated. ParamOut : (Tensor) Output updated parameter value.MomentOut : (Tensor) Output updated moment.MeanSquareOut : (Tensor) Output Mean squared updated value. epsilon (Duplicable): (float, default 1e-10) Constant for numerical stability.decay (Duplicable): (float, default 0.9) Discounting factor for coming gradient.momentum (Duplicable): (float, default 0.0) Constant value.

## elementwise_mul

Limited Elementwise Mul Operator.

The equation is:

$$Out = X \odot\ Y$$

$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.

There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.

For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.

For example .. code-block:: python

shape(X) = (2, 3, 4, 5), shape(Y) = (,)
shape(X) = (2, 3, 4, 5), shape(Y) = (5,)
shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5)
shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1
shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0


Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.

Inputs: X : (Tensor), The first input tensor of elementwise op.Y : (Tensor), The second input tensor of elementwise op. Out : The output of elementwise op. axis (Duplicable): (int, default -1). The start dimension index for broadcasting Y onto X.

## sequence_slice

Sequence slice operator

The operator crops a subsequence from given sequence with given start offset and subsequence length. It only supports sequence (LoD Tensor with level number is 1). - Case: X = [[a1, a2; b1, b2; c1, c2] [d1, d2; e1, e2]] LoD(X) = {{0, 3, 5}}; Dims(X) = (5, 2) Offset = [[0], [1]]; Length = [[2], [1]]

Out = [[a1, a2;
b1, b2]
[e1, e2]]
LoD(Out) = {{0, 2, 3}}; Dims(Out) = (3, 2)


NOTE: The first dimension size of input, the size of offset and Length, should be equal. The offset start from 0.

Inputs: X : (LoDTensor), the input of SequenceSliceOp.Offset : (Tensor), a vector to describe the offset of every input sequence for sub sequence item.Length : (Tensor), a vector to describe the length of every input sequence for sub sequence item. Out : (LoDTensor), the output of SequenceSliceOp.

## hinge_loss

HingeLoss Operator.

Let x be a logit (prediction) and y be the actual label. The logit can take any values from (-inf, inf), but the labels should be either -1 or 1. Then, the hinge loss is computed as follows:

$$L_(x, y) = max(1 - y.x, 0)$$

Note that the labels passed as input will have values as either 0 or 1.

Inputs: Logits : The input value (Logits) of Hinge loss op.Logits is a 2-D tensor with shape [batch_size, 1].Labels : The target value (Labels) of Hinge loss op.Labels is a 2-D tensor with shape [batch_size, 1]. Loss : The output tensor with shape [batch_size, 1] which represents the hinge loss.

## fill_constant

FillConstantBatchSizeLike Operator.

Fill up a variable with specified constant value.

Inputs: Out : (Tensor) Tensor of specified shape will be filled with the specified value dtype (Duplicable): (int, default 5 (FP32)) Output data typeshape (Duplicable): (vector) The shape of the outputvalue (Duplicable): (float, default 0) The value to be filledforce_cpu (Duplicable): (bool, default false) Force fill output variable to cpu memory. Otherwise, fill output variable to the running device

## detection_output

      detection output for SSD(single shot multibox detector)
Apply the NMS to the output of network and compute the predict
bounding box location. The output’s shape of this layer could
be zero if there is no valid bounding box.

Inputs: Loc : (Tensor) The input tensor of detection_output operator.The input predict locationsThe format of input tensor is kNCHW. Where K is priorbox point numbers,N is How many boxes are there on each point, C is 4, H and W both are 1.Conf : (Tensor) The input tensor of detection_output operator.The input priorbox confidence.The format of input tensor is kNCHW. Where K is priorbox point numbers,N is How many boxes are there on each point, C is the number of classes, H and W both are 1.PriorBox : (Tensor) The input tensor of detection_output operator.The format of input tensor is the position and variance of the boxes Out : (Tensor) The output tensor of detection_output operator. background_label_id (Duplicable): (int), The background class index.num_classes (Duplicable): (int), The number of the classification.nms_threshold (Duplicable): (float), The Non-maximum suppression threshold.confidence_threshold (Duplicable): (float), The classification confidence threshold.top_k (Duplicable): (int), The bbox number kept of the layer’s output.nms_top_k (Duplicable): (int), The bbox number kept of the NMS’s output.

## fill_zeros_like

FillZerosLike Operator.

Fill up a variable with zeros. The output will have the same size as the input.

Inputs: X : The input of fill-zeros-like op. Out : The variable will be filled up with zeros.

## softmax_with_cross_entropy

Softmax With Cross Entropy Operator.

Cross entropy loss with softmax is used as the output layer extensively. This operator computes the softmax normalized values for each row of the input tensor, after which cross-entropy loss is computed. This provides a more numerically stable gradient.

Because this operator performs a softmax on logits internally, it expects unscaled logits. This operator should not be used with the output of softmax operator since that would produce incorrect results.

When the attribute soft_label is set false, this operators expects mutually exclusive hard labels, each sample in a batch is in exactly one class with a probability of 1.0. Each sample in the batch will have a single label.

The equation is as follows:

1) Hard label (one-hot label, so every sample has exactly one class)

$$Loss_j = -\text{Logit}_{Label_j} + \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right), j = 1,..., K$$

2) Soft label (each sample can have a distribution over all classes)

$$Loss_j = -\sum_{i=0}^{K}\text{Label}_i \left(\text{Logit}_i - \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right)\right), j = 1,...,K$$

Inputs: Logits : (Tensor, default: Tensor), The unscaled log probabilities which is a 2-D tensor with shape [N x K]. N is the batch_size, and K is the class number.Label : (Tensor) The ground truth which is a 2-D tensor. If soft_label is set to false, Label is a Tensor with shape [N x 1]. If soft_label is set to true, Label is a Tensor with shape [N x K]. Softmax (Intermediate) : (Tensor, default: Tensor), A 2-D tensor with shape [N x K]. The outputs value of softmax activation by given the input batch, which will be used in backward calculation.Loss : (Tensor, default: Tensor), A 2-D tensor. The cross entropy loss with shape [N x 1]. soft_label (Duplicable): (bool, default: false), A flag to indicate whether to interpretate the given labels as soft labels.

## fill_constant_batch_size_like

FillConstantBatchSizeLike Operator.

Fill up a variable with specified constant value.

Inputs: Input : (Tensor) Tensor whose dim_idx th dimension is used to specify the batch_size Out : (Tensor) Tensor of specified shape will be filled with the specified value dtype (Duplicable): (int, default 5 (FP32)) Output data typeshape (Duplicable): (vector) The shape of the outputinput_dim_idx (Duplicable): (int, default 0) The index of input's batch size dimensionoutput_dim_idx (Duplicable): (int, default 0) The index of output's batch size dimensionvalue (Duplicable): (float, default 0) The value to be filled

## tanh

Tanh Activation Operator.

$$out = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$

Inputs: X : Input of Tanh operator Out : Output of Tanh operator

## feed

Feed Operator.

It should not be configured by users directly.

Inputs: X : The input of feed op Out : The output of feed op col (Duplicable): (int) The column of feed

## label_smooth

LabelSmooth Operator.

Label smoothing is a mechanism to regularize the classifier layer. In machine learning, optimizing the log-likelihood of the correct label directly may cause two problems. First, it may result in overfitting: if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, reducing the ability of the model to adapt. Label smoothing is proposed to encourage the model to be less confident, which replaces the ground-truth label $y$ with the weighted sum of itself and some fixed distribution $mu$, i.e.

$$\tilde{y} = (1 - \epsilon) * y + \epsilon * \mu,$$

where $(1 - epsilon)$ and $epsilon$ are the weights respectively, and $tilde{y}$ is the smoothed label. Usually uniform distribution is used for $mu$. This change in the ground-truth label is called label-smoothing regularization or LSR.

See more details about label smoothing in https://arxiv.org/abs/1512.00567.

Inputs: X : (LoDTensor) The input labels of LabelSmooth operator. This input can be batched labels in one-hot encoding or output from softmax, with shape [N x K], where N is the batch size and K is the number of classesPriorDist : (Tensor, optional)The prior distribution to be added to the smoothed label. It is fixed during training and the number of elements should be equal to the dimension K of each label. Default is uniform distribution and each element will be set to 1/K if not provided in input. Out : (loDTensor) The smoothed label of LabelSmooth operator. It hasthe same shape and LoD with the Input(LoDTensor). epsilon (Duplicable): (float, default 0.0f)The smoothing parameter of LabelSmooth operator.

## expand

Expand operator tiles the input by given times number. You should set times number for each dimension by providing attribute 'expand_times'. The rank of X should be in [1, 6]. Please note that size of 'expand_times' must be the same with X's rank. Following is a using case:

Input(X) is a 3-D tensor with shape [2, 3, 1]:

    [
[[1], [2], [3]],
[[4], [5], [6]]
]


Attr(expand_times): [1, 2, 2]

Output(Out) is a 3-D tensor with shape [2, 6, 2]:

    [
[[1, 1], [2, 2], [3, 3], [1, 1], [2, 2], [3, 3]],
[[4, 4], [5, 5], [6, 6], [4, 4], [5, 5], [6, 6]]
]

Inputs: X : (Tensor, default Tensor). A tensor with rank in [1, 6].X is the input to be expanded. Out : (Tensor, default Tensor). A tensor with rank in [1, 6].The rank of Output(Out) have the same with Input(X). After expanding, size of each dimension of Output(Out) is equal to size of the corresponding dimension of Input(X) multiplying the corresponding value given by Attr(expand_times). expand_times (Duplicable): Expand times number for each dimension.

## elementwise_min

Limited Elementwise Max Operator.

The equation is:

$$Out = min(X, Y)$$

$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.

There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.

For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.

For example .. code-block:: python

shape(X) = (2, 3, 4, 5), shape(Y) = (,)
shape(X) = (2, 3, 4, 5), shape(Y) = (5,)
shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5)
shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1
shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0


Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.

Inputs: X : (Tensor), The first input tensor of elementwise op.Y : (Tensor), The second input tensor of elementwise op. Out : The output of elementwise op. axis (Duplicable): (int, default -1). The start dimension index for broadcasting Y onto X.

## elementwise_div

Limited Elementwise Div Operator.

The equation is:

$$Out = X / Y$$

$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.

There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.

For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.

For example .. code-block:: python

shape(X) = (2, 3, 4, 5), shape(Y) = (,)
shape(X) = (2, 3, 4, 5), shape(Y) = (5,)
shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5)
shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1
shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0


Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.

Inputs: X : (Tensor), The first input tensor of elementwise op.Y : (Tensor), The second input tensor of elementwise op. Out : The output of elementwise op. axis (Duplicable): (int, default -1). The start dimension index for broadcasting Y onto X.

The equation is:

$$Out = X + Y$$

$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be smaller than or equal to the dimensions of $X$.

There are two cases for this operator: 1. The shape of $Y$ is same with $X$; 2. The shape of $Y$ is a subset of $X$.

For case 2: $Y$ will be broadcasted to match the shape of $X$ and axis should be set to index of the start dimension to broadcast $Y$ onto $X$.

For example .. code-block:: python

shape(X) = (2, 3, 4, 5), shape(Y) = (,)
shape(X) = (2, 3, 4, 5), shape(Y) = (5,)
shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5)
shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1
shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0


Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input $X$.

Inputs: X : (Tensor), The first input tensor of elementwise op.Y : (Tensor), The second input tensor of elementwise op. Out : The output of elementwise op. axis (Duplicable): (int, default -1). The start dimension index for broadcasting Y onto X.

## cross_entropy

CrossEntropy Operator.

It supports both standard cross-entropy and soft-label cross-entropy loss computation. 1) One-hot cross-entropy: soft_label = false, Label[i, 0] indicates the class index for sample i:

            $Y[i] = -\log(X[i, Label[i]])$


2) Soft-label cross-entropy: soft_label = true, Label[i, j] indicates the soft label of class j for sample i:

            $Y[i] = \sum_j{-Label[i, j] * log(X[i, j])}$


Please make sure that in this case the summuation of each row of Label equals one.

3) One-hot cross-entropy with vecterized Input(Label): As a special case of 2), when each row of Input(Label) has only one non-zero element (equals 1), soft-label cross-entropy degenerates to a one-hot cross-entropy with one-hot label representation.

Both the input X and Label can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

Inputs: X : (Tensor, default Tensor), a 2-D tensor with shape [N x D], where N is the batch size and D is the number of classes. This input is a probability computed by the previous operator, which is almost always the result of a softmax operator.Label : (Tensor), the ground truth which is a 2-D tensor. When soft_label is set to false, Label is a Tensor with shape [N x 1]. When soft_label is set to true, Label is a Tensor with shape [N x D]. Y : (Tensor, default Tensor), a 2-D tensor with shape [N x 1]. The cross entropy loss. soft_label (Duplicable): (bool, default false), a flag indicating whether to interpretate the given labels as soft labels.

## matmul

MatMul Operator.

This operator is used to perform (batched) matrix multiplication over the last two dimensions of the input tensors X and Y.

If a transpose flag is specified, the last two dimensions of the tensor are transposed. If the tensor is rank-1 of shape [D], then for X it is treated as [1, D] in nontransposed form and as [D, 1] in transposed form, whereas for Y it is the opposite: It is treated as [D, 1] in nontransposed form and as [1, D] in transposed form.

Examples without transpose: - X: [K], Y: [K] => Out: [1] - X: [K], Y: [K, N] => Out: [N] - X: [B, M, K], Y: [K] => Out: [B, M] - X: [M, K], Y: [B, K, N] => Out: [B, M, N] - X: [B, M, K], Y: [B, K, N] => Out: [B, M, N] - X: [B, ..., M, K], Y: [B, ..., K, N] => Out: [B, ..., M, N]

The behavior is designed to be similar to the numpy.matmul function. The differences are: - When the rank of the input data is less than or equal to 3, it is similar to the numpy.matmul function. - When the rank of the input is greater than 3, the rank of X and Y must be equal, and the first rank - 2 dimensions must be equal. - We add transpose_X and transpose_Y flags.

Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

Inputs: X : The first input of MatMul opY : The second input of MatMul op Out : The output of MatMul op transpose_X (Duplicable): If true, use the transpose of X. transpose_Y (Duplicable): If true, use the transpose of Y.

## dropout

Dropout Operator.

Dropout refers to randomly dropping out units in a nerual network. It is a regularization technique for reducing overfitting by preventing neuron co-adaption during training. The dropout operator randomly set (according to the given dropout probability) the outputs of some units to zero, while others are set equal to their corresponding inputs.

Inputs: X : The input of dropout op. Out : The output of dropout op.Mask (Intermediate) : The random sampled dropout mask. dropout_prob (Duplicable): Probability of setting units to zero.is_test (Duplicable): True if in test phase.fix_seed (Duplicable): A flag indicating whether to use a fixed seed to generate random mask. NOTE: DO NOT set this flag to true in training. Setting this flag to true is only useful in unittest or for debug that always the same output units will be dropped.seed (Duplicable): Dropout random seed.

## fetch

Fetch Operator.

It should not be configured by users directly.

Inputs: X : The input of fetch op Out : The output of fetch op col (Duplicable): (int) The column of fetch

## squared_l2_distance

SquaredL2Distance operator

This operator will cacluate the squared L2 distance for the input and the target. Number of distance value will be equal to the first dimension of input. First dimension of the target could be equal to the input or to 1. If the first dimension of target is 1, the operator will broadcast target's first dimension to input's first dimension. During backward propagation, the user can decide whether to calculate the gradient of the input or the target or both.

Both the input X and Y can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input X.

Inputs: X : (Tensor) Input of SquaredL2DistanceOp.Y : (Tensor) Target of SquaredL2DistanceOp. sub_result (Intermediate) : (Tensor) Buffering subtraction result which will be reused in backward.Out : (Tensor) Squared l2 distance between input and target.

## while

Inputs: X (Duplicable) : A set of variables, which are required by operators inside the block of While Op.Condition (Duplicable) : (Bool) An scalar. When it's False, the While Op will be terminated. Out (Duplicable) : A set of variables, which will be assigned with values generated by the operators inside the block of While Op.StepScopes : (StepScopeVar) A vector of local scope, which size equals the step number of While Op. The i'th scope storages temporary variables generated in the i'th step. sub_block (Duplicable): The step block inside WhileOp

## relu

Relu Activation Operator.

$out = max(x, 0)$

Inputs: X : Input of Relu operator Out : Output of Relu operator

The update is done as follows:

$$moment\_out = decay * moment + (1 - decay) * grad * grad \\ param\_out = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + epsilon}$$

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have an epsilon attribute. It is added here for numerical stability to avoid the division by zero error.

Inputs: Param : (Tensor) Input parameterGrad : (Tensor) Input gradientMoment : (Tensor) Second momentLearningRate : (Tensor) Learning rate ParamOut : (Tensor) Output parameterMomentOut : (Tensor) Output second moment decay (Duplicable): (float, default 0.95) Discounting factor for coming gradientepsilon (Duplicable): (float, default 1.0e-6) Constant for numerical stability

## gru

GRU Operator implements part calculations of the complete GRU as following:

$$update\_gate: u_t = actGate(xu_t + W_u * h_{t-1} + b_u) \\ reset\_gate: r_t = actGate(xr_t + W_r * h_{t-1} + b_r) \\ output\_candidate: {h}_t = actNode(xc_t + W_c * dot(r_t, h_{t-1}) + b_c) \\ output: h_t = dot((1 - u_t), h_{t-1}) + dot(u_t, {h}_t)$$

@note To implement the complete GRU, fully-connected operator must be used before to feed xu, xr and xc as the Input of GRU operator.

Inputs: Input : (LoDTensor) The first input is a LodTensor, which supports variable-time length input sequence. The underlying tensor in this LoDTenosr is a matrix with shape (T X 3D), where, T is the total time steps in this mini-batch, D is the hidden size.H0 : (Tensor, optional) The initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size, D is the hidden size.Weight : (Tensor) The learnable hidden-hidden weight matrix with shape (D x 3D), where D is the hidden size. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape (D x 2D), and the second part are weights of output candidate with shape (D x D).Bias : (Tensor, optional) Bias vector with shape (1 x 3D) concating bias of the update gate, reset gate and output candidate. BatchGate (Intermediate) : (LoDTensor) To compute with batches, sequence data will be reorganized into several successive batches each containing data from the same time step. The LoDTensor BatchGate contains the update gate, reset gate and output candidate values organized in batches. The LoD size is 2. The first LoD contains the batch offsets and the second LoD contains the indexes in the raw sequence data.BatchResetHiddenPrev (Intermediate) : (LoDTensor) The reseted hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with BatchGate.BatchHidden (Intermediate) : (LoDTensor) The hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with BatchGate.Hidden : (LoDTensor) the hidden state LoDTensor organized in sequences. This LoDTensor is a matrix with shape (T X D) and has the same LoD with BatchGate. activation (Duplicable): (string, default tanh) The activation type used for output candidate {h}_t.gate_activation (Duplicable): (string, default sigmoid) The activation type used in update gate and reset gate.is_reverse (Duplicable): (bool, defalut: False) whether to compute reversed GRU.

## ctc_align

CTCAlign op is used to merge repeated elements between two blanks and then delete all blanks in sequence.

Given: Input.data = [0, 1, 2, 2, 0, 4, 0, 4, 5, 0, 6, 6, 0, 0, 7, 7, 7, 0] Input.dims = {18, 1} Input.LoD = [[0, 11, 18]]

And: blank = 0 merge_repeated = True

Then: Output.data = [1, 2, 4, 4, 5, 6, 6, 7] Output.dims = {8, 1} Output.LoD = [[0, 6, 8]]

Inputs: Input : (LodTensor, default: LoDTensor), Its shape is [Lp, 1], where Lp is the sum of all input sequences' length. Output : (Tensor, default: Tensor), The align result. blank (Duplicable): (int, default: 0), the blank label setted in Connectionist Temporal Classification (CTC) op.merge_repeated (Duplicable): (bool, default: true), whether to merge repeated elements between two blanks.

## split_lod_tensor

    Split a LoDTensor with a Mask at certain level. The input LoDTensor
has 3 sequence at certain lod level. The Mask is a bool column vector,
such as [0, 1, 0] at the same level. The first and third sequence will
be send to False Output LoDTensor; whereas the second sequence will
be send to True Output LoDTensor. Please refer to MergeLoDTensorOp.

Inputs: X : The input LoDTensorMask : A bool column vector which mask the input OutTrue : True branch of input LoDTensorOutFalse : False branch of input LoDTensor level (Duplicable): (int) the specific lod level to split.

  Read Operator

Execute a given reader once and output data.


## crop

Crop Operator.

Crop input into output, as specified by offsets and shape.

There are two ways to set shape: 1. reference input: crop input X into the same shape as reference input. The dimension of reference input should be the same as the dimension of input X. 2. shape list: crop input X into the shape described by a list. The size of shape list should be the same as the dimension size of input X.

The input should be a k-D tensor(k > 0 and k < 7). As an example:

Case 1: Given

X = [[0, 1, 2, 0, 0]
[0, 3, 4, 0, 0]
[0, 0, 0, 0, 0]],


and

offsets = [0, 1],


and

shape = [2, 2],


we get:

Out = [[1, 2],
[3, 4]].


Case 2: Given

X = [[0, 1, 2, 5, 0]
[0, 3, 4, 6, 0]
[0, 0, 0, 0, 0]],


and

offsets = [0, 1],


and

Y = [[0, 0, 0]
[0, 0, 0]],


we get:

Out = [[1, 2, 5],
[3, 4, 6]].

Inputs: X : The input of pad op. The input should be a k-D tensor(k > 0 and k < 7).Y : The input used as reference for cropping, which is of the same dimensions as X. Out : The output of crop op, which is of the same dimensions as X. offsets (Duplicable): A list describing offsets to be cropped. The size of offsets list should be the same as the dimension size of input X.shape (Duplicable): A list describing the shape of output. The size of shape list should be the same as the dimension size of input X.

## brelu

BRelu Activation Operator.

$out = max(min(x, t_{min}), t_{max})$

Inputs: X : Input of BRelu operator Out : Output of BRelu operator t_min (Duplicable): The min marginal value of BRelut_max (Duplicable): The max marginal value of BRelu

## crf_decoding

The crf_decoding operator reads the emission feature weights and the transition feature weights learned by the linear_chain_crf operator. It implements the Viterbi algorithm which is a dynamic programming algorithm for finding the most likely sequence of hidden states, called the Viterbi path, that results in a sequence of observed tags.

The output of this operator changes according to whether Input(Label) is given:

1. Input(Label) is given:

This happens in training. This operator is used to co-work with the chunk_eval operator.

When Input(Label) is given, the crf_decoding operator returns a row vector with shape [N x 1] whose values are fixed to be 0, indicating an incorrect prediction, or 1 indicating a tag is correctly predicted. Such an output is the input to chunk_eval operator.

1. Input(Label) is not given:

This is the standard decoding process.

The crf_decoding operator returns a row vector with shape [N x 1] whose values range from 0 to maximum tag number - 1. Each element indicates an index of a predicted tag.

Inputs: Emission : (LoDTensor, default: LoDTensor). A LoDTensor with shape [N x D] where N is the size of the mini-batch and D is the total tag number. This input is the unscaled emission weight matrix of the linear_chain_crf operator.Transition : (Tensor, default: Tensor). A Tensor with shape [(D + 2) x D]. This input is the transition weights learned by the linear_chain_crf operator, denoted as w. The 1st row of w are transition weights for the start mask. The 2nd row of w are transition weights for the end mask. Transition weights between other tags begin from the 3rd row of w. See more details in comments of the linear_chain_crf operator.Label : (LoDTensor, LoDTensor). The ground truth with shape [N x 1]. This input is optional. See more details in the operator's comments. ViterbiPath : (LoDTensor, LoDTensor). The decoding results. What to return changes depending on whether the Input(Label) (the ground truth) is given. See more details in the operator's comment.

## maxout

MaxOut Operator.

Assumed the input shape is (N, Ci, H, W). The output shape is (N, Co, H, W). Then $Co = Ci / groups$ and the operator formula is as follows:

$$y_{si+j} = \max_k x_{gsi + sk + j} \\ g = groups \\ s = \frac{input.size}{num\_channels} \\ 0 \le i < \frac{num\_channels}{groups} \\ 0 \le j < s \\ 0 \le k < groups$$

Please refer to Paper: - Maxout Networks: http://www.jmlr.org/proceedings/papers/v28/goodfellow13.pdf - Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks: https://arxiv.org/pdf/1312.6082v4.pdf

Inputs: X : (Tensor) The input tensor of maxout operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature. Out : (Tensor) The output tensor of maxout operator.The format of output tensor is also NCHW.Where N is batch size, C is the number of channels, H and W is the height and width of feature. groups (Duplicable): "Specifies how many groups the input tensor will be split" "in the channel dimension. And the number of output channel is " "the number of channels divided by groups.."

## ftrl

Optimizer that implements the FTRL algorithm:

$$new\_accum = squared\_accum + grad^2 \\ if (lr\_power == -0.5) { linear\_accum += grad - (\surd(new\_accum) - \surd(squared\_accum)) / (learning\_rate * param) \\ } else { linear\_accum += grad - (new\_accum^{-lr\_power} - accum^{-lr\_power}) / (learning\_rate * param) \\ } x = (l1 * sign(linear\_accum) - linear\_accum) if (lr\_power == -0.5) { y = \frac{\surd(new\_accum)}{learning\_rate} + (2 * l2) \\ pre\_shrink = \frac{x}{y} \\ param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0) \\ } else { y = \frac{new\_accum^{-lr\_power}}{learning\_rate} + (2 * l2) \\ pre\_shrink = \frac{x}{y} \\ param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0) \\ } squared\_accum += grad^2;$$

Inputs: Param : (Tensor, default Tensor) Input parameter value that has to be updated.SquaredAccumulator : (Tensor, default Tensor) Accumulator that accumulates squared gradients.LinearAccumulator : (Tensor, default Tensor) Accumulator that accumulates linear gradients.Grad : (Tensor, default Tensor) Input gradient of the parameter.LearningRate : (Tensor, default Tensor) The learning rate should be a tensor of size 1. ParamOut : (Tensor) Output updated parameter value.SquaredAccumOut : (Tensor) Output accumulated squared gradients.LinearAccumOut : (Tensor) Output accumulated linear gradients. l1 (Duplicable): (float, default 0.0) L1 regularization strength.l2 (Duplicable): (float, default 0.0) L2 regularization strength.lr_power (Duplicable): (float, default -0.5f) Learning Rate Power.

## conv_shift

ConvShift Operator.

A layer for circular convolution of two vectors, as used in the Neural Turing Machine: https://arxiv.org/abs/1410.5401

The equation is:

$$Out[i] = \sum_{j=-(N-1)/2}^{(N-1)/2} X_{i+j} * Y_{j}$$

where X's index is computed modulo M, and Y's index is computed modulo N.

Both inputs X and Y can carry LoD (Level of Details) information. However, the output only shares the LoD information with input X.

Inputs: X : (Tensor, default Tensor), a 2-D tensor with shape B x M, where B is the batch size and M is the data dimension.Y : (Tensor, default Tensor), a 2-D tensor with shape B x N, where B is the batch size and N is the data dimension. N must be odd. Out : (Tensor, default Tensor), a 2-D tensor with shape B x M, i.e., the same shape as X.

## sum

Sum operator.

This operators sums the input tensors. All the inputs can carry the LoD (Level of Details) information. However, the output only shares the LoD information with the first input.

Inputs: X (Duplicable) : (vector) The input tensors of sum operator. Out : (Tensor) The output tensor of sum operator.

## concat

Concat Operator.

Concatenate the input tensors along dimension axis. Examples: Input[0] = [[1,2],[3,4]] Input[1] = [[5,6]] axis = 0 Output = [[1,2], [3,4], [5,6]]

Inputs: X (Duplicable) : Input tensors of concat operator. Out : Output tensor of concat operator. axis (Duplicable): The axis along which the input tensors will be concatenated.

## less_equal

less_equal Operator

It operates element-wise on X and Y, and returns the Out. Each of them is a N-dim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X <= Y

Inputs: X : (LoDTensor) the left hand operand of less_equal operatorY : (LoDTensor) the right hand operand of less_equal operator Out : (LoDTensor) n-dim bool tensor. Each element is Out = X <= Y axis (Duplicable): (int, default -1). The start dimension index for broadcasting Y onto X.

## equal

equal Operator

It operates element-wise on X and Y, and returns the Out. Each of them is a N-dim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X == Y

Inputs: X : (LoDTensor) the left hand operand of equal operatorY : (LoDTensor) the right hand operand of equal operator Out : (LoDTensor) n-dim bool tensor. Each element is Out = X == Y axis (Duplicable): (int, default -1). The start dimension index for broadcasting Y onto X.

## gather

Gather Operator.

$Out = X[Index]$

Out is obtained by gathering entries of the outer-most dimension of X indexed by Index and concatenate them together.

Example:

X = [[1, 2], [3, 4], [5, 6]]

Index = [[1, 2]]

Then:

Out = [[3, 4], [5, 6]]

Inputs: X : The source input of gather opIndex : The index input of gather op Out : The output of gather op

## clip_by_norm

ClipByNorm Operator.

This operator limits the L2 norm of the input $X$ within $max_norm$. If the L2 norm of $X$ is less than or equal to $max_norm$, $Out$ will be the same as $X$. If the L2 norm of $X$ is greater than $max_norm$, $X$ will be linearly scaled to make the L2 norm of $Out$ equal to $max_norm$, as shown in the following formula:

$$Out = \frac{max\_norm * X}{norm(X)},$$

where $norm(X)$ represents the L2 norm of $X$.

Inputs: X : (Tensor) The input of clip_by_norm op.The number of dimensions must be between [1, 9]. Out : (Tensor) The output of clip_by_norm op with shape as input(X) max_norm (Duplicable): (float) The maximum norm value.

## sigmoid

Sigmoid Activation Operator

$$out = \frac{1}{1 + e^{-x}}$$

Inputs: X : Input of Sigmoid operator Out : Output of Sigmoid operator

## floor

Floor Activation Operator.

$out = floor(x)$

Inputs: X : Input of Floor operator Out : Output of Floor operator

## sequence_concat

The sequence_concat operator concatenates multiple LoDTensors. It only supports sequence (LoD Tensor with level number is 1) or a nested sequence (LoD tensor with level number is 2) as its input. - Case1: If the axis is other than 0(here, axis is 1 and level is 1), each input should have the same LoD information and the LoD information of the output keeps the same as the input.

LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,2,4}, {0,1,2,3,4}}; Dims(x1) = (4,4,4) LoD(Out) = {{0,2,4}, {0,1,2,3,4}}; Dims(Out) = (4,7,4)

• Case2: If the axis is 0(here, leve is 0), the inputs are concatenated along time steps, the LoD information of the output need to re-compute. The LoD information of level-1 should be same.

LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,2,4}, {0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,2,4}, {0,2,5,8,11}}; Dims(Out) = (11,3,4)

• Case3: If the axis is 0(here, level is 1).

LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,3,4}, {0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,5,8}, {0,1,2,3,5,7,8,9,11}}; Dims(Out) = (11,3,4)

• Case4: If the LoD number is 1, axis is 0, level is 0

LoD(x0) = {{0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,2,5,8,11}}; Dims(Out) = (11,3,4)

NOTE: The levels of all the inputs should be the same.

Inputs: X (Duplicable) : (LodTensorArray) Input is a vector of LoDTensor, each of which is a variable-length sequence or nested sequence. Out : (LoDTensor), Variable-length output of sequence_concat Op. axis (Duplicable): (int, default 0) The axis along which the inputs will be joined. If axis is 0, the inputs will be joined with LoD index.level (Duplicable): (int, default 0) The level at which the inputs will be joined. If the level is 0, the inputs will be joined at the nested sequence level. If the level is 1, the inputs will be joined at the sequence level. The level should be less than the level number of inputs.

## cast

Cast Operator.

This Operator casts the input tensor to another data type and returns tha Output Tensor.

Inputs: X : The input tensor of cast op Out : The output tensor of cast op out_dtype (Duplicable): output data typein_dtype (Duplicable): input data type

## chunk_eval

For some basics of chunking, please refer to ‘Chunking with Support Vector Machines https://aclanthology.info/pdf/N/N01/N01-1025.pdf’.

CheckEvalOp computes the precision, recall, and F1-score of chunk detection, and supports IOB, IOE, IOBES and IO (also known as plain) tagging schemes. Here is a NER example of labeling for these tagging schemes:

     Li     Ming    works  at  Agricultural   Bank   of    China  in  Beijing.


IO: I-PER I-PER O O I-ORG I-ORG I-ORG I-ORG O I-LOC IOB: B-PER I-PER O O B-ORG I-ORG I-ORG I-ORG O B-LOC IOE: I-PER E-PER O O I-ORG I-ORG I-ORG E-ORG O E-LOC IOBES: B-PER E-PER O O I-ORG I-ORG I-ORG E-ORG O S-LOC

There are three chunk types(named entity types) including PER(person), ORG(organization) and LOC(LOCATION), and we can see that the labels have the form -.

Since the calculations actually use label ids rather than labels, extra attention should be paid when mapping labels to ids to make CheckEvalOp work. The key point is that the listed equations are satisfied by ids.

tag_type = label % num_tag_type
chunk_type = label / num_tag_type


where num_tag_type is the num of tag types in the tagging scheme, num_chunk_type is the num of chunk types, and tag_type get its value from the following table.

Scheme Begin Inside End   Single
plain   0     -      -     -
IOB     0     1      -     -
IOE     -     0      1     -
IOBES   0     1      2     3


Still use NER as example, assuming the tagging scheme is IOB while chunk types are ORG, PER and LOC. To satisfy the above equations, the label map can be like this:

B-ORG  0
I-ORG  1
B-PER  2
I-PER  3
B-LOC  4
I-LOC  5
O      6


It’s not hard to verify the equations noting that the num of chunk types is 3 and the num of tag types in IOB scheme is 2. For example, the label id of I-LOC is 5, the tag type id of I-LOC is 1, and the chunk type id of I-LOC is 2, which consistent with the results from the equations.

Inputs: Inference : (Tensor, default: Tensor). Predictions from the network.Label : (Tensor, default: Tensor). The true tag sequences. Precision : (float). The evaluated precision (called positive predictive value) of chunks on the given mini-batch.Recall : (float). The evaluated recall (true positive rate or sensitivity) of chunks on the given mini-batch.F1-Score : (float). The evaluated F1-Score on the given mini-batch.NumInferChunks : (int64_t). The number of chunks in Inference on the given mini-batch.NumLabelChunks : (int64_t). The number of chunks in Label on the given mini-batch.NumCorrectChunks : (int64_t). The number of chunks both in Inference and Label on the given mini-batch. num_chunk_types (Duplicable): (int). The number of chunk type. See below for details.chunk_scheme (Duplicable): (string, default IOB). The labeling scheme indicating how to encode the chunks. Must be IOB, IOE, IOBES or plain. See below for details.excluded_chunk_types (Duplicable): (list) A list including chunk type ids indicating chunk types that are not counted. See below for details.

## box_coder

Bounding Box Coder Operator. Encode/Decode the target bounding box with the priorbox information. The Encoding schema described below: ox = (tx - px) / pw / pxv oy = (ty - py) / ph / pyv ow = log(abs(tw / pw)) / pwv oh = log(abs(th / ph)) / phv The Decoding schema described below: ox = (pw * pxv * tx * + px) - tw / 2 oy = (ph * pyv * ty * + py) - th / 2 ow = exp(pwv * tw) * pw + tw / 2 oh = exp(phv * th) * ph + th / 2 where tx, ty, tw, th denote the target box's center coordinates, width and height respectively. Similarly, px, py, pw, ph denote the priorbox's(anchor) center coordinates, width and height. pxv, pyv, pwv, phv denote the variance of the priorbox and ox, oy, ow, oh denote the encoded/decoded coordinates, width and height.

Inputs: PriorBox : (Tensor, default Tensor) Box list PriorBox is a 2-D Tensor with shape [M, 4] holds M boxes, each box is represented as [xmin, ymin, xmax, ymax], [xmin, ymin] is the left top coordinate of the anchor box, if the input is image feature map, they are close to the origin of the coordinate system. [xmax, ymax] is the right bottom coordinate of the anchor box.PriorBoxVar : (Tensor, default Tensor) PriorBoxVar is a 2-D Tensor with shape [M, 4] holds M group of variance.TargetBox : (LoDTensor or Tensor) this input is a 2-D LoDTensor with shape [N, 4], each box is represented as [xmin, ymin, xmax, ymax], [xmin, ymin] is the left top coordinate of the box if the input is image feature map, they are close to the origin of the coordinate system. [xmax, ymax] is the right bottom coordinate of the box. This tensor can contain LoD information to represent a batch of inputs. One instance of this batch can contain different numbers of entities. OutputBox : (LoDTensor or Tensor) (Tensor) The output of box_coder_op, a tensor with shape [N, M, 4] representing the result of N target boxes encoded/decoded with M Prior boxes and variances. code_type (Duplicable): (string, default encode_center_size) the code type used with the target box

## bipartite_match

This operator is a greedy bipartite matching algorithm, which is used to obtain the matching with the maximum distance based on the input distance matrix. For input 2D matrix, the bipartite matching algorithm can find the matched column for each row, also can find the matched row for each column. And this operator only calculate matched indices from column to row. For each instance, the number of matched indices is the number of of columns of the input ditance matrix.

There are two outputs to save matched indices and distance. A simple description, this algothrim matched the best (maximum distance) row entity to the column entity and the matched indices are not duplicated in each row of ColToRowMatchIndices. If the column entity is not matched any row entity, set -1 in ColToRowMatchIndices.

Please note that the input DistMat can be LoDTensor (with LoD) or Tensor. If LoDTensor with LoD, the height of ColToRowMatchIndices is batch size. If Tensor, the height of ColToRowMatchIndices is 1.

Inputs: DistMat : (LoDTensor or Tensor) this input is a 2-D LoDTensor with shape [K, M]. It is pair-wise distance matrix between the entities represented by each row and each column. For example, assumed one entity is A with shape [K], another entity is B with shape [M]. The DistMat[i][j] is the distance between A[i] and B[j]. The bigger the distance is, the better macthing the pairs are. Please note, This tensor can contain LoD information to represent a batch of inputs. One instance of this batch can contain different numbers of entities. ColToRowMatchIndices : (Tensor) A 2-D Tensor with shape [N, M] in int type. N is the batch size. If ColToRowMatchIndices[i][j] is -1, it means B[j] does not match any entity in i-th instance. Otherwise, it means B[j] is matched to row ColToRowMatchIndices[i][j] in i-th instance. The row number of i-th instance is saved in ColToRowMatchIndices[i][j].ColToRowMatchDist : (Tensor) A 2-D Tensor with shape [N, M] in float type. N is batch size. If ColToRowMatchIndices[i][j] is -1, ColToRowMatchDist[i][j] is also -1.0. Otherwise, assumed ColToRowMatchIndices[i][j] = d, and the row offsets of each instance are called LoD. Then ColToRowMatchDist[i][j] = DistMat[d+LoD[i]][j]

## batch_norm

Batch Normalization.

Batch Norm has been implemented as discussed in the paper: https://arxiv.org/pdf/1502.03167.pdf Can be used as a normalizer function for conv2d and fully_connected operations. The required data format for this layer is one of the following: 1. NHWC [batch, in_height, in_width, in_channels] 2. NCHW [batch, in_channels, in_height, in_width]

Inputs: X : The input tensorScale : Scale is a 1-dimensional tensor of size C that is applied to the outputBias : Bias is a 1-dimensional tensor of size C that is applied to the outputMean : The global mean (for training) or estimated mean (for testing)Variance : The global variance (for training) or estimated Variance (for testing) Y : result after normalizationMeanOut : Share memory with Mean. Store the global mean when trainingVarianceOut : Share memory with Variance. Store the global Variance when trainingSavedMean (Intermediate) : Mean of the current mini batch, will apply to output when trainingSavedVariance (Intermediate) : Variance of the current mini batch, will apply to output when training is_test (Duplicable): momentum (Duplicable): epsilon (Duplicable): data_layout (Duplicable):

## auc

Area Under The Curve (AUC) Operator.

This implementation computes the AUC according to forward output and label. It is used very widely in binary classification evaluation. As a note: If input label contains values other than 0 and 1, it will be cast to bool. You can find the relevant definitions here: https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve

There are two types of possible curves: 1. ROC: Receiver operating characteristic 2. PR: Precision Recall

Inputs: Out : A floating point 2D tensor, values are in the range [0, 1].Each row is sorted in descending order. This input should be theoutput of topk.Typically, this tensor indicates the probability of each labelIndices : An int 2D tensor, indicating the indices of originaltensor before sorting. Typically, this tensor indicates which label the probability stands for.Label : A 2D int tensor indicating the label of the training data.The height is batch size and width is always 1. AUC : A scalar representing the current area-under-the-curve. curve (Duplicable): Curve type, can be 'ROC' or 'PR'.num_thresholds (Duplicable): The number of thresholds to use when discretizing the roc curve.

## assign_value

AssignValue operator

$$Out = values$$

Inputs: Out : (Tensor) Output tensor of assign_value operator. shape (Duplicable): (vector) Shape of values.dtype (Duplicable): data type of valuesfp32_values (Duplicable): store the float valuesint32_values (Duplicable): store the int values

## split

Split operator

This operator splits the input tensor into multiple sub-tensors.

Example: Input = [[1,2], [3,4], [5,6]] sections = [2,1] axis = 0 Output[0] = [[1,2], [3,4]] Output[1] = [[5,6]]

Inputs: X : (Tensor) Input tensor of the split operator. Out (Duplicable) : (Tensor) Output tensors of the split operator. sections (Duplicable): (vector) the length of each output along the specified axis.num (Duplicable): (int, default 0)Number of sub-tensors. This must evenly divide Input.dims()[axis]axis (Duplicable): (int, default 0) The axis which the input will be splited on.

## beam_search_decode

Pack the result of Beam search op into SentenceIds and SentenceScores.

Inputs: Ids : (LodTensorArray)score of the candidate words in each stepScores : (LodTensorArray)score of the candidate words in each step SentenceIds : (LodTensor)All possible result sentences of word idsSentenceScores : (LodTensor)All possible result sentences of word scores

## assign

Assign Operator

Out = X, when type in [LoDTensor/SelectedRows/LoDTensorArray] raise error if the type is not listed above.

Inputs: X : (LoDTensor, SelectedRows or LoDTensorArray) The input variable could be LoDTensor, SelectedRows or LoDTensorArray. Out : (LoDTensor, SelectedRows or LoDTensorArray) The type of output is the same as input X.

$$avg\_squared\_grad\_out = \rho * avg\_squared\_grad + (1 - \rho) * grad * grad \\ param\_update = - \sqrt{\frac{avg\_squared\_update + \epsilon}{avg\_squared\_grad\_out + \epsilon}} * grad \\ avg\_squared\_update\_out = \rho * avg\_squared\_update + (1 - \rho) * {param\_update}^2 \\ param\_out = param + param\_update$$

Inputs: Param : (Tensor) Input parameterGrad : (Tensor) Input gradientAvgSquaredGrad : (Tensor) Input average of squared gradientAvgSquaredUpdate : (Tensor) Input average of squared parameter updates ParamOut : (Tensor) Output parameterAvgSquaredGradOut : (Tensor) Output average of squared gradientAvgSquaredUpdateOut : (Tensor) Output average of squared parameter updates rho (Duplicable): (float, default 0.95) Exponential decay rate for squared gradients.epsilon (Duplicable): (float, default 1.0e-6) Constant for numerical stability

## nce

Compute and return the noise-contrastive estimation training loss. See Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. By default this operator uses a uniform distribution for sampling.

Inputs: Input : (Tensor) A tensor of shape [batch_size, dim].Label : (Tensor) A tensor of shape [batch_size, num_true_class]. 'num_true_class' is the number of target classes in each sample.The number of target classes per sample should be same. If you have a variable number of target classes, you can pad them out to a constant number by either repeating them or by padding with an otherwise unused class.)Weight : (Tensor) A tensor of shape [num_class, dim]. 'num_class' is the total number of class.Bias : (Tensor) A tensor of shape [num_class, 1]. 'num_class' is the total number of class. It is a dispensable input.SampleWeight : (Tensor) A tensor of shape [batch_size, 1] storing a weight for each sample. And it is a dispensable input. The default value of sample is 1. Cost : (Tensor) A tensor of shape [batch_size, 1]. Cost of samples.SampleLogits (Intermediate) : An intermediate tensor of shape[batch_size, num_neg_samples + num_pos_samples].This tensor is output of forward kernel and used in backward kernel to compute grads.Given X is the dot product of input tensor and sampled labels' weights.Then 'SampleLogits' is sigmoid(X).SampleLabels (Intermediate) : An intermediate tensor of shape[batch_size, num_neg_samples + num_pos_samples].This tensor is output of forward kernel and used in backward kernel to compute grads. num_total_classes (Duplicable): Total number of classes in all samples.num_neg_samples (Duplicable): The number of negative classes. The default value is 10.custom_neg_classes (Duplicable): This attribute only be used in unitest. Classes in this list wiil be used as negative classes for every samples. Under normal conditions, user should avoid setting this attribute.

## linear_chain_crf

LinearChainCRF Operator.

Conditional Random Field defines an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. CRF learns the conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are structured inputs and $Y = (y_1, y_2, ... , y_n)$ are labels for the inputs.

Linear chain CRF is a special case of CRF that is useful for sequence labeling task. Sequence labeling tasks do not assume a lot of conditional independences among inputs. The only constraint they impose is that the input and output must be linear sequences. Thus, the graph of such a CRF is a simple chain or a line, which results in the linear chain CRF.

This operator implements the Forward-Backward algorithm for the linear chain CRF. Please refer to http://www.cs.columbia.edu/~mcollins/fb.pdf and http://cseweb.ucsd.edu/~elkan/250Bwinter2012/loglinearCRFs.pdf for details.

Equation: 1. Denote Input(Emission) to this operator as $x$ here. 2. The first D values of Input(Transition) to this operator are for starting weights, denoted as $a$ here. 3. The next D values of Input(Transition) of this operator are for ending weights, denoted as $b$ here. 4. The remaning values of Input(Transition) are for transition weights, denoted as $w$ here. 5. Denote Input(Label) as $s$ here.

The probability of a sequence $s$ of length $L$ is defined as: $$P(s) = (1/Z) \exp(a_{s_1} + b_{s_L} + \sum_{l=1}^L x_{s_l} + \sum_{l=2}^L w_{s_{l-1},s_l})$$

where $Z$ is a normalization value so that the sum of $P(s)$ over all possible sequences is 1, and $x$ is the emission feature weight to the linear chain CRF.

Finally, the linear chain CRF operator outputs the logarithm of the conditional likelihood of each training sample in a mini-batch.

NOTE: 1. The feature function for a CRF is made up of the emission features and the transition features. The emission feature weights are NOT computed in this operator. They MUST be computed first before this operator is called.

1. Because this operator performs global normalization over all possible sequences internally, it expects UNSCALED emission feature weights. Please do not call this op with the emission feature being output of any nonlinear activation.

2. The 2nd dimension of Input(Emission) MUST be equal to the tag number.

Inputs: Emission : (LoDTensor, default LoDTensor) A 2-D LoDTensor with shape [N x D], where N is the size of the mini-batch and D is the total tag number. The unscaled emission weight matrix for the linear chain CRF. Transition : (Tensor, default Tensor) A 2-D Tensor with shape [(D + 2) x D]. The learnable parameter for the linear_chain_crf operator. See more details in the operator's comments.Label : (LoDTensor, default LoDTensor) A LoDTensor with shape [N x 1], where N is the total element number in a mini-batch. The ground truth. Alpha (Intermediate) : (Tensor, default Tensor) A 2-D Tensor with shape [N x D]. The forward vectors for the entire batch. Denote it as $lpha$. $lpha$ is a memo table used to calculate the normalization factor in CRF. $lpha[k, v]$ stores the unnormalized probabilites of all possible unfinished sequences of tags that end at position $k$ with tag $v$. For each $k$, $lpha[k, v]$ is a vector of length $D$ with a component for each tag value $v$. This vector is called a forward vecotr and will also be used in backward computations.EmissionExps (Intermediate) : (Tensor, default Tensor) A 2-D Tensor with shape [N x D]. The exponentials of Input(Emission). This is an intermediate computational result in forward computation, and will be reused in backward computation.TransitionExps (Intermediate) : (Tensor, default Tensor) A 2-D Tensor with shape [(D + 2) x D]. The exponentials of Input(Transition). This is an intermediate computational result in forward computation, and will be reused in backward computation.LogLikelihood : (Tensor, default Tensor) The logarithm of the conditional likelihood of each training sample in a mini-batch. This is a 2-D tensor with shape [S x 1], where S is the sequence number in a mini-batch. Note: S is equal to the sequence number in a mini-batch. The output is no longer a LoDTensor.

## logsigmoid

Logsigmoid Activation Operator

$$out = \log \frac{1}{1 + e^{-x}}$$

Inputs: X : Input of LogSigmoid operator Out : Output of LogSigmoid operator

## row_conv

Row-convolution Operator.

The row convolution is called lookahead convolution. This operator was introduced in the following paper for DeepSpeech2: http://www.cs.cmu.edu/~dyogatam/papers/wang+etal.iclrworkshop2016.pdf

The main motivation is that a bidirectional RNN, useful in DeepSpeech like speech models, learns representation for a sequence by performing a forward and a backward pass through the entire sequence. However, unlike unidirectional RNNs, bidirectional RNNs are challenging to deploy in an online and low-latency setting. The lookahead convolution incorporates information from future subsequences in a computationally efficient manner to improve unidirectional recurrent neural networks. The row convolution operator is different from the 1D sequence convolution, and is computed as follows:

Given an input sequence $in$ of length $t$ and input dimension $d$, and a filter ($W$) of size $context times d$, the output sequence is convolved as:

$$out_{i, :} = \sum_{j=i}^{i + context} in_{j,:} \dot W_{i-j, :}$$

Inputs: X : (LoDTensor), the input(X) is a LodTensor, which supports variable time-length input sequences. The underlying tensor in this LoDTensor is a matrix with shape (T x N), where T is the total time steps in this mini-batch and N is the input data dimension.Filter : (Tensor), the input(Filter) is a learnable parameter. It is a 2-D tensor with shape (future_context x N), where, future_context is the future context length and N is the data dimension. Out : (LoDTensor), the output(Out) is a LodTensor, which supports variable time-length input sequences. The underlying tensor in this LodTensor is a matrix with shape T x N, i.e., the same shape as X.

## exp

Exp Activation Operator.

$out = e^x$

Inputs: X : Input of Exp operator Out : Output of Exp operator