Operators

sgd

SGD operator

This operator implements one step of the stochastic gradient descent algorithm.

$$param\_out = param - learning\_rate * grad$$

Inputs:
  • Param : (Tensor) Input parameter
  • LearningRate : (Tensor) Learning rate of SGD
  • Grad : (Tensor) Input gradient
Outputs:
  • ParamOut : (Tensor) Output parameter

adagrad

Adaptive Gradient Algorithm (Adagrad).

The update is done as follows:

$$moment\_out = moment + grad * grad \\ param\_out = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon} $$

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have the epsilon attribute. It is added here in our implementation as also proposed here: http://cs231n.github.io/neural-networks-3/#ada for numerical stability to avoid the division by zero error.

Inputs:
  • Param : (Tensor) Input parameter
  • Grad : (Tensor) Input gradient
  • Moment : (Tensor) Second moment
  • LearningRate : (Tensor) Learning rate
Outputs:
  • ParamOut : (Tensor) Output parameter
  • MomentOut : (Tensor) Output second moment
Attributes:
  • epsilon (Duplicable): (float, default 1.0e-6) Constant for numerical stability

conv3d

Convolution3D Operator.

The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and output(Output) are in NCDHW format, where N is batch size, C is the number of channels,D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCDHW format, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.

Example: Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$ D_{out}= \frac{(D_{in} + 2 * paddings[0] - (dilations[0] * (D_f - 1) + 1))}{ strides[0]}+ 1 \\ H_{out}= \frac{(H_{in} + 2 * paddings[1] - (dilations[1] * (H_f - 1) + 1))}{ strides[1]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[2] - (dilations[2] * (W_f - 1) + 1))}{ strides[2]}+ 1 $$

Inputs:
  • Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
  • Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCDHW, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups.
Outputs:
  • Output : (Tensor) The output tensor of convolution operator.The format of output tensor is also NCDHW.
Attributes:
  • strides (Duplicable): (vector<int>, default:{1, 1, 1}), the strides(d_stride, h_stride, w_stride) of convolution operator.
  • paddings (Duplicable): (vector<int>, default:{0, 0, 0}), the paddings(d_pad, h_pad, w_pad) of convolution operator.
  • groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.
  • dilations (Duplicable): (vector<int> default:{1, 1, 1}), the dilations(d_dilation, h_dilation, w_dilation) of convolution operator.

conv2d

Convolution Operator.

The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and Output(Output) are in NCHW format. Where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCHW format. Where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.

Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$ H_{out}= \frac{(H_{in} + 2 * paddings[0] - (dilations[0] * (H_f - 1) + 1))}{strides[0]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[1] - (dilations[1] * (W_f - 1) + 1))}{strides[1]}+ 1 $$

Inputs:
  • Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
  • Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCHW, where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups.
Outputs:
  • Output : (Tensor) The output tensor of convolution operator. The format of output tensor is also NCHW.
Attributes:
  • strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride) of convolution operator.
  • paddings (Duplicable): (vector<int> default:{0, 0}), the paddings(h_pad, w_pad) of convolution operator.
  • groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.
  • dilations (Duplicable): (vector<int> default:{1, 1}), the dilations(h_dilation, w_dilation) of convolution operator.

pool3d

Pool3d Operator.

The pooling3d operation calculates the output based on the input, pooling_type, ksize, strides, and paddings parameters. Input(X) and output(Out) are in NCDHW format, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the feature, respectively. Parameters(ksize, strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.

Example: Input: X shape: $(N, C, D_{in}, H_{in}, W_{in})$ Output: Out shape: $(N, C, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = \frac{(D_{in} - ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ H_{out} = \frac{(H_{in} - ksize[1] + 2 * paddings[1])}{strides[1]} + 1 \\ W_{out} = \frac{(W_{in} - ksize[2] + 2 * paddings[2])}{strides[2]} + 1 $$

Inputs:
  • X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively.
Outputs:
  • Out : (Tensor) The output tensor of pooling operator.The format of output tensor is also NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively.
Attributes:
  • pooling_type (Duplicable): (string) Pooling type, can be "max" for max-pooling and "avg" for average-pooling.
  • ksize (Duplicable): (vector<int>) The pooling window size(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
  • global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings wille be ignored.
  • strides (Duplicable): (vector<int>, default {1,1,1}) Strides(depth, height, width) of the pooling operator.
  • paddings (Duplicable): (vector<int>, default {0,0,0}), paddings(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.

pool2d

Pool2d Operator.

The pooling2d operation calculates the output based on the input, pooling_type and ksize, strides, paddings parameters. Input(X) and output(Out) are in NCHW format, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Parameters(ksize, strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.

Example:
Input: X shape: $(N, C, H_{in}, W_{in})$ Output: Out shape: $(N, C, H_{out}, W_{out})$ Where $$ H_{out} = \frac{(H_{in} - ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ W_{out} = \frac{(W_{in} - ksize[1] + 2 * paddings[1])}{strides[1]} + 1 $$

Inputs:
  • X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
Outputs:
  • Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
Attributes:
  • pooling_type (Duplicable): (string), pooling type, can be "max" for max-pooling and "avg" for average-pooling.
  • ksize (Duplicable): (vector<int>) The pooling window size(height, width) of the pooling operator. If global_pooling = true, ksize and paddings will be ignored.
  • global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.
  • strides (Duplicable): (vector<int>, default {1, 1}), strides(height, width) of pooling operator.
  • paddings (Duplicable): (vector<int>, default {0,0}), paddings(height, width) of pooling operator.If global_pooling = true, paddings and ksize will be ignored.

max_pool3d_with_index

MaxPool3d Operator.

The maxpooling3d with index operation calculates the output and the mask based on the input and ksize, strides, paddings parameters. Input(X) and output(Out, Mask) are in NCDHW format, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the feature, respectively. Parameters(ksize, strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out, Mask) size may be different.

Example: Input: X shape: $(N, C, D_{in}, H_{in}, W_{in})$ Output: Out shape: $(N, C, D_{out}, H_{out}, W_{out})$ Mask shape: $(N, C, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = \frac{(D_{in} - ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ H_{out} = \frac{(H_{in} - ksize[1] + 2 * paddings[1])}{strides[1]} + 1 \\ W_{out} = \frac{(W_{in} - ksize[2] + 2 * paddings[2])}{strides[2]} + 1 $$

Inputs:
  • X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCDHW, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively
Outputs:
  • Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCDHW, where N is the batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively.
  • Mask : (Tensor) The Mask tensor of pooling operator. The format of output tensor is also NCDHW, where N is the batch size, C is the number of channels, and D, H and W are the depth, height and width of the image, respectively. It represents the index in the current feature map.
Attributes:
  • ksize (Duplicable): (vector<int>) The pooling window size(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
  • global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.
  • strides (Duplicable): (vector<int>, default {1,1,1}), strides(depth, height, width) of pooling operator.
  • paddings (Duplicable): (vector, default {0,0,0}), paddings(depth, height, width) of pooling operator. If global_pooling = true, paddings and ksize will be ignored.

lod_rank_table

Create LoDRanTable by LoDTensor

LoD Rank Table stores the level of lod which is ordered by sequence length in descending order. It is useful when implement dynamic RNN and is shared by dynamic RNN memory, dynamic RNN slice input and dynamic RNN slice output operators.

Inputs:
  • X : (LoDTensor) input lod tensor, must contain lod information.
Outputs:
  • Out : (LoDRankTable) The rank table of specific level.
Attributes:
  • level (Duplicable): (int) the specific lod level to rank.

array_to_lod_tensor

This Op build a big LoDTensor from a std::vector and a LoDRankTable. It is supposed to be used in getting dynamic RNN's outputs back to a normal LoDTensor. The std::vector would be the output of RNN Op and the LoDRankTable would be build with RNN's input.

Inputs:
  • X : (std::vector<LodTensor>) A vector of tensors that is going to be casted to a big LoDTensor.
  • RankTable : (LoDRankTable) RankTable provides the coarse lod infomation to build the output LoDTensor. See 'paddle/framework/lod_rank_table.h' for more details.
Outputs:
  • Out : (LoDTensor) The LoDTensor formed by input tensor array.

sequence_conv

Sequence Conv Operator.

SequenceConvOp performs convolution operation on features of contextLength time-steps of each instance. The convolution operation calculates the output based on the input, filter, strides and paddings parameters. The size of each dimension of the parameters is checked during infer-shape. In order to ensure the equal length of sequence before and after convolution, it is necessary to fill the top and bottom of each sequence based on context_length, context_stride and context_start.

Inputs:
  • X : (LoDTensor) the input(X) is a LodTensor, which supports variable-time length input sequence. The underlying tensor in this LoDTensor is a matrix with shape (T, N), where T is the total time steps in this mini-batch and N is the input_hidden_size.
  • PaddingData : (Tensor, optional) the input(PaddingData) is an optional parameter, and it is learnable. This is a tensor with shape (P, N), where P is the top_pad + bottom_pad, N is the input_hidden_size. In order to ensure the equal length of sequence before and after convolution, it is necessary to fill the top and bottom of each sequence according to context_length, context_stride and context_start
  • Filter : (Tensor) the input(Filter) is an learnable parameter.This is a tensor with shape (K, M), where K is the context_length * input_hidden_size, M is the output feature size.
Outputs:
  • Out : (LoDTensor) the output(Out) is a LodTensor, which support variable-time length output sequence. The underlying tensor in this LoDTensor is a matrix with shape (T, M), where, T is the total time steps in this mini-batch, M is the output feature size.
Attributes:
  • paddingTrainable (Duplicable): (bool, default:false) the padding data of SequenceConvOp is trainable or not.
  • contextLength (Duplicable): (int) the contextLength of SequenceConvOp is the height of the convolution kernel.
  • contextStart (Duplicable): (int, default:0) the contextStart of SequenceConvOp represents the beginning of the convolution of the number of rows of sequence, which can be negative. The negative number means to pad contextStart time-steps of zeros or learnable parameters at the beginning of each instance. The positive number means to skip contextStart time-steps of each instance.
  • contextStride (Duplicable): (int, default:1) the contextStride of SequenceConvOp represents the stride length of convolution kernel. Currently, SequenceConvOp only supportscontextStride=1.

sequence_pool

Sequence Pool Operator.

The SequencePoolOp pools features of all time-steps of each instance. It supports six pooling types: 1. AVERAGE: $$Out[i] = \frac{\sum_i X_i}{N}$$ 2. SUM: $$Out[i] = \sum_jX_{ij}$$ 3. SQRT: $$Out[i] = \frac{\sum_jX_{ij}}{\sqrt{len(X_i)}}$$ 4. LAST: Out[i] = last instance in i-th sequence X[i] 5. FIRST: Out[i] = first instance in i-th sequence X[i] 6. MAX: $$Out[i] = max(X_i)$$

The following example explains how this works: For a mini-batch of 3 variable-length sentences, containing 2, 3, and 2 time-steps:

Assume X is a [7,M,N] LoDTensor, and X->lod()[0] = [0, 2, 5, 7], 7=2+3+2. Besides, for the sake of simplicity, we assume M=1 and N=1, and the value of X = [[1, 3], [2, 4, 6], [5, 1]].

Thus, Out is a [3,1,1] Tensor without LoD infomation. And for different pooltype, the value of Out is as follows:

  • AVERAGE: [2, 4, 3], where 2=(1+3)/2, 4=(2+4+6)/3, 3=(5+1)/2
  • SUM: [4, 12, 6], where 4=1+3, 12=2+4+6, 6=5+1
  • SQRT: [2.82, 6.93, 4.24], where 2.82=(1+3)/sqrt(2), 6.93=(2+4+6)/sqrt(3), 4.24=(5+1)/sqrt(2)
  • MAX: [3, 6, 5], where 3=max(1,3), 6=max(2,4,6), 5=max(5,1)
  • LAST: [3, 6, 1], where 3=last(1,3), 6=last(2,4,6), 1=last(5,1)
  • FIRST: [1, 2, 5], where 1=first(1,3), 2=first(2,4,6), 5=first(5,1)
Inputs:
  • X : (LoDTensor) The variable-length input of SequencePoolOp
Outputs:
  • Out : (Tensor) The output of SequencePoolOp does not contain LoD infomation.
  • MaxIndex (Intermediate) : (Tensor<int>) This tensor is used for the sequence max-pooling to record the max indexes.
Attributes:
  • pooltype (Duplicable): (int, default AVERAGE) the pooling pooltype of SequencePoolOp.

lstm

Long-Short Term Memory (LSTM) Operator.

The defalut implementation is diagonal/peephole connection (https://arxiv.org/pdf/1402.1128.pdf), the formula is as follows:

$$ i_t = \sigma(W_{ix}x_{t} + W_{ih}h_{t-1} + W_{ic}c_{t-1} + b_i) \\ f_t = \sigma(W_{fx}x_{t} + W_{fh}h_{t-1} + W_{fc}c_{t-1} + b_f) \\ \tilde{c_t} = act_g(W_{cx}x_t + W_{ch}h_{t-1} + b_c) \\ o_t = \sigma(W_{ox}x_{t} + W_{oh}h_{t-1} + W_{oc}c_t + b_o) \\ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c_t} \\ h_t = o_t \odot act_h(c_t) $$

where the W terms denote weight matrices (e.g. $W_{xi}$ is the matrix of weights from the input gate to the input), $W_{ic}, W_{fc}, W_{oc}$ are diagonal weight matrices for peephole connections. In our implementation, we use vectors to reprenset these diagonal weight matrices. The b terms denote bias vectors ($b_i$ is the input gate bias vector), $sigma$ is the non-line activations, such as logistic sigmoid function, and $i, f, o$ and $c$ are the input gate, forget gate, output gate, and cell activation vectors, respectively, all of which have the same size as the cell output activation vector $h$.

The $odot$ is the element-wise product of the vectors. $act_g$ and $act_h$ are the cell input and cell output activation functions and tanh is usually used for them. $tilde{c_t}$ is also called candidate hidden state, which is computed based on the current input and the previous hidden state.

Set use_peepholes False to disable peephole connection. The formula is omitted here, please refer to the paper http://www.bioinf.jku.at/publications/older/2604.pdf for details.

Note that these $W_{xi}x_{t}, W_{xf}x_{t}, W_{xc}x_{t}, W_{xo}x_{t}$ operations on the input $x_{t}$ are NOT included in this operator. Users can choose to use fully-connect operator before LSTM operator.

Inputs:
  • Input : (LoDTensor) the first input is a LodTensor, which support variable-time length input sequence. The underlying tensor in this LoDTensor is a matrix with shape (T X 4D), where T is the total time steps in this mini-batch, D is the hidden size.
  • H0 : (Tensor, optional) the initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size and D is the hidden size.
  • C0 : (Tensor, optional) the initial cell state is an optional input. This is a tensor with shape (N x D), where N is the batch size. `H0` and `C0` can be NULL but only at the same time
  • Weight : (Tensor) the learnable hidden-hidden weights. - The shape is (D x 4D), where D is the hidden size. - Weight = {W_ch, W_ih, W_fh, W_oh}
  • Bias : (Tensor) the learnable weights, which contains two parts: input-hidden bias weight and peephole connections weight if setting `use_peepholes` True. 1. `use_peepholes = False` - The shape is (1 x 4D). - Bias = {b_c, b_i, b_f, b_o}.2. `use_peepholes = True` - The shape is (1 x 7D). - Bias = {b_c, b_i, b_f, b_o, W_ic, W_fc, W_oc}.
Outputs:
  • Hidden : (LoDTensor) the hidden state of LSTM operator. The shape is (T x D), and lod is the same with the `Input`.
  • Cell : (LoDTensor) the cell state of LSTM operator. The shape is (T x D), and lod is the same with the `Input`.
  • BatchGate (Intermediate) : (LoDTensor) This LoDTensor contains input gate, forget gate and output gate after the nonlinear computation. This LoDTensor has the same shape as the reorganized input, which is also be called batch input. The LoD size is 2. The first LoD is the batch offsets and the second LoD contains the indexes, which denote the position of reorganized sequence in the raw input.
  • BatchCellPreAct (Intermediate) : (LoDTensor) This LoDTensor is obtained in the forward and used in the backward.
Attributes:
  • use_peepholes (Duplicable): (bool, defalut: True) whether to enable diagonal/peephole connections.
  • is_reverse (Duplicable): (bool, defalut: False) whether to compute reversed LSTM.
  • gate_activation (Duplicable): (string, default: sigmoid)The activation for input gate, forget gate and output gate, `sigmoid` by default.
  • cell_activation (Duplicable): (string, default: tanh)The activation for cell output, `tanh` by defalut.
  • candidate_activation (Duplicable): (string, default: tanh)The activation for candidate hidden state, `tanh` by default.

conv3d_transpose

Convolution3D Transpose Operator.

The convolution transpose operation calculates the output based on the input, filter and strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and output(Output) are in NCDHW format. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCDHW format. Where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter,H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.

Example:
Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = (D_{in} - 1) * strides[0] - 2 * paddings[0] + D_f \\ H_{out} = (H_{in} - 1) * strides[1] - 2 * paddings[1] + H_f \\ W_{out} = (W_{in} - 1) * strides[2] - 2 * paddings[2] + W_f $$

Inputs:
  • Input : (Tensor) The input tensor of convolution transpose operator.The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
  • Filter : (Tensor) The filter tensor of convolution transpose operator.The format of the filter tensor is MCDHW, where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.We enforce groups number == 1 and padding == 0 in the convolution3d transpose scenario.
Outputs:
  • Output : (Tensor) The output tensor of convolution transpose operator.The format of output tensor is also NCDHW.Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
Attributes:
  • strides (Duplicable): (vector<int> default:{1, 1, 1}), the strides{d_stride, h_stride, w_stride} of convolution transpose operator.
  • paddings (Duplicable): (vector<int> default:{0, 0, 0}), paddings(d_pad, h_pad, w_pad) of convolution transpose operator.

conv2d_transpose

Convolution2D Transpose Operator.

The convolution transpose operation calculates the output based on the input, filter and strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and output(Output) are in NCHW format. Where N is batchsize, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCHW format. Where M is the number of input feature channels, C is the number of output feature channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.

Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$ H_{out} = (H_{in} - 1) * strides[0] - 2 * paddings[0] + H_f \\ W_{out} = (W_{in} - 1) * strides[1] - 2 * paddings[1] + W_f $$

Inputs:
  • Input : (Tensor) The input tensor of convolution transpose operator. The format of input tensor is NCHW. Where N is batch size, C is the number of input channels, H is the height of the feature, and W is the width of the feature.
  • Filter : (Tensor) The filter tensor of convolution transpose operator. The format of the filter tensor is MCHW, where M is the number of input feature channels, C is the number of output feature channels,H is the height of the filter, and W is the width of the filter. We enforce groups number == 1 in the convolution transpose scenario.
Outputs:
  • Output : (Tensor) The output tensor of convolution transpose operator. The format of output tensor is also NCHW.
Attributes:
  • strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride) of convolution transpose operator.
  • paddings (Duplicable): (vector<int> default:{0, 0}), the paddings(h_pad, w_pad) of convolution transpose operator.

gru

GRU Operator implements part calculations of the complete GRU as following:

f[ update gate: u_t = actGate(xu_t + W_u * h_{t-1} + b_u) \ reset gate: r_t = actGate(xr_t + W_r * h_{t-1} + b_r) \ output candidate: {h}t = actNode(xc_t + W_c * dot(r_t, h{t-1}) + b_c) \ output: h_t = dot((1 - u_t), h_{t-1}) + dot(u_t, {h}_t) f]

@note To implement the complete GRU, fully-connected operator must be used
before to feed xu, xr and xc as the Input of GRU operator.

Inputs:
  • Input : (LoDTensor) The first input is a LodTensor, which supports variable-time length input sequence. The underlying tensor in this LoDTenosr is a matrix with shape (T X 3D), where, T is the total time steps in this mini-batch, D is the hidden size.
  • H0 : (Tensor, optional) The initial hidden state is an optional input. This is a tensor with shape (N x D), where N is the batch size, D is the hidden size.
  • Weight : (Tensor) The learnable hidden-hidden weight matrix with shape (D x 3D), where D is the hidden size. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape (D x 2D), and the second part are weights of output candidate with shape (D x D).
  • Bias : (Tensor, optional) Bias vector with shape (1 x 3D) concating bias of the update gate, reset gate and output candidate.
Outputs:
  • BatchGate (Intermediate) : (LoDTensor) To compute with batches, sequence data will be reorganized into several successive batches each containing data from the same time step. The LoDTensor BatchGate contains the update gate, reset gate and output candidate values organized in batches. The LoD size is 2. The first LoD contains the batch offsets and the second LoD contains the indexes in the raw sequence data.
  • BatchResetHiddenPrev (Intermediate) : (LoDTensor) The reseted hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.
  • BatchHidden (Intermediate) : (LoDTensor) The hidden state LoDTensor organized in batches. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.
  • Hidden : (LoDTensor) the hidden state LoDTensor organized in sequences. This LoDTensor is a matrix with shape (T X D) and has the same LoD with `BatchGate`.
Attributes:
  • activation (Duplicable): (string, default tanh) The activation type used for output candidate {h}_t.
  • gate_activation (Duplicable): (string, default sigmoid) The activation type used in update gate and reset gate.
  • is_reverse (Duplicable): (bool, defalut: False) whether to compute reversed GRU.

recurrent

Static Length Recurrent Operator.

The static length recurrent operator can only operate on fixed size sequence data, i.e. in each mini-batch, the sequence length of all inputs are the same.

Inputs:
  • inputs (Duplicable) : rnn inputs
  • initial_states (Duplicable) : rnn initial states
  • parameters (Duplicable) : Parameters are used by step block as its input. However, the input is not a sequence tensor. Every time step, each operator in step block just use the parameter directly.
Outputs:
  • outputs (Duplicable) : The output sequence of RNN. The sequence length must be same.
  • step_scopes : StepScopes contain all local variables in each time step.
Attributes:
  • ex_states (Duplicable): The ex-state variable names. The ex-state means the state value in the ex-timestep or the previous time step [ex_states, states, initial_states@GRAD] must be the same order
  • states (Duplicable): The state variable names. [ex_states, states, initial_states@GRAD] must be the same order
  • step_block (Duplicable): The step block inside RNN
  • reverse (Duplicable): Calculate RNN reversely or not. By default reverse=False Assume the input data is [A, B, C, D] if reverse is False: the computation of RNN is like A B C D | | | | v v v v rnn -----> rnn -----> rnn ----> rnn | | | | v v v v o o o o if reverse is True the computation of RNN is like A B C D | | | | v v v v rnn <----- rnn <----- rnn <---- rnn | | | | v v v v o o o o
  • is_train (Duplicable):

save

Save operator

This operator will serialize and write a tensor variable to file on disk.

Inputs:
  • X : (Tensor ) Input tensor to be saved
Outputs:
    Attributes:
    • overwrite (Duplicable): (boolean, default true)Overwrite the output file if exist
    • file_path (Duplicable): (string)The "file_path" where the variable will be saved.

    load

    Load Operator.

    Load operator will load a tensor variable from disk file.

    Inputs:
      Outputs:
      • Out : (Tensor) The tensor need to be loaded
      Attributes:
      • file_path (Duplicable): (string) Variable will be loaded from "file_path".

      auc

      Area Under The Curve (AUC) Operator.

      This implementation computes the AUC according to forward output and label. It is used very widely in binary classification evaluation. As a note: If input label contains values other than 0 and 1, it will be cast to bool. You can find the relevant definitions here: https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve

      There are two types of possible curves: 1. ROC: Receiver operating characteristic 2. PR: Precision Recall

      Inputs:
      • Out : A floating point 2D tensor, values are in the range [0, 1].Each row is sorted in descending order. This input should be theoutput of topk.Typically, this tensor indicates the probability of each label
      • Indices : An int 2D tensor, indicating the indices of originaltensor before sorting. Typically, this tensor indicates which label the probability stands for.
      • Label : A 2D int tensor indicating the label of the training data.The height is batch size and width is always 1.
      Outputs:
      • AUC : A scalar representing the current area-under-the-curve.
      Attributes:
      • curve (Duplicable): Curve type, can be 'ROC' or 'PR'.
      • num_thresholds (Duplicable): The number of thresholds to use when discretizing the roc curve.

      hard_sigmoid

      HardSigmoid Activation Operator.

      Segment-wise linear approximation of sigmoid(https://arxiv.org/abs/1603.00391), which is much faster than sigmoid.

      $y = max(0, min(1, slope * x + shift))$

      The slope should be positive. The offset can be either positive or negative. The default slope and shift are set according to the above reference. It is recommended to use the defaults for this activation.

      Inputs:
      • X : Input of HardSigmoid operator
      Outputs:
      • Y : Output of HardSigmoid operator
      Attributes:
      • slope (Duplicable): Slope for linear approximation of sigmoid
      • offset (Duplicable): Offset for linear approximation of sigmoid

      cond

      Sample Dependent Conditional Operator.

      Given Cond[i] as a 1/0 vector to indicate true/false: Out[i] = subnet_true[i], if Cond[i] == true Out[i] = subnet_false[i], if Cond[i] == false

      Inputs:
      • Cond : The condition, which is a bool vector
      • Xs (Duplicable) : Inputs of Subnets
      Outputs:
      • Outs (Duplicable) : Outputs of Cond_Op after merge
      • SubScopes : sub scopes for true and false branches
      • IndexTensors : Index Tensors contains indices for true/false

      max_pool2d_with_index

      MaxPool2d Operator.

      The maxPooling2d with index operation calculates the output and the mask based on the input, ksize, strides, and paddings parameters. Input(X) and output(Out, Mask) are in NCHW format, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Parameters(ksize, strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out, Mask) size may be different.

      Example: Input: X shape: $(N, C, H_{in}, W_{in})$ Output: Out shape: $(N, C, H_{out}, W_{out})$ Mask shape: $(N, C, H_{out}, W_{out})$ Where $$ H_{out} = \frac{(H_{in} - ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ W_{out} = \frac{(W_{in} - ksize[1] + 2 * paddings[1])}{strides[1]} + 1 $$

      Inputs:
      • X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the image, and W is the width of the image.
      Outputs:
      • Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the image and W is the width of the image.
      • Mask : (Tensor) The Mask tensor of pooling operator.The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the image, and W is the width of the image. It represents the index in the current feature map.
      Attributes:
      • ksize (Duplicable): (vector<int>) The pooling window size(height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
      • global_pooling (Duplicable): (bool, default:false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.
      • strides (Duplicable): (vector<int>, default {1, 1}), strides(height, width) of pooling operator.
      • paddings (Duplicable): (vector<int>, default:{0, 0}), paddings(height, width) of pooling operator. If global_pooling = true, paddings and will be ignored.

      thresholded_relu

      ThresholdedRelu Activation Operator.

      $$ y = \begin{cases} x, \text{if } x > threshold \\ 0, \text{otherwise} \end{cases} $$

      Inputs:
      • X : Input of ThresholdedRelu operator
      Outputs:
      • Y : Output of ThresholdedRelu operator
      Attributes:
      • threshold (Duplicable): The threshold location of activation

      hard_shrink

      HardShrink Activation Operator.

      $$ y = \begin{cases} x, \text{if } x > \lambda \\ x, \text{if } x < -\lambda \\ 0, \text{otherwise} \end{cases} $$

      Inputs:
      • X : Input of HardShrink operator
      Outputs:
      • Y : Output of HardShrink operator
      Attributes:
      • threshold (Duplicable): The value of threshold for HardShrink

      relu6

      Relu6 Activation Operator.

      $y = min(max(0, x), 6)$

      Inputs:
      • X : Input of Relu6 operator
      Outputs:
      • Y : Output of Relu6 operator
      Attributes:
      • threshold (Duplicable): The threshold value of Relu6

      elu

      ELU Activation Operator.

      Applies the following element-wise computation on the input according to https://arxiv.org/abs/1511.07289.

      $y = max(0, x) + min(0, alpha * (e^x - 1))$

      Inputs:
      • X : Input of ELU operator
      Outputs:
      • Y : Output of ELU operator
      Attributes:
      • alpha (Duplicable): The alpha value of ELU

      leaky_relu

      LeakyRelu Activation Operator.

      $y = max(x, alpha * x)$

      Inputs:
      • X : Input of LeakyRelu operator
      Outputs:
      • Y : Output of LeakyRelu operator
      Attributes:
      • alpha (Duplicable): The small negative slope

      top_k

      Top K operator

      If the input is a vector (1d tensor), this operator finds the k largest entries in the vector and outputs their values and indices as vectors. Thus values[j] is the j-th largest entry in input, and its index is indices[j].

      For matrices, this operator computes the top k entries in each row.

      Inputs:
      • X : (Tensor) The input of Topk op
      Outputs:
      • Out : (Tensor) The output tensor of Topk op
      • Indices : (Tensor) The indices of Topk elements of input
      Attributes:
      • k (Duplicable): (int, default 1) Number of top elements to look for along the last dimension (along each row for matrices).

      sequence_softmax

      Sequence Softmax Operator.

      SequenceSoftmaxOp computes the softmax activation among all time-steps for each sequence. The dimension of each time-step should be 1. Thus, the shape of input Tensor can be either [N, 1] or [N], where N is the sum of the length of all sequences.

      The algorithm works as follows: for i-th sequence in a mini-batch: $$Out(X[lod[i]:lod[i+1]], :) = \frac{\exp(X[lod[i]:lod[i+1], :])} {\sum(\exp(X[lod[i]:lod[i+1], :]))}$$

      For example, for a mini-batch of 3 sequences with variable-length, each containing 2, 3, 2 time-steps, the lod of which is [0, 2, 5, 7], then softmax will be computed among X[0:2, :], X[2:5, :], X[5:7, :] and N turns out to be 7.

      Inputs:
      • X : (LoDTensor) 1-D or 2-D input LoDTensor with the 2-nd dimension of length 1.
      Outputs:
      • Out : (LoDTensor) 1-D or 2-D output LoDTensor with the 2-nd dimension of length 1.

      decayed_adagrad

      Decayed Adagrad Optimizer.

      The update is done as follows:

      $$ moment\_out = decay * moment + (1 - decay) * grad * grad \\ param\_out = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + epsilon} $$

      The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have an epsilon attribute. It is added here for numerical stability to avoid the division by zero error.

      Inputs:
      • Param : (Tensor) Input parameter
      • Grad : (Tensor) Input gradient
      • Moment : (Tensor) Second moment
      • LearningRate : (Tensor) Learning rate
      Outputs:
      • ParamOut : (Tensor) Output parameter
      • MomentOut : (Tensor) Output second moment
      Attributes:
      • decay (Duplicable): (float, default 0.95) Discounting factor for coming gradient
      • epsilon (Duplicable): (float, default 1.0e-6) Constant for numerical stability

      scale

      Scale operator

      $$Out = scale*X$$

      Inputs:
      • X : (Tensor) Input tensor of scale operator.
      Outputs:
      • Out : (Tensor) Output tensor of scale operator.
      Attributes:
      • scale (Duplicable): (float, default 0)The scaling factor of the scale operator.

      increment

      Increment Operator.

      The equation is: $$Out = X + step$$

      Inputs:
      • X : (Tensor) The input tensor of increment operator
      Outputs:
      • Out : (Tensor) The output tensor of increment operator.
      Attributes:
      • step (Duplicable): (float, default 1.0) The step size by which the input tensor will be incremented.

      expand

      Expand operator tiles the input by given times number. You should set times number for each dimension by providing attribute 'expand_times'. The rank of X should be in [1, 6]. Please notice that size of 'expand_times' must be same with X's rank. Following is a using case:

      Input(X) is a 3-D tensor with shape [2, 3, 1]:

          [
             [[1], [2], [3]],
             [[4], [5], [6]]
          ]
      

      Attr(expand_times): [1, 2, 2]

      Output(Out) is a 3-D tensor with shape [2, 6, 2]:

          [
              [[1, 1], [2, 2], [3, 3], [1, 1], [2, 2], [3, 3]],
              [[4, 4], [5, 5], [6, 6], [4, 4], [5, 5], [6, 6]]
          ]
      
      Inputs:
      • X : (Tensor, default Tensor<float>) A tensor with rank in [1, 6].X is the input tensor to be expanded.
      Outputs:
      • Out : (Tensor, default Tensor<float>) A tensor with rank in [1, 6].The rank of Output(Out) is same as Input(X) except that each dimension size of Output(Out) is equal to corresponding dimension size of Input(X) multiplying corresponding value of Attr(expand_times).
      Attributes:
      • expand_times (Duplicable): Expand times number for each dimension.

      lod_array_length

      LoDArrayLength Operator.

      This operator obtains the length of lod tensor array:

      $$Out = len(X)$$

      NOTE: The output is a CPU Tensor since the control variable should be only in CPU and the length of LoDTensorArray should be used as control variables.

      Inputs:
      • X : (LoDTensorArray) The input tensor array.
      Outputs:
      • Out : (Tensor) 1x1 CPU Tensor of length, int64_t

      reduce_sum

      {ReduceOp} Operator.

      This operator computes the sum of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true.

      Inputs:
      • X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
      Outputs:
      • Out : (Tensor) The result tensor.
      Attributes:
      • dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [-rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
      • keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.

      tanh_shrink

      TanhShrink Activation Operator.

      $$y = x - \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$

      Inputs:
      • X : Input of TanhShrink operator
      Outputs:
      • Y : Output of TanhShrink operator

      adam

      Adam Optimizer.

      This implements the Adam optimizer from Section 2 of the Adam paper : https://arxiv.org/abs/1412.6980. Adam is a first-order gradient-based optimization method based on adaptive estimates of lower-order moments.

      Adam updates:

      $$ moment\_1\_out = \beta_1 * moment\_1 + (1 - \beta_1) * grad \\ moment\_2_\out = \beta_2 * moment\_2 + (1 - \beta_2) * grad * grad \\ learning\_rate = learning\_rate * \frac{\sqrt{1 - \beta_{2\_pow}}}{1 - \beta_{1\_pow}} \\ param\_out = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon} $$

      Inputs:
      • Param : (Tensor) Input parameter
      • Grad : (Tensor) Input gradient
      • LearningRate : (Tensor) Learning rate
      • Moment1 : (Tensor) Input first moment
      • Moment2 : (Tensor) Input second moment
      • Beta1Pow : (Tensor) Input beta1 power accumulator
      • Beta2Pow : (Tensor) Input beta2 power accumulator
      Outputs:
      • ParamOut : (Tensor) Output parameter
      • Moment1Out : (Tensor) Output first moment
      • Moment2Out : (Tensor) Output second moment
      Attributes:
      • beta1 (Duplicable): (float, default 0.9) Exponential decay rate for the first moment estimates.
      • beta2 (Duplicable): (float, default 0.999) exponential decay rate for the second moment estimates.
      • epsilon (Duplicable): (float, default 1.0e-8) Constant for numerical stability

      reduce_min

      {ReduceOp} Operator.

      This operator computes the min of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true.

      Inputs:
      • X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
      Outputs:
      • Out : (Tensor) The result tensor.
      Attributes:
      • dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [-rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
      • keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.

      lod_reset

      LoDReset operator

      Reset LoD of Input(X) into a new one specified by Input(TargetLoD) or Attr(target_lod), or set LoD for Input(X) if it doesn't have one. Currently the lod_reset operator only supports the reset of level 0 LoD. At least one of Input(TargetLoD) and Attr(target_lod) must be set, and if both of them are set, Input(TargetLoD) will be chosen as the target LoD.

      An example: Given a float LoDTensor X with shape (6, 1), its transpose form represents

      [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
      

      with LoD = [[0, 2, 5, 6]] and the three (transposed) sequences look like

      [1.0, 2.0], [3.0, 4.0, 5.0], [6.0].
      

      If target LoD = [0, 4, 6], the lod_reset operator will reset the LoD and the sequences that the LoDTensor Output(Out) contains becomes:

      [1.0, 2.0, 3.0, 4.0], [5.0, 6.0].
      
      Inputs:
      • X : (LoDTensor) The input tensor of lod_reset operator.
      • TargetLoD : (Tensor, optional) The target level 0 LoD from Input().
      Outputs:
      • Out : (LoDTensor) The output tensor of lod_reset operator.
      Attributes:
      • target_lod (Duplicable): The target level 0 LoD from Attr().

      write_to_array

      WriteToArray Operator.

      This operator writes a LoDTensor to a LoDTensor array.

      Assume $T$ is LoDTensor, $i$ is the subscript of the array, and $A$ is the array. The equation is

      $$A[i] = T$$

      Inputs:
      • X : (LoDTensor) the tensor will be written to tensor array
      • I : (Tensor) the subscript index in tensor array. The number of element should be 1
      Outputs:
      • Out : (TensorArray) the tensor array will be written

      reshape

      Reshape Operator.

      Reshape Input(X) into the shape specified by Attr(shape).

      An example: Given a 2-D tensor X with 2 rows and 2 columns

      [[1, 2], [3, 4]]
      

      and target shape = [1, 4], the reshape operator will transform the tensor X into a 1-D tensor:

      [1, 2, 3, 4]
      
      Inputs:
      • X : The input tensor of reshape operator.
      Outputs:
      • Out : The output tensor of reshape operator.
      Attributes:
      • shape (Duplicable): (vector<int>) Target shape of reshape operator.

      fill_constant

      FillConstantBatchSizeLike Operator.

      Fill up a variable with specified constant value.

      Inputs:
        Outputs:
        • Out : (Tensor) Tensor of specified shape will be filled with the specified value
        Attributes:
        • dtype (Duplicable): (int, default 5 (FP32)) Output data type
        • shape (Duplicable): (vector<int>) The shape of the output
        • value (Duplicable): (float, default 0) The value to be filled
        • force_cpu (Duplicable): (bool, default false) Force fill output variable to cpu memory. Otherwise, fill output variable to the running device

        elementwise_div

        Limited Elementwise Div Operator.

        The equation is:

        $Out = X / Y$

        X is a tensor of any dimension and the dimensions of tensor Y must be smaller than or equal to the dimensions of X.

        There are two cases for this operator: 1. The shape of Y is same with X; 2. The shape of Y is a subset of X.

        For case 2: Y will be broadcasted to match the shape of X and axis should be the starting dimension index for broadcasting Y onto X.

        example: shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0

        Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

        Inputs:
        • X : (Tensor) The first input tensor of elementwise op
        • Y : (Tensor) The second input tensor of elementwise op
        Outputs:
        • Out : The output of elementwise op
        Attributes:
        • axis (Duplicable): (int, default -1) The starting dimension index for broadcasting Y onto X

        conv2d_cudnn

        Convolution Operator.

        The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and Output(Output) are in NCHW format. Where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCHW format. Where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.

        Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$ H_{out}= \frac{(H_{in} + 2 * paddings[0] - (dilations[0] * (H_f - 1) + 1))}{strides[0]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[1] - (dilations[1] * (W_f - 1) + 1))}{strides[1]}+ 1 $$

        Inputs:
        • Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
        • Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCHW, where M is the number of output image channels, C is the number of input image channels, H is the height of the filter, and W is the width of the filter. If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups.
        Outputs:
        • Output : (Tensor) The output tensor of convolution operator. The format of output tensor is also NCHW.
        Attributes:
        • strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride) of convolution operator.
        • paddings (Duplicable): (vector<int> default:{0, 0}), the paddings(h_pad, w_pad) of convolution operator.
        • groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.
        • dilations (Duplicable): (vector<int> default:{1, 1}), the dilations(h_dilation, w_dilation) of convolution operator.
        • workspace_size_MB (Duplicable): workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardware. This size should be chosen carefully.

        mul

        Mul Operator.

        This operator is used to perform matrix multiplication for input X and Y.

        The equation is:

        <span class="markdown-equation" id="equation-0"></span>
        

        Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

        Inputs:
        • X : The first input of mul op
        • Y : The second input of mul op
        Outputs:
        • Out : The output of mul op
        Attributes:
        • x_num_col_dims (Duplicable): (int, default 1) mul_op can take tensors with more than two dimensions as input `X`, in that case, tensors will be reshaped to a matrix. The matrix's first dimension(column length) will be the product of tensor's last `num_col_dims` dimensions, and the matrix's second dimension(row length) will be the product of tensor's first `rank - num_col_dims` dimensions.
        • y_num_col_dims (Duplicable): (int, default 1) mul_op can take tensors with more than two dimensions as input `Y`, in that case, tensors will be reshaped to a matrix. Just like input `X`.

        margin_rank_loss

        MarginRankLoss Operator.

        This operator measures the loss given a pair of training sample {X1, X2} and the Label with attribute margin, where Label = +1 indicating X1 is ranked higher than X2 and Label = -1 otherwise. The loss is calculated as:

        $loss(X1, X2, Label) = max(0, -Label * (X1 - X2) + margin)$

        The attribute margin here helps make the predictions more robust. Denote the item ranked higher as the positive sample, otherwise the negative sample. If the score of the two samples satisfies

        $positive sample - negative sample < margin$

        the pair of samples will contribute to the final loss, which will backpropagate and train the ranking model to enlarge the difference between the two scores.

        For batch input with size batch_size, X1, X2 and Label all have the same shape [batch_size x 1].

        Inputs:
        • X1 : (2-D tensor with shape [batch_size x 1]) The score for one item X1 to be ranked, from pairwise ranking model.
        • X2 : (2-D tensor with shape [batch_size x 1]) The score for another item X2 to be ranked, from pairwise ranking model.
        • Label : (2-D tensor with shape [batch_size x 1]) The label indicating X1 ranked higher than X2 or not, can only be +1 or -1.
        Outputs:
        • Activated (Intermediate) : (2-D tensor with shape [batch_size x 1]) Intermediate tensor to indicate whether each element of Output(Out) is activated.
        • Out : (2-D tensor with shape [batch_size x 1]) The output loss of MarginRankLoss operator.
        Attributes:
        • margin (Duplicable): (scalar, default 0) Margin for MarginRankLossOp.

        greater_equal

        greater_equal Operator

        It operates element-wise on X and Y, and returns the Out. Each of them is a N-dim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X >= Y

        Inputs:
        • X : (LoDTensor) the left hand operand of greater_equal operator
        • Y : (LoDTensor) the right hand operand of greater_equal operator
        Outputs:
        • Out : (LoDTensor) n-dim bool tensor. Each element is Out = X >= Y

        reciprocal

        Reciprocal Activation Operator.

        $$y = \frac{1}{x}$$

        Inputs:
        • X : Input of Reciprocal operator
        Outputs:
        • Y : Output of Reciprocal operator

        squared_l2_norm

        SquaredL2Norm Operator.

        Computes the squared L2 norm of a tensor.

        $$Out = \sum_{i} X_{i}^2$$

        Inputs:
        • X : (Tensor) The input of squared_l2_norm op.
        Outputs:
        • Out : (Scalar) The output of squared_l2_norm op.

        shrink_rnn_memory

            In dynamic RNN, we are able to handle sequences of different lengths. 
            Because of the multiple lengths, the size of each step input can be 
            different, which may lead to a mismatching between the input of
            the current step and the memory generated by the previous one. This 
            operator shrinks memory according to the size of the next step input, 
            to make sure that they can match each other.
        
        Inputs:
        • X : (LoDTensor) The RNN step memory to be shrinked.
        • RankTable : (LoDRankTable) The lod_rank_table of dynamic RNN.
        • I : (LoDTensor) The step index. The RNN step memory 'X' will be shrinked to match the size of the input of the index'th step.
        Outputs:
        • Out : (LoDTensor) The shrinked RNN step memory.

        conditional_block

        Conditional block operator

        Run the sub-block if X is not empty. Params is the other inputs and Out is the outputs of the sub-block.

        Inputs:
        • X (Duplicable) : The conditional variable of this operator. If X is empty, the whole sub-block will not be executed.
        • Params (Duplicable) : The input variables of the sub-block.
        Outputs:
        • Out (Duplicable) : The output variables of the sub-block.
        • Scope : (std::vector<Scope*>) The step scope of conditional block. To unify the conditional block, rnn and while op, the type of scope is std::vector<Scope*>
        Attributes:
        • block (Duplicable): The step block of conditional block operator

        lookup_table

        Lookup Table Operator.

        This operator is used to perform lookups on the parameter W, then concatenated into a dense tensor.

        The input Ids can carry the LoD (Level of Details) information, or not. And the output only shares the LoD information with input Ids.

        Inputs:
        • W : An input represents embedding tensors, which is a learnable parameter.
        • Ids : An input with type int32 or int64 contains the ids to be looked up in W. Ids must be a column vector with rank = 2. The 2nd dimension size must be 1.
        Outputs:
        • Out : The lookup results, which have the same type as W.
        Attributes:
        • is_sparse (Duplicable): (boolean, default false) Sparse update

        pad

        Pad Operator.

        Pad input into output, as specified by paddings and pad_value. The input should be a k-D tensor(k > 0 and k < 7). As an example:

        Given:

        X = [[1, 2], [3, 4]],

        paddings = [0, 1, 1, 2],

        and

        pad_value = 0,

        we have:

        Out = [[0, 1, 2, 0, 0] [0, 3, 4, 0, 0] [0, 0, 0, 0, 0]]

        Inputs:
        • X : The input of pad op. The input should be a k-D tensor(k > 0 and k < 7)
        Outputs:
        • Out : The output of pad op. A tensor with the same shape as X.
        Attributes:
        • paddings (Duplicable): (vector<int>) A list<int> to describe the padding rules for each dimension. For 2-D image tensor, paddings=[0, 1, 2, 3] means padding 0 row to top, 1 row to bottom, 2 columns to left and 3 columns to right. Size of paddings should be equal to 2 * dimension size of the input tensor.
        • pad_value (Duplicable): (float, default 0.0) The value to fill the padded areas.

        split_lod_tensor

            Split a LoDTensor with a Mask at certain level. The input LoDTensor
            has 3 sequence at certain lod level. The Mask is a bool column vector,
            such as [0, 1, 0] at the same level. The first and third sequence will
            be send to False Output LoDTensor; whereas the second sequence will
            be send to True Output LoDTensor. Please refer to MergeLoDTensorOp.
        
        Inputs:
        • X : The input LoDTensor
        • Mask : A bool column vector which mask the input
        Outputs:
        • OutTrue : True branch of input LoDTensor
        • OutFalse : False branch of input LoDTensor
        Attributes:
        • level (Duplicable): (int) the specific lod level to split.

        max_sequence_len

        Calculate the max sequence length through lod_rank_table.

        Inputs:
        • RankTable : The lod_rank_table.
        Outputs:
        • Out : The max sequence length.

        multiplex

        Multiplex Operator.

        Multiplex multiple tensors according to the index provided by the index tensor.

        Ids: the index tensor. X[0 : N - 1]: the candidate tensors for output (N >= 2). For each index i from 0 to batchSize - 1, the output is the i-th row of the the (Ids[i])-th tensor.

        For i-th row of the output tensor:

        $$y[i] = x_{k}[i]$$

        where y is the output tensor, x_{k} is the k-th input tensor, and k = Ids[i].

        Inputs:
        • Ids : The index tensor of multiplex operator.
        • X (Duplicable) : The candidate tensors of multiplex operator.
        Outputs:
        • Out : The output tensor of multiplex operator.

        stanh

        STanh Activation Operator.

        $$y = b * \frac{e^{a * x} - e^{-a * x}}{e^{a * x} + e^{-a * x}}$$

        Inputs:
        • X : Input of STanh operator
        Outputs:
        • Y : Output of STanh operator
        Attributes:
        • scale_a (Duplicable): The scale parameter of a for the input
        • scale_b (Duplicable): The scale parameter of b for the input

        adamax

        Adamax Optimizer.

        We implement the Adamax optimizer from Section 7 of the Adam paper: https://arxiv.org/abs/1412.6980. Adamax is a variant of the Adam algorithm based on the infinity norm.

        Adamax updates:

        $$ moment\_out = \beta_1 * moment + (1 - \beta_1) * grad \\ inf\_norm\_out = max(\beta_2 * inf\_norm + \epsilon, |grad|) \\ learning\_rate = \frac{learning\_rate}{1 - \beta_{1\_pow}} \\ param\_out = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out} $$

        The original paper does not have an epsilon attribute. However, it is added here for numerical stability to prevent the division by 0 error.

        Inputs:
        • Param : (Tensor) Input parameter
        • Grad : (Tensor) Input gradient
        • LearningRate : (Tensor) Learning rate
        • Moment : (Tensor) First moment
        • InfNorm : (Tensor) Input exponentially weighted infinity norm
        • Beta1Pow : (Tensor) Input beta1 power accumulator
        Outputs:
        • ParamOut : (Tensor) Output parameter
        • MomentOut : (Tensor) Output first moment
        • InfNormOut : (Tensor) Output exponentially weighted infinity norm
        Attributes:
        • beta1 (Duplicable): (float, default 0.9) Exponential decay rate for the 1st moment estimates.
        • beta2 (Duplicable): (float, default 0.999) exponential decay rate for the weighted infinity norm estimates.
        • epsilon (Duplicable): (float, default 1.0e-8) Constant for numerical stability

        l1_norm

        L1 Norm Operator.

        Computes the L1 norm of a tensor.

        $$Out = \sum{|X|}$$

        Inputs:
        • X : (Tensor) The input of l1_norm op.
        Outputs:
        • Out : (Scalar) The output of l1_norm op.

        dropout

        Dropout Operator.

        Dropout refers to randomly dropping out units in a nerual network. It is a regularization technique for reducing overfitting by preventing neuron co-adaption during training. The dropout operator randomly set (according to the given dropout probability) the outputs of some units to zero, while others are set equal to their corresponding inputs.

        Inputs:
        • X : The input of dropout op.
        Outputs:
        • Out : The output of dropout op.
        • Mask (Intermediate) : The random sampled dropout mask.
        Attributes:
        • dropout_prob (Duplicable): Probability of setting units to zero.
        • is_test (Duplicable): True if in test phase.
        • seed (Duplicable): Dropout random seed.

        lod_tensor_to_array

        Inputs:
        • X :
        • RankTable :
        Outputs:
        • Out :

        pool2d_cudnn

        Pool2d Operator.

        The pooling2d operation calculates the output based on the input, pooling_type and ksize, strides, paddings parameters. Input(X) and output(Out) are in NCHW format, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature. Parameters(ksize, strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.

        Example:
        Input: X shape: $(N, C, H_{in}, W_{in})$ Output: Out shape: $(N, C, H_{out}, W_{out})$ Where $$ H_{out} = \frac{(H_{in} - ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ W_{out} = \frac{(W_{in} - ksize[1] + 2 * paddings[1])}{strides[1]} + 1 $$

        Inputs:
        • X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
        Outputs:
        • Out : (Tensor) The output tensor of pooling operator. The format of output tensor is also NCHW, where N is batch size, C is the number of channels, H is the height of the feature, and W is the width of the feature.
        Attributes:
        • pooling_type (Duplicable): (string), pooling type, can be "max" for max-pooling and "avg" for average-pooling.
        • ksize (Duplicable): (vector<int>) The pooling window size(height, width) of the pooling operator. If global_pooling = true, ksize and paddings will be ignored.
        • global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings will be ignored.
        • strides (Duplicable): (vector<int>, default {1, 1}), strides(height, width) of pooling operator.
        • paddings (Duplicable): (vector<int>, default {0,0}), paddings(height, width) of pooling operator.If global_pooling = true, paddings and ksize will be ignored.

        conv2d_transpose_cudnn

        Convolution2D Transpose Operator.

        The convolution transpose operation calculates the output based on the input, filter and strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and output(Output) are in NCHW format. Where N is batchsize, C is the number of channels, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCHW format. Where M is the number of input feature channels, C is the number of output feature channels, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are two elements. These two elements represent height and width, respectively. The input(X) size and output(Out) size may be different.

        Example: Input: Input shape: $(N, C_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, H_f, W_f)$ Output: Output shape: $(N, C_{out}, H_{out}, W_{out})$ Where $$ H_{out} = (H_{in} - 1) * strides[0] - 2 * paddings[0] + H_f \\ W_{out} = (W_{in} - 1) * strides[1] - 2 * paddings[1] + W_f $$

        Inputs:
        • Input : (Tensor) The input tensor of convolution transpose operator. The format of input tensor is NCHW. Where N is batch size, C is the number of input channels, H is the height of the feature, and W is the width of the feature.
        • Filter : (Tensor) The filter tensor of convolution transpose operator. The format of the filter tensor is MCHW, where M is the number of input feature channels, C is the number of output feature channels,H is the height of the filter, and W is the width of the filter. We enforce groups number == 1 in the convolution transpose scenario.
        Outputs:
        • Output : (Tensor) The output tensor of convolution transpose operator. The format of output tensor is also NCHW.
        Attributes:
        • strides (Duplicable): (vector<int> default:{1, 1}), the strides(h_stride, w_stride) of convolution transpose operator.
        • paddings (Duplicable): (vector<int> default:{0, 0}), the paddings(h_pad, w_pad) of convolution transpose operator.
        • dilations (Duplicable): dilations of convolution operator.
        • workspace_size_MB (Duplicable): workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardward. This size should be carefully setted.

        gaussian_random

        GaussianRandom Operator.

        Used to initialize tensors with gaussian random generator.

        Inputs:
          Outputs:
          • Out : Output matrix of gaussian random op
          Attributes:
          • shape (Duplicable): (vector<int>) The dimension of random tensor.
          • mean (Duplicable): (float, default 0.0) mean of random tensor.
          • std (Duplicable): (float, default 1.0) std of random tensor.
          • seed (Duplicable): (int, default 0) Random seed of generator.0 means use system wide seed.
          • dtype (Duplicable): (int, default 5(FP32)) Output data type.

          lstm_unit

          Lstm Unit Operator

          Equation:

          $$ i, f, o, j = split(X) \\ C = C_{prev} * sigm(f + forget\_bias) + sigm(i) * tanh(j) \\ H = C * sigm(o) $$

          Inputs:
          • X : FC input before the non-linear activation.
          • C_prev : The cell state tensor of last time-step in the Lstm Unit operator.
          Outputs:
          • C : The cell tensor of Lstm Unit operator.
          • H : The hidden state tensor of Lstm Unit operator.
          Attributes:
          • forget_bias (Duplicable): (float, default 0.0) The forget bias of Lstm Unit.

          sign

          Sign operator

          $$Out = X.sign()$$

          Inputs:
          • X : (Tensor) Input tensor of sign operator.
          Outputs:
          • Out : (Tensor) Output tensor of sign operator.

          pow

          Pow Activation Operator.

          $y = x^{factor}$

          Inputs:
          • X : Input of Pow operator
          Outputs:
          • Y : Output of Pow operator
          Attributes:
          • factor (Duplicable): The exponential factor of Pow

          clip

          Clip Operator.

          The clip operator limits the value of given input within an interval. The interval is specified with arguments 'min' and 'max':

          $$ Out = \min(\max(X, min), max) $$

          Inputs:
          • X : (Tensor)The input of clip op.The number of dimensions must be between [1, 9].
          Outputs:
          • Out : (Tensor)The output of clip op with shape as input(X)
          Attributes:
          • min (Duplicable): (float)Minimum value, under which element is replaced by min.
          • max (Duplicable): (float)Maximum value, above which element is replaced by max

          huber_loss

          HuberLoss Operator.

          Huber loss is a loss function used in robust regression. We define X as the input value and Y as the target value. Huber loss can evaluate the fitness of X to Y. Different from MSE loss, Huber loss is more robust for outliers. The shape of X and Y are [batch_size, 1]. The equation is:

          $$ Out_{\delta}(X, Y)_i = \begin{cases} 0.5 * (Y_i - X_i)^2, \quad |Y_i - X_i| \leq \delta \\ \delta * (|Y_i - X_i| - 0.5 * \delta), \quad otherwise \end{cases} $$

          In the above equation, $Out_delta(X, Y)_i$, $X_i$ and $Y_i$ represent the ith element of Out, X and Y.

          Inputs:
          • X : The input value of huber loss op.X is a 2-D tensor with shape [batch_size, 1].
          • Y : The target value of huber loss op.Y is a 2-D tensor with shape [batch_size, 1].
          Outputs:
          • Residual (Intermediate) : Intermediate tensor to cache residual value between Y and X.The shape is same as Input(X) and will be reused in backward.
          • Out : The output tensor with shape [batch_size, 1] which represents the huber loss.
          Attributes:
          • delta (Duplicable): Hyper parameter in huber loss.

          smooth_l1_loss

          Smooth L1 Loss Operator.

          This operator computes the smooth l1 loss for X and Y. The operator takes the first dimension of X and Y as batch size. For each instance, it computes the smooth l1 loss element by element first and then sums all the losses. So the shape of Out is [batch_size, 1].

          The equation is: $$ Out_{\sigma}(X, Y)_i = \begin{cases} 0.5 * (\sigma * (X_i - Y_i)) ^ 2 \quad |X_i - Y_i| \lt \frac{1} {{\sigma} ^ 2} \\ \frac{|X_i - Y_i| - 0.5}{{\sigma}^2}, \quad otherwise \end{cases} $$

          In the above equation, $Out_{sigma}(X, Y)_i$, $X_i$ and $Y_i$ represent the ith element of Out, X and Y.

          Inputs:
          • X : (Tensor, default Tensor<float>) A tensor with rank at least 2. The input value of smooth l1 loss op with shape [batch_size, dim1, ..., dimN].
          • Y : (Tensor, default Tensor<float>) A tensor with rank at least 2. The target value of smooth l1 loss op with same shape as X.
          • InsideWeight : (Tensor, default Tensor<float>) A tensor with rank at least 2. This input is optional and should have same shape with X. If provided, the result of (X - Y) will be multiplied by this tensor element by element.
          • OutsideWeight : (Tensor, default Tensor<float>) A tensor with rank at least 2. This input is optional and should have same shape with X. If provided, the out smooth l1 loss will be multiplied by this tensor element by element.
          Outputs:
          • Diff (Intermediate) : Intermediate variable to cache InsideWeight * (X - Y).
          • Out : (Tensor, default Tensor<float>) A tensor with rank be 2. The output smooth l1 loss with shape [batch_size, 1].
          Attributes:
          • sigma (Duplicable): Hyper parameter of smooth l1 loss op.A float scalar with default value 3.0.

          sum

          Sum operator.

          This operators sums the input tensors. All the inputs can carry the LoD (Level of Details) information. However, the output only shares the LoD information with the first input.

          Inputs:
          • X (Duplicable) : (vector<Tensor>) The input tensors of sum operator.
          Outputs:
          • Out : (Tensor) The output tensor of sum operator.

          concat

          Concat Operator.

          Concatenate the input tensors along dimension axis. Examples: Input[0] = [[1,2],[3,4]] Input[1] = [[5,6]] axis = 0 Output = [[1,2], [3,4], [5,6]]

          Inputs:
          • X (Duplicable) : Input tensors of concat operator.
          Outputs:
          • Out : Output tensor of concat operator.
          Attributes:
          • axis (Duplicable): The axis along which the input tensors will be concatenated.

          softmax_with_cross_entropy

          Softmax With Cross Entropy Operator.

          Cross entropy loss with softmax is used as the output layer extensively. This operator computes the softmax normalized values for each row of the input tensor, after which cross-entropy loss is computed. This provides a more numerically stable gradient.

          Because this operator performs a softmax on logits internally, it expects unscaled logits. This operator should not be used with the output of softmax operator since that would produce incorrect results.

          When the attribute soft_label is set false, this operators expects mutually exclusive hard labels, each sample in a batch is in exactly one class with a probability of 1.0. Each sample in the batch will have a single label.

          The equation is as follows:

          1) Hard label (one-hot label, so every sample has exactly one class)

          $$Loss_j = -\text{Logit}_{Label_j} + \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right), j = 1,..., K$$

          2) Soft label (each sample can have a distribution over all classes)

          $$Loss_j = -\sum_{i=0}^{K}\text{Label}_i \left(\text{Logit}_i - \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right)\right), j = 1,...,K$$

          Inputs:
          • Logits : (Tensor, default: Tensor<float>), The unscaled log probabilities which is a 2-D tensor with shape [N x K]. N is the batch_size, and K is the class number.
          • Label : (Tensor) The ground truth which is a 2-D tensor. If soft_label is set to false, Label is a Tensor<int64> with shape [N x 1]. If soft_label is set to true, Label is a Tensor<float/double> with shape [N x K].
          Outputs:
          • Softmax (Intermediate) : (Tensor, default: Tensor<float>), A 2-D tensor with shape [N x K]. The outputs value of softmax activation by given the input batch, which will be used in backward calculation.
          • Loss : (Tensor, default: Tensor<float>), A 2-D tensor. The cross entropy loss with shape [N x 1].
          Attributes:
          • soft_label (Duplicable): (bool, default: false), A flag to indicate whether to interpretate the given labels as soft labels.

          fill_constant_batch_size_like

          FillConstantBatchSizeLike Operator.

          Fill up a variable with specified constant value.

          Inputs:
          • Input : (Tensor) Tensor whose dim_idx th dimension is used to specify the batch_size
          Outputs:
          • Out : (Tensor) Tensor of specified shape will be filled with the specified value
          Attributes:
          • dtype (Duplicable): (int, default 5 (FP32)) Output data type
          • shape (Duplicable): (vector<int>) The shape of the output
          • input_dim_idx (Duplicable): (int, default 0) The index of input's batch size dimension
          • output_dim_idx (Duplicable): (int, default 0) The index of output's batch size dimension
          • value (Duplicable): (float, default 0) The value to be filled

          adadelta

          Adadelta Optimizer.

          Adadelta optimizer is implemented as explained in: https://arxiv.org/abs/1212.5701 Adadelta is a per-dimension adaptive learning rate method used for gradient descent.

          Adadelta updates are as follows:

          $$ avg\_squared\_grad\_out = \rho * avg\_squared\_grad + (1 - \rho) * grad * grad \\ param\_update = - \sqrt{\frac{avg\_squared\_update + \epsilon}{avg\_squared\_grad\_out + \epsilon}} * grad \\ avg\_squared\_update\_out = \rho * avg\_squared\_update + (1 - \rho) * {param\_update}^2 \\ param\_out = param + param\_update $$

          Inputs:
          • Param : (Tensor) Input parameter
          • Grad : (Tensor) Input gradient
          • AvgSquaredGrad : (Tensor) Input average of squared gradient
          • AvgSquaredUpdate : (Tensor) Input average of squared parameter updates
          Outputs:
          • ParamOut : (Tensor) Output parameter
          • AvgSquaredGradOut : (Tensor) Output average of squared gradient
          • AvgSquaredUpdateOut : (Tensor) Output average of squared parameter updates
          Attributes:
          • rho (Duplicable): (float, default 0.95) Exponential decay rate for squared gradients.
          • epsilon (Duplicable): (float, default 1.0e-6) Constant for numerical stability

          log

          Log Activation Operator.

          $y = ln(x)$

          Natural logarithm of x.

          Inputs:
          • X : Input of Log operator
          Outputs:
          • Y : Output of Log operator

          conv3d_cudnn

          Convolution3D Operator.

          The convolution operation calculates the output based on the input, filter and strides, paddings, dilations, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and output(Output) are in NCDHW format, where N is batch size, C is the number of channels,D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filters(Input) is MCDHW format, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings, dilations) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.

          Example: Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{out}, C_{in}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$ D_{out}= \frac{(D_{in} + 2 * paddings[0] - (dilations[0] * (D_f - 1) + 1))}{ strides[0]}+ 1 \\ H_{out}= \frac{(H_{in} + 2 * paddings[1] - (dilations[1] * (H_f - 1) + 1))}{ strides[1]}+ 1 \\ W_{out}= \frac{(W_{in} + 2 * paddings[2] - (dilations[2] * (W_f - 1) + 1))}{ strides[2]}+ 1 $$

          Inputs:
          • Input : (Tensor) The input tensor of convolution operator. The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
          • Filter : (Tensor) The filter tensor of convolution operator. The format of the filter tensor is MCDHW, where M is the number of output image channels, C is the number of input image channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.If the groups attribute is greater than 1, C equals the number of input image channels divided by the groups.
          Outputs:
          • Output : (Tensor) The output tensor of convolution operator.The format of output tensor is also NCDHW.
          Attributes:
          • strides (Duplicable): (vector<int>, default:{1, 1, 1}), the strides(d_stride, h_stride, w_stride) of convolution operator.
          • paddings (Duplicable): (vector<int>, default:{0, 0, 0}), the paddings(d_pad, h_pad, w_pad) of convolution operator.
          • groups (Duplicable): (int default:1), the groups number of the convolution operator. According to grouped convolution in Alex Krizhevsky's Deep CNN paper: when group=2, the first half of the filters is only connected to the first half of the input channels, while the second half of the filters is only connected to the second half of the input channels.
          • dilations (Duplicable): (vector<int> default:{1, 1, 1}), the dilations(d_dilation, h_dilation, w_dilation) of convolution operator.
          • workspace_size_MB (Duplicable): workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardware. This size should be chosen carefully.

          conv3d_transpose_cudnn

          Convolution3D Transpose Operator.

          The convolution transpose operation calculates the output based on the input, filter and strides, paddings, groups parameters. The size of each dimension of the parameters is checked in the infer-shape. Input(Input) and output(Output) are in NCDHW format. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature. Filter(Input) is in MCDHW format. Where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter,H is the height of the filter, and W is the width of the filter. Parameters(strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.

          Example:
          Input: Input shape: $(N, C_{in}, D_{in}, H_{in}, W_{in})$ Filter shape: $(C_{in}, C_{out}, D_f, H_f, W_f)$ Output: Output shape: $(N, C_{out}, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = (D_{in} - 1) * strides[0] - 2 * paddings[0] + D_f \\ H_{out} = (H_{in} - 1) * strides[1] - 2 * paddings[1] + H_f \\ W_{out} = (W_{in} - 1) * strides[2] - 2 * paddings[2] + W_f $$

          Inputs:
          • Input : (Tensor) The input tensor of convolution transpose operator.The format of input tensor is NCDHW. Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
          • Filter : (Tensor) The filter tensor of convolution transpose operator.The format of the filter tensor is MCDHW, where M is the number of input feature channels, C is the number of output feature channels, D is the depth of the filter, H is the height of the filter, and W is the width of the filter.We enforce groups number == 1 and padding == 0 in the convolution3d transpose scenario.
          Outputs:
          • Output : (Tensor) The output tensor of convolution transpose operator.The format of output tensor is also NCDHW.Where N is batch size, C is the number of channels, D is the depth of the feature, H is the height of the feature, and W is the width of the feature.
          Attributes:
          • strides (Duplicable): (vector<int> default:{1, 1, 1}), the strides{d_stride, h_stride, w_stride} of convolution transpose operator.
          • paddings (Duplicable): (vector<int> default:{0, 0, 0}), paddings(d_pad, h_pad, w_pad) of convolution transpose operator.
          • dilations (Duplicable): dilations of convolution operator.
          • workspace_size_MB (Duplicable): workspace size for cudnn, in MB, workspace is a section of GPU memory which will be allocated/freed each time the operator runs, larger workspace size can increase performance but also requires better hardward. This size should be carefully setted.

          cross_entropy

          CrossEntropy Operator.

          It supports both standard cross-entropy and soft-label cross-entropy loss computation. 1) One-hot cross-entropy: soft_label = false, Label[i, 0] indicates the class index for sample i:

                      $Y[i] = -\log(X[i, Label[i]])$
          

          2) Soft-label cross-entropy: soft_label = true, Label[i, j] indicates the soft label of class j for sample i:

                      $Y[i] = \sum_j{-Label[i, j] * log(X[i, j])}$
          

          Please make sure that in this case the summuation of each row of Label equals one.

          3) One-hot cross-entropy with vecterized Input(Label): As a special case of 2), when each row of Input(Label) has only one non-zero element (equals 1), soft-label cross-entropy degenerates to a one-hot cross-entropy with one-hot label representation.

          Both the input X and Label can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

          Inputs:
          • X : (Tensor, default Tensor<float>), a 2-D tensor with shape N x D, where N is the batch size and D is the number of classes. This input is a probability computed by the previous operator, which is almost always the result of a softmax operator.
          • Label : (Tensor), the ground truth which is a 2-D tensor. When soft_label is set to false, Label is a Tensor<int64> with shape [N x 1]. When soft_label is set to true, Label is a Tensor<float/double> with shape [N x K].
          Outputs:
          • Y : (Tensor, default Tensor<float>), a 2-D tensor with shape [N x 1]. The cross entropy loss.
          Attributes:
          • soft_label (Duplicable): (bool, default false), a flag indicating whether to interpretate the given labels as soft labels.

          matmul

          MatMul Operator.

          This operator is used to perform (batched) matrix multiplication over the last two dimensions of the input tensors X and Y.

          If a transpose flag is specified, the last two dimensions of the tensor are transposed. If the tensor is rank-1 of shape [D], then for X it is treated as [1, D] in nontransposed form and as [D, 1] in transposed form, whereas for Y it is the opposite: It is treated as [D, 1] in nontransposed form and as [1, D] in transposed form.

          Examples without transpose: - X: [K], Y: [K] => Out: [1] - X: [K], Y: [K, N] => Out: [N] - X: [B, M, K], Y: [K] => Out: [B, M] - X: [M, K], Y: [B, K, N] => Out: [B, M, N] - X: [B, M, K], Y: [B, K, N] => Out: [B, M, N]

          The behavior is designed to be similar to the numpy.matmul function. The differences are: - Currently only rank 1 to rank 3 input tensors are supported. - We add transpose_X and transpose_Y flags.

          Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

          Inputs:
          • X : The first input of MatMul op
          • Y : The second input of MatMul op
          Outputs:
          • Out : The output of MatMul op
          Attributes:
          • transpose_X (Duplicable): If true, use the transpose of `X`.
          • transpose_Y (Duplicable): If true, use the transpose of `Y`.

          brelu

          BRelu Activation Operator.

          $y = max(min(x, t_{min}), t_{max})$

          Inputs:
          • X : Input of BRelu operator
          Outputs:
          • Y : Output of BRelu operator
          Attributes:
          • t_min (Duplicable): The min marginal value of BRelu
          • t_max (Duplicable): The max marginal value of BRelu

          crf_decoding

          The crf_decoding operator reads the emission feature weights and the transition feature weights learned by the linear_chain_crf operator. It implements the Viterbi algorithm which is a dynamic programming algorithm for finding the most likely sequence of hidden states, called the Viterbi path, that results in a sequence of observed tags.

          The output of this operator changes according to whether Input(Label) is given:

          1. Input(Label) is given:

          This happens in training. This operator is used to co-work with the chunk_eval operator.

          When Input(Label) is given, the crf_decoding operator returns a row vector with shape [N x 1] whose values are fixed to be 0, indicating an incorrect prediction, or 1 indicating a tag is correctly predicted. Such an output is the input to chunk_eval operator.

          1. Input(Label) is not given:

          This is the standard decoding process.

          The crf_decoding operator returns a row vector with shape [N x 1] whose values range from 0 to maximum tag number - 1. Each element indicates an index of a predicted tag.

          Inputs:
          • Emission : (LoDTensor, default: LoDTensor<float>). A LoDTensor with shape [N x D] where N is the size of the mini-batch and D is the total tag number. This input is the unscaled emission weight matrix of the linear_chain_crf operator.
          • Transition : (Tensor, default: Tensor<float>). A Tensor with shape [(D + 2) x D]. This input is the transition weights learned by the linear_chain_crf operator, denoted as w. The 1st row of w are transition weights for the start mask. The 2nd row of w are transition weights for the end mask. Transition weights between other tags begin from the 3rd row of w. See more details in comments of the linear_chain_crf operator.
          • Label : (LoDTensor, LoDTensor<int64_t>). The ground truth with shape [N x 1]. This input is optional. See more details in the operator's comments.
          Outputs:
          • ViterbiPath : (LoDTensor, LoDTensor<int64_t>). The decoding results. What to return changes depending on whether the Input(Label) (the ground truth) is given. See more details in the operator's comment.

          clip_by_norm

          ClipByNorm Operator.

          This operator limits the L2 norm of the input $X$ within $max_norm$. If the L2 norm of $X$ is less than or equal to $max_norm$, $Out$ will be the same as $X$. If the L2 norm of $X$ is greater than $max_norm$, $X$ will be linearly scaled to make the L2 norm of $Out$ equal to $max_norm$, as shown in the following formula:

          $$ Out = \frac{max\_norm * X}{norm(X)}, $$

          where $norm(X)$ represents the L2 norm of $X$.

          Inputs:
          • X : (Tensor) The input of clip_by_norm op.The number of dimensions must be between [1, 9].
          Outputs:
          • Out : (Tensor) The output of clip_by_norm op with shape as input(X)
          Attributes:
          • max_norm (Duplicable): (float) The maximum norm value.

          gather

          Gather Operator.

          $Out = X[Index]$

          Out is obtained by gathering entries of the outer-most dimension of X indexed by Index and concatenate them together.

          Example:

          X = [[1, 2], [3, 4], [5, 6]]

          Index = [[1, 2]]

          Then:

          Out = [[3, 4], [5, 6]]

          Inputs:
          • X : The source input of gather op
          • Index : The index input of gather op
          Outputs:
          • Out : The output of gather op

          pool3d_cudnn

          Pool3d Operator.

          The pooling3d operation calculates the output based on the input, pooling_type, ksize, strides, and paddings parameters. Input(X) and output(Out) are in NCDHW format, where N is batch size, C is the number of channels, and D, H and W are the depth, height and width of the feature, respectively. Parameters(ksize, strides, paddings) are three elements. These three elements represent depth, height and width, respectively. The input(X) size and output(Out) size may be different.

          Example: Input: X shape: $(N, C, D_{in}, H_{in}, W_{in})$ Output: Out shape: $(N, C, D_{out}, H_{out}, W_{out})$ Where $$ D_{out} = \frac{(D_{in} - ksize[0] + 2 * paddings[0])}{strides[0]} + 1 \\ H_{out} = \frac{(H_{in} - ksize[1] + 2 * paddings[1])}{strides[1]} + 1 \\ W_{out} = \frac{(W_{in} - ksize[2] + 2 * paddings[2])}{strides[2]} + 1 $$

          Inputs:
          • X : (Tensor) The input tensor of pooling operator. The format of input tensor is NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively.
          Outputs:
          • Out : (Tensor) The output tensor of pooling operator.The format of output tensor is also NCDHW, where N is batch size, C is the number of channels, and D, H and W is the depth, height and width of the feature, respectively.
          Attributes:
          • pooling_type (Duplicable): (string) Pooling type, can be "max" for max-pooling and "avg" for average-pooling.
          • ksize (Duplicable): (vector<int>) The pooling window size(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.
          • global_pooling (Duplicable): (bool, default false) Whether to use the global pooling. If global_pooling = true, ksize and paddings wille be ignored.
          • strides (Duplicable): (vector<int>, default {1,1,1}) Strides(depth, height, width) of the pooling operator.
          • paddings (Duplicable): (vector<int>, default {0,0,0}), paddings(depth, height, width) of pooling operator. If global_pooling = true, ksize and paddings will be ignored.

          crop

          Crop Operator.

          Crop input into output, as specified by offsets and shape.

          There are two ways to set shape: 1. reference input: crop input X into the same shape as reference input. The dimension of reference input should be the same as the dimension of input X. 2. shape list: crop input X into the shape described by a list. The size of shape list should be the same as the dimension size of input X.

          The input should be a k-D tensor(k > 0 and k < 7). As an example:

          Given:

          X = [[0, 1, 2, 0, 0]
               [0, 3, 4, 0, 0]
               [0, 0, 0, 0, 0]],
          

          and

          offsets = [0, 1],
          

          and

          shape = [2, 2],
          

          we get:

          Out = [[1, 2],
                 [3, 4]].
          
          Inputs:
          • X : The input of pad op. The input should be a k-D tensor(k > 0 and k < 7).
          • Y : The input used as reference for cropping, which is of the same dimensions as X.
          Outputs:
          • Out : The output of crop op, which is of the same dimensions as X.
          Attributes:
          • offsets (Duplicable): A list<int> describing offsets to be cropped. The size of offsets list should be the same as the dimension size of input X.
          • shape (Duplicable): A list<int> describing the shape of output. The size of shape list should be the same as the dimension size of input X.

          merge_lod_tensor

              Merge True and False branches of LoDTensor into a single Output,
              with a mask at certain lod level. X is used to obtain complete
              lod information. Please refer to SplitLoDTensorOp.
          
          Inputs:
          • X : The input LoDTensor, contains complete lod information to construct the output
          • Mask : A bool column vector which mask the input
          • InTrue : The True branch to be merged
          • InFalse : The False branch to be merged
          Outputs:
          • Out : The merged output LoDTensor
          Attributes:
          • level (Duplicable): (int) the specific lod level to rank.

          elementwise_mul

          Limited Elementwise Mul Operator.

          The equation is:

          $Out = X odot Y$

          X is a tensor of any dimension and the dimensions of tensor Y must be smaller than or equal to the dimensions of X.

          There are two cases for this operator: 1. The shape of Y is same with X; 2. The shape of Y is a subset of X.

          For case 2: Y will be broadcasted to match the shape of X and axis should be the starting dimension index for broadcasting Y onto X.

          example: shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0

          Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

          Inputs:
          • X : (Tensor) The first input tensor of elementwise op
          • Y : (Tensor) The second input tensor of elementwise op
          Outputs:
          • Out : The output of elementwise op
          Attributes:
          • axis (Duplicable): (int, default -1) The starting dimension index for broadcasting Y onto X

          rmsprop

          Rmsprop Optimizer.

          $$ MeanSquareOut = decay * MeanSquare + (1 - decay) * Grad * Grad \\ MomentOut = momentum * Moment + \frac{LearningRate * Grad}{\sqrt{MeanSquareOut + epsilon}} \\ ParamOut = Param - MomentOut $$

          The original slides that proposed Rmsprop: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)

          Inputs:
          • Param : (Tensor, default Tensor<float>) Input parameter value that has to be updated.
          • MeanSquare : (Tensor, default Tensor<float>) The mean square value that gets updated.
          • LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
          • Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
          • Moment : (Tensor, default Tensor<float>) The moment that gets updated.
          Outputs:
          • ParamOut : (Tensor) Output updated parameter value.
          • MomentOut : (Tensor) Output updated moment.
          • MeanSquareOut : (Tensor) Output Mean squared updated value.
          Attributes:
          • epsilon (Duplicable): (float, default 1e-10) Constant for numerical stability.
          • decay (Duplicable): (float, default 0.9) Discounting factor for coming gradient.
          • momentum (Duplicable): (float, default 0.0) Constant value.

          proximal_gd

          ProximalGD Operator.

          Optimizer that implements the proximal gradient descent algorithm:

          $$ prox\_param = param - learning\_rate * grad \\ param = sign(prox\_param) / (1 + learning\_rate * l2) * \max(|prox\_param| - learning\_rate * l1, 0) $$

          The paper that proposed Proximal Gradient Descent: (http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf)

          Inputs:
          • Param : (Tensor, default Tensor<float>) Input parameter value that has to be updated.
          • Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
          • LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
          Outputs:
          • ParamOut : (Tensor) Output updated parameter value.
          Attributes:
          • l1 (Duplicable): (float, default 0.0) L1 regularization strength.
          • l2 (Duplicable): (float, default 0.0) L2 regularization strength.

          positive_negative_pair

              PositiveNegativePairOp can be used to evaluate Learning To Rank(LTR) 
              model performance. 
              Within some context, e.g. the "query", a LTR model generates scores
              for a list of items, which gives a partial order of the items.
              PositiveNegativePairOp takes a list of reference rank order 
              (Input("Label")) and the model generated scores (Input(Score)) as 
              inputs and counts the pairs that ranked correctly and incorrectly.
          
          Inputs:
          • Score : (Tensor, float) Model Score on an item (with respect to QueryID). It's a 2-D tensor with shape [batch_size, depth], where the column specified by the attribute "column" is used as item score.
          • Label : (Tensor, float) Label of an item (with repsect to QueryId). It's a 2-D tensor with shape [batch_size, 1].
          • QueryID : (Tensor, int64) Query ID that indicates the context. Its shape should be the same as Label.
          • AccumulatePositivePair : (float) Optional. The accumulated number of positive pairs over a stream of data. If provided, the output PositivePair will be initialized with this number rather than 0. it won't be modified in place.
          • AccumulateNegativePair : (float) Optional. The accumulated number of negative pairs over a stream of data. If provided, the output NegativePair will be initialized with this number rather than 0. it won't be modified in place.
          • AccumulateNeutralPair : (float) Optional. The accumulated number of neutral pairs over a stream of data. If provided, the output NeutralPair will be initialized with this number rather than 0. it won't be modified in place.
          • Weight : (float) Optional. Weight of current item. If specified, its shape should be the same as Label, and the meaning of the output changes from numbers of pairs to the total sum of pairs' weights. Weight of a pair of items is the average of their weights.
          Outputs:
          • PositivePair : (float) Number of positive pairs, i.e. the pairs of items that are ranked correctly.
          • NegativePair : (float) Number of negative pairs, i.e. the pairs of items that are ranked incorrectly.
          • NeutralPair : (float) Number of neutral pairs, i.e. the pairs of items that have the same score.
          Attributes:
          • column (Duplicable): (int, default -1) The column position of Score used to rank items in descending order. It must be in the range of [-rank(Score), rank(Score)). If `dim < 0`, the dim to reduce is `rank + dim`. Noting that reducing on the first dim will make the LoD info lost.

          log_loss

          LogLoss Operator.

          Log loss is a loss function used for binary classification. Log Loss quantifies the accuracy of a classifier by penalising false classifications. Minimising the Log Loss is equivalent to maximising the accuracy of the classifier. We define Predicted as the values predicted by our model and Labels as the target ground truth value. Log loss can evaluate how close the predicted values are to the target. The shapes of Predicted and Labels are both [batch_size, 1]. The equation is:

          $$ Loss = - Labels * log(Predicted + \epsilon) - (1 - Labels) * log(1 - Predicted + \epsilon) $$

          Inputs:
          • Predicted : The input value (Predicted) of Log loss op.Predicted is a 2-D tensor with shape [batch_size, 1].
          • Labels : The target value (Labels) of Log loss op.Labels is a 2-D tensor with shape [batch_size, 1].
          Outputs:
          • Loss : The output tensor with shape [batch_size, 1] which represents the log loss.
          Attributes:
          • epsilon (Duplicable): Epsilon in log loss.

          mean

          Mean Operator.

          Out is a scalar which is the mean of all elements in X.

          Inputs:
          • X : The input of mean op
          Outputs:
          • Out : The output of mean op

          elementwise_add

          Limited Elementwise Add Operator.

          The equation is:

          $Out = X + Y$

          X is a tensor of any dimension and the dimensions of tensor Y must be smaller than or equal to the dimensions of X.

          There are two cases for this operator: 1. The shape of Y is same with X; 2. The shape of Y is a subset of X.

          For case 2: Y will be broadcasted to match the shape of X and axis should be the starting dimension index for broadcasting Y onto X.

          example: shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0

          Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

          Inputs:
          • X : (Tensor) The first input tensor of elementwise op
          • Y : (Tensor) The second input tensor of elementwise op
          Outputs:
          • Out : The output of elementwise op
          Attributes:
          • axis (Duplicable): (int, default -1) The starting dimension index for broadcasting Y onto X

          fill_zeros_like

          FillZerosLike Operator.

          Fill up a variable with zeros. The output will have the same size as the input.

          Inputs:
          • X : The input of fill-zeros-like op.
          Outputs:
          • Y : The variable will be filled up with zeros.

          prelu

          PRelu Operator.

          The equation is:

          $$ f(x) = \begin{cases} \alpha * x, \quad \text{if} \ x < 0 \\ x, \qquad \text{if} \ x >= 0 \end{cases} $$

          The input X can carry the LoD (Level of Details) information, or not. And the output shares the LoD information with input X.

          Inputs:
          • X : The input tensor of prelu operator.
          • Alpha : The alpha weight of prelu operator.
          Outputs:
          • Out : The output tensor of prelu operator.

          fill

          Fill operator

          Fill an tensor with value and shape. The type of the tensor is specify by dtype.

          Inputs:
            Outputs:
            • Out : (LoDTensor) The output tensor.
            Attributes:
            • value (Duplicable): The float values of tensor, which are flatten in row major
            • shape (Duplicable): The shape of output tensor
            • dtype (Duplicable): The data type of output tensor, Default is float
            • force_cpu (Duplicable): Whether the output tensor must be at CPU memory or not. Default is false.

            sigmoid_cross_entropy_with_logits

            SigmoidCrossEntropyWithLogits Operator.

            This measures the element-wise probability error in classification tasks in which each class is independent. This can be thought of as predicting labels for a data-point, where labels are not mutually exclusive. For example, a news article can be about politics, technology or sports at the same time or none of these.

            The logistic loss is given as follows:

               <span class="markdown-equation" id="equation-0"></span>
            

            We know that $$\sigma(X) = (1 / (1 + \exp(-X)))$$. By substituting this we get:

               <span class="markdown-equation" id="equation-2"></span>
            

            For stability and to prevent overflow of $$\exp(-X)$$ when X < 0, we reformulate the loss as follows:

               <span class="markdown-equation" id="equation-4"></span>
            

            Both the input X and Labels can carry the LoD (Level of Details) information. However the output only shares the LoD with input X.

            Inputs:
            • X : (Tensor, default Tensor<float>), a 2-D tensor with shape N x D, where N is the batch size and D is the number of classes. This input is a tensor of logits computed by the previous operator. Logits are unscaled log probabilities given as log(p/(1-p)).
            • Label : (Tensor, default Tensor<float>), a 2-D tensor of the same type and shape as X. This input is a tensor of probabalistic labels for each logit
            Outputs:
            • Out : (Tensor, default Tensor<float>), a 2-D tensor with shape N x D of elementwise logistic losses.

            modified_huber_loss

            Modified Huber Loss Operator.

            This operator is used in binary classification problem. The shape of input X and target Y are both [N, 1] and so is the shape of the output loss. Since target Y is not differentiable, calculating gradient for Y is illegal. The formula of modified huber loss is:

            $$ L(y, f(x)) = \begin{cases} (\max(0, 1 - yf(x)))^2, \text{if} \ yf(x) >= -1 \\ -4yf(x), \quad \text{otherwise} \end{cases} $$

            Make sure the values of target label Y are in {0, 1} here. This operator will scale values of Y to {-1, +1} when computing losses and gradients.

            Inputs:
            • X : The input tensor of modified huber loss op. X is 2-D tensor with shape [batch_size, 1].
            • Y : The target labels of modified huber loss op. The shape of Y is the same as X. Values of Y must be 0 or 1.
            Outputs:
            • IntermediateVal (Intermediate) : Variable to save intermediate result which will be reused in backward processing.
            • Out : Classification loss for X.

            elementwise_sub

            Limited Elementwise Sub Operator.

            The equation is:

            $Out = X - Y$

            X is a tensor of any dimension and the dimensions of tensor Y must be smaller than or equal to the dimensions of X.

            There are two cases for this operator: 1. The shape of Y is same with X; 2. The shape of Y is a subset of X.

            For case 2: Y will be broadcasted to match the shape of X and axis should be the starting dimension index for broadcasting Y onto X.

            example: shape(X) = (2, 3, 4, 5), shape(Y) = (,) shape(X) = (2, 3, 4, 5), shape(Y) = (5,) shape(X) = (2, 3, 4, 5), shape(Y) = (4, 5) shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1 shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0

            Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

            Inputs:
            • X : (Tensor) The first input tensor of elementwise op
            • Y : (Tensor) The second input tensor of elementwise op
            Outputs:
            • Out : The output of elementwise op
            Attributes:
            • axis (Duplicable): (int, default -1) The starting dimension index for broadcasting Y onto X

            reduce_mean

            {ReduceOp} Operator.

            This operator computes the mean of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true.

            Inputs:
            • X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
            Outputs:
            • Out : (Tensor) The result tensor.
            Attributes:
            • dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [-rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
            • keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.

            square

            Square Activation Operator.

            $y = x^2$

            Inputs:
            • X : Input of Square operator
            Outputs:
            • Y : Output of Square operator

            reduce_max

            {ReduceOp} Operator.

            This operator computes the max of input tensor along the given dimension. The result tensor has 1 fewer dimension than the input unless keep_dim is true.

            Inputs:
            • X : (Tensor) The input tensor. Tensors with rank at most 6 are supported.
            Outputs:
            • Out : (Tensor) The result tensor.
            Attributes:
            • dim (Duplicable): (int, default 0) The dimension to reduce. Must be in the range [-rank(input), rank(input)). If `dim < 0`, the dim to reduce is `rank + dim`. Note that reducing on the first dim will make the LoD info lost.
            • keep_dim (Duplicable): (bool, default false) If true, retain the reduced dimension with length 1.

            logical_or

            logical_or Operator

            It operates element-wise on X and Y, and returns the Out. X, Y and Out are N-dim boolean tensors. Each element of Out is calculated by $$Out = X || Y$$

            Inputs:
            • X : (LoDTensor) Left hand operand of logical_or operator
            • Y : (LoDTensor) Right hand operand of logical_or operator
            Outputs:
            • Out : (LoDTensor) n-dim bool tensor. Each element is $$Out = X || Y$$

            less_than

            less_than Operator

            It operates element-wise on X and Y, and returns the Out. Each of them is a N-dim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X < Y

            Inputs:
            • X : (LoDTensor) the left hand operand of less_than operator
            • Y : (LoDTensor) the right hand operand of less_than operator
            Outputs:
            • Out : (LoDTensor) n-dim bool tensor. Each element is Out = X < Y

            gru_unit

            GRUUnit Operator implements partial calculations of the GRU unit as following:

            $$ update \ gate: u_t = actGate(xu_t + W_u * h_{t-1} + b_u) \\ reset \ gate: r_t = actGate(xr_t + W_r * h_{t-1} + b_r) \\ output \ candidate: {h}_t = actNode(xc_t + W_c * dot(r_t, h_{t-1}) + b_c) \\ output: h_t = dot((1 - u_t), h_{t-1}) + dot(u_t, {h}_t) $$

            which is same as one time step of GRU Operator.

            @note To implement the complete GRU unit, fully-connected operator must be used before to feed xu, xr and xc as the Input of GRUUnit operator.

            Inputs:
            • Input : (Tensor) Matrix with shape [batch_size, frame_size * 3] for the input.
            • HiddenPrev : (Tensor) Matrix with shape [batch_size, frame_size] for the states of previous time step.
            • Weight : (Tensor) Weight matrix with shape [frame_size, frame_size * 3]. The elements continuous in memory can be divided into two parts. The first part are weights of the update gate and reset gate with shape [frame_size, frame_size * 2], and the second part are weights of output candidate with shape [frame_size, frame_size].
            • Bias : (Tensor) Bias vector with shape [1, frame_size * 3] concatenating bias of the update gate, reset gate and output candidate.
            Outputs:
            • Gate (Intermediate) : (Tensor) Matrix with shape [batch_size, frame_size * 3] for the output of update gate, reset gate and output candidate.
            • ResetHiddenPrev (Intermediate) : (Tensor) Matrix with shape [batch_size, frame_size] for the reseted hidden state of previous time step.
            • Hidden : (Tensor) The GRU hidden state of the current time step with shape [batch_size, frame_size].
            Attributes:
            • activation (Duplicable): (enum int, default tanh) The activation type used for output candidate {h}_t.
            • gate_activation (Duplicable): (enum int, default sigmoid) The activation type used in update gate and reset gate.

            swish

            Swish Activation Operator.

            $$y = \frac{x}{1 + e^{- \beta x}}$$

            Inputs:
            • X : Input of Swish operator
            Outputs:
            • Y : Output of Swish operator
            Attributes:
            • beta (Duplicable): Constant beta of swish operator

            is_empty

            IsEmpty Operator which checks whether a tensor is empty.

            It will just return product(tensor.ddims()) > 0;

            Inputs:
            • X : (Tensor) Tensor which is to be checked.
            Outputs:
            • Out : (Tensor) a boolean Tensor that indicate empty or not.

            sequence_concat

            The sequence_concat operator concatenates multiple LoDTensors. It only supports sequence (LoD Tensor with level number is 1) or a nested sequence (LoD tensor with level number is 2) as its input. - Case1: If the axis is other than 0(here, axis is 1 and level is 1), each input should have the same LoD information and the LoD information of the output keeps the same as the input.

            LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,2,4}, {0,1,2,3,4}}; Dims(x1) = (4,4,4) LoD(Out) = {{0,2,4}, {0,1,2,3,4}}; Dims(Out) = (4,7,4)

            • Case2: If the axis is 0(here, leve is 0), the inputs are concatenated along time steps, the LoD information of the output need to re-compute. The LoD information of level-1 should be same.

            LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,2,4}, {0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,2,4}, {0,2,5,8,11}}; Dims(Out) = (11,3,4)

            • Case3: If the axis is 0(here, level is 1).

            LoD(x0) = {{0,2,4}, {0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,3,4}, {0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,5,8}, {0,1,2,3,5,7,8,9,11}}; Dims(Out) = (11,3,4)

            • Case4: If the LoD number is 1, axis is 0, level is 0

            LoD(x0) = {{0,1,2,3,4}}; Dims(x0) = (4,3,4) LoD(x1) = {{0,1,3,5,7}}; Dims(x1) = (7,3,4) LoD(Out) = {{0,2,5,8,11}}; Dims(Out) = (11,3,4)

            NOTE: The levels of all the inputs should be the same.

            Inputs:
            • X (Duplicable) : (LodTensorArray) Input is a vector of LoDTensor, each of which is a variable-length sequence or nested sequence.
            Outputs:
            • Out : (LoDTensor), Variable-length output of sequence_concat Op.
            Attributes:
            • axis (Duplicable): (int, default 0) The axis along which the inputs will be joined. If axis is 0, the inputs will be joined with LoD index.
            • level (Duplicable): (int, default 0) The level at which the inputs will be joined. If the level is 0, the inputs will be joined at the nested sequence level. If the level is 1, the inputs will be joined at the sequence level. The level should be less than the level number of inputs.

            floor

            Floor Activation Operator.

            $y = floor(x)$

            Inputs:
            • X : Input of Floor operator
            Outputs:
            • Y : Output of Floor operator

            cast

            Cast Operator.

            This Operator casts the input tensor to another data type and returns tha Output Tensor.

            Inputs:
            • X : The input tensor of cast op
            Outputs:
            • Out : The output tensor of cast op
            Attributes:
            • out_dtype (Duplicable): output data type
            • in_dtype (Duplicable): input data type

            ceil

            Ceil Activation Operator.

            $y = ceil(x)$

            Inputs:
            • X : Input of Ceil operator
            Outputs:
            • Y : Output of Ceil operator

            tanh

            Tanh Activation Operator.

            $$y = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$

            Inputs:
            • X : Input of Tanh operator
            Outputs:
            • Y : Output of Tanh operator

            feed

            Feed Operator.

            It should not be configured by users directly.

            Inputs:
            • X : The input of feed op
            Outputs:
            • Out : The output of feed op
            Attributes:
            • col (Duplicable): (int) The column of feed

            rnn_memory_helper

            Inputs:
            • X :
            Outputs:
            • Out :
            Attributes:
            • dtype (Duplicable): (int, default 5 (FP32)) Output data type

            unpool

                "Input shape: $(N, C_{in}, H_{in}, W_{in})$
                Output shape: $(N, C_{out}, H_{out}, W_{out})$
                Where
                  <span class="markdown-equation" id="equation-0"></span>
                Paper: http://www.matthewzeiler.com/wp-content/uploads/2017
                /07/iccv2011.pdf
            
            Inputs:
            • X : (Tensor) The input tensor of unpool operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.
            • Indices : (Tensor) The input tensor of the indices given out by MaxPool2d. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.
            Outputs:
            • Out : (Tensor) The output tensor of unpool operator.The format of output tensor is also NCHW.Where N is batch size, C is the number of channels, H and W is the height and width of feature.
            Attributes:
            • ksize (Duplicable): (vector), the unpooling window size(height, width) of unpooling operator.
            • strides (Duplicable): (vector, default:{1, 1}), strides (height, width) of unpooling operator.
            • paddings (Duplicable): (vector defalut:{0,0}), paddings (height, width) of unpooling operator.
            • unpooling_type (Duplicable): (string), unpooling type, can be "max" for max-unpooling

            transpose

            Transpose Operator.

            The input tensor will be permuted according to the axis values given. The op functions similar to how numpy.transpose works in python. For example:

            input = numpy.arange(6).reshape((2,3)) input array([[0, 1, 2], [3, 4, 5]]) axis = [1, 0] output = input.transpose(axis) output array([[0, 3], [1, 4], [2, 5]]) So, given a input tensor of shape(N, C, H, W) and the axis is {0, 2, 3, 1}, the output tensor shape will be (N, H, W, C)

            Inputs:
            • X : (Tensor)The input tensor, tensors with rank at most 6 are supported
            Outputs:
            • Out : (Tensor)The output tensor
            Attributes:
            • axis (Duplicable): (vector<int>)A list of values, and the size of the list should be the same with the input tensor rank, the tensor will permute the axes according the the values given

            rnn_memory_helper_grad

            Inputs:
            • Out@GRAD :
            • X :
            • Out :
            Outputs:
            • X@GRAD :
            Attributes:
            • dtype (Duplicable): (int, default 5 (FP32)) Output data type

            momentum

            Momentum Optimizer.

            This optimizer has a flag for Nestrov Momentum. The update equations are as follows:

            $$ velocity = mu * velocity + gradient \\ if (use\_nesterov): \\ param = param - gradient * learning\_rate + mu * velocity * learning\_rate \\ else: \\ param = param - learning\_rate * velocity. \\ $$

            Inputs:
            • Param : (Tensor, default Tensor<float>) Input parameter that has to be updated
            • Grad : (Tensor, default Tensor<float>) Input gradient of the parameter
            • Velocity : (Tensor, default Tensor<float>) Input velocity (corresponding to the parameter) that has to be updated
            • LearningRate : (Tensor, default Tensor<float>) Input learning rate
            Outputs:
            • ParamOut : (Tensor) This output is updated parameter. It shared memory with Input(Param).
            • VelocityOut : (Tensor) This output is updated velocity. It shared memory with Input(Velocity).
            Attributes:
            • mu (Duplicable): (float) Momentum coefficient
            • use_nesterov (Duplicable): (bool, default false) Use Nesterov Momentum

            scatter

            Scatter Operator.

            This operator obtains output by updating the input on selected indices on the first axis:

            $$ Out = Ref \\ Out[Index] = Ref[Index] + Updates $$

            Inputs:
            • Ref : The source input of scatter op
            • Index : The index input of scatter op where Ref will be updated
            • Updates : The updated value of updates op
            Outputs:
            • Out : The output of add op

            less_equal

            less_equal Operator

            It operates element-wise on X and Y, and returns the Out. Each of them is a N-dim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X <= Y

            Inputs:
            • X : (LoDTensor) the left hand operand of less_equal operator
            • Y : (LoDTensor) the right hand operand of less_equal operator
            Outputs:
            • Out : (LoDTensor) n-dim bool tensor. Each element is Out = X <= Y

            rank_loss

            RankLoss Operator.

            RankLoss operator for RankNet (http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf). RankNet is a pairwise ranking model with one training sample consisting of a pair of doc A and B, and the label P indicating that A is ranked higher than B or not:

            P = {0, 1} or {0, 0.5, 1}, where 0.5 means no information about the rank of the input pair.

            The RankLoss operator takes three inputs: Left (o_i), Right (o_j) and Label (P_{i,j}), which represent the output score of RankNet for the two docs and the label respectively, and yields the rank loss C_{i,j} using the following equation:

            $$ C_{i,j} = -\tilde{P_{ij}} * o_{i,j} + \log(1 + e^{o_{i,j}}) \\ o_{i,j} = o_i - o_j \\ \tilde{P_{i,j}} = \left \{0, 0.5, 1 \right \} \ or \ \left \{0, 1 \right \} $$

            The operator can take batch inputs with size batch_size (batch_size >= 1).

            Inputs:
            • Label : (2-D Tensor with shape [batch_size x 1]) The label indicating A ranked higher than B or not.
            • Left : (2-D Tensor with shape [batch_size x 1]) The output of RankNet for doc A.
            • Right : (2-D Tensor with shape [batch_size x 1]) The output of RankNet for doc B.
            Outputs:
            • Out : (2-D Tensor with shape [batch_size x 1]) The output loss of RankLoss operator.

            greater_than

            greater_than Operator

            It operates element-wise on X and Y, and returns the Out. Each of them is a N-dim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X > Y

            Inputs:
            • X : (LoDTensor) the left hand operand of greater_than operator
            • Y : (LoDTensor) the right hand operand of greater_than operator
            Outputs:
            • Out : (LoDTensor) n-dim bool tensor. Each element is Out = X > Y

            equal

            equal Operator

            It operates element-wise on X and Y, and returns the Out. Each of them is a N-dim tensor. X and Y could be any type. The each element of the Out tensor is calculated by Out = X == Y

            Inputs:
            • X : (LoDTensor) the left hand operand of equal operator
            • Y : (LoDTensor) the right hand operand of equal operator
            Outputs:
            • Out : (LoDTensor) n-dim bool tensor. Each element is Out = X == Y

            uniform_random

            Uniform random operator.

            This operator initializes a tensor with random values sampled from a uniform distribution.

            Inputs:
              Outputs:
              • Out : (Tensor) The output tensor of uniform random op
              Attributes:
              • shape (Duplicable): (vector<int>) The shape of the output tensor
              • min (Duplicable): (float, default -1.0) Minimum value of uniform random
              • max (Duplicable): (float, default 1.0) Maximun value of uniform random
              • seed (Duplicable): (int, default 0) Random seed used for generating samples. 0 means use a seed generated by the system.
              • dtype (Duplicable): (int, default 5(FP32)) Output tensor data type

              roi_pool

              ROIPool operator

              ROI Pooling for Faster-RCNN. The link below is a further introduction: https://stackoverflow.com/questions/43430056/what-is-roi-layer-in-fast-rcnn

              Inputs:
              • X : (Tensor), the input of ROIPoolOp. The format of input tensor is NCHW. Where N is batch size, C is the number of input channels, H is the height of the feature, and W is the width of the feature.
              • ROIs : (Tensor), ROIs (Regions of Interest) to pool over. should be a 2-D tensor of shape (num_rois, 5)given as [[batch_id, x1, y1, x2, y2], …]. Where batch_id is the id of the data, (x1, y1) is the top left coordinates, and (x2, y2) is the bottom right coordinates.
              Outputs:
              • Out : (Tensor), The output of ROIPoolOp is a 4-D tensor with shape (num_rois, channels, pooled_h, pooled_w).
              • Argmax (Intermediate) : (Tensor), Argmaxes corresponding to indices in X used for gradient computation. Only output if arg “is_test” is false.
              Attributes:
              • spatial_scale (Duplicable): (float, default 1.0), Multiplicative spatial scale factor to translate ROI coords from their input scale to the scale used when pooling.
              • pooled_height (Duplicable): (int, default 1), The pooled output height.
              • pooled_width (Duplicable): (int, default 1), The pooled output width.

              softmax

              Softmax Operator.

              The input of the softmax operator is a 2-D tensor with shape N x K (N is the batch_size, K is the dimension of input feature). The output tensor has the same shape as the input tensor.

              For each row of the input tensor, the softmax operator squashes the K-dimensional vector of arbitrary real values to a K-dimensional vector of real values in the range [0, 1] that add up to 1. It computes the exponential of the given dimension and the sum of exponential values of all the other dimensions in the K-dimensional vector input. Then the ratio of the exponential of the given dimension and the sum of exponential values of all the other dimensions is the output of the softmax operator.

              For each row $i$ and each column $j$ in Input(X), we have: $$Y[i, j] = \frac{\exp(X[i, j])}{\sum_j(exp(X[i, j])}$$

              Inputs:
              • X : The input tensor of softmax. 2-D with shape [batch_size, input_feature_dimensions].
              Outputs:
              • Y : The normalized values with the same shape as X.

              seq_expand

              Seq Expand Operator.

              This operator expands input(X) according to LOD of input(Y). Following are cases to better explain how this works: Case 1:

              Given 2-level a LoDTensor input(X) X.lod = [[0, 2, 3], [0, 1, 3, 4]] X.data = [a, b, c, d] X.dims = [4, 1] and input(Y) Y.lod = [[0, 2, 4], [0, 3, 6, 7, 8]] with condition len(Y.lod[-1]) -1 == X.dims[0] then we get 2-level LoDTensor Out.lod = [[0, 2, 4], [0, 3, 6, 7, 8]] Out.data = [a, a, a, b, b, b, c, d] Out.dims = [8, 1]

              Case 2:

              Given a 0-level LoDTensor input(X) X.data = [a, b, c] X.lod = NULL X.dims = [3, 1] and input(Y) Y.lod = [[0, 2, 3, 6]] with condition len(Y.lod[-1]) -1 == X.dims[0] then we get 1-level LoDTensor Out.lod = [[0, 2, 3, 6]] Out.data = [a, a, b, c, c, c] Out.dims = [6, 1]

              Case 3:

              Given a 0-level LoDTensor input(X) X.data = [[a, b], [c, d], [e, f]] X.lod = NULL X.dims = [3, 2] and input(Y) Y.lod = [[0, 2, 3, 6]] with condition len(Y.lod[-1]) -1 == X.dims[0] then we get 1-level LoDTensor Out.lod = [[0, 2, 3, 6]] Out.data = [[a,b], [a,b] [c,d], [e, f], [e, f], [e, f]] Out.dims = [6, 2]

              Case 4:

              Given 2-level a LoDTensor input(X) X.lod = [[0, 2, 3], [0, 1, 3, 4]] X.data = [a, b, c, d] X.dims = [4, 1] and input(Y) Y.lod = [[0, 2, 4], [0, 3, 6, 6, 8]] with condition len(Y.lod[-1]) -1 == X.dims[0] then we get 2-level LoDTensor Out.lod = [[0, 2, 4], [0, 3, 6, 6, 8]] Out.data = [a, a, a, b, b, b, d, d] Out.dims = [8, 1]

              Inputs:
              • X : (Tensor or LoDTensor) The input(X) of this operator can be a LoDTensor or a base Tensor.
              • Y : (LoDTensor)The reference input(Y) of seq_expand op.It must be a LoDTensor with k-level(k>0).The input(X) will be expanded according to LOD of input(Y).The element numbers of last level in input(Y) must be equal to dims[0] of input(X).
              Outputs:
              • Out : (LodTensor)The output of seq_expand op.The lod of output will be as same as input(Y)'s lod.

              sqrt

              Sqrt Activation Operator.

              $y = sqrt{x}$

              Inputs:
              • X : Input of Sqrt operator
              Outputs:
              • Y : Output of Sqrt operator

              logical_and

              logical_and Operator

              It operates element-wise on X and Y, and returns the Out. X, Y and Out are N-dim boolean tensors. Each element of Out is calculated by $$Out = X \&\& Y$$

              Inputs:
              • X : (LoDTensor) Left hand operand of logical_and operator
              • Y : (LoDTensor) Right hand operand of logical_and operator
              Outputs:
              • Out : (LoDTensor) n-dim bool tensor. Each element is $$Out = X \&\& Y$$

              logical_not

              logical_not Operator

              It operates element-wise on X, and returns the Out. X and Out are N-dim boolean tensors. Each element of Out is calculated by $$Out = !X$$

              Inputs:
              • X : (LoDTensor) Operand of logical_not operator
              Outputs:
              • Out : (LoDTensor) n-dim bool tensor. Each element is $$Out = !X$$

              abs

              Abs Activation Operator.

              $y = |x|$

              Inputs:
              • X : Input of Abs operator
              Outputs:
              • Y : Output of Abs operator

              logical_xor

              logical_xor Operator

              It operates element-wise on X and Y, and returns the Out. X, Y and Out are N-dim boolean tensors. Each element of Out is calculated by $$Out = (X || Y) \, \&\& \, !(X \&\& Y)$$

              Inputs:
              • X : (LoDTensor) Left hand operand of logical_xor operator
              • Y : (LoDTensor) Right hand operand of logical_xor operator
              Outputs:
              • Out : (LoDTensor) n-dim bool tensor. Each element is $$Out = (X || Y) \, \&\& \, !(X \&\& Y)$$

              sequence_slice

              Sequence slice operator

              The operator crops a subsequence from given sequence with given start offset and subsequence length. It only supports sequence (LoD Tensor with level number is 1). - Case: X = [[a1, a2; b1, b2; c1, c2] [d1, d2; e1, e2]] LoD(X) = {{0, 3, 5}}; Dims(X) = (5, 2) Offset = [[0], [1]]; Length = [[2], [1]]

              Out = [[a1, a2;
                      b1, b2]
                      [e1, e2]]
              LoD(Out) = {{0, 2, 3}}; Dims(Out) = (3, 2)
              

              NOTE: The first dimension size of input, the size of offset and Length, should be equal. The offset start from 0.

              Inputs:
              • X : (LoDTensor), the input of SequenceSliceOp.
              • Offset : (Tensor), a vector<int> to describe the offset of every input sequence for sub sequence item.
              • Length : (Tensor), a vector<int> to describe the length of every input sequence for sub sequence item.
              Outputs:
              • Out : (LoDTensor), the output of SequenceSliceOp.

              hinge_loss

              HingeLoss Operator.

              Let x be a logit (prediction) and y be the actual label. The logit can take any values from (-inf, inf), but the labels should be either -1 or 1. Then, the hinge loss is computed as follows:

              $$ L_(x, y) = max(1 - y.x, 0) $$

              Note that the labels passed as input will have values as either 0 or 1.

              Inputs:
              • Logits : The input value (Logits) of Hinge loss op.Logits is a 2-D tensor with shape [batch_size, 1].
              • Labels : The target value (Labels) of Hinge loss op.Labels is a 2-D tensor with shape [batch_size, 1].
              Outputs:
              • Loss : The output tensor with shape [batch_size, 1] which represents the hinge loss.

              bilinear_tensor_product

              Bilinear Tensor Product operator. Given input X and Y, a 3D tensor Weight and a Bias. Each column of the Output is computed by one slice $i = 1, . . . , k$ of the tensor:

              $$ M = (X W_i) * Y \\ Out_i = \sum_j {M_j} + Bias_i $$

              Where $W_i$ is the $i$-th slice of Input(Weight); $M_j$ is the $j$-th column of $M$; $Out_i$ is the $i$-th column of Output(Out); $Bias_i$ is a column vector, each element of it is equal to the $i$-th element of $Bias$;

              Inputs:
              • X : The first input of bilinear_tensor_product operator.
              • Y : The second input of bilinear_tensor_product operator.
              • Weight : The learnable parameters of bilinear_tensor_product operator.
              • Bias : The learnable bias of bilinear_tensor_product operator.
              Outputs:
              • Out : The output of bilinear_tensor_product operator.

              lrn

              Local Response Normalization Operator.

              This operator comes from the paper: <>.

              The original formula is:

              $$ Output(i, x, y) = Input(i, x, y) / \left( k + \alpha \sum\limits^{\min(C, c + n/2)}_{j = \max(0, c - n/2)} (Input(j, x, y))^2 \right)^{\beta} $$

              Function implementation:

              Inputs and outpus are in NCHW format, while input.shape.ndims() equals 4. And dimensions 0 ~ 3 represent batch size, feature maps, rows, and columns, respectively.

              Input and Output in the formula above is for each map(i) of one image, and Input(i, x, y), Output(i, x, y) represents an element in an image.

              C is the number of feature maps of one image. n is a hyper-parameter configured when operator is initialized. The sum in the denominator is the sum of the same positions in the neighboring maps.

              Inputs:
              • X : (Tensor) The input of LRN operator. It must be a 4D tenor with NCHW format.
              Outputs:
              • Out : (Tensor) The output of LRN operator, which is also the 4D tensor with NCHW format.
              • MidOut : (Tensor) Middle result of LRN operator. It's computed in forward process and also used in backward process.
              Attributes:
              • n (Duplicable): (int default 5) n is the "adjacent" kernel that maps at the same spatial position.
              • k (Duplicable): (float, default 2.0) k is the bias.
              • alpha (Duplicable): (float, default 0.0001) alpha is the scale number.
              • beta (Duplicable): (float, default 0.75) beta is the power number.

              beam_search_decode

              Pack the result of Beam search op into SentenceIds and SentenceScores.

              Inputs:
              • Ids : (LodTensorArray)score of the candidate words in each step
              • Scores : (LodTensorArray)score of the candidate words in each step
              Outputs:
              • SentenceIds : (LodTensor)All possible result sentences of word ids
              • SentenceScores : (LodTensor)All possible result sentences of word scores

              assign

              Assign Operator

              Out = X, when type in [LoDTensor/SelectedRows/LoDTensorArray] raise error if the type is not listed above.

              Inputs:
              • X : (LoDTensor, SelectedRows or LoDTensorArray) The input variable could be LoDTensor, SelectedRows or LoDTensorArray.
              Outputs:
              • Out : (LoDTensor, SelectedRows or LoDTensorArray) The type of output is the same as input X.

              split

              Split operator

              This operator splits the input tensor into multiple sub-tensors.

              Example: Input = [[1,2], [3,4], [5,6]] sections = [2,1] axis = 0 Output[0] = [[1,2], [3,4]] Output[1] = [[5,6]]

              Inputs:
              • X : (Tensor) Input tensor of the split operator.
              Outputs:
              • Out (Duplicable) : (Tensor) Output tensors of the split operator.
              Attributes:
              • sections (Duplicable): (vector<int>) the length of each output along the specified axis.
              • num (Duplicable): (int, default 0)Number of sub-tensors. This must evenly divide Input.dims()[axis]
              • axis (Duplicable): (int, default 0) The axis which the input will be splited on.

              chunk_eval

              For some basics of chunking, please refer to ‘Chunking with Support Vector Machines https://aclanthology.info/pdf/N/N01/N01-1025.pdf’.

              CheckEvalOp computes the precision, recall, and F1-score of chunk detection, and supports IOB, IOE, IOBES and IO (also known as plain) tagging schemes. Here is a NER example of labeling for these tagging schemes:

                   Li     Ming    works  at  Agricultural   Bank   of    China  in  Beijing.
              

              IO: I-PER I-PER O O I-ORG I-ORG I-ORG I-ORG O I-LOC IOB: B-PER I-PER O O B-ORG I-ORG I-ORG I-ORG O B-LOC IOE: I-PER E-PER O O I-ORG I-ORG I-ORG E-ORG O E-LOC IOBES: B-PER E-PER O O I-ORG I-ORG I-ORG E-ORG O S-LOC

              There are three chunk types(named entity types) including PER(person), ORG(organization) and LOC(LOCATION), and we can see that the labels have the form -.

              Since the calculations actually use label ids rather than labels, extra attention should be paid when mapping labels to ids to make CheckEvalOp work. The key point is that the listed equations are satisfied by ids.

              tag_type = label % num_tag_type
              chunk_type = label / num_tag_type
              

              where num_tag_type is the num of tag types in the tagging scheme, num_chunk_type is the num of chunk types, and tag_type get its value from the following table.

              Scheme Begin Inside End   Single
               plain   0     -      -     -
               IOB     0     1      -     -
               IOE     -     0      1     -
               IOBES   0     1      2     3
              

              Still use NER as example, assuming the tagging scheme is IOB while chunk types are ORG, PER and LOC. To satisfy the above equations, the label map can be like this:

              B-ORG  0
              I-ORG  1
              B-PER  2
              I-PER  3
              B-LOC  4
              I-LOC  5
              O      6
              

              It’s not hard to verify the equations noting that the num of chunk types is 3 and the num of tag types in IOB scheme is 2. For example, the label id of I-LOC is 5, the tag type id of I-LOC is 1, and the chunk type id of I-LOC is 2, which consistent with the results from the equations.

              Inputs:
              • Inference : (Tensor, default: Tensor<int64_t>). Predictions from the network.
              • Label : (Tensor, default: Tensor<int64_t>). The true tag sequences.
              Outputs:
              • Precision : (float). The evaluated precision (called positive predictive value) of chunks on the given mini-batch.
              • Recall : (float). The evaluated recall (true positive rate or sensitivity) of chunks on the given mini-batch.
              • F1-Score : (float). The evaluated F1-Score on the given mini-batch.
              Attributes:
              • num_chunk_types (Duplicable): (int). The number of chunk type. See below for details.
              • chunk_scheme (Duplicable): (string, default IOB). The labeling scheme indicating how to encode the chunks. Must be IOB, IOE, IOBES or plain. See below for details.
              • excluded_chunk_types (Duplicable): (list<int>) A list including chunk type ids indicating chunk types that are not counted. See below for details.

              sigmoid

              Sigmoid Activation Operator

              $$y = \frac{1}{1 + e^{-x}}$$

              Inputs:
              • X : Input of Sigmoid operator
              Outputs:
              • Y : Output of Sigmoid operator

              squared_l2_distance

              SquaredL2Distance operator

              This operator will cacluate the squared L2 distance for the input and the target. Number of distance value will be equal to the first dimension of input. First dimension of the target could be equal to the input or to 1. If the first dimension of target is 1, the operator will broadcast target's first dimension to input's first dimension. During backward propagation, the user can decide whether to calculate the gradient of the input or the target or both.

              Both the input X and Y can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input X.

              Inputs:
              • X : (Tensor) Input of SquaredL2DistanceOp.
              • Y : (Tensor) Target of SquaredL2DistanceOp.
              Outputs:
              • sub_result (Intermediate) : (Tensor) Buffering subtraction result which will be reused in backward.
              • Out : (Tensor) Squared l2 distance between input and target.

              relu

              Relu Activation Operator.

              $y = max(x, 0)$

              Inputs:
              • X : Input of Relu operator
              Outputs:
              • Y : Output of Relu operator

              fetch

              Fetch Operator.

              It should not be configured by users directly.

              Inputs:
              • X : The input of fetch op
              Outputs:
              • Out : The output of fetch op
              Attributes:
              • col (Duplicable): (int) The column of fetch

              while

              Inputs:
              • X (Duplicable) : A set of variables, which are required by operators inside the block of While Op.
              • Condition (Duplicable) : (Bool) An scalar. When it's False, the While Op will be terminated.
              Outputs:
              • Out (Duplicable) : A set of variables, which will be assigned with values generated by the operators inside the block of While Op.
              • StepScopes : (StepScopeVar) A vector of local scope, which size equals the step number of While Op. The i'th scope storages temporary variables generated in the i'th step.
              Attributes:
              • step_block (Duplicable): The step block inside WhileOp

              proximal_adagrad

              Proximal Adagrad Optimizer.

              Optimizer that implements the proximal adagrad algorithm:

              $$ moment = moment + grad * grad \\ prox\_param = param - learning\_rate * grad * (1 / \sqrt{moment}) \\ param = sign(prox\_param) / (1 + learning\_rate * l2) * \max(|prox\_param| - learning\_rate * l1 , 0) $$

              The paper that proposed Proximal GD: (http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf) Here, we use the adagrad learning rate as specified here: (http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)

              Inputs:
              • Param : (Tensor, default Tensor<float>) Input parameter that has to be updated.
              • Moment : (Tensor, default Tensor<float>) Moment parameter that has to be updated.
              • Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
              • LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
              Outputs:
              • ParamOut : (Tensor) Output updated parameter value.
              • MomentOut : (Tensor) Output updated moment value.
              Attributes:
              • l1 (Duplicable): (float, default 0.0) L1 regularization strength.
              • l2 (Duplicable): (float, default 0.0) L2 regularization strength.

              minus

              Minus Operator.

              Equation:

              $Out = X - Y$
              

              Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

              Inputs:
              • X : The left tensor of minus operator.
              • Y : The right tensor of minus operator.
              Outputs:
              • Out : The output tensor of minus operator.

              cos_sim

              Cosine Similarity Operator.

              $Out = X^T * Y / (sqrt{X^T * X} * sqrt{Y^T * Y})$

              The input X and Y must have the same shape, except that the 1st dimension of input Y could be just 1 (different from input X), which will be broadcasted to match the shape of input X before computing their cosine similarity.

              Both the input X and Y can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with input X.

              Inputs:
              • X : The 1st input of cos_sim op.
              • Y : The 2nd input of cos_sim op.
              Outputs:
              • Out : The output of cos_sim op.
              • XNorm (Intermediate) : Norm of the first input, reduced along the 1st dimension.
              • YNorm (Intermediate) : Norm of the second input, reduced along the 1st dimension.

              precision_recall

              Precision Recall Operator.

              When given Input(Indices) and Input(Labels), this operator can be used to compute various metrics including: 1. macro average precision 2. macro average recall 3. macro f1 score 4. micro average precision 5. micro average recall 6. micro f1 score

              To compute the above metrics, we need to do statistics for true positives, false positives and false negatives. Here the count of true negatives is not necessary, but counting it may provide potential usage and the cost is trivial, so the operator also provides the count of true negatives.

              We define state as a 2-D tensor with shape [class_number, 4]. Each row of a state contains statistic variables for corresponding class. Layout of each row is: TP(true positives), FP(false positives), TN(true negatives), FN(false negatives). If Input(Weights) is provided, TP, FP, TN, FN will be calculated by given weight instead of the instance count.

              This operator also supports metrics computing for cross-batch situation. To achieve this, Input(StatesInfo) should be provided. State of current batch data will be accumulated to Input(StatesInfo) and Output(AccumStatesInfo) is the accumulation state.

              Output(BatchMetrics) is metrics of current batch data while Output(AccumStatesInfo) is metrics of accumulation data.

              Inputs:
              • MaxProbs : (Tensor, default Tensor<float>) A 2-D tensor with shape N x 1, where N is the batch size. Each row contains the max probability of an instance which computed by the previous top_k (k=1) operator.
              • Indices : (Tensor, default Tensor<int>) A 2-D tensor with shape N x 1, where N is the batch size. Each row contains the corresponding index which computed by the previous top_k (k=1) operator.
              • Labels : (Tensor, default Tensor<int>) A 2-D tensor with shape N x 1, where N is the batch size. Each element is a label and the value should be in [0, class_number - 1].
              • Weights : (Tensor, default Tensor<float>) A 2-D tensor with shape N x 1, where N is the batch size. This input is optional. If provided, weight of instance would be considered when computing metrics.
              • StatesInfo : (Tensor, default Tensor<int>) A 2-D tensor with shape D x 4, where D is the number of classes. This input is optional. If provided, current state will be accumulated to this state and the accumulation state will be the output state.
              Outputs:
              • BatchMetrics : (Tensor, default Tensor<float>) A 1-D tensor with shape {6}. This output tensor contains metrics for current batch data. The layout is [macro average precision, macro average recall, macro f1 score, micro average precision, micro average recall, micro f1 score].
              • AccumMetrics : (Tensor, default Tensor<float>) A 1-D tensor with shape {6}. This output tensor contains metrics for accumulated data. The layout is [macro average precision, macro average recall, macro f1 score, micro average precision, micro average recall, micro f1 score].
              • AccumStatesInfo : (Tensor, default Tensor<float>) A 2-D tensor with shape D x 4, where D is equal to class number. This output tensor contains accumulated state variables used to compute metrics. The layout for each class is [true positives, false positives, true negatives, false negatives].
              Attributes:
              • class_number (Duplicable): (int) Number of classes to be evaluated.

              batch_norm

              Batch Normalization.

              Batch Norm has been implemented as discussed in the paper: https://arxiv.org/pdf/1502.03167.pdf Can be used as a normalizer function for conv2d and fully_connected operations. The required data format for this layer is one of the following: 1. NHWC [batch, in_height, in_width, in_channels] 2. NCHW [batch, in_channels, in_height, in_width]

              Inputs:
              • X : The input tensor
              • Scale : Scale is a 1-dimensional tensor of size C that is applied to the output
              • Bias : Bias is a 1-dimensional tensor of size C that is applied to the output
              • Mean : The global mean (for training) or estimated mean (for testing)
              • Variance : The global variance (for training) or estimated Variance (for testing)
              Outputs:
              • Y : result after normalization
              • MeanOut : Share memory with Mean. Store the global mean when training
              • VarianceOut : Share memory with Variance. Store the global Variance when training
              • SavedMean (Intermediate) : Mean of the current mini batch, will apply to output when training
              • SavedVariance (Intermediate) : Variance of the current mini batch, will apply to output when training
              Attributes:
              • is_test (Duplicable):
              • momentum (Duplicable):
              • epsilon (Duplicable):
              • tensor_format (Duplicable):

              read_from_array

              ReadFromArray Operator.

              Read a LoDTensor from a LoDTensor Array.

              Assume $T$ is LoDTensor, $i$ is the subscript of the array, and $A$ is the array. The equation is

              $$T = A[i]$$

              Inputs:
              • X : (TensorArray) the array will be read from.
              • I : (Tensor) the subscript index in tensor array. The number of element should be 1
              Outputs:
              • Out : (LoDTensor) the tensor will be read from.

              softplus

              Softplus Activation Operator.

              $y = ln(1 + e^{x})$

              Inputs:
              • X : Input of Softplus operator
              Outputs:
              • Y : Output of Softplus operator

              accuracy

              Accuracy Operator.

              It will print accuracy rate for classification. The accuracy is calculated as follows:

              $$accuracy = \frac{NumOfCorrectPredicts}{NumOfAllSamples}$$

              Both the input Out and Label can carry the LoD (Level of Details) information, or not. But the output only shares the LoD information with the input Out(Inference).

              Inputs:
              • Out : The network output of topk (inferences)
              • Indices : The the network output of topk (indices)
              • Label : Label of the training data
              Outputs:
              • Accuracy : The accuracy of current batch
              • Correct : The correct samples count of current batch
              • Total : The samples count of current batch

              conv_shift

              ConvShift Operator.

              A layer for circular convolution of two vectors, as used in the Neural Turing Machine: https://arxiv.org/abs/1410.5401

              The equation is:

              $$Out[i] = \sum_{j=-(N-1)/2}^{(N-1)/2} X_{i+j} * Y_{j}$$

              where X's index is computed modulo M, and Y's index is computed modulo N.

              Both inputs X and Y can carry LoD (Level of Details) information. However, the output only shares the LoD information with input X.

              Inputs:
              • X : (Tensor, default Tensor<float>), a 2-D tensor with shape B x M, where B is the batch size and M is the data dimension.
              • Y : (Tensor, default Tensor<float>), a 2-D tensor with shape B x N, where B is the batch size and N is the data dimension. N must be odd.
              Outputs:
              • Out : (Tensor, default Tensor<float>), a 2-D tensor with shape B x M, i.e., the same shape as X.

              nce

              Compute and return the noise-contrastive estimation training loss. See Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. By default this operator uses a uniform distribution for sampling.

              Inputs:
              • Input : (Tensor) A tensor of shape [batch_size, dim].
              • Label : (Tensor) A tensor of shape [batch_size, num_true_class]. 'num_true_class' is the number of target classes in each sample.The number of target classes per sample should be same. If you have a variable number of target classes, you can pad them out to a constant number by either repeating them or by padding with an otherwise unused class.)
              • Weight : (Tensor) A tensor of shape [num_class, dim]. 'num_class' is the total number of class.
              • Bias : (Tensor) A tensor of shape [num_class, 1]. 'num_class' is the total number of class. It is a dispensable input.
              • SampleWeight : (Tensor) A tensor of shape [batch_size, 1] storing a weight for each sample. And it is a dispensable input. The default value of sample is 1.
              Outputs:
              • Cost : (Tensor) A tensor of shape [batch_size, 1]. Cost of samples.
              • SampleLogits (Intermediate) : An intermediate tensor of shape[batch_size, num_neg_samples + num_pos_samples].This tensor is output of forward kernel and used in backward kernel to compute grads.Given X is the dot product of input tensor and sampled labels' weights.Then 'SampleLogits' is sigmoid(X).
              • SampleLabels (Intermediate) : An intermediate tensor of shape[batch_size, num_neg_samples + num_pos_samples].This tensor is output of forward kernel and used in backward kernel to compute grads.
              Attributes:
              • num_total_classes (Duplicable): Total number of classes in all samples.
              • num_neg_samples (Duplicable): The number of negative classes. The default value is 10.
              • custom_neg_classes (Duplicable): This attribute only be used in unitest. Classes in this list wiil be used as negative classes for every samples. Under normal conditions, user should avoid setting this attribute.

              linear_chain_crf

              LinearChainCRF Operator.

              Conditional Random Field defines an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. CRF learns the conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are structured inputs and $Y = (y_1, y_2, ... , y_n)$ are labels for the inputs.

              Linear chain CRF is a special case of CRF that is useful for sequence labeling task. Sequence labeling tasks do not assume a lot of conditional independences among inputs. The only constraint they impose is that the input and output must be linear sequences. Thus, the graph of such a CRF is a simple chain or a line, which results in the linear chain CRF.

              This operator implements the Forward-Backward algorithm for the linear chain CRF. Please refer to http://www.cs.columbia.edu/~mcollins/fb.pdf and http://cseweb.ucsd.edu/~elkan/250Bwinter2012/loglinearCRFs.pdf for details.

              Equation: 1. Denote Input(Emission) to this operator as $x$ here. 2. The first D values of Input(Transition) to this operator are for starting weights, denoted as $a$ here. 3. The next D values of Input(Transition) of this operator are for ending weights, denoted as $b$ here. 4. The remaning values of Input(Transition) are for transition weights, denoted as $w$ here. 5. Denote Input(Label) as $s$ here.

              The probability of a sequence $s$ of length $L$ is defined as: $$P(s) = (1/Z) \exp(a_{s_1} + b_{s_L} + \sum_{l=1}^L x_{s_l} + \sum_{l=2}^L w_{s_{l-1},s_l})$$

              where $Z$ is a normalization value so that the sum of $P(s)$ over all possible sequences is 1, and $x$ is the emission feature weight to the linear chain CRF.

              Finally, the linear chain CRF operator outputs the logarithm of the conditional likelihood of each training sample in a mini-batch.

              NOTE: 1. The feature function for a CRF is made up of the emission features and the transition features. The emission feature weights are NOT computed in this operator. They MUST be computed first before this operator is called.

              1. Because this operator performs global normalization over all possible sequences internally, it expects UNSCALED emission feature weights. Please do not call this op with the emission feature being output of any nonlinear activation.

              2. The 2nd dimension of Input(Emission) MUST be equal to the tag number.

              Inputs:
              • Emission : (LoDTensor, default LoDTensor<float>) A 2-D LoDTensor with shape [N x D], where N is the size of the mini-batch and D is the total tag number. The unscaled emission weight matrix for the linear chain CRF.
              • Transition : (Tensor, default Tensor<float>) A 2-D Tensor with shape [(D + 2) x D]. The learnable parameter for the linear_chain_crf operator. See more details in the operator's comments.
              • Label : (LoDTensor, default LoDTensor<int64_t>) A LoDTensor with shape [N x 1], where N is the total element number in a mini-batch. The ground truth.
              Outputs:
              • Alpha (Intermediate) : (Tensor, default Tensor<float>) A 2-D Tensor with shape [N x D]. The forward vectors for the entire batch. Denote it as $lpha$. $lpha$ is a memo table used to calculate the normalization factor in CRF. $lpha[k, v]$ stores the unnormalized probabilites of all possible unfinished sequences of tags that end at position $k$ with tag $v$. For each $k$, $lpha[k, v]$ is a vector of length $D$ with a component for each tag value $v$. This vector is called a forward vecotr and will also be used in backward computations.
              • EmissionExps (Intermediate) : (Tensor, default Tensor<float>) A 2-D Tensor with shape [N x D]. The exponentials of Input(Emission). This is an intermediate computational result in forward computation, and will be reused in backward computation.
              • TransitionExps (Intermediate) : (Tensor, default Tensor<float>) A 2-D Tensor with shape [(D + 2) x D]. The exponentials of Input(Transition). This is an intermediate computational result in forward computation, and will be reused in backward computation.
              • LogLikelihood : (Tensor, default Tensor<float>) The logarithm of the conditional likelihood of each training sample in a mini-batch. This is a 2-D tensor with shape [S x 1], where S is the sequence number in a mini-batch. Note: S is equal to the sequence number in a mini-batch. The output is no longer a LoDTensor.

              logsigmoid

              Logsigmoid Activation Operator

              $$y = \log \frac{1}{1 + e^{-x}}$$

              Inputs:
              • X : Input of LogSigmoid operator
              Outputs:
              • Y : Output of LogSigmoid operator

              row_conv

              Row-convolution Operator.

              The row convolution is called lookahead convolution. This operator was introduced in the following paper for DeepSpeech2: http://www.cs.cmu.edu/~dyogatam/papers/wang+etal.iclrworkshop2016.pdf

              The main motivation is that a bidirectional RNN, useful in DeepSpeech like speech models, learns representation for a sequence by performing a forward and a backward pass through the entire sequence. However, unlike unidirectional RNNs, bidirectional RNNs are challenging to deploy in an online and low-latency setting. The lookahead convolution incorporates information from future subsequences in a computationally efficient manner to improve unidirectional recurrent neural networks. The row convolution operator is different from the 1D sequence convolution, and is computed as follows:

              Given an input sequence $in$ of length $t$ and input dimension $d$, and a filter ($W$) of size $context times d$, the output sequence is convolved as:

              $$ out_{i, :} = \sum_{j=i}^{i + context} in_{j,:} \dot W_{i-j, :} $$

              Inputs:
              • X : (LoDTensor), the input(X) is a LodTensor, which supports variable time-length input sequences. The underlying tensor in this LoDTensor is a matrix with shape (T x N), where T is the total time steps in this mini-batch and N is the input data dimension.
              • Filter : (Tensor), the input(Filter) is a learnable parameter. It is a 2-D tensor with shape (future_context x N), where, future_context is the future context length and N is the data dimension.
              Outputs:
              • Out : (LoDTensor), the output(Out) is a LodTensor, which supports variable time-length input sequences. The underlying tensor in this LodTensor is a matrix with shape T x N, i.e., the same shape as X.

              exp

              Exp Activation Operator.

              $y = e^x$

              Inputs:
              • X : Input of Exp operator
              Outputs:
              • Y : Output of Exp operator

              soft_relu

              SoftRelu Activation Operator.

              $y = ln(1 + exp(max(min(x, threshold), threshold))$

              Inputs:
              • X : Input of SoftRelu operator
              Outputs:
              • Y : Output of SoftRelu operator
              Attributes:
              • threshold (Duplicable): The threshold value of SoftRelu

              softshrink

              Softshrink Activation Operator.

              $$ y = \begin{cases} x - \lambda, \text{if } x > \lambda \\ x + \lambda, \text{if } x < -\lambda \\ 0, \text{otherwise} \end{cases} $$

              Inputs:
              • X : Input of Softshrink operator
              Outputs:
              • Y : Output of Softshrink operator
              Attributes:
              • lambda (Duplicable): non-negative offset

              maxout

              MaxOut Operator.

              Assumed the input shape is (N, Ci, H, W). The output shape is (N, Co, H, W). Then $Co = Ci / groups$ and the operator formula is as follows:

              $$ y_{si+j} = \max_k x_{gsi + sk + j} \\ g = groups \\ s = \frac{input.size}{num\_channels} \\ 0 \le i < \frac{num\_channels}{groups} \\ 0 \le j < s \\ 0 \le k < groups $$

              Please refer to Paper: - Maxout Networks: http://www.jmlr.org/proceedings/papers/v28/goodfellow13.pdf - Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks: https://arxiv.org/pdf/1312.6082v4.pdf

              Inputs:
              • X : (Tensor) The input tensor of maxout operator. The format of input tensor is NCHW. Where N is batch size, C is the number of channels, H and W is the height and width of feature.
              Outputs:
              • Out : (Tensor) The output tensor of maxout operator.The format of output tensor is also NCHW.Where N is batch size, C is the number of channels, H and W is the height and width of feature.
              Attributes:
              • groups (Duplicable): "Specifies how many groups the input tensor will be split" "in the channel dimension. And the number of output channel is " "the number of channels divided by groups.."

              ftrl

              FTRL (Follow The Regularized Leader) Operator.

              Optimizer that implements the FTRL algorithm:

              $$ new\_accum = squared\_accum + grad^2 \\ if (lr\_power == -0.5) { linear\_accum += grad - (\surd(new\_accum) - \surd(squared\_accum)) / (learning\_rate * param) \\ } else { linear\_accum += grad - (new\_accum^{-lr\_power} - accum^{-lr\_power}) / (learning\_rate * param) \\ } x = (l1 * sign(linear\_accum) - linear\_accum) if (lr\_power == -0.5) { y = \frac{\surd(new\_accum)}{learning\_rate} + (2 * l2) \\ pre\_shrink = \frac{x}{y} \\ param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0) \\ } else { y = \frac{new\_accum^{-lr\_power}}{learning\_rate} + (2 * l2) \\ pre\_shrink = \frac{x}{y} \\ param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0) \\ } squared\_accum += grad^2; $$

              The paper that proposed Follow The Regularized Leader (FTRL): (https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf)

              Inputs:
              • Param : (Tensor, default Tensor<float>) Input parameter value that has to be updated.
              • SquaredAccumulator : (Tensor, default Tensor<float>) Accumulator that accumulates squared gradients.
              • LinearAccumulator : (Tensor, default Tensor<float>) Accumulator that accumulates linear gradients.
              • Grad : (Tensor, default Tensor<float>) Input gradient of the parameter.
              • LearningRate : (Tensor, default Tensor<float>) The learning rate should be a tensor of size 1.
              Outputs:
              • ParamOut : (Tensor) Output updated parameter value.
              • SquaredAccumOut : (Tensor) Output accumulated squared gradients.
              • LinearAccumOut : (Tensor) Output accumulated linear gradients.
              Attributes:
              • l1 (Duplicable): (float, default 0.0) L1 regularization strength.
              • l2 (Duplicable): (float, default 0.0) L2 regularization strength.
              • lr_power (Duplicable): (float, default -0.5f) Learning Rate Power.

              round

              Round Activation Operator.

              $y = [x]$

              Inputs:
              • X : Input of Round operator
              Outputs:
              • Y : Output of Round operator

              softsign

              Softsign Activation Operator.

              $$y = \frac{x}{1 + |x|}$$

              Inputs:
              • X : Input of Softsign operator
              Outputs:
              • Y : Output of Softsign operator