# Networks¶

The v2.networks module contains pieces of neural network that combine multiple layers.

## NLP¶

### sequence_conv_pool¶

paddle.v2.networks.sequence_conv_pool(*args, **kwargs)

Text convolution pooling group.

Text input => Context Projection => FC Layer => Pooling => Output.

Parameters: name (basestring) – group name. input (LayerOutput) – input layer. context_len (int) – context projection length. See context_projection’s document. hidden_size (int) – FC Layer size. context_start (int|None) – context start position. See context_projection’s context_start. pool_type (BasePoolingType) – pooling layer type. See pooling_layer’s document. context_proj_layer_name (basestring) – context projection layer name. None if user don’t care. context_proj_param_attr (ParameterAttribute|None) – padding parameter attribute of context projection layer. If false, it means padding always be zero. fc_layer_name (basestring) – fc layer name. None if user don’t care. fc_param_attr (ParameterAttribute|None) – fc layer parameter attribute. None if user don’t care. fc_bias_attr (ParameterAttribute|False|None) – fc bias parameter attribute. False if no bias, None if user don’t care. fc_act (BaseActivation) – fc layer activation type. None means tanh. pool_bias_attr (ParameterAttribute|False|None) – pooling layer bias attr. False if no bias. None if user don’t care. fc_attr (ExtraLayerAttribute) – fc layer extra attribute. context_attr (ExtraLayerAttribute) – context projection layer extra attribute. pool_attr (ExtraLayerAttribute) – pooling layer extra attribute. layer’s output. LayerOutput

### text_conv_pool¶

paddle.v2.networks.text_conv_pool(*args, **kwargs)

Text convolution pooling group.

Text input => Context Projection => FC Layer => Pooling => Output.

Parameters: name (basestring) – group name. input (LayerOutput) – input layer. context_len (int) – context projection length. See context_projection’s document. hidden_size (int) – FC Layer size. context_start (int|None) – context start position. See context_projection’s context_start. pool_type (BasePoolingType) – pooling layer type. See pooling_layer’s document. context_proj_layer_name (basestring) – context projection layer name. None if user don’t care. context_proj_param_attr (ParameterAttribute|None) – padding parameter attribute of context projection layer. If false, it means padding always be zero. fc_layer_name (basestring) – fc layer name. None if user don’t care. fc_param_attr (ParameterAttribute|None) – fc layer parameter attribute. None if user don’t care. fc_bias_attr (ParameterAttribute|False|None) – fc bias parameter attribute. False if no bias, None if user don’t care. fc_act (BaseActivation) – fc layer activation type. None means tanh. pool_bias_attr (ParameterAttribute|False|None) – pooling layer bias attr. False if no bias. None if user don’t care. fc_attr (ExtraLayerAttribute) – fc layer extra attribute. context_attr (ExtraLayerAttribute) – context projection layer extra attribute. pool_attr (ExtraLayerAttribute) – pooling layer extra attribute. layer’s output. LayerOutput

## Images¶

### img_conv_bn_pool¶

paddle.v2.networks.img_conv_bn_pool(*args, **kwargs)

Convolution, batch normalization, pooling group.

Img input => Conv => BN => Pooling => Output.

Parameters: name (basestring) – group name. input (LayerOutput) – input layer. filter_size (int) – see img_conv_layer for details. num_filters (int) – see img_conv_layer for details. pool_size (int) – see img_pool_layer for details. pool_type (BasePoolingType) – see img_pool_layer for details. act (BaseActivation) – see batch_norm_layer for details. groups (int) – see img_conv_layer for details. conv_stride (int) – see img_conv_layer for details. conv_padding (int) – see img_conv_layer for details. conv_bias_attr (ParameterAttribute) – see img_conv_layer for details. num_channel (int) – see img_conv_layer for details. conv_param_attr (ParameterAttribute) – see img_conv_layer for details. shared_bias (bool) – see img_conv_layer for details. conv_layer_attr (ExtraLayerOutput) – see img_conv_layer for details. bn_param_attr (ParameterAttribute) – see batch_norm_layer for details. bn_bias_attr (ParameterAttribute) – see batch_norm_layer for details. bn_layer_attr (ExtraLayerAttribute) – see batch_norm_layer for details. pool_stride (int) – see img_pool_layer for details. pool_padding (int) – see img_pool_layer for details. pool_layer_attr (ExtraLayerAttribute) – see img_pool_layer for details. layer’s output LayerOutput

### img_conv_group¶

paddle.v2.networks.img_conv_group(*args, **kwargs)

Image Convolution Group, Used for vgg net.

Parameters: conv_batchnorm_drop_rate (list) – if conv_with_batchnorm[i] is true, conv_batchnorm_drop_rate[i] represents the drop rate of each batch norm. input (LayerOutput) – input layer. conv_num_filter (list|tuple) – list of output channels num. pool_size (int) – pooling filter size. num_channels (int) – input channels num. conv_padding (int) – convolution padding size. conv_filter_size (int) – convolution filter size. conv_act (BaseActivation) – activation funciton after convolution. conv_with_batchnorm (list) – if conv_with_batchnorm[i] is true, there is a batch normalization operation after each convolution. pool_stride (int) – pooling stride size. pool_type (BasePoolingType) – pooling type. param_attr (ParameterAttribute) – param attribute of convolution layer, None means default attribute. layer’s output LayerOutput

### simple_img_conv_pool¶

paddle.v2.networks.simple_img_conv_pool(*args, **kwargs)

Simple image convolution and pooling group.

Img input => Conv => Pooling => Output.

Parameters: name (basestring) – group name. input (LayerOutput) – input layer. filter_size (int) – see img_conv_layer for details. num_filters (int) – see img_conv_layer for details. pool_size (int) – see img_pool_layer for details. pool_type (BasePoolingType) – see img_pool_layer for details. act (BaseActivation) – see img_conv_layer for details. groups (int) – see img_conv_layer for details. conv_stride (int) – see img_conv_layer for details. conv_padding (int) – see img_conv_layer for details. bias_attr (ParameterAttribute) – see img_conv_layer for details. num_channel (int) – see img_conv_layer for details. param_attr (ParameterAttribute) – see img_conv_layer for details. shared_bias (bool) – see img_conv_layer for details. conv_layer_attr (ExtraLayerAttribute) – see img_conv_layer for details. pool_stride (int) – see img_pool_layer for details. pool_padding (int) – see img_pool_layer for details. pool_layer_attr (ExtraLayerAttribute) – see img_pool_layer for details. layer’s output LayerOutput

### vgg_16_network¶

paddle.v2.networks.vgg_16_network(input_image, num_channels, num_classes=1000)

Same model from https://gist.github.com/ksimonyan/211839e770f7b538e2d8

Parameters: num_classes (int) – number of class. input_image (LayerOutput) – input layer. num_channels (int) – input channels num. layer’s output LayerOutput

## Recurrent¶

### LSTM¶

#### lstmemory_unit¶

paddle.v2.networks.lstmemory_unit(*args, **kwargs)

lstmemory_unit defines the caculation process of a LSTM unit during a single time step. This function is not a recurrent layer, so it can not be directly used to process sequence input. This function is always used in recurrent_group (see layers.py for more details) to implement attention mechanism.

Please refer to Generating Sequences With Recurrent Neural Networks for more details about LSTM. The link goes as follows: .. _Link: https://arxiv.org/abs/1308.0850

\begin{align}\begin{aligned}i_t & = \sigma(W_{x_i}x_{t} + W_{h_i}h_{t-1} + W_{c_i}c_{t-1} + b_i)\\f_t & = \sigma(W_{x_f}x_{t} + W_{h_f}h_{t-1} + W_{c_f}c_{t-1} + b_f)\\c_t & = f_tc_{t-1} + i_t tanh (W_{x_c}x_t+W_{h_c}h_{t-1} + b_c)\\o_t & = \sigma(W_{x_o}x_{t} + W_{h_o}h_{t-1} + W_{c_o}c_t + b_o)\\h_t & = o_t tanh(c_t)\end{aligned}\end{align}

The example usage is:

lstm_step = lstmemory_unit(input=[layer1],
size=256,
act=TanhActivation(),
gate_act=SigmoidActivation(),
state_act=TanhActivation())

Parameters: input (LayerOutput) – Input layer. out_memory (LayerOutput | None) – The output of previous time step. name (basestring) – The lstmemory unit name. size (int) – The lstmemory unit size. param_attr (ParameterAttribute) – The parameter attribute for the weights in input to hidden projection. None means default attribute. act (BaseActivation) – The last activiation type of lstm. gate_act (BaseActivation) – The gate activiation type of lstm. state_act (BaseActivation) – The state activiation type of lstm. input_proj_bias_attr (ParameterAttribute|bool|None) – The parameter attribute for the bias in input to hidden projection. False or None means no bias. If this parameter is set to True, the bias is initialized to zero. input_proj_layer_attr (ExtraLayerAttribute) – The extra layer attribute for input to hidden projection of the LSTM unit, such as dropout, error clipping. lstm_bias_attr (ParameterAttribute|True|None) – The parameter attribute for the bias in lstm layer. False or None means no bias. If this parameter is set to True, the bias is initialized to zero. lstm_layer_attr (ExtraLayerAttribute) – The extra attribute of lstm layer. The lstmemory unit name. LayerOutput

#### lstmemory_group¶

paddle.v2.networks.lstmemory_group(*args, **kwargs)

lstm_group is a recurrent_group version of Long Short Term Memory. It does exactly the same calculation as the lstmemory layer (see lstmemory in layers.py for the maths) does. A promising benefit is that LSTM memory cell states(or hidden states) in every time step are accessible to the user. This is especially useful in attention model. If you do not need to access the internal states of the lstm and merely use its outputs, it is recommended to use the lstmemory, which is relatively faster than lstmemory_group.

NOTE: In PaddlePaddle’s implementation, the following input-to-hidden multiplications: $W_{x_i}x_{t}$ , $W_{x_f}x_{t}$, $W_{x_c}x_t$, $W_{x_o}x_{t}$ are not done in lstmemory_unit to speed up the calculations. Consequently, an additional mixed_layer with full_matrix_projection must be included before lstmemory_unit is called.

The example usage is:

lstm_step = lstmemory_group(input=[layer1],
size=256,
act=TanhActivation(),
gate_act=SigmoidActivation(),
state_act=TanhActivation())

Parameters: input (LayerOutput) – Input layer. size (int) – The lstmemory group size. name (basestring) – The name of lstmemory group. out_memory (LayerOutput | None) – The output of previous time step. reverse (bool) – Process the input in a reverse order or not. param_attr (ParameterAttribute) – The parameter attribute for the weights in input to hidden projection. None means default attribute. act (BaseActivation) – The last activiation type of lstm. gate_act (BaseActivation) – The gate activiation type of lstm. state_act (BaseActivation) – The state activiation type of lstm. input_proj_bias_attr (ParameterAttribute|bool|None) – The parameter attribute for the bias in input to hidden projection. False or None means no bias. If this parameter is set to True, the bias is initialized to zero. input_proj_layer_attr (ExtraLayerAttribute) – The extra layer attribute for input to hidden projection of the LSTM unit, such as dropout, error clipping. lstm_bias_attr (ParameterAttribute|True|None) – The parameter attribute for the bias in lstm layer. False or None means no bias. If this parameter is set to True, the bias is initialized to zero. lstm_layer_attr (ExtraLayerAttribute) – The extra attribute of lstm layer. the lstmemory group. LayerOutput

#### simple_lstm¶

paddle.v2.networks.simple_lstm(*args, **kwargs)

Simple LSTM Cell.

It just combines a mixed layer with fully_matrix_projection and a lstmemory layer. The simple lstm cell was implemented with follow equations.

\begin{align}\begin{aligned}i_t & = \sigma(W_{xi}x_{t} + W_{hi}h_{t-1} + W_{ci}c_{t-1} + b_i)\\f_t & = \sigma(W_{xf}x_{t} + W_{hf}h_{t-1} + W_{cf}c_{t-1} + b_f)\\c_t & = f_tc_{t-1} + i_t tanh (W_{xc}x_t+W_{hc}h_{t-1} + b_c)\\o_t & = \sigma(W_{xo}x_{t} + W_{ho}h_{t-1} + W_{co}c_t + b_o)\\h_t & = o_t tanh(c_t)\end{aligned}\end{align}

Please refer to Generating Sequences With Recurrent Neural Networks for more details about lstm. Link is here.

Parameters: name (basestring) – lstm layer name. input (LayerOutput) – layer’s input. size (int) – lstm layer size. reverse (bool) – process the input in a reverse order or not. mat_param_attr (ParameterAttribute) – parameter attribute of matrix projection in mixed layer. bias_param_attr (ParameterAttribute|False) – bias parameter attribute. False means no bias, None means default bias. inner_param_attr (ParameterAttribute) – parameter attribute of lstm cell. act (BaseActivation) – last activiation type of lstm. gate_act (BaseActivation) – gate activiation type of lstm. state_act (BaseActivation) – state activiation type of lstm. mixed_layer_attr (ExtraLayerAttribute) – extra attribute of mixed layer. lstm_cell_attr (ExtraLayerAttribute) – extra attribute of lstm. layer’s output. LayerOutput

#### bidirectional_lstm¶

paddle.v2.networks.bidirectional_lstm(*args, **kwargs)

A bidirectional_lstm is a recurrent unit that iterates over the input sequence both in forward and backward orders, and then concatenate two outputs to form a final output. However, concatenation of two outputs is not the only way to form the final output, you can also, for example, just add them together.

Please refer to Neural Machine Translation by Jointly Learning to Align and Translate for more details about the bidirectional lstm. The link goes as follows: .. _Link: https://arxiv.org/pdf/1409.0473v3.pdf

The example usage is:

bi_lstm = bidirectional_lstm(input=[input1], size=512)

Parameters: name (basestring) – bidirectional lstm layer name. input (LayerOutput) – input layer. size (int) – lstm layer size. return_seq (bool) – If set False, the last time step of output are concatenated and returned. If set True, the entire output sequences in forward and backward directions are concatenated and returned. LayerOutput object. LayerOutput

### GRU¶

#### gru_unit¶

paddle.v2.networks.gru_unit(*args, **kwargs)

gru_unit defines the calculation process of a gated recurrent unit during a single time step. This function is not a recurrent layer, so it can not be directly used to process sequence input. This function is always used in the recurrent_group (see layers.py for more details) to implement attention mechanism.

Please see grumemory in layers.py for the details about the maths.

Parameters: input (LayerOutput) – input layer. memory_boot (LayerOutput | None) – the initialization state of the LSTM cell. name (basestring) – name of the gru group. size (int) – hidden size of the gru. act (BaseActivation) – activation type of gru gate_act (BaseActivation) – gate activation type or gru gru_layer_attr (ExtraLayerAttribute) – Extra attribute of the gru layer. the gru output layer. LayerOutput

#### gru_group¶

paddle.v2.networks.gru_group(*args, **kwargs)

gru_group is a recurrent_group version of Gated Recurrent Unit. It does exactly the same calculation as the grumemory layer does. A promising benefit is that gru hidden states are accessible to the user. This is especially useful in attention model. If you do not need to access any internal state and merely use the outputs of a GRU, it is recommended to use the grumemory, which is relatively faster.

Please see grumemory in layers.py for more detail about the maths.

The example usage is:

gru = gru_group(input=[layer1],
size=256,
act=TanhActivation(),
gate_act=SigmoidActivation())

Parameters: input (LayerOutput) – input layer. memory_boot (LayerOutput | None) – the initialization state of the LSTM cell. name (basestring) – name of the gru group. size (int) – hidden size of the gru. reverse (bool) – process the input in a reverse order or not. act (BaseActivation) – activiation type of gru gate_act (BaseActivation) – gate activiation type of gru gru_bias_attr (ParameterAttribute|False|None) – bias parameter attribute of gru layer, False means no bias, None means default bias. gru_layer_attr (ExtraLayerAttribute) – Extra attribute of the gru layer. the gru group. LayerOutput

#### simple_gru¶

paddle.v2.networks.simple_gru(*args, **kwargs)

You may see gru_step_layer, grumemory in layers.py, gru_unit, gru_group, simple_gru in network.py. The reason why there are so many interfaces is that we have two ways to implement recurrent neural network. One way is to use one complete layer to implement rnn (including simple rnn, gru and lstm) with multiple time steps, such as recurrent_layer, lstmemory, grumemory. But the multiplication operation $W x_t$ is not computed in these layers. See details in their interfaces in layers.py. The other implementation is to use an recurrent group which can ensemble a series of layers to compute rnn step by step. This way is flexible for attenion mechanism or other complex connections.

• gru_step_layer: only compute rnn by one step. It needs an memory as input and can be used in recurrent group.
• gru_unit: a wrapper of gru_step_layer with memory.
• gru_group: a GRU cell implemented by a combination of multiple layers in recurrent group. But $W x_t$ is not done in group.
• gru_memory: a GRU cell implemented by one layer, which does same calculation with gru_group and is faster than gru_group.
• simple_gru: a complete GRU implementation inlcuding $W x_t$ and gru_group. $W$ contains $W_r$, $W_z$ and $W$, see formula in grumemory.

The computational speed is that, grumemory is relatively better than gru_group, and gru_group is relatively better than simple_gru.

The example usage is:

gru = simple_gru(input=[layer1], size=256)

Parameters: input (LayerOutput) – input layer. name (basestring) – name of the gru group. size (int) – hidden size of the gru. reverse (bool) – process the input in a reverse order or not. act (BaseActivation) – activiation type of gru gate_act (BaseActivation) – gate activiation type of gru gru_bias_attr (ParameterAttribute|False|None) – bias parameter attribute of gru layer, False means no bias, None means default bias. gru_layer_attr (ExtraLayerAttribute) – Extra attribute of the gru layer. the gru group. LayerOutput

#### simple_gru2¶

paddle.v2.networks.simple_gru2(*args, **kwargs)

simple_gru2 is the same with simple_gru, but using grumemory instead. Please refer to grumemory in layers.py for more detail about the math. simple_gru2 is faster than simple_gru.

The example usage is:

gru = simple_gru2(input=[layer1], size=256)

Parameters: input (LayerOutput) – input layer. name (basestring) – name of the gru group. size (int) – hidden size of the gru. reverse (bool) – process the input in a reverse order or not. act (BaseActivation) – activiation type of gru gate_act (BaseActivation) – gate activiation type of gru gru_bias_attr (ParameterAttribute|False|None) – bias parameter attribute of gru layer, False means no bias, None means default bias. gru_param_attr (ParameterAttribute|None) – param parameter attribute of gru layer, None means default param. the gru group. LayerOutput

#### bidirectional_gru¶

paddle.v2.networks.bidirectional_gru(*args, **kwargs)

A bidirectional_gru is a recurrent unit that iterates over the input sequence both in forward and backward orders, and then concatenate two outputs to form a final output. However, concatenation of two outputs is not the only way to form the final output, you can also, for example, just add them together.

The example usage is:

bi_gru = bidirectional_gru(input=[input1], size=512)

Parameters: name (basestring) – bidirectional gru layer name. input (LayerOutput) – input layer. size (int) – gru layer size. return_seq (bool) – If set False, the last time step of output are concatenated and returned. If set True, the entire output sequences in forward and backward directions are concatenated and returned. LayerOutput object. LayerOutput

### simple_attention¶

paddle.v2.networks.simple_attention(*args, **kwargs)

Calculate and return a context vector with attention mechanism. Size of the context vector equals to size of the encoded_sequence.

\begin{align}\begin{aligned}a(s_{i-1},h_{j}) & = v_{a}f(W_{a}s_{t-1} + U_{a}h_{j})\\e_{i,j} & = a(s_{i-1}, h_{j})\\a_{i,j} & = \frac{exp(e_{i,j})}{\sum_{k=1}^{T_x}{exp(e_{i,k})}}\\c_{i} & = \sum_{j=1}^{T_{x}}a_{i,j}h_{j}\end{aligned}\end{align}

where $h_{j}$ is the jth element of encoded_sequence, $U_{a}h_{j}$ is the jth element of encoded_proj $s_{i-1}$ is decoder_state $f$ is weight_act, and is set to tanh by default.

Please refer to Neural Machine Translation by Jointly Learning to Align and Translate for more details. The link is as follows: https://arxiv.org/abs/1409.0473.

The example usage is:

context = simple_attention(encoded_sequence=enc_seq,
encoded_proj=enc_proj,
decoder_state=decoder_prev,)

Parameters: name (basestring) – name of the attention model. softmax_param_attr (ParameterAttribute) – parameter attribute of sequence softmax that is used to produce attention weight. weight_act (BaseActivation) – activation of the attention model. encoded_sequence (LayerOutput) – output of the encoder encoded_proj (LayerOutput) – attention weight is computed by a feed forward neural network which has two inputs : decoder’s hidden state of previous time step and encoder’s output. encoded_proj is output of the feed-forward network for encoder’s output. Here we pre-compute it outside simple_attention for speed consideration. decoder_state (LayerOutput) – hidden state of decoder in previous time step transform_param_attr (ParameterAttribute) – parameter attribute of the feed-forward network that takes decoder_state as inputs to compute attention weight. a context vector LayerOutput

### dot_product_attention¶

paddle.v2.networks.dot_product_attention(*args, **kwargs)

Calculate and return a context vector with dot-product attention mechanism. The dimension of the context vector equals to that of the attended_sequence.

\begin{align}\begin{aligned}a(s_{i-1},h_{j}) & = s_{i-1}^\mathrm{T} h_{j}\\e_{i,j} & = a(s_{i-1}, h_{j})\\a_{i,j} & = \frac{exp(e_{i,j})}{\sum_{k=1}^{T_x}{exp(e_{i,k})}}\\c_{i} & = \sum_{j=1}^{T_{x}}a_{i,j}z_{j}\end{aligned}\end{align}

where $h_{j}$ is the jth element of encoded_sequence, $z_{j}$ is the jth element of attended_sequence, $s_{i-1}$ is transformed_state.

The example usage is:

context = dot_product_attention(encoded_sequence=enc_seq,
attended_sequence=att_seq,
transformed_state=state,)

Parameters: name (basestring) – A prefix attached to the name of each layer that defined inside the dot_product_attention. softmax_param_attr (ParameterAttribute) – The parameter attribute of sequence softmax that is used to produce attention weight. encoded_sequence (LayerOutput) – The output hidden vectors of the encoder. attended_sequence (LayerOutput) – The attention weight is computed by a feed forward neural network which has two inputs : decoder’s transformed hidden state of previous time step and encoder’s output. attended_sequence is the sequence to be attended. transformed_state (LayerOutput) – The transformed hidden state of decoder in previous time step. Since the dot-product operation will be performed on it and the encoded_sequence, their dimensions must be equal. For flexibility, we suppose transformations of the decoder’s hidden state have been done outside dot_product_attention and no more will be performed inside. Then users can use either the original or transformed one. The context vector. LayerOutput