API comparision between RNN and hierarchical RNN

This article takes PaddlePaddle’s hierarchical RNN unit test as an example. We will use several examples to illestrate the usage of single-layer and hierarchical RNNs. Each example has two model configurations, one for single-layer, and the other for hierarchical RNN. Although the implementations are different, both the two model configurations’ effects are the same. All of the examples in this article only describe the API interface of the hierarchical RNN, while we do not use this hierarchical RNN to solve practical problems. If you want to understand the use of hierarchical RNN in specific issues, please refer to algo_hrnn_demo。The unit test file used in this article’s example is test_RecurrentGradientMachine.cpp

Example 1:Hierarchical RNN without Memory between subsequences

The classical case in the hierarchical RNN is to perform sequence operations on each time series data in the inner layers seperately. And the sequence operations in the inner layers is independent, that is, it does not need to use Memory.

In this example, the network configuration of single-layer RNNs and hierarchical RNNs are all to use LSTM as en encoder to compress a word-segmented sentence into a vector. The difference is that, RNN uses a hierarchical RNN model, treating multiple sentences as a whole to use encoder to compress simultaneously. They are completely consistent in their semantic meanings. This pair of semantically identical example configurations is as follows:

Reading hierarchical sequence data

Firstly, the original data in this example is as follows :

  • The original data in this example has 10 samples. Each of the sample includes two components: a lable(all 2 here), and a word-segmented sentence. This data is used by single RNN as well.
2  	酒店 有 很 舒适 的 床垫 子 , 床上用品 也 应该 是 一人 一 换 , 感觉 很 利落 对 卫生 很 放心 呀 。
2  	很 温馨 , 也 挺 干净 的 * 地段 不错 , 出来 就 有 全家 , 离 地铁站 也 近 , 交通 很方便 * 就是 都 不 给 刷牙 的 杯子 啊 , 就 第一天 给 了 一次性杯子 *
2  	位置 方便 , 强烈推荐 , 十一 出去玩 的 时候 选 的 , 对面 就是 华润万家 , 周围 吃饭 的 也 不少 。
2  	交通便利 , 吃 很 便利 , 乾 浄 、 安静 , 商务 房 有 电脑 、 上网 快 , 价格 可以 , 就 早餐 不 好吃 。 整体 是 不错 的 。 適 合 出差 來 住 。
2  	本来 准备 住 两 晚 , 第 2 天 一早 居然 停电 , 且 无 通知 , 只有 口头 道歉 。 总体来说 性价比 尚可 , 房间 较 新 , 还是 推荐 .
2  	这个 酒店 去过 很多 次 了 , 选择 的 主要原因 是 离 客户 最 便宜 相对 又 近 的 酒店
2  	挺好 的 汉庭 , 前台 服务 很 热情 , 卫生 很 整洁 , 房间 安静 , 水温 适中 , 挺好 !
2  	HowardJohnson 的 品质 , 服务 相当 好 的 一 家 五星级 。 房间 不错 、 泳池 不错 、 楼层 安排 很 合理 。 还有 就是 地理位置 , 简直 一 流 。 就 在 天一阁 、 月湖 旁边 , 离 天一广场 也 不远 。 下次 来 宁波 还会 住 。
2  	酒店 很干净 , 很安静 , 很 温馨 , 服务员 服务 好 , 各方面 都 不错 *
2  	挺好 的 , 就是 没 窗户 , 不过 对 得 起 这 价格
  • The data for hierarchical RNN has 4 samples. Every sample is seperated by a blank line, while the content of the data is the same as the original data. But as for hierarchical LSTM, the first sample will encode two sentences into two vectors simultaneously. The sentence count dealed simultaneously by this 4 samples are [2, 3, 2, 3].
2  	酒店 有 很 舒适 的 床垫 子 , 床上用品 也 应该 是 一人 一 换 , 感觉 很 利落 对 卫生 很 放心 呀 。
2  	很 温馨 , 也 挺 干净 的 * 地段 不错 , 出来 就 有 全家 , 离 地铁站 也 近 , 交通 很方便 * 就是 都 不 给 刷牙 的 杯子 啊 , 就 第一天 给 了 一次性杯子 *

2  	位置 方便 , 强烈推荐 , 十一 出去玩 的 时候 选 的 , 对面 就是 华润万家 , 周围 吃饭 的 也 不少 。
2  	交通便利 , 吃 很 便利 , 乾 浄 、 安静 , 商务 房 有 电脑 、 上网 快 , 价格 可以 , 就 早餐 不 好吃 。 整体 是 不错 的 。 適 合 出差 來 住 。
2  	本来 准备 住 两 晚 , 第 2 天 一早 居然 停电 , 且 无 通知 , 只有 口头 道歉 。 总体来说 性价比 尚可 , 房间 较 新 , 还是 推荐 .

2  	这个 酒店 去过 很多 次 了 , 选择 的 主要原因 是 离 客户 最 便宜 相对 又 近 的 酒店
2  	挺好 的 汉庭 , 前台 服务 很 热情 , 卫生 很 整洁 , 房间 安静 , 水温 适中 , 挺好 !

2  	HowardJohnson 的 品质 , 服务 相当 好 的 一 家 五星级 。 房间 不错 、 泳池 不错 、 楼层 安排 很 合理 。 还有 就是 地理位置 , 简直 一 流 。 就 在 天一阁 、 月湖 旁边 , 离 天一广场 也 不远 。 下次 来 宁波 还会 住 。
2  	酒店 很干净 , 很安静 , 很 温馨 , 服务员 服务 好 , 各方面 都 不错 *
2  	挺好 的 , 就是 没 窗户 , 不过 对 得 起 这 价格

Secondly, as for these two types of different input data formats, the contrast of different DataProviders are as follows (sequenceGen.py):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
    settings.word_dict = dict_file
    settings.input_types = [
        integer_value_sequence(len(settings.word_dict)), integer_value(3)
    ]
    settings.logger.info('dict len : %d' % (len(settings.word_dict)))


@provider(init_hook=hook, should_shuffle=False)
def process(settings, file_name):
    with open(file_name, 'r') as fdata:
        for line in fdata:
            label, comment = line.strip().split('\t')
            label = int(''.join(label.split()))
            words = comment.split()
            words = [
                settings.word_dict[w] for w in words if w in settings.word_dict
            ]
            yield words, label

  • This is the DataProvider code for an ordinary single-layer time series. Its description is as follows:
    • DataProvider returns two parts, that are “words” and “label”,as line 19 in the above code.
      • “words” is a list of word table indices corresponding to each word in the sentence in the original data. Its data type is integer_value_sequence, that is integer list. So, “words” is a singler-layer time series in the data.
      • “label” is the categorical label of each sentence, whose data type is integer_value.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def hook2(settings, dict_file, **kwargs):
    settings.word_dict = dict_file
    settings.input_types = [
        integer_value_sub_sequence(len(settings.word_dict)),
        integer_value_sequence(3)
    ]
    settings.logger.info('dict len : %d' % (len(settings.word_dict)))


@provider(init_hook=hook2, should_shuffle=False)
def process2(settings, file_name):
    with open(file_name) as fdata:
        labels = []
        sentences = []
        for line in fdata:
            if (len(line)) > 1:
                label, comment = line.strip().split('\t')
                label = int(''.join(label.split()))
                words = comment.split()
                words = [
                    settings.word_dict[w] for w in words
                    if w in settings.word_dict
                ]
                labels.append(label)
                sentences.append(words)
            else:
                yield sentences, labels
                labels = []
                sentences = []
  • As for the same data, the DataProvider code for hierarchical time series. Its description is as follows:
    • DataProvider returns two lists of data, that are “sentences” and “labels”, corresponding to the sentences and labels in each group in the original data of hierarchical time series.
    • “sentences” comes from the hierarchical time series original data. As it contains every sentences in each group internally, and each sentences are represented by a list of word table indices, so its data type is integer_value_sub_sequence, which is hierarchical time series.
    • “labels” is the categorical lable of each sentence, so it is a sigle-layer time series.

Model configuration

Firstly, let’s look at the configuration of single-layer RNN. The hightlighted part of line 9 to line 15 is the usage of single-layer RNN. Here we use the pre-defined RNN process function in PaddlePaddle. In this function, for each time step, RNN passes through an LSTM network.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
data = data_layer(name="word", size=dict_dim)

emb = embedding_layer(input=data, size=word_dim)

# (lstm_input + lstm) is equal to lstmemory 
with mixed_layer(size=hidden_dim * 4) as lstm_input:
    lstm_input += full_matrix_projection(input=emb)

lstm = lstmemory_group(
    input=lstm_input,
    size=hidden_dim,
    act=TanhActivation(),
    gate_act=SigmoidActivation(),
    state_act=TanhActivation())

lstm_last = last_seq(input=lstm)

with mixed_layer(
        size=label_dim, act=SoftmaxActivation(), bias_attr=True) as output:
    output += full_matrix_projection(input=lstm_last)

outputs(
    classification_cost(
        input=output, label=data_layer(
            name="label", size=1)))

Secondly, let’s look at the model configuration of hierarchical RNN which has the same semantic meaning. :

  • Most layers in PaddlePaddle do not care about whether the input is time series or not, e.g. embedding_layer. In these layers, every operation is processed on each time step.
  • In the hightlighted part of line 7 to line 26 of this configuration, we transform the hierarchical time series data into single-layer time series data, then process each single-layer time series.
    • Use the function recurrent_groupto transform. Input sequences need to be passed in when transforming. As we want to transform hierarchical time series into single-layer sequences, we need to lable the input data as SubsequenceInput.
    • In this example, we disassemble every group of the original data into sentences using recurrent_group. Each of the disassembled sentences passes through an LSTM network. This is equivalent to single-layer RNN configuration.
  • Similar to single-layer RNN configuration, we only use the last vector after the encode of LSTM. So we use the operation of last_seqto recurrent_group. But unlike single-layer RNN, we use the last element of every subsequence, so we need to set agg_level=AggregateLevel.TO_SEQUENCE.
  • Till now, lstm_lasthas the same result as lstm_lastin single-layer RNN configuration.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
data = data_layer(name="word", size=dict_dim)

emb_group = embedding_layer(input=data, size=word_dim)


# (lstm_input + lstm) is equal to lstmemory 
def lstm_group(lstm_group_input):
    with mixed_layer(size=hidden_dim * 4) as group_input:
        group_input += full_matrix_projection(input=lstm_group_input)

    lstm_output = lstmemory_group(
        input=group_input,
        name="lstm_group",
        size=hidden_dim,
        act=TanhActivation(),
        gate_act=SigmoidActivation(),
        state_act=TanhActivation())
    return lstm_output


lstm_nest_group = recurrent_group(
    input=SubsequenceInput(emb_group), step=lstm_group, name="lstm_nest_group")
# hasSubseq ->(seqlastins) seq
lstm_last = last_seq(
    input=lstm_nest_group, agg_level=AggregateLevel.TO_SEQUENCE)

# seq ->(expand) hasSubseq

Example 2:Hierarchical RNN with Memory between subsequences

This example is intended to implement two fully-equivalent fully-connected RNNs using single-layer RNN and hierarchical RNN.

  • As for single-layer RNN, input is a full time series, e.g. [4, 5, 2, 0, 9, 8, 1, 4].
  • As for hierarchical RNN, input is a hierarchical time series which elements are arbitrarily combination of data in single-layer RNN, e.g. [ [4, 5, 2], [0, 9], [8, 1, 4]].

model configuration

We select the different parts between single-layer RNN and hierarchical RNN configurations, to compare and analyze the reason why they have same semantic meanings.

  • single-layer RNN:passes through a simple recurrent_group. For each time step, the current input y and the last time step’s output rnn_state pass through a fully-connected layer.
def step(y):
    mem = memory(name="rnn_state", size=hidden_dim)
    out = fc_layer(input=[y, mem],
                    size=hidden_dim,
                    act=TanhActivation(),
                    bias_attr=True,
                    name="rnn_state")
    return out

out = recurrent_group(
    name="rnn",
    step=step,
    input=emb)
  • hierarchical RNN, the outer layer’s memory is an element.
    • The recurrent_group of inner layer’s inner_step is nearly the same as single-layer sequence, except for the case of boot_layer=outer_mem, which means using the outer layer’s outer_mem as the initial state for the inner layer’s memory. In the outer layer’s out_step, outer_mem is the last vector of a subsequence, that is, the whole hierarchical group uses the last vector of the previous subsequence as the initial state for the next subsequence’s memory.
    • From the aspect of the input data, sentences from single-layer and hierarchical RNN are the same. The only difference is that, hierarchical RNN disassembes the sequence into subsequences. So in the hierarchical RNN configuration, we must use the last element of the previous subsequence as a boot_layer for the memory of the next subsequence, so that it makes no difference with “every time step uses the output of last time step” in the sigle-layer RNN configuration.
def outer_step(x):
    outer_mem = memory(name="outer_rnn_state", size=hidden_dim)
    def inner_step(y):
        inner_mem = memory(name="inner_rnn_state",
                           size=hidden_dim,
                           boot_layer=outer_mem)
        out = fc_layer(input=[y, inner_mem],
                        size=hidden_dim,
                        act=TanhActivation(),
                        bias_attr=True,
                        name="inner_rnn_state")
        return out

    inner_rnn_output = recurrent_group(
        step=inner_step,
        name="inner",
        input=x)
    last = last_seq(input=inner_rnn_output, name="outer_rnn_state")

    # "return last" won't work, because recurrent_group only support the input 
    # sequence type is same as return sequence type.
    return inner_rnn_output

out = recurrent_group(
    name="outer",
    step=outer_step,
    input=SubsequenceInput(emb))

Warning

Currently PaddlePaddle only supports the case that the lengths of the time series of Memory in each time step are the same.

Example 3:hierarchical RNN with unequal length inputs

unequal length inputs means in the multiple input sequences of recurrent_group, the lengths of subsequences can be unequal. But the output of the sequence, needs to be consistent with one of the input sequences. Using targetInlinkcan help you specify which of the input sequences and the output sequence can be consistent, by default is the first input.

The configurations of Example 3 are sequence_rnn_multi_unequalength_inputs and sequence_nest_rnn_multi_unequalength_inputs.

The data for the configurations of Example 3’s single-layer RNN and hierarchical RNN are exactly the same.

  • For the single-layer RNN, the data has two samples, which are [1, 2, 4, 5, 2], [5, 4, 1, 3, 1]and [0, 2, 2, 5, 0, 1, 2], [1, 5, 4, 2, 3, 6, 1]. Each of the data for the single-layer RNN has two group of features.
  • On the basis of the single-layer’s data, hierarchical RNN’s data randomly adds some partitions. For example, the first sample is transformed to [[0, 2], [2, 5], [0, 1, 2]],[[1, 5], [4], [2, 3, 6, 1]].
  • You need to pay attention that, PaddlePaddle only supports multiple input hierarchical RNNs that have same amount of subsequences currently. In this example, the two features both have 3 subsequences. Although the length of each subsequence can be different, the amount of subsequences should be the same.

model configuration

Similar to Example 2’s configuration, Example 3’s configuration uses single-layer and hierarchical RNN to implement 2 fully-equivalent fully-connected RNNs.

  • single-layer RNN:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
    def calrnn(y):
        mem = memory(name='rnn_state_' + y.name, size=hidden_dim)
        out = fc_layer(
            input=[y, mem],
            size=hidden_dim,
            act=TanhActivation(),
            bias_attr=True,
            name='rnn_state_' + y.name)
        return out

    encoder1 = calrnn(x1)
    encoder2 = calrnn(x2)
    return [encoder1, encoder2]


encoder1_rep, encoder2_rep = recurrent_group(
    name="stepout", step=step, input=[emb1, emb2])

  • hierarchical RNN:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

    def inner_step(ipt):
        index[0] += 1
        i = index[0]
        outer_mem = memory(name="outer_rnn_state_%d" % i, size=hidden_dim)

        def inner_step_impl(y):
            inner_mem = memory(
                name="inner_rnn_state_" + y.name,
                size=hidden_dim,
                boot_layer=outer_mem)
            out = fc_layer(
                input=[y, inner_mem],
                size=hidden_dim,
                act=TanhActivation(),
                bias_attr=True,
                name='inner_rnn_state_' + y.name)
            return out

        encoder = recurrent_group(
            step=inner_step_impl, name='inner_%d' % i, input=ipt)
        last = last_seq(name="outer_rnn_state_%d" % i, input=encoder)
        return encoder, last

    encoder1, sentence_last_state1 = inner_step(ipt=x1)
    encoder2, sentence_last_state2 = inner_step(ipt=x2)

    encoder1_expand = expand_layer(
        input=sentence_last_state1, expand_as=encoder2)

    return [encoder1_expand, encoder2]


encoder1_rep, encoder2_rep = recurrent_group(
    name="outer",
    step=outer_step,
    input=[SubsequenceInput(emb1), SubsequenceInput(emb2)],
    targetInlink=emb2)

encoder1_last = last_seq(input=encoder1_rep)

In the above code, the usage of single-layer and hierarchical RNNs are similar to Example 2, which difference is that it processes 2 inputs simultaneously. As for the hierarchical RNN, the lengths of the 2 input’s subsequences are not equal. But we use the parameter targetInlink to set the outper layer’s recurrent_group ‘s output format, so the shape of outer layer’s output is the same as the shape of emb2.

Glossary

Memory

Memory is a concept when PaddlePaddle is implementing RNN. RNN, recurrent neural network, usually requires some dependency between time steps, that is, the neural network in current time step depends on one of the neurons in the neural network in previous time steps, as the following figure shows:

The dotted connections in the figure, is the network connections across time steps. When PaddlePaddle is implementing RNN, this connection accross time steps is implemented using a special neural network unit, called Memory. Memory can cache the output of one of the neurons in previous time step, then can be passed to another neuron in next time step. The implementation of an RNN using Memory is as follows:

With this method, PaddlePaddle can easily determine which outputs should cross time steps, and which should not.

time step

refers to time series

time series

Time series is a series of featured data. The order among these featured data is meaningful. So it is a list of features, not a set of features. As for each element of this list, or the featured data in each series, is called a time step. It must be noted that, the concepts of time series and time steps, are not necessarrily related to “time”. As long as the “order” in a series of featured data is meaningful, it can be the input of time series.

For example, in text classification task, we regard a sentence as a time series. So, each word in the sentence can become the index of the word in the word table. So this sentence can be represented as a list of these indices, e.g.:code:[9, 2, 3, 5, 3] .

For a more detailed and accurate definition of the time series, please refer to Wikipedia of Time series or Chinese Wikipedia of time series .

In additioin, Paddle always calls time series as Sequence . They are a same concept in Paddle’s documentations and APIs.

RNN

In PaddlePaddle’s documentations, RNN is usually represented as Recurrent neural network . For more information, please refer to Wikipedia Recurrent neural network or Chinese Wikipedia .

In PaddlePaddle, RNN usually means, for the input data of a time series, the neural network between each time steps has a certain relevance. For example, the input of a certain neuron is the output of a certain neuron in the neural network of the last time step. Or, as for each time step, the network structure of the neural network has a directed ring structure.

hierarchical RNN

Hierarchical RNN, as the name suggests, means there is a nested relationship in RNNs. The input data is a time series, but for each of the inner featured data, it is also a time series, namely 2-dimentional array, or, array of array. Hierarchical RNN is a neural network that can process this type of input data.

For example, the task of text classification of a paragragh, meaning to classify a paragraph of sentences. We can treat a paragraph as an array of sentences, and each sentence is an array of words. This is a type of the input data for the hierarchical RNN. We encode each sentence of this paragraph into a vector using LSTM, then encode each of the encoded vectors into a vector of this paragraph using LSTM. Finally we use this paragraph vector perform classification, which is the neural network structure of this hierarchical RNN.