Use PyReader to read training and test data

Paddle Fluid supports PyReader, which implements feeding data from Python to C++. Different from Take Numpy Array as Training Data , the process of loading data to Python is asynchronous with the process of Executor::Run() reading data when PyReader is in use. Moreover, PyReader is able to work with double_buffer_reader to upgrade the performance of reading data.

Create PyReader Object

You can create PyReader object as follows:

import paddle.fluid as fluid

py_reader = fluid.layers.py_reader(capacity=64,
                                   shapes=[(-1,3,224,224), (-1,1)],
                                   dtypes=['float32', 'int64'],
                                   name='py_reader',
                                   use_double_buffer=True)

In the code, capacity is buffer size of PyReader; shapes is the size of parameters in the batch (such as image and label in picture classification task); dtypes is data type of parameters in the batch; name is name of PyReader instance; use_double_buffer is True by default, which means double_buffer_reader is used.

To create some different PyReader objects (Usually, you have to create two different PyReader objects for training and testing phase), the names of objects must be different. For example, In the same task, PyReader objects in training and testing period are created as follows:

import paddle.fluid as fluid

train_py_reader = fluid.layers.py_reader(capacity=64,
                                         shapes=[(-1,3,224,224), (-1,1)],
                                         dtypes=['float32', 'int64'],
                                         name='train',
                                         use_double_buffer=True)

test_py_reader = fluid.layers.py_reader(capacity=64,
                                        shapes=[(-1,3,224,224), (-1,1)],
                                        dtypes=['float32', 'int64'],
                                        name='test',
                                        use_double_buffer=True)

Note: You could not copy PyReader object with Program.clone() so you have to create PyReader objects in training and testing phase with the method mentioned above

Because you could not copy PyReader with Program.clone() so you have to share the parameters of training phase with testing phase through fluid.unique_name.guard() .

Details are as follows:

import paddle.fluid as fluid
import paddle.dataset.mnist as mnist
import paddle.v2

import numpy

def network(is_train):
    reader = fluid.layers.py_reader(
        capacity=10,
        shapes=((-1, 784), (-1, 1)),
        dtypes=('float32', 'int64'),
        name="train_reader" if is_train else "test_reader",
        use_double_buffer=True)
    img, label = fluid.layers.read_file(reader)
    ...
    # Here, we omitted the definition of loss of the model
    return loss , reader

train_prog = fluid.Program()
train_startup = fluid.Program()

with fluid.program_guard(train_prog, train_startup):
    with fluid.unique_name.guard():
        train_loss, train_reader = network(True)
        adam = fluid.optimizer.Adam(learning_rate=0.01)
        adam.minimize(train_loss)

test_prog = fluid.Program()
test_startup = fluid.Program()
with fluid.program_guard(test_prog, test_startup):
    with fluid.unique_name.guard():
        test_loss, test_reader = network(False)

Configure data source of PyReader objects

PyReader provides decorate_tensor_provider and decorate_paddle_reader , both of which receieve Python generator as data source.The difference is:

  1. decorate_tensor_provider : generator generates a list or tuple each time, with each element of list or tuple being LoDTensor or Numpy array, and LoDTensor or shape of Numpy array must be the same as shapes stated while PyReader is created.
  2. decorate_paddle_reader : generator generates a list or tuple each time, with each element of list or tuple being Numpy array,but the shape of Numpy array doesn’t have to be the same as shape stated while PyReader is created. decorate_paddle_reader will reshape Numpy array internally.

Train and test model with PyReader

Details are as follows(the remaining part of the code above):

place = fluid.CUDAPlace(0)
startup_exe = fluid.Executor(place)
startup_exe.run(train_startup)
startup_exe.run(test_startup)

trainer = fluid.ParallelExecutor(
    use_cuda=True, loss_name=train_loss.name, main_program=train_prog)

tester = fluid.ParallelExecutor(
    use_cuda=True, share_vars_from=trainer, main_program=test_prog)

train_reader.decorate_paddle_reader(
    paddle.v2.reader.shuffle(paddle.batch(mnist.train(), 512), buf_size=8192))

test_reader.decorate_paddle_reader(paddle.batch(mnist.test(), 512))

for epoch_id in xrange(10):
    train_reader.start()
    try:
        while True:
            print 'train_loss', numpy.array(
                trainer.run(fetch_list=[train_loss.name]))
    except fluid.core.EOFException:
        print 'End of epoch', epoch_id
        train_reader.reset()

    test_reader.start()
    try:
        while True:
            print 'test loss', numpy.array(
                tester.run(fetch_list=[test_loss.name]))
    except fluid.core.EOFException:
        print 'End of testing'
        test_reader.reset()

Specific steps are as follows:

  1. Before the start of every epoch, call start() to invoke PyReader;
  2. At the end of every epoch, read_file throws exception fluid.core.EOFException . Call reset() after catching up exception to reset the state of PyReader in order to start next epoch.