Data Reader Interface

DataTypes

paddle.v2.data_type.dense_array(dim, seq_type=0)

Dense Array. It means the input feature is dense array with float type. For example, if the input is an image with 28*28 pixels, the input of Paddle neural network could be a dense vector with dimension 784 or a numpy array with shape (28, 28).

For the 2-D convolution operation, each sample in one mini-batch must have the similarly size in PaddlePaddle now. But, it supports variable-dimension feature across mini-batch. For the variable-dimension, the param dim is not used. While the data reader must yield numpy array and the data feeder will set the data shape correctly.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.integer_value(value_range, seq_type=0)

Data type of integer.

Parameters:
  • seq_type (int) – sequence type of this input.
  • value_range (int) – range of this integer.
Returns:

An input type object

Return type:

InputType

paddle.v2.data_type.integer_value_sequence(value_range)

Data type of a sequence of integer.

Parameters:value_range (int) – range of each element.
paddle.v2.data_type.integer_value_sub_sequence(dim)
paddle.v2.data_type.sparse_binary_vector(dim, seq_type=0)

Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_binary_vector_sequence(dim)
Data type of a sequence of sparse vector, which every element is either zero
or one.
Parameters:dim (int) – dimension of sparse vector.
Returns:An input type object
Return type:InputType
paddle.v2.data_type.sparse_binary_vector_sub_sequence(dim)
paddle.v2.data_type.sparse_float_vector(dim, seq_type=0)

Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_float_vector_sequence(dim)

Data type of a sequence of sparse vector, which most elements are zero, others could be any float value.

Parameters:dim (int) – dimension of sparse vector.
Returns:An input type object
Return type:InputType
paddle.v2.data_type.sparse_float_vector_sub_sequence(dim)
paddle.v2.data_type.sparse_non_value_slot(dim, seq_type=0)

Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_value_slot(dim, seq_type=0)

Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

class paddle.v2.data_type.InputType(dim, seq_type, tp)

InputType is the base class for paddle input types.

Note

this is a base class, and should never be used by user.

Parameters:
  • dim (int) – dimension of input. If the input is an integer, it means the value range. Otherwise, it means the size of layer.
  • seq_type (int) – sequence type of input. 0 means it is not a sequence. 1 means it is a variable length sequence. 2 means it is a nested sequence.
  • type (int) – data type of input.

DataFeeder

class paddle.v2.data_feeder.DataFeeder(data_types, feeding=None)

DataFeeder converts the data returned by paddle.reader into a data structure of Arguments which is defined in the API. The paddle.reader usually returns a list of mini-batch data entries. Each data entry in the list is one sample. Each sample is a list or a tuple with one feature or multiple features. DataFeeder converts this mini-batch data entries into Arguments in order to feed it to C++ interface.

The simple usage shows below

feeding = ['image', 'label']
data_types = enumerate_data_types_of_data_layers(topology)
feeder = DataFeeder(data_types=data_types, feeding=feeding)

minibatch_data = [([1.0, 2.0, 3.0, ...], 5)]

arg = feeder(minibatch_data)

If mini-batch data and data layers are not one to one mapping, we could pass a dictionary to feeding parameter to represent the mapping relationship.

data_types = [('image', paddle.data_type.dense_vector(784)),
              ('label', paddle.data_type.integer_value(10))]
feeding = {'image':0, 'label':1}
feeder = DataFeeder(data_types=data_types, feeding=feeding)
minibatch_data = [
                   ( [1.0,2.0,3.0,4.0], 5, [6,7,8] ),  # first sample
                   ( [1.0,2.0,3.0,4.0], 5, [6,7,8] )   # second sample
                 ]
# or minibatch_data = [
#                       [ [1.0,2.0,3.0,4.0], 5, [6,7,8] ],  # first sample
#                       [ [1.0,2.0,3.0,4.0], 5, [6,7,8] ]   # second sample
#                     ]
arg = feeder.convert(minibatch_data)

Note

This module is for internal use only. Users should use the reader interface.

Parameters:
  • data_types (list) – A list to specify data name and type. Each item is a tuple of (data_name, data_type).
  • feeding (dict|collections.Sequence|None) – A dictionary or a sequence to specify the position of each data in the input data.
convert(dat, argument=None)
Parameters:
  • dat (list) – A list of mini-batch data. Each sample is a list or tuple one feature or multiple features.
  • argument (py_paddle.swig_paddle.Arguments) – An Arguments object contains this mini-batch data with one or multiple features. The Arguments definition is in the API.

Reader

At training and testing time, PaddlePaddle programs need to read data. To ease the users’ work to write data reading code, we define that

  • A reader is a function that reads data (from file, network, random number generator, etc) and yields data items.
  • A reader creator is a function that returns a reader function.
  • A reader decorator is a function, which accepts one or more readers, and returns a reader.
  • A batch reader is a function that reads data (from reader, file, network, random number generator, etc) and yields a batch of data items.

Data Reader Interface

Indeed, data reader doesn’t have to be a function that reads and yields data items. It can be any function with no parameter that creates a iterable (anything can be used in for x in iterable):

iterable = data_reader()

Element produced from the iterable should be a single entry of data, not a mini batch. That entry of data could be a single item, or a tuple of items. Item should be of supported type (e.g., numpy 1d array of float32, int, list of int)

An example implementation for single item data reader creator:

def reader_creator_random_image(width, height):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height)
return reader

An example implementation for multiple item data reader creator:

def reader_creator_random_image_and_label(width, height, label):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height), label
return reader

TODO(yuyang18): Should we add whole design doc here?

paddle.reader.map_readers(func, *readers)

Creates a data reader that outputs return value of function using output of each data readers as arguments.

Parameters:
  • func – function to use. The type of func should be (Sample) => Sample
  • readers – readers whose outputs will be used as arguments of func.
Type:

callable

Returns:

the created data reader.

Return type:

callable

paddle.reader.buffered(reader, size)

Creates a buffered data reader.

The buffered data reader will read and save data entries into a buffer. Reading from the buffered data reader will proceed as long as the buffer is not empty.

Parameters:
  • reader (callable) – the data reader to read from.
  • size (int) – max buffer size.
Returns:

the buffered data reader.

paddle.reader.compose(*readers, **kwargs)

Creates a data reader whose output is the combination of input readers.

If input readers output following data entries: (1, 2) 3 (4, 5) The composed reader will output: (1, 2, 3, 4, 5)

Parameters:
  • readers – readers that will be composed together.
  • check_alignment (bool) – if True, will check if input readers are aligned correctly. If False, will not check alignment and trailing outputs will be discarded. Defaults to True.
Returns:

the new data reader.

Raises:

ComposeNotAligned – outputs of readers are not aligned. Will not raise when check_alignment is set to False.

paddle.reader.chain(*readers)

Creates a data reader whose output is the outputs of input data readers chained together.

If input readers output following data entries: [0, 0, 0] [1, 1, 1] [2, 2, 2] The chained reader will output: [0, 0, 0, 1, 1, 1, 2, 2, 2]

Parameters:readers – input readers.
Returns:the new data reader.
Return type:callable
paddle.reader.shuffle(reader, buf_size)

Creates a data reader whose data output is shuffled.

Output from the iterator that created by original reader will be buffered into shuffle buffer, and then shuffled. The size of shuffle buffer is determined by argument buf_size.

Parameters:
  • reader (callable) – the original reader whose output will be shuffled.
  • buf_size (int) – shuffle buffer size.
Returns:

the new reader whose output is shuffled.

Return type:

callable

paddle.reader.firstn(reader, n)

Limit the max number of samples that reader could return.

Parameters:
  • reader (callable) – the data reader to read from.
  • n (int) – the max number of samples that return.
Returns:

the decorated reader.

Return type:

callable

paddle.reader.xmap_readers(mapper, reader, process_num, buffer_size, order=False)

Use multiprocess to map samples from reader by a mapper defined by user. And this function contains a buffered decorator. :param mapper: a function to map sample. :type mapper: callable :param reader: the data reader to read from :type reader: callable :param process_num: process number to handle original sample :type process_num: int :param buffer_size: max buffer size :type buffer_size: int :param order: keep the order of reader :type order: bool :return: the decarated reader :rtype: callable

class paddle.reader.PipeReader(command, bufsize=8192, file_type='plain')

PipeReader read data by stream from a command, take it’s stdout into a pipe buffer and redirect it to the parser to parse, then yield data as your desired format.

You can using standard linux command or call another program to read data, from HDFS, Ceph, URL, AWS S3 etc:

An example:

def example_reader():
    for f in myfiles:
        pr = PipeReader("cat %s"%f)
        for l in pr.get_line():
            sample = l.split(" ")
            yield sample
get_line(cut_lines=True, line_break='\n')
param cut_lines:
 cut buffer to lines
type cut_lines:bool
param line_break:
 line break of the file, like
or
type line_break:
 string
return:one line or a buffer of bytes
rtype:string

Creator package contains some simple reader creator, which could be used in user program.

paddle.reader.creator.np_array(x)

Creates a reader that yields elements of x, if it is a numpy vector. Or rows of x, if it is a numpy matrix. Or any sub-hyperplane indexed by the highest dimension.

Parameters:x – the numpy array to create reader from.
Returns:data reader created from x.
paddle.reader.creator.text_file(path)

Creates a data reader that outputs text line by line from given text file. Trailing new line (‘\n’) of each line will be removed.

Path:path of the text file.
Returns:data reader of text file
paddle.reader.creator.recordio(paths, buf_size=100)
Creates a data reader from given RecordIO file paths separated by ”,”,
glob pattern is supported.
Path:path of recordio files, can be a string or a string list.
Returns:data reader of recordio files.

minibatch

paddle.v2.minibatch.batch(reader, batch_size, drop_last=True)

Create a batched reader.

Parameters:
  • reader (callable) – the data reader to read from.
  • batch_size (int) – size of each mini-batch
  • drop_last (bool) – drop the last batch, if the size of last batch is not equal to batch_size.
Returns:

the batched reader.

Return type:

callable