Dataset

Dataset package.

mnist

MNIST dataset.

This module will download dataset from http://yann.lecun.com/exdb/mnist/ and parse training set and test set into paddle reader creators.

paddle.dataset.mnist.train()

MNIST training set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Training reader creator
Return type:callable
paddle.dataset.mnist.test()

MNIST test set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Test reader creator.
Return type:callable
paddle.dataset.mnist.convert(path)

Converts dataset to recordio format

cifar

CIFAR dataset.

This module will download dataset from https://www.cs.toronto.edu/~kriz/cifar.html and parse train/test set into paddle reader creators.

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The CIFAR-100 dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.

paddle.dataset.cifar.train100()

CIFAR-100 training set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 99].

Returns:Training reader creator
Return type:callable
paddle.dataset.cifar.test100()

CIFAR-100 test set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Test reader creator.
Return type:callable
paddle.dataset.cifar.train10()

CIFAR-10 training set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Training reader creator
Return type:callable
paddle.dataset.cifar.test10()

CIFAR-10 test set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Test reader creator.
Return type:callable
paddle.dataset.cifar.convert(path)

Converts dataset to recordio format

conll05

Conll05 dataset. Paddle semantic role labeling Book and demo use this dataset as an example. Because Conll05 is not free in public, the default downloaded URL is test set of Conll05 (which is public). Users can change URL and MD5 to their Conll dataset. And a pre-trained word vector model based on Wikipedia corpus is used to initialize SRL model.

paddle.dataset.conll05.get_dict()

Get the word, verb and label dictionary of Wikipedia corpus.

paddle.dataset.conll05.get_embedding()

Get the trained word vector based on Wikipedia corpus.

paddle.dataset.conll05.test()

Conll05 test set creator.

Because the training dataset is not free, the test dataset is used for training. It returns a reader creator, each sample in the reader is nine features, including sentence sequence, predicate, predicate context, predicate context flag and tagged sequence.

Returns:Training reader creator
Return type:callable

imdb

IMDB dataset.

This module downloads IMDB dataset from http://ai.stanford.edu/%7Eamaas/data/sentiment/. This dataset contains a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. Besides, this module also provides API for building dictionary.

paddle.dataset.imdb.build_dict(pattern, cutoff)

Build a word dictionary from the corpus. Keys of the dictionary are words, and values are zero-based IDs of these words.

paddle.dataset.imdb.train(word_idx)

IMDB training set creator.

It returns a reader creator, each sample in the reader is an zero-based ID sequence and label in [0, 1].

Parameters:word_idx (dict) – word dictionary
Returns:Training reader creator
Return type:callable
paddle.dataset.imdb.test(word_idx)

IMDB test set creator.

It returns a reader creator, each sample in the reader is an zero-based ID sequence and label in [0, 1].

Parameters:word_idx (dict) – word dictionary
Returns:Test reader creator
Return type:callable
paddle.dataset.imdb.convert(path)

Converts dataset to recordio format

imikolov

imikolov’s simple dataset.

This module will download dataset from http://www.fit.vutbr.cz/~imikolov/rnnlm/ and parse training set and test set into paddle reader creators.

paddle.dataset.imikolov.build_dict(min_word_freq=50)

Build a word dictionary from the corpus, Keys of the dictionary are words, and values are zero-based IDs of these words.

paddle.dataset.imikolov.train(word_idx, n, data_type=1)

imikolov training set creator.

It returns a reader creator, each sample in the reader is a word ID tuple.

Parameters:
  • word_idx (dict) – word dictionary
  • n (int) – sliding window size if type is ngram, otherwise max length of sequence
  • data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
Returns:

Training reader creator

Return type:

callable

paddle.dataset.imikolov.test(word_idx, n, data_type=1)

imikolov test set creator.

It returns a reader creator, each sample in the reader is a word ID tuple.

Parameters:
  • word_idx (dict) – word dictionary
  • n (int) – sliding window size if type is ngram, otherwise max length of sequence
  • data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
Returns:

Test reader creator

Return type:

callable

paddle.dataset.imikolov.convert(path)

Converts dataset to recordio format

movielens

Movielens 1-M dataset.

Movielens 1-M dataset contains 1 million ratings from 6000 users on 4000 movies, which was collected by GroupLens Research. This module will download Movielens 1-M dataset from http://files.grouplens.org/datasets/movielens/ml-1m.zip and parse training set and test set into paddle reader creators.

paddle.dataset.movielens.get_movie_title_dict()

Get movie title dictionary.

paddle.dataset.movielens.max_movie_id()

Get the maximum value of movie id.

paddle.dataset.movielens.max_user_id()

Get the maximum value of user id.

paddle.dataset.movielens.max_job_id()

Get the maximum value of job id.

paddle.dataset.movielens.movie_categories()

Get movie categoriges dictionary.

paddle.dataset.movielens.user_info()

Get user info dictionary.

paddle.dataset.movielens.movie_info()

Get movie info dictionary.

paddle.dataset.movielens.convert(path)

Converts dataset to recordio format

class paddle.dataset.movielens.MovieInfo(index, categories, title)

Movie id, title and categories information are stored in MovieInfo.

class paddle.dataset.movielens.UserInfo(index, gender, age, job_id)

User id, gender, age, and job information are stored in UserInfo.

sentiment

The script fetch and preprocess movie_reviews data set that provided by NLTK

TODO(yuyang18): Complete dataset.

paddle.dataset.sentiment.get_word_dict()

Sorted the words by the frequency of words which occur in sample :return:

words_freq_sorted
paddle.dataset.sentiment.train()

Default training set reader creator

paddle.dataset.sentiment.test()

Default test set reader creator

paddle.dataset.sentiment.convert(path)

Converts dataset to recordio format

uci_housing

UCI Housing dataset.

This module will download dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ and parse training set and test set into paddle reader creators.

paddle.dataset.uci_housing.train()

UCI_HOUSING training set creator.

It returns a reader creator, each sample in the reader is features after normalization and price number.

Returns:Training reader creator
Return type:callable
paddle.dataset.uci_housing.test()

UCI_HOUSING test set creator.

It returns a reader creator, each sample in the reader is features after normalization and price number.

Returns:Test reader creator
Return type:callable

wmt14

WMT14 dataset. The original WMT14 dataset is too large and a small set of data for set is provided. This module will download dataset from http://paddlepaddle.cdn.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz and parse training set and test set into paddle reader creators.

paddle.dataset.wmt14.train(dict_size)

WMT14 training set creator.

It returns a reader creator, each sample in the reader is source language word ID sequence, target language word ID sequence and next word ID sequence.

Returns:Training reader creator
Return type:callable
paddle.dataset.wmt14.test(dict_size)

WMT14 test set creator.

It returns a reader creator, each sample in the reader is source language word ID sequence, target language word ID sequence and next word ID sequence.

Returns:Test reader creator
Return type:callable
paddle.dataset.wmt14.convert(path)

Converts dataset to recordio format

wmt16

ACL2016 Multimodal Machine Translation. Please see this website for more details: http://www.statmt.org/wmt16/multimodal-task.html#task1

If you use the dataset created for your task, please cite the following paper: Multi30K: Multilingual English-German Image Descriptions.

@article{elliott-EtAl:2016:VL16,
author = {{Elliott}, D. and {Frank}, S. and {Sima”an}, K. and {Specia}, L.}, title = {Multi30K: Multilingual English-German Image Descriptions}, booktitle = {Proceedings of the 6th Workshop on Vision and Language}, year = {2016}, pages = {70–74}, year = 2016

}

paddle.dataset.wmt16.train(src_dict_size, trg_dict_size, src_lang='en')

WMT16 train set reader.

This function returns the reader for train data. Each sample the reader returns is made up of three fields: the source language word index sequence, target language word index sequence and next word index sequence.

NOTE: The original like for training data is: http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz

paddle.dataset.wmt16 provides a tokenized version of the original dataset by using moses’s tokenization script: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl

Parameters:
  • src_dict_size (int) – Size of the source language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
  • trg_dict_size (int) – Size of the target language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
  • src_lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.
Returns:

The train reader.

Return type:

callable

paddle.dataset.wmt16.test(src_dict_size, trg_dict_size, src_lang='en')

WMT16 test set reader.

This function returns the reader for test data. Each sample the reader returns is made up of three fields: the source language word index sequence, target language word index sequence and next word index sequence.

NOTE: The original like for test data is: http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz

paddle.dataset.wmt16 provides a tokenized version of the original dataset by using moses’s tokenization script: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl

Parameters:
  • src_dict_size (int) – Size of the source language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
  • trg_dict_size (int) – Size of the target language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
  • src_lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.
Returns:

The test reader.

Return type:

callable

paddle.dataset.wmt16.validation(src_dict_size, trg_dict_size, src_lang='en')

WMT16 validation set reader.

This function returns the reader for validation data. Each sample the reader returns is made up of three fields: the source language word index sequence, target language word index sequence and next word index sequence.

NOTE: The original like for validation data is: http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz

paddle.dataset.wmt16 provides a tokenized version of the original dataset by using moses’s tokenization script: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl

Parameters:
  • src_dict_size (int) – Size of the source language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
  • trg_dict_size (int) – Size of the target language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.
  • src_lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.
Returns:

The validation reader.

Return type:

callable

paddle.dataset.wmt16.get_dict(lang, dict_size, reverse=False)

return the word dictionary for the specified language.

Parameters:
  • lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.
  • dict_size (int) – Size of the specified language dictionary.
  • reverse (bool) – If reverse is set to False, the returned python dictionary will use word as key and use index as value. If reverse is set to True, the returned python dictionary will use index as key and word as value.
Returns:

The word dictionary for the specific language.

Return type:

dict

paddle.dataset.wmt16.fetch()

download the entire dataset.

paddle.dataset.wmt16.convert(path, src_dict_size, trg_dict_size, src_lang)

Converts dataset to recordio format.