podium.datasets package

Submodules

podium.datasets.dataset module

Module contains base classes for datasets.

class podium.datasets.dataset.Dataset(examples, fields, sort_key=None)

Bases: abc.ABC

General purpose container for datasets defining some common methods.

A dataset is a list of Example classes, along with the corresponding

Field classes, which process the columns of each example.

examples

A list of Example objects.

Type

list

fields

A list of Field objects that were used to create examples.

Type

list

__getattr__(attr)

Returns an Iterator iterating over values of the field with the given name for every example in the dataset.

Parameters

attr (str) – The name of the field whose values are to be returned.

Returns

  • an Iterator iterating over values of the field with the given name

  • for every example in the dataset.

Raises

AttributeError – If there is no Field with the given name.

__getitem__(i)

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.

Usage example:

example = dataset[1] # Indexing by single integer returns a single example

new_dataset = dataset[1:10] # Multi-indexing returns a new dataset containing

# the indexed examples.

Parameters

i (int or slice or iterable) – Index used to index examples.

Returns

If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.

Return type

single example or Dataset

__getstate__()

Method obtains dataset state. It is used for pickling dataset data to file.

Returns

state – dataset state dictionary

Return type

dict

__iter__()

Iterates over all examples in the dataset in order.

Yields

example – Yields examples in the dataset.

__len__()

Returns the number of examples in the dataset.

Returns

The number of examples in the dataset.

Return type

int

__setstate__(state)

Method sets dataset state. It is used for unpickling dataset data from file.

Parameters

state (dict) – dataset state dictionary

batch()

Creates an input and target batch containing the whole dataset. The format of the batch is the same as the batches returned by the

Returns

Two objects containing the input and target batches over the whole dataset.

Return type

input_batch, target_batch

filter(predicate, inplace=False)

Method filters examples with given predicate.

Parameters
  • predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false

  • inplace (bool, default False) – if True, do operation inplace and return None

finalize_fields(*args)

Builds vocabularies of all the non-eager fields in the dataset, from the Dataset objects given as *args and then finalizes all the fields.

Parameters

*args – A variable number of Dataset objects from which to build the vocabularies for non-eager fields. If none provided, the vocabularies are built from this Dataset (self).

get(i, deep_copy=False)

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Example

# Indexing by a single integers returns a single example example = dataset.get(1)

# Same as the first example, but returns a deep_copy of the example example_copy = dataset.get(1, deep_copy=True)

# Multi-indexing returns a new dataset containing the indexed examples s = slice(1, 10) new_dataset = dataset.get(s)

new_dataset_copy = dataset.get(s, deep_copy=True)

Parameters
  • i (int or slice or iterable) – Index used to index examples.

  • deep_copy (bool) – If true, the returned dataset will contain deep-copies of this dataset’s examples and fields. If false, existing examples and fields will be reused.

Returns

If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.

Return type

single example or Dataset

numericalize_examples()

Generates and caches numericalized data for every example in the dataset. Call before using the dataset to avoid lazy numericalization during iteration.

shuffle_examples(random_state=None)

Shuffles the examples in this dataset

Parameters

random_state (int) – The random seed used for shuffling.

split(split_ratio=0.7, stratified=False, strata_field_name=None, random_state=None, shuffle=True)

Creates train-(validation)-test splits from this dataset.

The splits are new Dataset objects, each containing a part of this one’s examples.

Parameters
  • split_ratio ((float | list[float] | tuple[float])) – If type is float, a number in the interval (0.0, 1.0) denoting the amount of data to be used for the train split (the rest is used for test). If type is list or tuple, it should be of length 2 (or 3) and the numbers should denote the relative sizes of train, (valid) and test splits respectively. If the relative size for valid is missing (length is 2), only the train-test split is returned (valid is taken to be 0.0). Also, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically). The ratio must not be so unbalanced that it would result in either of the splits being empty (having zero elements). Default is 0.7 (for the train set).

  • stratified (bool) – Whether the split should be stratified. A stratified split means that for each concrete value of the strata field, the given train-val-test ratio is preserved. Usually used on fields representing labels / classes, so that every class is present in each of our splits with the same percentage as in the entire dataset. Default is False.

  • strata_field_name (str) – Name of the field that is to be used to do the stratified split. Only relevant when ‘stratified’ is true. If the name of the strata field is not provided (the default behaviour), the stratified split will be done over the first field that is a target (its ‘is_target’ attribute is True). Note that the values of the strata field have to be hashable. Default is None.

  • random_state (int) – The random seed used for shuffling.

Returns

Datasets for train, (validation) and test splits in that order, depending on the split ratios that were provided.

Return type

tuple[Dataset]

Raises

ValueError – If the given split ratio is not in one of the valid forms. If the given split ratio is in a valid form, but wrong in the sense that it would result with at least one empty split. If stratified is True and the field with the given strata_field_name doesn’t exist.

podium.datasets.dataset.check_split_ratio(split_ratio)

Checks that the split ratio argument is not malformed and if not transforms it to a tuple of (train_size, valid_size, test_size) and normalizes it if necessary so that all elements sum to 1.

(See Dataset.split docs for more info).

Parameters

split_ratio ((float | list[float] | tuple[float])) – The split_ratio should either be a float in the interval (0.0, 1.0) (size of train) or a list / tuple of floats of length 2 (or 3) that are all larger than 0 and that represent the relative sizes of train, (val), test splits. If given as a list / tuple, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically).

Returns

A tuple of (train_size, valid_size, test_size) whose elements sum to 1.0.

Return type

tuple[float]

Raises

ValueError – If the ratio doesn’t obey any of the expected formats described above.

podium.datasets.dataset.rationed_split(examples, train_ratio, val_ratio, test_ratio, shuffle)

Splits a list of examples according to the given ratios and returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).

The list can also be randomly shuffled before splitting.

Parameters
  • examples (list) – A list of examples that is to be split according to the ratios.

  • train_ratio (float) – The fraction of examples that should be put into the train split.

  • val_ratio (float) – The fraction of examples that should be put into the valid split.

  • test_ratio (float) – The fraction of examples that should be put into the test split.

  • shuffle (bool) – Whether to shuffle the list before splitting.

Returns

The train, valid and test splits, each as a list of examples.

Return type

tuple

Raises

ValueError – If the given split ratio is wrong in the sense that it would result with at least one empty split.

podium.datasets.dataset.stratified_split(examples, train_ratio, val_ratio, test_ratio, strata_field_name, shuffle)

Performs a stratified split on a list of examples according to the given ratios and the given strata field.

Returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).

The list can also be randomly shuffled before splitting.

Parameters
  • examples (list) – A list of examples that is to be split according to the ratios.

  • train_ratio (float) – The fraction of examples that should be put into the train split.

  • val_ratio (float) – The fraction of examples that should be put into the valid split.

  • test_ratio (float) – The fraction of examples that should be put into the test split.

  • strata_field_name (str) – Name of the field that the examples should be stratified over. The values of the strata field have to be hashable. Default is ‘label’ for the conventional label field.

  • shuffle (bool) – Whether to shuffle the list before splitting.

Returns

The stratified train, valid and test splits, each as a list of examples.

Return type

tuple

podium.datasets.hierarhical_dataset module

class podium.datasets.hierarhical_dataset.HierarchicalDataset(parser, fields)

Bases: object

Container for datasets with a hierarchical structure of examples which have the same structure on every level of the hierarchy.

class Node(example, index, parent)

Bases: object

Class defines a node in hierarhical dataset.

example

example instance containing node data

Type

Example

index

index in current hierarchy level

Type

int

parent

parent node

Type

Node

children

children nodes

Type

tuple(Node)

__getstate__()

Method obtains dataset state. It is used for pickling dataset data to file.

Returns

state – dataset state dictionary

Return type

dict

__setstate__(state)

Method sets dataset state. It is used for unpickling dataset data from file.

Parameters

state (dict) – dataset state dictionary

as_flat_dataset()

Returns a standard Dataset containing the examples in order as defined in ‘flatten’.

Returns

a standard Dataset

Return type

Dataset

property depth

returns: the maximum depth of a node in the hierarchy. :rtype: int

finalize_fields()

Finalizes all fields in this dataset.

flatten()

Returns an iterable iterating trough examples in the dataset as if it was a standard Dataset. The iteration is done in pre-order.

Returns

iterable iterating through examples in the dataset.

Return type

iterable

static from_json(dataset, fields, parser)

Makes an HierarchicalDataset from a JSON formatted string.

Parameters
  • dataset (str) – Dataset in JSON format. The root element of the JSON string must be a list of root examples.

  • fields (dict(str, Field)) – a dict mapping keys in the raw_example to corresponding fields in the dataset.

  • parser (callable(raw_example, fields, depth) returning (example, raw_children)) – Callable taking (raw_example, fields, depth) and returning a tuple containing (example, raw_children).

Returns

dataset containing the data

Return type

HierarchicalDataset

Raises

If the base element in the JSON string is not a list of root elements.

get_context(index, levels=None)

Returns an Iterator iterating through the context of the Example with the passed index.

Parameters
  • index (int) – Index of the Example the context should be retrieved for.

  • levels (int) – the maximum number of levels of the hierarchy the context should contain. If None, the context will contain all levels up to the root node of the dataset.

Returns

an Iterator iterating through the context of the Example with the passed index.

Return type

Iterator(Node)

Raises

If levels is less than 0.

static get_default_dict_parser(child_attribute_name)

Returns a callable instance that can be used for parsing datasets in which examples on all levels in the hierarchy have children under the same key.

Parameters

child_attribute_name (str) – key used for accessing children in the examples

Returns

Return type

Callable(raw_example, fields, depth) returning (example, raw_children)

podium.datasets.iris_dataset module

class podium.datasets.iris_dataset.IrisDataset

Bases: podium.datasets.dataset.Dataset

This is the classic Iris dataset. This is perhaps the best known database to be found in the pattern recognition literature.

The fields of this dataset are:

sepal_length - float sepal_width - float petal_length - float petal_width - float species - int, specifying iris species

podium.datasets.iterator module

Module contains classes for iterating over datasets.

class podium.datasets.iterator.BucketIterator(batch_size, dataset=None, sort_key=None, shuffle=True, seed=42, look_ahead_multiplier=100, bucket_sort_key=None)

Bases: podium.datasets.iterator.Iterator

Creates a bucket iterator that uses a look-ahead heuristic to try and batch examples in a way that minimizes the amount of necessary padding.

It creates a bucket of size N x batch_size, and sorts that bucket before splitting it into batches, so there is less padding necessary.

__iter__()

Returns an iterator object that knows how to iterate over the batches of the given dataset.

Returns

Iterator that iterates over batches of examples in the dataset.

Return type

iter

class podium.datasets.iterator.HierarchicalDatasetIterator(batch_size, dataset=None, sort_key=None, shuffle=False, seed=1, internal_random_state=None, context_max_length=None, context_max_depth=None)

Bases: podium.datasets.iterator.Iterator

Iterator used to create batches for Hierarchical Datasets.

It creates batches in the form of lists of matrices. In the batch namedtuple that gets returned, every attribute corresponds to a field in the dataset. For every field in the dataset, the namedtuple contains a list of matrices, where every matrix represents the context of an example in the batch. The rows of a matrix contain numericalized representations of the examples that make up the context of an example in the batch with the representation of the example itself being in the last row of its own context matrix.

class podium.datasets.iterator.Iterator(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, internal_random_state=None)

Bases: object

An iterator that batches data from a dataset after numericalization.

epoch

The number of epochs elapsed up to this point.

Type

int

iterations

The number of iterations elapsed in the current epoch.

Type

int

__call__(dataset: podium.datasets.dataset.Dataset)

Sets the dataset for this Iterator and returns an iterable over the batches of that Dataset. Same as calling iterator.set_dataset() followed by iter(iterator)

Parameters

dataset (Dataset) – Dataset to iterate over.

Returns

Return type

Iterable over batches in the Dataset.

__iter__()

Returns an iterator object that knows how to iterate over the given dataset. The iterator yields tuples in the form (input_batch, target_batch). The input_batch and target_batch objects have attributes that correspond to the names of input fields and target fields (respectively) of the dataset. The values of those attributes are numpy matrices, whose rows are the numericalized values of that field in the examples that are in the batch. Rows of sequential fields (that are of variable length) are all padded to a common length. The common length is either the fixed_length attribute of the field or, if that is not given, the maximum length of all examples in the batch.

Returns

Iterator that iterates over batches of examples in the dataset.

Return type

iter

__len__()

Returns the number of batches this iterator provides in one epoch.

Returns

Number of batches s provided in one epoch.

Return type

int

get_internal_random_state()

Returns the internal random state of the iterator.

Useful when we want to stop iteration and later continue where we left off. We can store the random state obtained with this method and later initialize another iterator with the same random state and continue iterating.

Only to be called if shuffle is True, otherwise a RuntimeError is raised.

Returns

The internal random state of the iterator.

Return type

tuple

Raises

RuntimeError – If shuffle is False.

set_dataset(dataset: podium.datasets.dataset.Dataset)

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters

dataset (Dataset) – Dataset to iterate over.

set_internal_random_state(state)

Sets the internal random state of the iterator.

Useful when we want to stop iteration and later continue where we left off. We can take the random state previously obtained from another iterator to initialize this iterator with the same state and continue iterating where the previous iterator stopped.

Only to be called if shuffle is True, otherwise a RuntimeError is raised.

Raises

RuntimeError – If shuffle is False.

class podium.datasets.iterator.SingleBatchIterator(dataset: podium.datasets.dataset.Dataset = None, shuffle=True)

Bases: podium.datasets.iterator.Iterator

Iterator that creates one batch per epoch containing all examples in the dataset.

set_dataset(dataset: podium.datasets.dataset.Dataset)

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters

dataset (Dataset) – Dataset to iterate over.

podium.datasets.tabular_dataset module

class podium.datasets.tabular_dataset.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)

Bases: podium.datasets.dataset.Dataset

A dataset type for data stored in a single CSV, TSV or JSON file, where each row of the file is a single example.

podium.datasets.tabular_dataset.create_examples(reader, format, fields, skip_header)

Creates a list of examples from the given line reader and fields (see TabularDataset.__init__ docs for more info on the fields).

Parameters
  • reader – A reader object that reads one line at a time. Yields either strings (when format is JSON) or lists of values (when format is CSV/TSV).

  • format (str) – Format of the data file that is being read. Can be either CSV, TSV or JSON.

  • fields ((list | dict)) – A list or dict of fields (see TabularDataset.__init__ docs for more info).

  • skip_header (bool) – Whether to skip the first line of the input file. (see TabularDataset.__init__ docs for more info).

Returns

A list of created examples.

Return type

list

Raises

ValueError – If format is JSON and skip_header is True. If format is CSV/TSV, the fields are given as a dict and skip_header is True.

Module contents

Package contains datasets

class podium.datasets.Dataset(examples, fields, sort_key=None)

Bases: abc.ABC

General purpose container for datasets defining some common methods.

A dataset is a list of Example classes, along with the corresponding

Field classes, which process the columns of each example.

examples

A list of Example objects.

Type

list

fields

A list of Field objects that were used to create examples.

Type

list

__getattr__(attr)

Returns an Iterator iterating over values of the field with the given name for every example in the dataset.

Parameters

attr (str) – The name of the field whose values are to be returned.

Returns

  • an Iterator iterating over values of the field with the given name

  • for every example in the dataset.

Raises

AttributeError – If there is no Field with the given name.

__getitem__(i)

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.

Usage example:

example = dataset[1] # Indexing by single integer returns a single example

new_dataset = dataset[1:10] # Multi-indexing returns a new dataset containing

# the indexed examples.

Parameters

i (int or slice or iterable) – Index used to index examples.

Returns

If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.

Return type

single example or Dataset

__getstate__()

Method obtains dataset state. It is used for pickling dataset data to file.

Returns

state – dataset state dictionary

Return type

dict

__iter__()

Iterates over all examples in the dataset in order.

Yields

example – Yields examples in the dataset.

__len__()

Returns the number of examples in the dataset.

Returns

The number of examples in the dataset.

Return type

int

__setstate__(state)

Method sets dataset state. It is used for unpickling dataset data from file.

Parameters

state (dict) – dataset state dictionary

batch()

Creates an input and target batch containing the whole dataset. The format of the batch is the same as the batches returned by the

Returns

Two objects containing the input and target batches over the whole dataset.

Return type

input_batch, target_batch

filter(predicate, inplace=False)

Method filters examples with given predicate.

Parameters
  • predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false

  • inplace (bool, default False) – if True, do operation inplace and return None

finalize_fields(*args)

Builds vocabularies of all the non-eager fields in the dataset, from the Dataset objects given as *args and then finalizes all the fields.

Parameters

*args – A variable number of Dataset objects from which to build the vocabularies for non-eager fields. If none provided, the vocabularies are built from this Dataset (self).

get(i, deep_copy=False)

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Example

# Indexing by a single integers returns a single example example = dataset.get(1)

# Same as the first example, but returns a deep_copy of the example example_copy = dataset.get(1, deep_copy=True)

# Multi-indexing returns a new dataset containing the indexed examples s = slice(1, 10) new_dataset = dataset.get(s)

new_dataset_copy = dataset.get(s, deep_copy=True)

Parameters
  • i (int or slice or iterable) – Index used to index examples.

  • deep_copy (bool) – If true, the returned dataset will contain deep-copies of this dataset’s examples and fields. If false, existing examples and fields will be reused.

Returns

If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.

Return type

single example or Dataset

numericalize_examples()

Generates and caches numericalized data for every example in the dataset. Call before using the dataset to avoid lazy numericalization during iteration.

shuffle_examples(random_state=None)

Shuffles the examples in this dataset

Parameters

random_state (int) – The random seed used for shuffling.

split(split_ratio=0.7, stratified=False, strata_field_name=None, random_state=None, shuffle=True)

Creates train-(validation)-test splits from this dataset.

The splits are new Dataset objects, each containing a part of this one’s examples.

Parameters
  • split_ratio ((float | list[float] | tuple[float])) – If type is float, a number in the interval (0.0, 1.0) denoting the amount of data to be used for the train split (the rest is used for test). If type is list or tuple, it should be of length 2 (or 3) and the numbers should denote the relative sizes of train, (valid) and test splits respectively. If the relative size for valid is missing (length is 2), only the train-test split is returned (valid is taken to be 0.0). Also, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically). The ratio must not be so unbalanced that it would result in either of the splits being empty (having zero elements). Default is 0.7 (for the train set).

  • stratified (bool) – Whether the split should be stratified. A stratified split means that for each concrete value of the strata field, the given train-val-test ratio is preserved. Usually used on fields representing labels / classes, so that every class is present in each of our splits with the same percentage as in the entire dataset. Default is False.

  • strata_field_name (str) – Name of the field that is to be used to do the stratified split. Only relevant when ‘stratified’ is true. If the name of the strata field is not provided (the default behaviour), the stratified split will be done over the first field that is a target (its ‘is_target’ attribute is True). Note that the values of the strata field have to be hashable. Default is None.

  • random_state (int) – The random seed used for shuffling.

Returns

Datasets for train, (validation) and test splits in that order, depending on the split ratios that were provided.

Return type

tuple[Dataset]

Raises

ValueError – If the given split ratio is not in one of the valid forms. If the given split ratio is in a valid form, but wrong in the sense that it would result with at least one empty split. If stratified is True and the field with the given strata_field_name doesn’t exist.

class podium.datasets.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)

Bases: podium.datasets.dataset.Dataset

A dataset type for data stored in a single CSV, TSV or JSON file, where each row of the file is a single example.

class podium.datasets.HierarchicalDataset(parser, fields)

Bases: object

Container for datasets with a hierarchical structure of examples which have the same structure on every level of the hierarchy.

class Node(example, index, parent)

Bases: object

Class defines a node in hierarhical dataset.

example

example instance containing node data

Type

Example

index

index in current hierarchy level

Type

int

parent

parent node

Type

Node

children

children nodes

Type

tuple(Node)

__getstate__()

Method obtains dataset state. It is used for pickling dataset data to file.

Returns

state – dataset state dictionary

Return type

dict

__setstate__(state)

Method sets dataset state. It is used for unpickling dataset data from file.

Parameters

state (dict) – dataset state dictionary

as_flat_dataset()

Returns a standard Dataset containing the examples in order as defined in ‘flatten’.

Returns

a standard Dataset

Return type

Dataset

property depth

returns: the maximum depth of a node in the hierarchy. :rtype: int

finalize_fields()

Finalizes all fields in this dataset.

flatten()

Returns an iterable iterating trough examples in the dataset as if it was a standard Dataset. The iteration is done in pre-order.

Returns

iterable iterating through examples in the dataset.

Return type

iterable

static from_json(dataset, fields, parser)

Makes an HierarchicalDataset from a JSON formatted string.

Parameters
  • dataset (str) – Dataset in JSON format. The root element of the JSON string must be a list of root examples.

  • fields (dict(str, Field)) – a dict mapping keys in the raw_example to corresponding fields in the dataset.

  • parser (callable(raw_example, fields, depth) returning (example, raw_children)) – Callable taking (raw_example, fields, depth) and returning a tuple containing (example, raw_children).

Returns

dataset containing the data

Return type

HierarchicalDataset

Raises

If the base element in the JSON string is not a list of root elements.

get_context(index, levels=None)

Returns an Iterator iterating through the context of the Example with the passed index.

Parameters
  • index (int) – Index of the Example the context should be retrieved for.

  • levels (int) – the maximum number of levels of the hierarchy the context should contain. If None, the context will contain all levels up to the root node of the dataset.

Returns

an Iterator iterating through the context of the Example with the passed index.

Return type

Iterator(Node)

Raises

If levels is less than 0.

static get_default_dict_parser(child_attribute_name)

Returns a callable instance that can be used for parsing datasets in which examples on all levels in the hierarchy have children under the same key.

Parameters

child_attribute_name (str) – key used for accessing children in the examples

Returns

Return type

Callable(raw_example, fields, depth) returning (example, raw_children)

podium.datasets.stratified_split(examples, train_ratio, val_ratio, test_ratio, strata_field_name, shuffle)

Performs a stratified split on a list of examples according to the given ratios and the given strata field.

Returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).

The list can also be randomly shuffled before splitting.

Parameters
  • examples (list) – A list of examples that is to be split according to the ratios.

  • train_ratio (float) – The fraction of examples that should be put into the train split.

  • val_ratio (float) – The fraction of examples that should be put into the valid split.

  • test_ratio (float) – The fraction of examples that should be put into the test split.

  • strata_field_name (str) – Name of the field that the examples should be stratified over. The values of the strata field have to be hashable. Default is ‘label’ for the conventional label field.

  • shuffle (bool) – Whether to shuffle the list before splitting.

Returns

The stratified train, valid and test splits, each as a list of examples.

Return type

tuple

podium.datasets.rationed_split(examples, train_ratio, val_ratio, test_ratio, shuffle)

Splits a list of examples according to the given ratios and returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).

The list can also be randomly shuffled before splitting.

Parameters
  • examples (list) – A list of examples that is to be split according to the ratios.

  • train_ratio (float) – The fraction of examples that should be put into the train split.

  • val_ratio (float) – The fraction of examples that should be put into the valid split.

  • test_ratio (float) – The fraction of examples that should be put into the test split.

  • shuffle (bool) – Whether to shuffle the list before splitting.

Returns

The train, valid and test splits, each as a list of examples.

Return type

tuple

Raises

ValueError – If the given split ratio is wrong in the sense that it would result with at least one empty split.

class podium.datasets.IMDB(dir_path, fields)

Bases: podium.datasets.dataset.Dataset

Simple Imdb dataset with only supervised data which uses non processed data.

NAME

dataset name

Type

str

URL

url to the imdb dataset

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TRAIN_DIR

name of the training directory

Type

str

TEST_DIR

name of the directory containing test examples

Type

str

POSITIVE_LABEL_DIR

name of the subdirectory containing examples with positive sentiment

Type

str

NEGATIVE_LABEL_DIR

name of the subdirectory containing examples with negative sentiment

Type

str

TEXT_FIELD_NAME

name of the field containing comment text

Type

str

LABEL_FIELD_NAME

name of the field containing label value

Type

str

POSITIVE_LABEL

positive sentiment label

Type

int

NEGATIVE_LABEL

negative sentiment label

Type

int

static get_dataset_splits(fields=None)

Method creates train and test dataset for Imdb dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.

Returns

(train_dataset, test_dataset) – tuple containing train dataset and test dataset

Return type

(Dataset, Dataset)

static get_default_fields()

Method returns default Imdb fields: text and label.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

class podium.datasets.CatacxDataset(dir_path, fields=None)

Bases: podium.datasets.hierarhical_dataset.HierarchicalDataset

Catacx dataset.

static get_dataset(fields=None)

Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.

Parameters

fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.

Returns

The loaded dataset.

Return type

CatacxDataset

static get_default_fields()

Method returns a dict of default Catacx fields.

Returns

fields – dict containing all default Catacx fields

Return type

dict(str, Field)

class podium.datasets.CoNLLUDataset(file_path, fields=None)

Bases: podium.datasets.dataset.Dataset

A CoNLL-U dataset class. This class uses all default CoNLL-U fields.

static get_default_fields()

Method returns a dict of default CoNLL-U fields. fields : id, form, lemma, upos, xpos, feats, head, deprel, deps, misc

Returns

fields – Dict containing all default CoNLL-U fields.

Return type

dict(str, Field)

class podium.datasets.SST(file_path, fields, fine_grained=False, subtrees=False)

Bases: podium.datasets.dataset.Dataset

The Stanford sentiment treebank dataset.

NAME

dataset name

Type

str

URL

url to the SST dataset

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TEXT_FIELD_NAME

name of the field containing comment text

Type

str

LABEL_FIELD_NAME

name of the field containing label value

Type

str

POSITIVE_LABEL

positive sentiment label

Type

int

NEGATIVE_LABEL

negative sentiment label

Type

int

static get_dataset_splits(fields=None, fine_grained=False, subtrees=False)

Method loads and creates dataset splits for the SST dataset.

Parameters
  • fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.

  • fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.

  • subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.

Returns

(train_dataset, valid_dataset, test_dataset) – tuple containing train, valid and test dataset

Return type

(Dataset, Dataset, Dataset)

static get_default_fields()

Method returns default Imdb fields: text and label.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

class podium.datasets.CornellMovieDialogsConversationalDataset(data, fields=None)

Bases: podium.datasets.dataset.Dataset

Cornell Movie Dialogs Conversational dataset which contains sentences and replies from movies.

static get_default_fields()

Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

class podium.datasets.EuroVocDataset(eurovoc_labels, crovoc_labels, documents, mappings, fields=None)

Bases: podium.datasets.dataset.Dataset

EuroVoc dataset class that contains labeled documents and the label hierarchy.

get_all_ancestors(label_id)

Returns ids of all ancestors of the label with the given label id.

Parameters

label_id (int) – id of the label

Returns

list of label_ids of all ancestors of the given label or None if the label is not present in the dataset label hierarchies

Return type

list(int)

get_crovoc_label_hierarchy()

Returns CroVoc label hierarchy.

Returns

dict(int – dictionary that maps label id to label

Return type

Label)

static get_default_fields()

Method returns default EuroVoc fields: title, text, eurovoc and crovoc labels.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

get_direct_parents(label_id)

Returns ids of direct parents of the label with the given label id.

Parameters

label_id (int) – id of the label

Returns

list of label_ids of all direct parents of the given label or None if the label is not present in the dataset label hierarchies

Return type

list(int)

get_eurovoc_label_hierarchy()

Returns the EuroVoc label hierarchy.

Returns

dict(int – dictionary that maps label id to label

Return type

Label)

is_ancestor(label_id, example)

Checks if the given label_id is an ancestor of any labels of the example.

Parameters
  • label_id (int) – id of the label

  • example (Example) – example from dataset

Returns

True if label is ancestor to any of the example labels, False otherwise

Return type

boolean

class podium.datasets.PauzaHRDataset(dir_path, fields)

Bases: podium.datasets.dataset.Dataset

Simple PauzaHR dataset class which uses original reviews.

URL

url to the PauzaHR dataset

Type

str

NAME

dataset name

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TRAIN_DIR

name of the training directory

Type

str

TEST_DIR

name of the directory containing test examples

Type

str

static get_default_fields()

Method returns default PauzaHR fields: rating, source and text.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

static get_train_test_dataset(fields=None)

Method creates train and test dataset for PauzaHR dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`.

Returns

(train_dataset, test_dataset) – tuple containing train dataset and test dataset

Return type

(Dataset, Dataset)

class podium.datasets.Iterator(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, internal_random_state=None)

Bases: object

An iterator that batches data from a dataset after numericalization.

epoch

The number of epochs elapsed up to this point.

Type

int

iterations

The number of iterations elapsed in the current epoch.

Type

int

__call__(dataset: podium.datasets.dataset.Dataset)

Sets the dataset for this Iterator and returns an iterable over the batches of that Dataset. Same as calling iterator.set_dataset() followed by iter(iterator)

Parameters

dataset (Dataset) – Dataset to iterate over.

Returns

Return type

Iterable over batches in the Dataset.

__iter__()

Returns an iterator object that knows how to iterate over the given dataset. The iterator yields tuples in the form (input_batch, target_batch). The input_batch and target_batch objects have attributes that correspond to the names of input fields and target fields (respectively) of the dataset. The values of those attributes are numpy matrices, whose rows are the numericalized values of that field in the examples that are in the batch. Rows of sequential fields (that are of variable length) are all padded to a common length. The common length is either the fixed_length attribute of the field or, if that is not given, the maximum length of all examples in the batch.

Returns

Iterator that iterates over batches of examples in the dataset.

Return type

iter

__len__()

Returns the number of batches this iterator provides in one epoch.

Returns

Number of batches s provided in one epoch.

Return type

int

get_internal_random_state()

Returns the internal random state of the iterator.

Useful when we want to stop iteration and later continue where we left off. We can store the random state obtained with this method and later initialize another iterator with the same random state and continue iterating.

Only to be called if shuffle is True, otherwise a RuntimeError is raised.

Returns

The internal random state of the iterator.

Return type

tuple

Raises

RuntimeError – If shuffle is False.

set_dataset(dataset: podium.datasets.dataset.Dataset)

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters

dataset (Dataset) – Dataset to iterate over.

set_internal_random_state(state)

Sets the internal random state of the iterator.

Useful when we want to stop iteration and later continue where we left off. We can take the random state previously obtained from another iterator to initialize this iterator with the same state and continue iterating where the previous iterator stopped.

Only to be called if shuffle is True, otherwise a RuntimeError is raised.

Raises

RuntimeError – If shuffle is False.

class podium.datasets.SingleBatchIterator(dataset: podium.datasets.dataset.Dataset = None, shuffle=True)

Bases: podium.datasets.iterator.Iterator

Iterator that creates one batch per epoch containing all examples in the dataset.

set_dataset(dataset: podium.datasets.dataset.Dataset)

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters

dataset (Dataset) – Dataset to iterate over.

class podium.datasets.BucketIterator(batch_size, dataset=None, sort_key=None, shuffle=True, seed=42, look_ahead_multiplier=100, bucket_sort_key=None)

Bases: podium.datasets.iterator.Iterator

Creates a bucket iterator that uses a look-ahead heuristic to try and batch examples in a way that minimizes the amount of necessary padding.

It creates a bucket of size N x batch_size, and sorts that bucket before splitting it into batches, so there is less padding necessary.

__iter__()

Returns an iterator object that knows how to iterate over the batches of the given dataset.

Returns

Iterator that iterates over batches of examples in the dataset.

Return type

iter

class podium.datasets.HierarchicalDatasetIterator(batch_size, dataset=None, sort_key=None, shuffle=False, seed=1, internal_random_state=None, context_max_length=None, context_max_depth=None)

Bases: podium.datasets.iterator.Iterator

Iterator used to create batches for Hierarchical Datasets.

It creates batches in the form of lists of matrices. In the batch namedtuple that gets returned, every attribute corresponds to a field in the dataset. For every field in the dataset, the namedtuple contains a list of matrices, where every matrix represents the context of an example in the batch. The rows of a matrix contain numericalized representations of the examples that make up the context of an example in the batch with the representation of the example itself being in the last row of its own context matrix.

class podium.datasets.SNLIDataset(file_path, fields)

Bases: podium.datasets.impl.snli_dataset.SNLISimple

A SNLI Dataset class. Unlike SNLISimple, this class includes all the fields included in the SNLI dataset by default.

NAME

Name of the Dataset.

Type

str

URL

URL to the SNLI dataset.

Type

str

DATASET_DIR

Name of the directory in which the dataset files are stored.

Type

str

ARCHIVE_TYPE

Archive type, i.e. compression method used for archiving the downloaded dataset file.

Type

str

TRAIN_FILE_NAME

Name of the file in which the train dataset is stored.

Type

str

TEST_FILE_NAME

Name of the file in which the test dataset is stored.

Type

str

DEV_FILE_NAME

Name of the file in which the dev (validation) dataset is stored.

Type

str

ANNOTATOR_LABELS_FIELD_NAME

Name of the field containing annotator labels

Type

str

CAPTION_ID_FIELD_NAME

Name of the field containing caption ID

Type

str

GOLD_LABEL_FIELD_NAME

Name of the field containing gold label

Type

str

PAIR_ID_FIELD_NAME

Name of the field containing pair ID

Type

str

SENTENCE1_FIELD_NAME

Name of the field containing sentence1

Type

str

SENTENCE1_PARSE_FIELD_NAME

Name of the field containing sentence1 parse

Type

str

SENTENCE1_BINARY_PARSE_FIELD_NAME

Name of the field containing sentence1 binary parse

Type

str

SENTENCE2_FIELD_NAME

Name of the field containing sentence2

Type

str

SENTENCE2_PARSE_FIELD_NAME

Name of the field containing sentence2 parse

Type

str

SENTENCE2_BINARY_PARSE_FIELD_NAME

Name of the field containing sentence2 binary parse

Type

str

static get_default_fields()

Method returns all SNLI fields in the following order: annotator_labels, captionID, gold_label, pairID, sentence1, sentence1_parse, sentence1_binary_parse, sentence2, sentence2_parse, sentence2_binary_parse

Returns

fields – Dictionary mapping field names to respective Fields.

Return type

dict(str, Field)

Notes

This dataset includes both parses for every sentence,

static get_train_test_dev_dataset(fields=None)

Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.

Parameters

fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied, `get_default_fields` is used.

Returns

(train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.

Return type

(Dataset, Dataset, Dataset)

class podium.datasets.SNLISimple(file_path, fields)

Bases: podium.datasets.dataset.Dataset

A Simple SNLI Dataset class. This class only uses three fields by default: gold_label, sentence1, sentence2.

NAME

Name of the Dataset.

Type

str

URL

URL to the SNLI dataset.

Type

str

DATASET_DIR

Name of the directory in which the dataset files are stored.

Type

str

ARCHIVE_TYPE

Archive type, i.e. compression method used for archiving the downloaded dataset file.

Type

str

TRAIN_FILE_NAME

Name of the file in which the train dataset is stored.

Type

str

TEST_FILE_NAME

Name of the file in which the test dataset is stored.

Type

str

DEV_FILE_NAME

Name of the file in which the dev (validation) dataset is stored.

Type

str

GOLD_LABEL_FIELD_NAME

Name of the field containing gold label

Type

str

SENTENCE1_FIELD_NAME

Name of the field containing sentence1

Type

str

SENTENCE2_FIELD_NAME

Name of the field containing sentence2

Type

str

static get_default_fields()

Method returns the three main SNLI fields in the following order: gold_label, sentence1, sentence2

Returns

fields – Dictionary mapping field names to respective Fields.

Return type

dict(str, Field)

static get_train_test_dev_dataset(fields=None)

Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.

Parameters

fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied, `get_default_fields` is used.

Returns

(train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.

Return type

(Dataset, Dataset, Dataset)