podium.datasets package¶

Subpackages¶

podium.datasets.impl package

Submodules¶

podium.datasets.dataset module¶

Module contains base classes for datasets.

class podium.datasets.dataset.Dataset(examples, fields, sort_key=None)¶

Bases: abc.ABC

General purpose container for datasets defining some common methods.

A dataset is a list of Example classes, along with the corresponding
Field classes, which process the columns of each example.

examples¶

A list of Example objects.

Type: list

fields¶

A list of Field objects that were used to create examples.

Type: list

__getattr__(attr)¶

Returns an Iterator iterating over values of the field with the given name for every example in the dataset.

Parameters

attr (str) – The name of the field whose values are to be returned.

Returns

an Iterator iterating over values of the field with the given name
for every example in the dataset.

Raises

AttributeError – If there is no Field with the given name.

__getitem__(i)¶

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.

Usage example:

example = dataset[1] # Indexing by single integer returns a single example

new_dataset = dataset[1:10] # Multi-indexing returns a new dataset containing
# the indexed examples.

Parameters: i (int or slice or iterable) – Index used to index examples.
Returns: If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.
Return type: single example or Dataset

__getstate__()¶

Method obtains dataset state. It is used for pickling dataset data to file.

Returns: state – dataset state dictionary
Return type: dict

__iter__()¶

Iterates over all examples in the dataset in order.

Yields: example – Yields examples in the dataset.

__len__()¶

Returns the number of examples in the dataset.

Returns: The number of examples in the dataset.
Return type: int

__setstate__(state)¶

Method sets dataset state. It is used for unpickling dataset data from file.

Parameters: state (dict) – dataset state dictionary

batch()¶

Creates an input and target batch containing the whole dataset. The format of the batch is the same as the batches returned by the

Returns: Two objects containing the input and target batches over the whole dataset.
Return type: input_batch, target_batch

filter(predicate, inplace=False)¶

Method filters examples with given predicate.

Parameters

predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false
inplace (bool, default False) – if True, do operation inplace and return None

finalize_fields(*args)¶

Builds vocabularies of all the non-eager fields in the dataset, from the Dataset objects given as *args and then finalizes all the fields.

Parameters: *args – A variable number of Dataset objects from which to build the vocabularies for non-eager fields. If none provided, the vocabularies are built from this Dataset (self).

get(i, deep_copy=False)¶

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Example

# Indexing by a single integers returns a single example example = dataset.get(1)

# Same as the first example, but returns a deep_copy of the example example_copy = dataset.get(1, deep_copy=True)

# Multi-indexing returns a new dataset containing the indexed examples s = slice(1, 10) new_dataset = dataset.get(s)

new_dataset_copy = dataset.get(s, deep_copy=True)

Parameters

i (int or slice or iterable) – Index used to index examples.
deep_copy (bool) – If true, the returned dataset will contain deep-copies of this dataset’s examples and fields. If false, existing examples and fields will be reused.

Returns

If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.

Return type

single example or Dataset

numericalize_examples()¶: Generates and caches numericalized data for every example in the dataset. Call before using the dataset to avoid lazy numericalization during iteration.

shuffle_examples(random_state=None)¶

Shuffles the examples in this dataset

Parameters: random_state (int) – The random seed used for shuffling.

split(split_ratio=0.7, stratified=False, strata_field_name=None, random_state=None, shuffle=True)¶

Creates train-(validation)-test splits from this dataset.

The splits are new Dataset objects, each containing a part of this one’s examples.

Parameters

split_ratio ((float | list[float] | tuple[float])) – If type is float, a number in the interval (0.0, 1.0) denoting the amount of data to be used for the train split (the rest is used for test). If type is list or tuple, it should be of length 2 (or 3) and the numbers should denote the relative sizes of train, (valid) and test splits respectively. If the relative size for valid is missing (length is 2), only the train-test split is returned (valid is taken to be 0.0). Also, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically). The ratio must not be so unbalanced that it would result in either of the splits being empty (having zero elements). Default is 0.7 (for the train set).
stratified (bool) – Whether the split should be stratified. A stratified split means that for each concrete value of the strata field, the given train-val-test ratio is preserved. Usually used on fields representing labels / classes, so that every class is present in each of our splits with the same percentage as in the entire dataset. Default is False.
strata_field_name (str) – Name of the field that is to be used to do the stratified split. Only relevant when ‘stratified’ is true. If the name of the strata field is not provided (the default behaviour), the stratified split will be done over the first field that is a target (its ‘is_target’ attribute is True). Note that the values of the strata field have to be hashable. Default is None.
random_state (int) – The random seed used for shuffling.

Returns

Datasets for train, (validation) and test splits in that order, depending on the split ratios that were provided.

Return type

tuple[Dataset]

Raises

ValueError – If the given split ratio is not in one of the valid forms. If the given split ratio is in a valid form, but wrong in the sense that it would result with at least one empty split. If stratified is True and the field with the given strata_field_name doesn’t exist.

podium.datasets.dataset.check_split_ratio(split_ratio)¶

Checks that the split ratio argument is not malformed and if not transforms it to a tuple of (train_size, valid_size, test_size) and normalizes it if necessary so that all elements sum to 1.

(See Dataset.split docs for more info).

Parameters: split_ratio ((float | list[float] | tuple[float])) – The split_ratio should either be a float in the interval (0.0, 1.0) (size of train) or a list / tuple of floats of length 2 (or 3) that are all larger than 0 and that represent the relative sizes of train, (val), test splits. If given as a list / tuple, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically).
Returns: A tuple of (train_size, valid_size, test_size) whose elements sum to 1.0.
Return type: tuple[float]
Raises: ValueError – If the ratio doesn’t obey any of the expected formats described above.

podium.datasets.dataset.rationed_split(examples, train_ratio, val_ratio, test_ratio, shuffle)¶

Splits a list of examples according to the given ratios and returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).

The list can also be randomly shuffled before splitting.

Parameters

examples (list) – A list of examples that is to be split according to the ratios.
train_ratio (float) – The fraction of examples that should be put into the train split.
val_ratio (float) – The fraction of examples that should be put into the valid split.
test_ratio (float) – The fraction of examples that should be put into the test split.
shuffle (bool) – Whether to shuffle the list before splitting.

Returns

The train, valid and test splits, each as a list of examples.

Return type

tuple

Raises

ValueError – If the given split ratio is wrong in the sense that it would result with at least one empty split.

podium.datasets.dataset.stratified_split(examples, train_ratio, val_ratio, test_ratio, strata_field_name, shuffle)¶

Performs a stratified split on a list of examples according to the given ratios and the given strata field.

Returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).

The list can also be randomly shuffled before splitting.

Parameters

examples (list) – A list of examples that is to be split according to the ratios.
train_ratio (float) – The fraction of examples that should be put into the train split.
val_ratio (float) – The fraction of examples that should be put into the valid split.
test_ratio (float) – The fraction of examples that should be put into the test split.
strata_field_name (str) – Name of the field that the examples should be stratified over. The values of the strata field have to be hashable. Default is ‘label’ for the conventional label field.
shuffle (bool) – Whether to shuffle the list before splitting.

Returns

The stratified train, valid and test splits, each as a list of examples.

Return type

tuple

podium.datasets.hierarhical_dataset module¶

class podium.datasets.hierarhical_dataset.HierarchicalDataset(parser, fields)¶

Bases: object

Container for datasets with a hierarchical structure of examples which have the same structure on every level of the hierarchy.

class Node(example, index, parent)¶

Bases: object

Class defines a node in hierarhical dataset.

example¶

example instance containing node data

Type: Example

index¶

index in current hierarchy level

Type: int

parent¶

parent node

Type: Node

children¶

children nodes

Type: tuple(Node)

__getstate__()¶

Method obtains dataset state. It is used for pickling dataset data to file.

Returns: state – dataset state dictionary
Return type: dict

__setstate__(state)¶

Method sets dataset state. It is used for unpickling dataset data from file.

Parameters: state (dict) – dataset state dictionary

as_flat_dataset()¶

Returns a standard Dataset containing the examples in order as defined in ‘flatten’.

Returns: a standard Dataset
Return type: Dataset

property depth¶: returns: the maximum depth of a node in the hierarchy. :rtype: int

finalize_fields()¶: Finalizes all fields in this dataset.

flatten()¶

Returns an iterable iterating trough examples in the dataset as if it was a standard Dataset. The iteration is done in pre-order.

Returns: iterable iterating through examples in the dataset.
Return type: iterable

static from_json(dataset, fields, parser)¶

Makes an HierarchicalDataset from a JSON formatted string.

Parameters

dataset (str) – Dataset in JSON format. The root element of the JSON string must be a list of root examples.
fields (dict(str, Field)) – a dict mapping keys in the raw_example to corresponding fields in the dataset.
parser (callable(raw_example, fields, depth) returning (example, raw_children)) – Callable taking (raw_example, fields, depth) and returning a tuple containing (example, raw_children).

Returns

dataset containing the data

Return type

HierarchicalDataset

Raises

If the base element in the JSON string is not a list of root elements. –

get_context(index, levels=None)¶

Returns an Iterator iterating through the context of the Example with the passed index.

Parameters

index (int) – Index of the Example the context should be retrieved for.
levels (int) – the maximum number of levels of the hierarchy the context should contain. If None, the context will contain all levels up to the root node of the dataset.

Returns

an Iterator iterating through the context of the Example with the passed index.

Return type

Iterator(Node)

Raises

If levels is less than 0. –

static get_default_dict_parser(child_attribute_name)¶

Returns a callable instance that can be used for parsing datasets in which examples on all levels in the hierarchy have children under the same key.

Parameters: child_attribute_name (str) – key used for accessing children in the examples
Returns
Return type: Callable(raw_example, fields, depth) returning (example, raw_children)

podium.datasets.iris_dataset module¶

class podium.datasets.iris_dataset.IrisDataset¶

Bases: podium.datasets.dataset.Dataset

This is the classic Iris dataset. This is perhaps the best known database to be found in the pattern recognition literature.

The fields of this dataset are:: sepal_length - float sepal_width - float petal_length - float petal_width - float species - int, specifying iris species

podium.datasets.iterator module¶

Module contains classes for iterating over datasets.

class podium.datasets.iterator.BucketIterator(batch_size, dataset=None, sort_key=None, shuffle=True, seed=42, look_ahead_multiplier=100, bucket_sort_key=None)¶

Bases: podium.datasets.iterator.Iterator

Creates a bucket iterator that uses a look-ahead heuristic to try and batch examples in a way that minimizes the amount of necessary padding.

It creates a bucket of size N x batch_size, and sorts that bucket before splitting it into batches, so there is less padding necessary.

__iter__()¶

Returns an iterator object that knows how to iterate over the batches of the given dataset.

Returns: Iterator that iterates over batches of examples in the dataset.
Return type: iter

class podium.datasets.iterator.HierarchicalDatasetIterator(batch_size, dataset=None, sort_key=None, shuffle=False, seed=1, internal_random_state=None, context_max_length=None, context_max_depth=None)¶

Bases: podium.datasets.iterator.Iterator

Iterator used to create batches for Hierarchical Datasets.

It creates batches in the form of lists of matrices. In the batch namedtuple that gets returned, every attribute corresponds to a field in the dataset. For every field in the dataset, the namedtuple contains a list of matrices, where every matrix represents the context of an example in the batch. The rows of a matrix contain numericalized representations of the examples that make up the context of an example in the batch with the representation of the example itself being in the last row of its own context matrix.

class podium.datasets.iterator.Iterator(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, internal_random_state=None)¶

Bases: object

An iterator that batches data from a dataset after numericalization.

epoch¶

The number of epochs elapsed up to this point.

Type: int

iterations¶

The number of iterations elapsed in the current epoch.

Type: int

__call__(dataset: podium.datasets.dataset.Dataset)¶

Sets the dataset for this Iterator and returns an iterable over the batches of that Dataset. Same as calling iterator.set_dataset() followed by iter(iterator)

Parameters: dataset (Dataset) – Dataset to iterate over.
Returns
Return type: Iterable over batches in the Dataset.

__iter__()¶

Returns an iterator object that knows how to iterate over the given dataset. The iterator yields tuples in the form (input_batch, target_batch). The input_batch and target_batch objects have attributes that correspond to the names of input fields and target fields (respectively) of the dataset. The values of those attributes are numpy matrices, whose rows are the numericalized values of that field in the examples that are in the batch. Rows of sequential fields (that are of variable length) are all padded to a common length. The common length is either the fixed_length attribute of the field or, if that is not given, the maximum length of all examples in the batch.

Returns: Iterator that iterates over batches of examples in the dataset.
Return type: iter

__len__()¶

Returns the number of batches this iterator provides in one epoch.

Returns: Number of batches s provided in one epoch.
Return type: int

get_internal_random_state()¶

Returns the internal random state of the iterator.

Useful when we want to stop iteration and later continue where we left off. We can store the random state obtained with this method and later initialize another iterator with the same random state and continue iterating.

Only to be called if shuffle is True, otherwise a RuntimeError is raised.

Returns: The internal random state of the iterator.
Return type: tuple
Raises: RuntimeError – If shuffle is False.

set_dataset(dataset: podium.datasets.dataset.Dataset)¶

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters: dataset (Dataset) – Dataset to iterate over.

set_internal_random_state(state)¶

Sets the internal random state of the iterator.

Useful when we want to stop iteration and later continue where we left off. We can take the random state previously obtained from another iterator to initialize this iterator with the same state and continue iterating where the previous iterator stopped.

Only to be called if shuffle is True, otherwise a RuntimeError is raised.

Raises: RuntimeError – If shuffle is False.

class podium.datasets.iterator.SingleBatchIterator(dataset: podium.datasets.dataset.Dataset = None, shuffle=True)¶

Bases: podium.datasets.iterator.Iterator

Iterator that creates one batch per epoch containing all examples in the dataset.

set_dataset(dataset: podium.datasets.dataset.Dataset)¶

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters: dataset (Dataset) – Dataset to iterate over.

podium.datasets.tabular_dataset module¶

class podium.datasets.tabular_dataset.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)¶

Bases: podium.datasets.dataset.Dataset

A dataset type for data stored in a single CSV, TSV or JSON file, where each row of the file is a single example.

podium.datasets.tabular_dataset.create_examples(reader, format, fields, skip_header)¶

Creates a list of examples from the given line reader and fields (see TabularDataset.__init__ docs for more info on the fields).

Parameters

reader – A reader object that reads one line at a time. Yields either strings (when format is JSON) or lists of values (when format is CSV/TSV).
format (str) – Format of the data file that is being read. Can be either CSV, TSV or JSON.
fields ((list | dict)) – A list or dict of fields (see TabularDataset.__init__ docs for more info).
skip_header (bool) – Whether to skip the first line of the input file. (see TabularDataset.__init__ docs for more info).

Returns

A list of created examples.

Return type

list

Raises

ValueError – If format is JSON and skip_header is True. If format is CSV/TSV, the fields are given as a dict and skip_header is True.

Module contents¶

Package contains datasets

class podium.datasets.Dataset(examples, fields, sort_key=None)¶

Bases: abc.ABC

General purpose container for datasets defining some common methods.

A dataset is a list of Example classes, along with the corresponding
Field classes, which process the columns of each example.

examples¶

A list of Example objects.

Type: list

fields¶

A list of Field objects that were used to create examples.

Type: list

__getattr__(attr)¶

Returns an Iterator iterating over values of the field with the given name for every example in the dataset.

Parameters

attr (str) – The name of the field whose values are to be returned.

Returns

an Iterator iterating over values of the field with the given name
for every example in the dataset.

Raises

AttributeError – If there is no Field with the given name.

__getitem__(i)¶

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.

Usage example:

example = dataset[1] # Indexing by single integer returns a single example

new_dataset = dataset[1:10] # Multi-indexing returns a new dataset containing
# the indexed examples.

Parameters: i (int or slice or iterable) – Index used to index examples.
Returns: If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.
Return type: single example or Dataset

__getstate__()¶

Method obtains dataset state. It is used for pickling dataset data to file.

Returns: state – dataset state dictionary
Return type: dict

__iter__()¶

Iterates over all examples in the dataset in order.

Yields: example – Yields examples in the dataset.

__len__()¶

Returns the number of examples in the dataset.

Returns: The number of examples in the dataset.
Return type: int

__setstate__(state)¶

Method sets dataset state. It is used for unpickling dataset data from file.

Parameters: state (dict) – dataset state dictionary

batch()¶

Creates an input and target batch containing the whole dataset. The format of the batch is the same as the batches returned by the

Returns: Two objects containing the input and target batches over the whole dataset.
Return type: input_batch, target_batch

filter(predicate, inplace=False)¶

Method filters examples with given predicate.

Parameters

predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false
inplace (bool, default False) – if True, do operation inplace and return None

finalize_fields(*args)¶

Builds vocabularies of all the non-eager fields in the dataset, from the Dataset objects given as *args and then finalizes all the fields.

Parameters: *args – A variable number of Dataset objects from which to build the vocabularies for non-eager fields. If none provided, the vocabularies are built from this Dataset (self).

get(i, deep_copy=False)¶

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Example

# Indexing by a single integers returns a single example example = dataset.get(1)

# Same as the first example, but returns a deep_copy of the example example_copy = dataset.get(1, deep_copy=True)

# Multi-indexing returns a new dataset containing the indexed examples s = slice(1, 10) new_dataset = dataset.get(s)

new_dataset_copy = dataset.get(s, deep_copy=True)

Parameters

i (int or slice or iterable) – Index used to index examples.
deep_copy (bool) – If true, the returned dataset will contain deep-copies of this dataset’s examples and fields. If false, existing examples and fields will be reused.

Returns

If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.

Return type

single example or Dataset

numericalize_examples()¶: Generates and caches numericalized data for every example in the dataset. Call before using the dataset to avoid lazy numericalization during iteration.

shuffle_examples(random_state=None)¶

Shuffles the examples in this dataset

Parameters: random_state (int) – The random seed used for shuffling.

split(split_ratio=0.7, stratified=False, strata_field_name=None, random_state=None, shuffle=True)¶

Creates train-(validation)-test splits from this dataset.

The splits are new Dataset objects, each containing a part of this one’s examples.

Parameters

split_ratio ((float | list[float] | tuple[float])) – If type is float, a number in the interval (0.0, 1.0) denoting the amount of data to be used for the train split (the rest is used for test). If type is list or tuple, it should be of length 2 (or 3) and the numbers should denote the relative sizes of train, (valid) and test splits respectively. If the relative size for valid is missing (length is 2), only the train-test split is returned (valid is taken to be 0.0). Also, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically). The ratio must not be so unbalanced that it would result in either of the splits being empty (having zero elements). Default is 0.7 (for the train set).
stratified (bool) – Whether the split should be stratified. A stratified split means that for each concrete value of the strata field, the given train-val-test ratio is preserved. Usually used on fields representing labels / classes, so that every class is present in each of our splits with the same percentage as in the entire dataset. Default is False.
strata_field_name (str) – Name of the field that is to be used to do the stratified split. Only relevant when ‘stratified’ is true. If the name of the strata field is not provided (the default behaviour), the stratified split will be done over the first field that is a target (its ‘is_target’ attribute is True). Note that the values of the strata field have to be hashable. Default is None.
random_state (int) – The random seed used for shuffling.

Returns

Datasets for train, (validation) and test splits in that order, depending on the split ratios that were provided.

Return type

tuple[Dataset]

Raises

ValueError – If the given split ratio is not in one of the valid forms. If the given split ratio is in a valid form, but wrong in the sense that it would result with at least one empty split. If stratified is True and the field with the given strata_field_name doesn’t exist.

class podium.datasets.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)¶

Bases: podium.datasets.dataset.Dataset

A dataset type for data stored in a single CSV, TSV or JSON file, where each row of the file is a single example.

class podium.datasets.HierarchicalDataset(parser, fields)¶

Bases: object

Container for datasets with a hierarchical structure of examples which have the same structure on every level of the hierarchy.

class Node(example, index, parent)¶

Bases: object

Class defines a node in hierarhical dataset.

example¶

example instance containing node data

Type: Example

index¶

index in current hierarchy level

Type: int

parent¶

parent node

Type: Node

children¶

children nodes

Type: tuple(Node)

__getstate__()¶

Method obtains dataset state. It is used for pickling dataset data to file.

Returns: state – dataset state dictionary
Return type: dict

__setstate__(state)¶

Method sets dataset state. It is used for unpickling dataset data from file.

Parameters: state (dict) – dataset state dictionary

as_flat_dataset()¶

Returns a standard Dataset containing the examples in order as defined in ‘flatten’.

Returns: a standard Dataset
Return type: Dataset

property depth¶: returns: the maximum depth of a node in the hierarchy. :rtype: int

finalize_fields()¶: Finalizes all fields in this dataset.

flatten()¶

Returns an iterable iterating trough examples in the dataset as if it was a standard Dataset. The iteration is done in pre-order.

Returns: iterable iterating through examples in the dataset.
Return type: iterable

static from_json(dataset, fields, parser)¶

Makes an HierarchicalDataset from a JSON formatted string.

Parameters

dataset (str) – Dataset in JSON format. The root element of the JSON string must be a list of root examples.
fields (dict(str, Field)) – a dict mapping keys in the raw_example to corresponding fields in the dataset.
parser (callable(raw_example, fields, depth) returning (example, raw_children)) – Callable taking (raw_example, fields, depth) and returning a tuple containing (example, raw_children).

Returns

dataset containing the data

Return type

HierarchicalDataset

Raises

If the base element in the JSON string is not a list of root elements. –

get_context(index, levels=None)¶

Returns an Iterator iterating through the context of the Example with the passed index.

Parameters

index (int) – Index of the Example the context should be retrieved for.
levels (int) – the maximum number of levels of the hierarchy the context should contain. If None, the context will contain all levels up to the root node of the dataset.

Returns

an Iterator iterating through the context of the Example with the passed index.

Return type

Iterator(Node)

Raises

If levels is less than 0. –

static get_default_dict_parser(child_attribute_name)¶

Returns a callable instance that can be used for parsing datasets in which examples on all levels in the hierarchy have children under the same key.

Parameters: child_attribute_name (str) – key used for accessing children in the examples
Returns
Return type: Callable(raw_example, fields, depth) returning (example, raw_children)

podium.datasets.stratified_split(examples, train_ratio, val_ratio, test_ratio, strata_field_name, shuffle)¶

Performs a stratified split on a list of examples according to the given ratios and the given strata field.

Returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).

The list can also be randomly shuffled before splitting.

Parameters

examples (list) – A list of examples that is to be split according to the ratios.
train_ratio (float) – The fraction of examples that should be put into the train split.
val_ratio (float) – The fraction of examples that should be put into the valid split.
test_ratio (float) – The fraction of examples that should be put into the test split.
strata_field_name (str) – Name of the field that the examples should be stratified over. The values of the strata field have to be hashable. Default is ‘label’ for the conventional label field.
shuffle (bool) – Whether to shuffle the list before splitting.

Returns

The stratified train, valid and test splits, each as a list of examples.

Return type

tuple

podium.datasets.rationed_split(examples, train_ratio, val_ratio, test_ratio, shuffle)¶

Splits a list of examples according to the given ratios and returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).

The list can also be randomly shuffled before splitting.

Parameters

examples (list) – A list of examples that is to be split according to the ratios.
train_ratio (float) – The fraction of examples that should be put into the train split.
val_ratio (float) – The fraction of examples that should be put into the valid split.
test_ratio (float) – The fraction of examples that should be put into the test split.
shuffle (bool) – Whether to shuffle the list before splitting.

Returns

The train, valid and test splits, each as a list of examples.

Return type

tuple

Raises

ValueError – If the given split ratio is wrong in the sense that it would result with at least one empty split.

class podium.datasets.IMDB(dir_path, fields)¶

Bases: podium.datasets.dataset.Dataset

Simple Imdb dataset with only supervised data which uses non processed data.

NAME¶

dataset name

Type: str

URL¶

url to the imdb dataset

Type: str

DATASET_DIR¶

name of the folder in the dataset containing train and test directories

Type: str

ARCHIVE_TYPE¶

string that defines archive type, used for unpacking dataset

Type: str

TRAIN_DIR¶

name of the training directory

Type: str

TEST_DIR¶

name of the directory containing test examples

Type: str

POSITIVE_LABEL_DIR¶

name of the subdirectory containing examples with positive sentiment

Type: str

NEGATIVE_LABEL_DIR¶

name of the subdirectory containing examples with negative sentiment

Type: str

TEXT_FIELD_NAME¶

name of the field containing comment text

Type: str

LABEL_FIELD_NAME¶

name of the field containing label value

Type: str

POSITIVE_LABEL¶

positive sentiment label

Type: int

NEGATIVE_LABEL¶

negative sentiment label

Type: int

static get_dataset_splits(fields=None)¶

Method creates train and test dataset for Imdb dataset.

Parameters: fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.
Returns: (train_dataset, test_dataset) – tuple containing train dataset and test dataset
Return type: (Dataset, Dataset)

static get_default_fields()¶

Method returns default Imdb fields: text and label.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

class podium.datasets.CatacxDataset(dir_path, fields=None)¶

Bases: podium.datasets.hierarhical_dataset.HierarchicalDataset

Catacx dataset.

static get_dataset(fields=None)¶

Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.

Parameters: fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.
Returns: The loaded dataset.
Return type: CatacxDataset

static get_default_fields()¶

Method returns a dict of default Catacx fields.

Returns: fields – dict containing all default Catacx fields
Return type: dict(str, Field)

class podium.datasets.CoNLLUDataset(file_path, fields=None)¶

Bases: podium.datasets.dataset.Dataset

A CoNLL-U dataset class. This class uses all default CoNLL-U fields.

static get_default_fields()¶

Method returns a dict of default CoNLL-U fields. fields : id, form, lemma, upos, xpos, feats, head, deprel, deps, misc

Returns: fields – Dict containing all default CoNLL-U fields.
Return type: dict(str, Field)

class podium.datasets.SST(file_path, fields, fine_grained=False, subtrees=False)¶

Bases: podium.datasets.dataset.Dataset

The Stanford sentiment treebank dataset.

NAME¶

dataset name

Type: str

URL¶

url to the SST dataset

Type: str

DATASET_DIR¶

name of the folder in the dataset containing train and test directories

Type: str

ARCHIVE_TYPE¶

string that defines archive type, used for unpacking dataset

Type: str

TEXT_FIELD_NAME¶

name of the field containing comment text

Type: str

LABEL_FIELD_NAME¶

name of the field containing label value

Type: str

POSITIVE_LABEL¶

positive sentiment label

Type: int

NEGATIVE_LABEL¶

negative sentiment label

Type: int

static get_dataset_splits(fields=None, fine_grained=False, subtrees=False)¶

Method loads and creates dataset splits for the SST dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.
fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.
subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.

Returns

(train_dataset, valid_dataset, test_dataset) – tuple containing train, valid and test dataset

Return type

(Dataset, Dataset, Dataset)

static get_default_fields()¶

Method returns default Imdb fields: text and label.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

class podium.datasets.CornellMovieDialogsConversationalDataset(data, fields=None)¶

Bases: podium.datasets.dataset.Dataset

Cornell Movie Dialogs Conversational dataset which contains sentences and replies from movies.

static get_default_fields()¶

Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

class podium.datasets.EuroVocDataset(eurovoc_labels, crovoc_labels, documents, mappings, fields=None)¶

Bases: podium.datasets.dataset.Dataset

EuroVoc dataset class that contains labeled documents and the label hierarchy.

get_all_ancestors(label_id)¶

Returns ids of all ancestors of the label with the given label id.

Parameters: label_id (int) – id of the label
Returns: list of label_ids of all ancestors of the given label or None if the label is not present in the dataset label hierarchies
Return type: list(int)

get_crovoc_label_hierarchy()¶

Returns CroVoc label hierarchy.

Returns: dict(int – dictionary that maps label id to label
Return type: Label)

static get_default_fields()¶

Method returns default EuroVoc fields: title, text, eurovoc and crovoc labels.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

get_direct_parents(label_id)¶

Returns ids of direct parents of the label with the given label id.

Parameters: label_id (int) – id of the label
Returns: list of label_ids of all direct parents of the given label or None if the label is not present in the dataset label hierarchies
Return type: list(int)

get_eurovoc_label_hierarchy()¶

Returns the EuroVoc label hierarchy.

Returns: dict(int – dictionary that maps label id to label
Return type: Label)

is_ancestor(label_id, example)¶

Checks if the given label_id is an ancestor of any labels of the example.

Parameters

label_id (int) – id of the label
example (Example) – example from dataset

Returns

True if label is ancestor to any of the example labels, False otherwise

Return type

boolean

class podium.datasets.PauzaHRDataset(dir_path, fields)¶

Bases: podium.datasets.dataset.Dataset

Simple PauzaHR dataset class which uses original reviews.

URL¶

url to the PauzaHR dataset

Type: str

NAME¶

dataset name

Type: str

DATASET_DIR¶

name of the folder in the dataset containing train and test directories

Type: str

ARCHIVE_TYPE¶

string that defines archive type, used for unpacking dataset

Type: str

TRAIN_DIR¶

name of the training directory

Type: str

TEST_DIR¶

name of the directory containing test examples

Type: str

static get_default_fields()¶

Method returns default PauzaHR fields: rating, source and text.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

static get_train_test_dataset(fields=None)¶

Method creates train and test dataset for PauzaHR dataset.

Parameters: fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`.
Returns: (train_dataset, test_dataset) – tuple containing train dataset and test dataset
Return type: (Dataset, Dataset)

class podium.datasets.Iterator(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, internal_random_state=None)¶

Bases: object

An iterator that batches data from a dataset after numericalization.

epoch¶

The number of epochs elapsed up to this point.

Type: int

iterations¶

The number of iterations elapsed in the current epoch.

Type: int

__call__(dataset: podium.datasets.dataset.Dataset)¶

Sets the dataset for this Iterator and returns an iterable over the batches of that Dataset. Same as calling iterator.set_dataset() followed by iter(iterator)

Parameters: dataset (Dataset) – Dataset to iterate over.
Returns
Return type: Iterable over batches in the Dataset.

__iter__()¶

Returns an iterator object that knows how to iterate over the given dataset. The iterator yields tuples in the form (input_batch, target_batch). The input_batch and target_batch objects have attributes that correspond to the names of input fields and target fields (respectively) of the dataset. The values of those attributes are numpy matrices, whose rows are the numericalized values of that field in the examples that are in the batch. Rows of sequential fields (that are of variable length) are all padded to a common length. The common length is either the fixed_length attribute of the field or, if that is not given, the maximum length of all examples in the batch.

Returns: Iterator that iterates over batches of examples in the dataset.
Return type: iter

__len__()¶

Returns the number of batches this iterator provides in one epoch.

Returns: Number of batches s provided in one epoch.
Return type: int

get_internal_random_state()¶

Returns the internal random state of the iterator.

Useful when we want to stop iteration and later continue where we left off. We can store the random state obtained with this method and later initialize another iterator with the same random state and continue iterating.

Only to be called if shuffle is True, otherwise a RuntimeError is raised.

Returns: The internal random state of the iterator.
Return type: tuple
Raises: RuntimeError – If shuffle is False.

set_dataset(dataset: podium.datasets.dataset.Dataset)¶

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters: dataset (Dataset) – Dataset to iterate over.

set_internal_random_state(state)¶

Sets the internal random state of the iterator.

Useful when we want to stop iteration and later continue where we left off. We can take the random state previously obtained from another iterator to initialize this iterator with the same state and continue iterating where the previous iterator stopped.

Only to be called if shuffle is True, otherwise a RuntimeError is raised.

Raises: RuntimeError – If shuffle is False.

class podium.datasets.SingleBatchIterator(dataset: podium.datasets.dataset.Dataset = None, shuffle=True)¶

Bases: podium.datasets.iterator.Iterator

Iterator that creates one batch per epoch containing all examples in the dataset.

set_dataset(dataset: podium.datasets.dataset.Dataset)¶

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters: dataset (Dataset) – Dataset to iterate over.

class podium.datasets.BucketIterator(batch_size, dataset=None, sort_key=None, shuffle=True, seed=42, look_ahead_multiplier=100, bucket_sort_key=None)¶

Bases: podium.datasets.iterator.Iterator

Creates a bucket iterator that uses a look-ahead heuristic to try and batch examples in a way that minimizes the amount of necessary padding.

It creates a bucket of size N x batch_size, and sorts that bucket before splitting it into batches, so there is less padding necessary.

__iter__()¶

Returns an iterator object that knows how to iterate over the batches of the given dataset.

Returns: Iterator that iterates over batches of examples in the dataset.
Return type: iter

class podium.datasets.HierarchicalDatasetIterator(batch_size, dataset=None, sort_key=None, shuffle=False, seed=1, internal_random_state=None, context_max_length=None, context_max_depth=None)¶

Bases: podium.datasets.iterator.Iterator

Iterator used to create batches for Hierarchical Datasets.

It creates batches in the form of lists of matrices. In the batch namedtuple that gets returned, every attribute corresponds to a field in the dataset. For every field in the dataset, the namedtuple contains a list of matrices, where every matrix represents the context of an example in the batch. The rows of a matrix contain numericalized representations of the examples that make up the context of an example in the batch with the representation of the example itself being in the last row of its own context matrix.

class podium.datasets.SNLIDataset(file_path, fields)¶

Bases: podium.datasets.impl.snli_dataset.SNLISimple

A SNLI Dataset class. Unlike SNLISimple, this class includes all the fields included in the SNLI dataset by default.

NAME¶

Name of the Dataset.

Type: str

URL¶

URL to the SNLI dataset.

Type: str

DATASET_DIR¶

Name of the directory in which the dataset files are stored.

Type: str

ARCHIVE_TYPE¶

Archive type, i.e. compression method used for archiving the downloaded dataset file.

Type: str

TRAIN_FILE_NAME¶

Name of the file in which the train dataset is stored.

Type: str

TEST_FILE_NAME¶

Name of the file in which the test dataset is stored.

Type: str

DEV_FILE_NAME¶

Name of the file in which the dev (validation) dataset is stored.

Type: str

ANNOTATOR_LABELS_FIELD_NAME¶

Name of the field containing annotator labels

Type: str

CAPTION_ID_FIELD_NAME¶

Name of the field containing caption ID

Type: str

GOLD_LABEL_FIELD_NAME¶

Name of the field containing gold label

Type: str

PAIR_ID_FIELD_NAME¶

Name of the field containing pair ID

Type: str

SENTENCE1_FIELD_NAME¶

Name of the field containing sentence1

Type: str

SENTENCE1_PARSE_FIELD_NAME¶

Name of the field containing sentence1 parse

Type: str

SENTENCE1_BINARY_PARSE_FIELD_NAME¶

Name of the field containing sentence1 binary parse

Type: str

SENTENCE2_FIELD_NAME¶

Name of the field containing sentence2

Type: str

SENTENCE2_PARSE_FIELD_NAME¶

Name of the field containing sentence2 parse

Type: str

SENTENCE2_BINARY_PARSE_FIELD_NAME¶

Name of the field containing sentence2 binary parse

Type: str

static get_default_fields()¶

Method returns all SNLI fields in the following order: annotator_labels, captionID, gold_label, pairID, sentence1, sentence1_parse, sentence1_binary_parse, sentence2, sentence2_parse, sentence2_binary_parse

Returns: fields – Dictionary mapping field names to respective Fields.
Return type: dict(str, Field)

Notes

This dataset includes both parses for every sentence,

static get_train_test_dev_dataset(fields=None)¶

Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.

Parameters: fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied, `get_default_fields` is used.
Returns: (train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.
Return type: (Dataset, Dataset, Dataset)

class podium.datasets.SNLISimple(file_path, fields)¶

Bases: podium.datasets.dataset.Dataset

A Simple SNLI Dataset class. This class only uses three fields by default: gold_label, sentence1, sentence2.

NAME¶

Name of the Dataset.

Type: str

URL¶

URL to the SNLI dataset.

Type: str

DATASET_DIR¶

Name of the directory in which the dataset files are stored.

Type: str

ARCHIVE_TYPE¶

Archive type, i.e. compression method used for archiving the downloaded dataset file.

Type: str

TRAIN_FILE_NAME¶

Name of the file in which the train dataset is stored.

Type: str

TEST_FILE_NAME¶

Name of the file in which the test dataset is stored.

Type: str

DEV_FILE_NAME¶

Name of the file in which the dev (validation) dataset is stored.

Type: str

GOLD_LABEL_FIELD_NAME¶

Name of the field containing gold label

Type: str

SENTENCE1_FIELD_NAME¶

Name of the field containing sentence1

Type: str

SENTENCE2_FIELD_NAME¶

Name of the field containing sentence2

Type: str

static get_default_fields()¶

Method returns the three main SNLI fields in the following order: gold_label, sentence1, sentence2

Returns: fields – Dictionary mapping field names to respective Fields.
Return type: dict(str, Field)

static get_train_test_dev_dataset(fields=None)¶

Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.

Parameters: fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied, `get_default_fields` is used.
Returns: (train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.
Return type: (Dataset, Dataset, Dataset)

podium.datasets package¶

Subpackages¶

Submodules¶

podium.datasets.dataset module¶

podium.datasets.hierarhical_dataset module¶

podium.datasets.iris_dataset module¶

podium.datasets.iterator module¶

podium.datasets.tabular_dataset module¶

Module contents¶

Table of Contents

Previous topic

Next topic

This Page