podium.datasets package¶
Subpackages¶
- podium.datasets.impl package
- Submodules
- podium.datasets.impl.catacx_comments_dataset module
- podium.datasets.impl.catacx_dataset module
- podium.datasets.impl.conllu_dataset module
- podium.datasets.impl.cornell_movie_dialogs_dataset module
- podium.datasets.impl.croatian_ner_dataset module
- podium.datasets.impl.eurovoc_dataset module
- podium.datasets.impl.imdb_sentiment_dataset module
- podium.datasets.impl.pauza_dataset module
- Module contents
Submodules¶
podium.datasets.dataset module¶
Module contains base classes for datasets.
-
class
podium.datasets.dataset.
Dataset
(examples, fields, sort_key=None)¶ Bases:
abc.ABC
General purpose container for datasets defining some common methods.
- A dataset is a list of Example classes, along with the corresponding
Field classes, which process the columns of each example.
-
examples
¶ A list of Example objects.
- Type
list
-
fields
¶ A list of Field objects that were used to create examples.
- Type
list
-
__getattr__
(attr)¶ Returns an Iterator iterating over values of the field with the given name for every example in the dataset.
- Parameters
attr (str) – The name of the field whose values are to be returned.
- Returns
an Iterator iterating over values of the field with the given name
for every example in the dataset.
- Raises
AttributeError – If there is no Field with the given name.
-
__getitem__
(i)¶ Returns an example or a new dataset containing the indexed examples.
If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.
Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.
Usage example:
example = dataset[1] # Indexing by single integer returns a single example
- new_dataset = dataset[1:10] # Multi-indexing returns a new dataset containing
# the indexed examples.
- Parameters
i (int or slice or iterable) – Index used to index examples.
- Returns
If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.
- Return type
single example or Dataset
-
__getstate__
()¶ Method obtains dataset state. It is used for pickling dataset data to file.
- Returns
state – dataset state dictionary
- Return type
dict
-
__iter__
()¶ Iterates over all examples in the dataset in order.
- Yields
example – Yields examples in the dataset.
-
__len__
()¶ Returns the number of examples in the dataset.
- Returns
The number of examples in the dataset.
- Return type
int
-
__setstate__
(state)¶ Method sets dataset state. It is used for unpickling dataset data from file.
- Parameters
state (dict) – dataset state dictionary
-
batch
()¶ Creates an input and target batch containing the whole dataset. The format of the batch is the same as the batches returned by the
- Returns
Two objects containing the input and target batches over the whole dataset.
- Return type
input_batch, target_batch
-
filter
(predicate, inplace=False)¶ Method filters examples with given predicate.
- Parameters
predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false
inplace (bool, default False) – if True, do operation inplace and return None
-
finalize_fields
(*args)¶ Builds vocabularies of all the non-eager fields in the dataset, from the Dataset objects given as *args and then finalizes all the fields.
- Parameters
*args – A variable number of Dataset objects from which to build the vocabularies for non-eager fields. If none provided, the vocabularies are built from this Dataset (self).
-
get
(i, deep_copy=False)¶ Returns an example or a new dataset containing the indexed examples.
If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.
Example
# Indexing by a single integers returns a single example example = dataset.get(1)
# Same as the first example, but returns a deep_copy of the example example_copy = dataset.get(1, deep_copy=True)
# Multi-indexing returns a new dataset containing the indexed examples s = slice(1, 10) new_dataset = dataset.get(s)
new_dataset_copy = dataset.get(s, deep_copy=True)
- Parameters
i (int or slice or iterable) – Index used to index examples.
deep_copy (bool) – If true, the returned dataset will contain deep-copies of this dataset’s examples and fields. If false, existing examples and fields will be reused.
- Returns
If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.
- Return type
single example or Dataset
-
numericalize_examples
()¶ Generates and caches numericalized data for every example in the dataset. Call before using the dataset to avoid lazy numericalization during iteration.
-
shuffle_examples
(random_state=None)¶ Shuffles the examples in this dataset
- Parameters
random_state (int) – The random seed used for shuffling.
-
split
(split_ratio=0.7, stratified=False, strata_field_name=None, random_state=None, shuffle=True)¶ Creates train-(validation)-test splits from this dataset.
The splits are new Dataset objects, each containing a part of this one’s examples.
- Parameters
split_ratio ((float | list[float] | tuple[float])) – If type is float, a number in the interval (0.0, 1.0) denoting the amount of data to be used for the train split (the rest is used for test). If type is list or tuple, it should be of length 2 (or 3) and the numbers should denote the relative sizes of train, (valid) and test splits respectively. If the relative size for valid is missing (length is 2), only the train-test split is returned (valid is taken to be 0.0). Also, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically). The ratio must not be so unbalanced that it would result in either of the splits being empty (having zero elements). Default is 0.7 (for the train set).
stratified (bool) – Whether the split should be stratified. A stratified split means that for each concrete value of the strata field, the given train-val-test ratio is preserved. Usually used on fields representing labels / classes, so that every class is present in each of our splits with the same percentage as in the entire dataset. Default is False.
strata_field_name (str) – Name of the field that is to be used to do the stratified split. Only relevant when ‘stratified’ is true. If the name of the strata field is not provided (the default behaviour), the stratified split will be done over the first field that is a target (its ‘is_target’ attribute is True). Note that the values of the strata field have to be hashable. Default is None.
random_state (int) – The random seed used for shuffling.
- Returns
Datasets for train, (validation) and test splits in that order, depending on the split ratios that were provided.
- Return type
tuple[Dataset]
- Raises
ValueError – If the given split ratio is not in one of the valid forms. If the given split ratio is in a valid form, but wrong in the sense that it would result with at least one empty split. If stratified is True and the field with the given strata_field_name doesn’t exist.
-
podium.datasets.dataset.
check_split_ratio
(split_ratio)¶ Checks that the split ratio argument is not malformed and if not transforms it to a tuple of (train_size, valid_size, test_size) and normalizes it if necessary so that all elements sum to 1.
(See Dataset.split docs for more info).
- Parameters
split_ratio ((float | list[float] | tuple[float])) – The split_ratio should either be a float in the interval (0.0, 1.0) (size of train) or a list / tuple of floats of length 2 (or 3) that are all larger than 0 and that represent the relative sizes of train, (val), test splits. If given as a list / tuple, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically).
- Returns
A tuple of (train_size, valid_size, test_size) whose elements sum to 1.0.
- Return type
tuple[float]
- Raises
ValueError – If the ratio doesn’t obey any of the expected formats described above.
-
podium.datasets.dataset.
rationed_split
(examples, train_ratio, val_ratio, test_ratio, shuffle)¶ Splits a list of examples according to the given ratios and returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).
The list can also be randomly shuffled before splitting.
- Parameters
examples (list) – A list of examples that is to be split according to the ratios.
train_ratio (float) – The fraction of examples that should be put into the train split.
val_ratio (float) – The fraction of examples that should be put into the valid split.
test_ratio (float) – The fraction of examples that should be put into the test split.
shuffle (bool) – Whether to shuffle the list before splitting.
- Returns
The train, valid and test splits, each as a list of examples.
- Return type
tuple
- Raises
ValueError – If the given split ratio is wrong in the sense that it would result with at least one empty split.
-
podium.datasets.dataset.
stratified_split
(examples, train_ratio, val_ratio, test_ratio, strata_field_name, shuffle)¶ Performs a stratified split on a list of examples according to the given ratios and the given strata field.
Returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).
The list can also be randomly shuffled before splitting.
- Parameters
examples (list) – A list of examples that is to be split according to the ratios.
train_ratio (float) – The fraction of examples that should be put into the train split.
val_ratio (float) – The fraction of examples that should be put into the valid split.
test_ratio (float) – The fraction of examples that should be put into the test split.
strata_field_name (str) – Name of the field that the examples should be stratified over. The values of the strata field have to be hashable. Default is ‘label’ for the conventional label field.
shuffle (bool) – Whether to shuffle the list before splitting.
- Returns
The stratified train, valid and test splits, each as a list of examples.
- Return type
tuple
podium.datasets.hierarhical_dataset module¶
-
class
podium.datasets.hierarhical_dataset.
HierarchicalDataset
(parser, fields)¶ Bases:
object
Container for datasets with a hierarchical structure of examples which have the same structure on every level of the hierarchy.
-
class
Node
(example, index, parent)¶ Bases:
object
Class defines a node in hierarhical dataset.
-
example
¶ example instance containing node data
- Type
Example
-
index
¶ index in current hierarchy level
- Type
int
-
parent
¶ parent node
- Type
Node
-
children
¶ children nodes
- Type
tuple(Node)
-
-
__getstate__
()¶ Method obtains dataset state. It is used for pickling dataset data to file.
- Returns
state – dataset state dictionary
- Return type
dict
-
__setstate__
(state)¶ Method sets dataset state. It is used for unpickling dataset data from file.
- Parameters
state (dict) – dataset state dictionary
-
as_flat_dataset
()¶ Returns a standard Dataset containing the examples in order as defined in ‘flatten’.
- Returns
a standard Dataset
- Return type
Dataset
-
property
depth
¶ returns: the maximum depth of a node in the hierarchy. :rtype: int
-
finalize_fields
()¶ Finalizes all fields in this dataset.
-
flatten
()¶ Returns an iterable iterating trough examples in the dataset as if it was a standard Dataset. The iteration is done in pre-order.
- Returns
iterable iterating through examples in the dataset.
- Return type
iterable
-
static
from_json
(dataset, fields, parser)¶ Makes an HierarchicalDataset from a JSON formatted string.
- Parameters
dataset (str) – Dataset in JSON format. The root element of the JSON string must be a list of root examples.
fields (dict(str, Field)) – a dict mapping keys in the raw_example to corresponding fields in the dataset.
parser (callable(raw_example, fields, depth) returning (example, raw_children)) – Callable taking (raw_example, fields, depth) and returning a tuple containing (example, raw_children).
- Returns
dataset containing the data
- Return type
HierarchicalDataset
- Raises
If the base element in the JSON string is not a list of root elements. –
-
get_context
(index, levels=None)¶ Returns an Iterator iterating through the context of the Example with the passed index.
- Parameters
index (int) – Index of the Example the context should be retrieved for.
levels (int) – the maximum number of levels of the hierarchy the context should contain. If None, the context will contain all levels up to the root node of the dataset.
- Returns
an Iterator iterating through the context of the Example with the passed index.
- Return type
Iterator(Node)
- Raises
If levels is less than 0. –
-
static
get_default_dict_parser
(child_attribute_name)¶ Returns a callable instance that can be used for parsing datasets in which examples on all levels in the hierarchy have children under the same key.
- Parameters
child_attribute_name (str) – key used for accessing children in the examples
- Returns
- Return type
Callable(raw_example, fields, depth) returning (example, raw_children)
-
class
podium.datasets.iris_dataset module¶
-
class
podium.datasets.iris_dataset.
IrisDataset
¶ Bases:
podium.datasets.dataset.Dataset
This is the classic Iris dataset. This is perhaps the best known database to be found in the pattern recognition literature.
- The fields of this dataset are:
sepal_length - float sepal_width - float petal_length - float petal_width - float species - int, specifying iris species
podium.datasets.iterator module¶
Module contains classes for iterating over datasets.
-
class
podium.datasets.iterator.
BucketIterator
(batch_size, dataset=None, sort_key=None, shuffle=True, seed=42, look_ahead_multiplier=100, bucket_sort_key=None)¶ Bases:
podium.datasets.iterator.Iterator
Creates a bucket iterator that uses a look-ahead heuristic to try and batch examples in a way that minimizes the amount of necessary padding.
It creates a bucket of size N x batch_size, and sorts that bucket before splitting it into batches, so there is less padding necessary.
-
__iter__
()¶ Returns an iterator object that knows how to iterate over the batches of the given dataset.
- Returns
Iterator that iterates over batches of examples in the dataset.
- Return type
iter
-
-
class
podium.datasets.iterator.
HierarchicalDatasetIterator
(batch_size, dataset=None, sort_key=None, shuffle=False, seed=1, internal_random_state=None, context_max_length=None, context_max_depth=None)¶ Bases:
podium.datasets.iterator.Iterator
Iterator used to create batches for Hierarchical Datasets.
It creates batches in the form of lists of matrices. In the batch namedtuple that gets returned, every attribute corresponds to a field in the dataset. For every field in the dataset, the namedtuple contains a list of matrices, where every matrix represents the context of an example in the batch. The rows of a matrix contain numericalized representations of the examples that make up the context of an example in the batch with the representation of the example itself being in the last row of its own context matrix.
-
class
podium.datasets.iterator.
Iterator
(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, internal_random_state=None)¶ Bases:
object
An iterator that batches data from a dataset after numericalization.
-
epoch
¶ The number of epochs elapsed up to this point.
- Type
int
-
iterations
¶ The number of iterations elapsed in the current epoch.
- Type
int
-
__call__
(dataset: podium.datasets.dataset.Dataset)¶ Sets the dataset for this Iterator and returns an iterable over the batches of that Dataset. Same as calling iterator.set_dataset() followed by iter(iterator)
- Parameters
dataset (Dataset) – Dataset to iterate over.
- Returns
- Return type
Iterable over batches in the Dataset.
-
__iter__
()¶ Returns an iterator object that knows how to iterate over the given dataset. The iterator yields tuples in the form (input_batch, target_batch). The input_batch and target_batch objects have attributes that correspond to the names of input fields and target fields (respectively) of the dataset. The values of those attributes are numpy matrices, whose rows are the numericalized values of that field in the examples that are in the batch. Rows of sequential fields (that are of variable length) are all padded to a common length. The common length is either the fixed_length attribute of the field or, if that is not given, the maximum length of all examples in the batch.
- Returns
Iterator that iterates over batches of examples in the dataset.
- Return type
iter
-
__len__
()¶ Returns the number of batches this iterator provides in one epoch.
- Returns
Number of batches s provided in one epoch.
- Return type
int
-
get_internal_random_state
()¶ Returns the internal random state of the iterator.
Useful when we want to stop iteration and later continue where we left off. We can store the random state obtained with this method and later initialize another iterator with the same random state and continue iterating.
Only to be called if shuffle is True, otherwise a RuntimeError is raised.
- Returns
The internal random state of the iterator.
- Return type
tuple
- Raises
RuntimeError – If shuffle is False.
-
set_dataset
(dataset: podium.datasets.dataset.Dataset)¶ Sets the dataset for this Iterator to iterate over. Resets the epoch count.
- Parameters
dataset (Dataset) – Dataset to iterate over.
-
set_internal_random_state
(state)¶ Sets the internal random state of the iterator.
Useful when we want to stop iteration and later continue where we left off. We can take the random state previously obtained from another iterator to initialize this iterator with the same state and continue iterating where the previous iterator stopped.
Only to be called if shuffle is True, otherwise a RuntimeError is raised.
- Raises
RuntimeError – If shuffle is False.
-
-
class
podium.datasets.iterator.
SingleBatchIterator
(dataset: podium.datasets.dataset.Dataset = None, shuffle=True)¶ Bases:
podium.datasets.iterator.Iterator
Iterator that creates one batch per epoch containing all examples in the dataset.
-
set_dataset
(dataset: podium.datasets.dataset.Dataset)¶ Sets the dataset for this Iterator to iterate over. Resets the epoch count.
- Parameters
dataset (Dataset) – Dataset to iterate over.
-
podium.datasets.tabular_dataset module¶
-
class
podium.datasets.tabular_dataset.
TabularDataset
(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)¶ Bases:
podium.datasets.dataset.Dataset
A dataset type for data stored in a single CSV, TSV or JSON file, where each row of the file is a single example.
-
podium.datasets.tabular_dataset.
create_examples
(reader, format, fields, skip_header)¶ Creates a list of examples from the given line reader and fields (see TabularDataset.__init__ docs for more info on the fields).
- Parameters
reader – A reader object that reads one line at a time. Yields either strings (when format is JSON) or lists of values (when format is CSV/TSV).
format (str) – Format of the data file that is being read. Can be either CSV, TSV or JSON.
fields ((list | dict)) – A list or dict of fields (see TabularDataset.__init__ docs for more info).
skip_header (bool) – Whether to skip the first line of the input file. (see TabularDataset.__init__ docs for more info).
- Returns
A list of created examples.
- Return type
list
- Raises
ValueError – If format is JSON and skip_header is True. If format is CSV/TSV, the fields are given as a dict and skip_header is True.
Module contents¶
Package contains datasets
-
class
podium.datasets.
Dataset
(examples, fields, sort_key=None)¶ Bases:
abc.ABC
General purpose container for datasets defining some common methods.
- A dataset is a list of Example classes, along with the corresponding
Field classes, which process the columns of each example.
-
examples
¶ A list of Example objects.
- Type
list
-
fields
¶ A list of Field objects that were used to create examples.
- Type
list
-
__getattr__
(attr)¶ Returns an Iterator iterating over values of the field with the given name for every example in the dataset.
- Parameters
attr (str) – The name of the field whose values are to be returned.
- Returns
an Iterator iterating over values of the field with the given name
for every example in the dataset.
- Raises
AttributeError – If there is no Field with the given name.
-
__getitem__
(i)¶ Returns an example or a new dataset containing the indexed examples.
If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.
Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.
Usage example:
example = dataset[1] # Indexing by single integer returns a single example
- new_dataset = dataset[1:10] # Multi-indexing returns a new dataset containing
# the indexed examples.
- Parameters
i (int or slice or iterable) – Index used to index examples.
- Returns
If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.
- Return type
single example or Dataset
-
__getstate__
()¶ Method obtains dataset state. It is used for pickling dataset data to file.
- Returns
state – dataset state dictionary
- Return type
dict
-
__iter__
()¶ Iterates over all examples in the dataset in order.
- Yields
example – Yields examples in the dataset.
-
__len__
()¶ Returns the number of examples in the dataset.
- Returns
The number of examples in the dataset.
- Return type
int
-
__setstate__
(state)¶ Method sets dataset state. It is used for unpickling dataset data from file.
- Parameters
state (dict) – dataset state dictionary
-
batch
()¶ Creates an input and target batch containing the whole dataset. The format of the batch is the same as the batches returned by the
- Returns
Two objects containing the input and target batches over the whole dataset.
- Return type
input_batch, target_batch
-
filter
(predicate, inplace=False)¶ Method filters examples with given predicate.
- Parameters
predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false
inplace (bool, default False) – if True, do operation inplace and return None
-
finalize_fields
(*args)¶ Builds vocabularies of all the non-eager fields in the dataset, from the Dataset objects given as *args and then finalizes all the fields.
- Parameters
*args – A variable number of Dataset objects from which to build the vocabularies for non-eager fields. If none provided, the vocabularies are built from this Dataset (self).
-
get
(i, deep_copy=False)¶ Returns an example or a new dataset containing the indexed examples.
If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.
Example
# Indexing by a single integers returns a single example example = dataset.get(1)
# Same as the first example, but returns a deep_copy of the example example_copy = dataset.get(1, deep_copy=True)
# Multi-indexing returns a new dataset containing the indexed examples s = slice(1, 10) new_dataset = dataset.get(s)
new_dataset_copy = dataset.get(s, deep_copy=True)
- Parameters
i (int or slice or iterable) – Index used to index examples.
deep_copy (bool) – If true, the returned dataset will contain deep-copies of this dataset’s examples and fields. If false, existing examples and fields will be reused.
- Returns
If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.
- Return type
single example or Dataset
-
numericalize_examples
()¶ Generates and caches numericalized data for every example in the dataset. Call before using the dataset to avoid lazy numericalization during iteration.
-
shuffle_examples
(random_state=None)¶ Shuffles the examples in this dataset
- Parameters
random_state (int) – The random seed used for shuffling.
-
split
(split_ratio=0.7, stratified=False, strata_field_name=None, random_state=None, shuffle=True)¶ Creates train-(validation)-test splits from this dataset.
The splits are new Dataset objects, each containing a part of this one’s examples.
- Parameters
split_ratio ((float | list[float] | tuple[float])) – If type is float, a number in the interval (0.0, 1.0) denoting the amount of data to be used for the train split (the rest is used for test). If type is list or tuple, it should be of length 2 (or 3) and the numbers should denote the relative sizes of train, (valid) and test splits respectively. If the relative size for valid is missing (length is 2), only the train-test split is returned (valid is taken to be 0.0). Also, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically). The ratio must not be so unbalanced that it would result in either of the splits being empty (having zero elements). Default is 0.7 (for the train set).
stratified (bool) – Whether the split should be stratified. A stratified split means that for each concrete value of the strata field, the given train-val-test ratio is preserved. Usually used on fields representing labels / classes, so that every class is present in each of our splits with the same percentage as in the entire dataset. Default is False.
strata_field_name (str) – Name of the field that is to be used to do the stratified split. Only relevant when ‘stratified’ is true. If the name of the strata field is not provided (the default behaviour), the stratified split will be done over the first field that is a target (its ‘is_target’ attribute is True). Note that the values of the strata field have to be hashable. Default is None.
random_state (int) – The random seed used for shuffling.
- Returns
Datasets for train, (validation) and test splits in that order, depending on the split ratios that were provided.
- Return type
tuple[Dataset]
- Raises
ValueError – If the given split ratio is not in one of the valid forms. If the given split ratio is in a valid form, but wrong in the sense that it would result with at least one empty split. If stratified is True and the field with the given strata_field_name doesn’t exist.
-
class
podium.datasets.
TabularDataset
(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)¶ Bases:
podium.datasets.dataset.Dataset
A dataset type for data stored in a single CSV, TSV or JSON file, where each row of the file is a single example.
-
class
podium.datasets.
HierarchicalDataset
(parser, fields)¶ Bases:
object
Container for datasets with a hierarchical structure of examples which have the same structure on every level of the hierarchy.
-
class
Node
(example, index, parent)¶ Bases:
object
Class defines a node in hierarhical dataset.
-
example
¶ example instance containing node data
- Type
Example
-
index
¶ index in current hierarchy level
- Type
int
-
parent
¶ parent node
- Type
Node
-
children
¶ children nodes
- Type
tuple(Node)
-
-
__getstate__
()¶ Method obtains dataset state. It is used for pickling dataset data to file.
- Returns
state – dataset state dictionary
- Return type
dict
-
__setstate__
(state)¶ Method sets dataset state. It is used for unpickling dataset data from file.
- Parameters
state (dict) – dataset state dictionary
-
as_flat_dataset
()¶ Returns a standard Dataset containing the examples in order as defined in ‘flatten’.
- Returns
a standard Dataset
- Return type
Dataset
-
property
depth
¶ returns: the maximum depth of a node in the hierarchy. :rtype: int
-
finalize_fields
()¶ Finalizes all fields in this dataset.
-
flatten
()¶ Returns an iterable iterating trough examples in the dataset as if it was a standard Dataset. The iteration is done in pre-order.
- Returns
iterable iterating through examples in the dataset.
- Return type
iterable
-
static
from_json
(dataset, fields, parser)¶ Makes an HierarchicalDataset from a JSON formatted string.
- Parameters
dataset (str) – Dataset in JSON format. The root element of the JSON string must be a list of root examples.
fields (dict(str, Field)) – a dict mapping keys in the raw_example to corresponding fields in the dataset.
parser (callable(raw_example, fields, depth) returning (example, raw_children)) – Callable taking (raw_example, fields, depth) and returning a tuple containing (example, raw_children).
- Returns
dataset containing the data
- Return type
HierarchicalDataset
- Raises
If the base element in the JSON string is not a list of root elements. –
-
get_context
(index, levels=None)¶ Returns an Iterator iterating through the context of the Example with the passed index.
- Parameters
index (int) – Index of the Example the context should be retrieved for.
levels (int) – the maximum number of levels of the hierarchy the context should contain. If None, the context will contain all levels up to the root node of the dataset.
- Returns
an Iterator iterating through the context of the Example with the passed index.
- Return type
Iterator(Node)
- Raises
If levels is less than 0. –
-
static
get_default_dict_parser
(child_attribute_name)¶ Returns a callable instance that can be used for parsing datasets in which examples on all levels in the hierarchy have children under the same key.
- Parameters
child_attribute_name (str) – key used for accessing children in the examples
- Returns
- Return type
Callable(raw_example, fields, depth) returning (example, raw_children)
-
class
-
podium.datasets.
stratified_split
(examples, train_ratio, val_ratio, test_ratio, strata_field_name, shuffle)¶ Performs a stratified split on a list of examples according to the given ratios and the given strata field.
Returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).
The list can also be randomly shuffled before splitting.
- Parameters
examples (list) – A list of examples that is to be split according to the ratios.
train_ratio (float) – The fraction of examples that should be put into the train split.
val_ratio (float) – The fraction of examples that should be put into the valid split.
test_ratio (float) – The fraction of examples that should be put into the test split.
strata_field_name (str) – Name of the field that the examples should be stratified over. The values of the strata field have to be hashable. Default is ‘label’ for the conventional label field.
shuffle (bool) – Whether to shuffle the list before splitting.
- Returns
The stratified train, valid and test splits, each as a list of examples.
- Return type
tuple
-
podium.datasets.
rationed_split
(examples, train_ratio, val_ratio, test_ratio, shuffle)¶ Splits a list of examples according to the given ratios and returns the splits as a tuple of lists (train_examples, valid_examples, test_examples).
The list can also be randomly shuffled before splitting.
- Parameters
examples (list) – A list of examples that is to be split according to the ratios.
train_ratio (float) – The fraction of examples that should be put into the train split.
val_ratio (float) – The fraction of examples that should be put into the valid split.
test_ratio (float) – The fraction of examples that should be put into the test split.
shuffle (bool) – Whether to shuffle the list before splitting.
- Returns
The train, valid and test splits, each as a list of examples.
- Return type
tuple
- Raises
ValueError – If the given split ratio is wrong in the sense that it would result with at least one empty split.
-
class
podium.datasets.
IMDB
(dir_path, fields)¶ Bases:
podium.datasets.dataset.Dataset
Simple Imdb dataset with only supervised data which uses non processed data.
-
NAME
¶ dataset name
- Type
str
-
URL
¶ url to the imdb dataset
- Type
str
-
DATASET_DIR
¶ name of the folder in the dataset containing train and test directories
- Type
str
-
ARCHIVE_TYPE
¶ string that defines archive type, used for unpacking dataset
- Type
str
-
TRAIN_DIR
¶ name of the training directory
- Type
str
-
TEST_DIR
¶ name of the directory containing test examples
- Type
str
-
POSITIVE_LABEL_DIR
¶ name of the subdirectory containing examples with positive sentiment
- Type
str
-
NEGATIVE_LABEL_DIR
¶ name of the subdirectory containing examples with negative sentiment
- Type
str
-
TEXT_FIELD_NAME
¶ name of the field containing comment text
- Type
str
-
LABEL_FIELD_NAME
¶ name of the field containing label value
- Type
str
-
POSITIVE_LABEL
¶ positive sentiment label
- Type
int
-
NEGATIVE_LABEL
¶ negative sentiment label
- Type
int
-
static
get_dataset_splits
(fields=None)¶ Method creates train and test dataset for Imdb dataset.
- Parameters
fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use
`get_default_fields`
. User should use default field names defined in class attributes.- Returns
(train_dataset, test_dataset) – tuple containing train dataset and test dataset
- Return type
(Dataset, Dataset)
-
static
get_default_fields
()¶ Method returns default Imdb fields: text and label.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
-
class
podium.datasets.
CatacxDataset
(dir_path, fields=None)¶ Bases:
podium.datasets.hierarhical_dataset.HierarchicalDataset
Catacx dataset.
-
static
get_dataset
(fields=None)¶ Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.
- Parameters
fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.
- Returns
The loaded dataset.
- Return type
CatacxDataset
-
static
get_default_fields
()¶ Method returns a dict of default Catacx fields.
- Returns
fields – dict containing all default Catacx fields
- Return type
dict(str, Field)
-
static
-
class
podium.datasets.
CoNLLUDataset
(file_path, fields=None)¶ Bases:
podium.datasets.dataset.Dataset
A CoNLL-U dataset class. This class uses all default CoNLL-U fields.
-
static
get_default_fields
()¶ Method returns a dict of default CoNLL-U fields. fields : id, form, lemma, upos, xpos, feats, head, deprel, deps, misc
- Returns
fields – Dict containing all default CoNLL-U fields.
- Return type
dict(str, Field)
-
static
-
class
podium.datasets.
SST
(file_path, fields, fine_grained=False, subtrees=False)¶ Bases:
podium.datasets.dataset.Dataset
The Stanford sentiment treebank dataset.
-
NAME
¶ dataset name
- Type
str
-
URL
¶ url to the SST dataset
- Type
str
-
DATASET_DIR
¶ name of the folder in the dataset containing train and test directories
- Type
str
-
ARCHIVE_TYPE
¶ string that defines archive type, used for unpacking dataset
- Type
str
-
TEXT_FIELD_NAME
¶ name of the field containing comment text
- Type
str
-
LABEL_FIELD_NAME
¶ name of the field containing label value
- Type
str
-
POSITIVE_LABEL
¶ positive sentiment label
- Type
int
-
NEGATIVE_LABEL
¶ negative sentiment label
- Type
int
-
static
get_dataset_splits
(fields=None, fine_grained=False, subtrees=False)¶ Method loads and creates dataset splits for the SST dataset.
- Parameters
fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use
`get_default_fields`
. User should use default field names defined in class attributes.fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.
subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.
- Returns
(train_dataset, valid_dataset, test_dataset) – tuple containing train, valid and test dataset
- Return type
(Dataset, Dataset, Dataset)
-
static
get_default_fields
()¶ Method returns default Imdb fields: text and label.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
-
class
podium.datasets.
CornellMovieDialogsConversationalDataset
(data, fields=None)¶ Bases:
podium.datasets.dataset.Dataset
Cornell Movie Dialogs Conversational dataset which contains sentences and replies from movies.
-
static
get_default_fields
()¶ Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
static
-
class
podium.datasets.
EuroVocDataset
(eurovoc_labels, crovoc_labels, documents, mappings, fields=None)¶ Bases:
podium.datasets.dataset.Dataset
EuroVoc dataset class that contains labeled documents and the label hierarchy.
-
get_all_ancestors
(label_id)¶ Returns ids of all ancestors of the label with the given label id.
- Parameters
label_id (int) – id of the label
- Returns
list of label_ids of all ancestors of the given label or None if the label is not present in the dataset label hierarchies
- Return type
list(int)
-
get_crovoc_label_hierarchy
()¶ Returns CroVoc label hierarchy.
- Returns
dict(int – dictionary that maps label id to label
- Return type
Label)
-
static
get_default_fields
()¶ Method returns default EuroVoc fields: title, text, eurovoc and crovoc labels.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
get_direct_parents
(label_id)¶ Returns ids of direct parents of the label with the given label id.
- Parameters
label_id (int) – id of the label
- Returns
list of label_ids of all direct parents of the given label or None if the label is not present in the dataset label hierarchies
- Return type
list(int)
-
get_eurovoc_label_hierarchy
()¶ Returns the EuroVoc label hierarchy.
- Returns
dict(int – dictionary that maps label id to label
- Return type
Label)
-
is_ancestor
(label_id, example)¶ Checks if the given label_id is an ancestor of any labels of the example.
- Parameters
label_id (int) – id of the label
example (Example) – example from dataset
- Returns
True if label is ancestor to any of the example labels, False otherwise
- Return type
boolean
-
-
class
podium.datasets.
PauzaHRDataset
(dir_path, fields)¶ Bases:
podium.datasets.dataset.Dataset
Simple PauzaHR dataset class which uses original reviews.
-
URL
¶ url to the PauzaHR dataset
- Type
str
-
NAME
¶ dataset name
- Type
str
-
DATASET_DIR
¶ name of the folder in the dataset containing train and test directories
- Type
str
-
ARCHIVE_TYPE
¶ string that defines archive type, used for unpacking dataset
- Type
str
-
TRAIN_DIR
¶ name of the training directory
- Type
str
-
TEST_DIR
¶ name of the directory containing test examples
- Type
str
-
static
get_default_fields
()¶ Method returns default PauzaHR fields: rating, source and text.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
static
get_train_test_dataset
(fields=None)¶ Method creates train and test dataset for PauzaHR dataset.
- Parameters
fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use
`get_default_fields`
.- Returns
(train_dataset, test_dataset) – tuple containing train dataset and test dataset
- Return type
(Dataset, Dataset)
-
-
class
podium.datasets.
Iterator
(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, internal_random_state=None)¶ Bases:
object
An iterator that batches data from a dataset after numericalization.
-
epoch
¶ The number of epochs elapsed up to this point.
- Type
int
-
iterations
¶ The number of iterations elapsed in the current epoch.
- Type
int
-
__call__
(dataset: podium.datasets.dataset.Dataset)¶ Sets the dataset for this Iterator and returns an iterable over the batches of that Dataset. Same as calling iterator.set_dataset() followed by iter(iterator)
- Parameters
dataset (Dataset) – Dataset to iterate over.
- Returns
- Return type
Iterable over batches in the Dataset.
-
__iter__
()¶ Returns an iterator object that knows how to iterate over the given dataset. The iterator yields tuples in the form (input_batch, target_batch). The input_batch and target_batch objects have attributes that correspond to the names of input fields and target fields (respectively) of the dataset. The values of those attributes are numpy matrices, whose rows are the numericalized values of that field in the examples that are in the batch. Rows of sequential fields (that are of variable length) are all padded to a common length. The common length is either the fixed_length attribute of the field or, if that is not given, the maximum length of all examples in the batch.
- Returns
Iterator that iterates over batches of examples in the dataset.
- Return type
iter
-
__len__
()¶ Returns the number of batches this iterator provides in one epoch.
- Returns
Number of batches s provided in one epoch.
- Return type
int
-
get_internal_random_state
()¶ Returns the internal random state of the iterator.
Useful when we want to stop iteration and later continue where we left off. We can store the random state obtained with this method and later initialize another iterator with the same random state and continue iterating.
Only to be called if shuffle is True, otherwise a RuntimeError is raised.
- Returns
The internal random state of the iterator.
- Return type
tuple
- Raises
RuntimeError – If shuffle is False.
-
set_dataset
(dataset: podium.datasets.dataset.Dataset)¶ Sets the dataset for this Iterator to iterate over. Resets the epoch count.
- Parameters
dataset (Dataset) – Dataset to iterate over.
-
set_internal_random_state
(state)¶ Sets the internal random state of the iterator.
Useful when we want to stop iteration and later continue where we left off. We can take the random state previously obtained from another iterator to initialize this iterator with the same state and continue iterating where the previous iterator stopped.
Only to be called if shuffle is True, otherwise a RuntimeError is raised.
- Raises
RuntimeError – If shuffle is False.
-
-
class
podium.datasets.
SingleBatchIterator
(dataset: podium.datasets.dataset.Dataset = None, shuffle=True)¶ Bases:
podium.datasets.iterator.Iterator
Iterator that creates one batch per epoch containing all examples in the dataset.
-
set_dataset
(dataset: podium.datasets.dataset.Dataset)¶ Sets the dataset for this Iterator to iterate over. Resets the epoch count.
- Parameters
dataset (Dataset) – Dataset to iterate over.
-
-
class
podium.datasets.
BucketIterator
(batch_size, dataset=None, sort_key=None, shuffle=True, seed=42, look_ahead_multiplier=100, bucket_sort_key=None)¶ Bases:
podium.datasets.iterator.Iterator
Creates a bucket iterator that uses a look-ahead heuristic to try and batch examples in a way that minimizes the amount of necessary padding.
It creates a bucket of size N x batch_size, and sorts that bucket before splitting it into batches, so there is less padding necessary.
-
__iter__
()¶ Returns an iterator object that knows how to iterate over the batches of the given dataset.
- Returns
Iterator that iterates over batches of examples in the dataset.
- Return type
iter
-
-
class
podium.datasets.
HierarchicalDatasetIterator
(batch_size, dataset=None, sort_key=None, shuffle=False, seed=1, internal_random_state=None, context_max_length=None, context_max_depth=None)¶ Bases:
podium.datasets.iterator.Iterator
Iterator used to create batches for Hierarchical Datasets.
It creates batches in the form of lists of matrices. In the batch namedtuple that gets returned, every attribute corresponds to a field in the dataset. For every field in the dataset, the namedtuple contains a list of matrices, where every matrix represents the context of an example in the batch. The rows of a matrix contain numericalized representations of the examples that make up the context of an example in the batch with the representation of the example itself being in the last row of its own context matrix.
-
class
podium.datasets.
SNLIDataset
(file_path, fields)¶ Bases:
podium.datasets.impl.snli_dataset.SNLISimple
A SNLI Dataset class. Unlike SNLISimple, this class includes all the fields included in the SNLI dataset by default.
-
NAME
¶ Name of the Dataset.
- Type
str
-
URL
¶ URL to the SNLI dataset.
- Type
str
-
DATASET_DIR
¶ Name of the directory in which the dataset files are stored.
- Type
str
-
ARCHIVE_TYPE
¶ Archive type, i.e. compression method used for archiving the downloaded dataset file.
- Type
str
-
TRAIN_FILE_NAME
¶ Name of the file in which the train dataset is stored.
- Type
str
-
TEST_FILE_NAME
¶ Name of the file in which the test dataset is stored.
- Type
str
-
DEV_FILE_NAME
¶ Name of the file in which the dev (validation) dataset is stored.
- Type
str
-
ANNOTATOR_LABELS_FIELD_NAME
¶ Name of the field containing annotator labels
- Type
str
-
CAPTION_ID_FIELD_NAME
¶ Name of the field containing caption ID
- Type
str
-
GOLD_LABEL_FIELD_NAME
¶ Name of the field containing gold label
- Type
str
-
PAIR_ID_FIELD_NAME
¶ Name of the field containing pair ID
- Type
str
-
SENTENCE1_FIELD_NAME
¶ Name of the field containing sentence1
- Type
str
-
SENTENCE1_PARSE_FIELD_NAME
¶ Name of the field containing sentence1 parse
- Type
str
-
SENTENCE1_BINARY_PARSE_FIELD_NAME
¶ Name of the field containing sentence1 binary parse
- Type
str
-
SENTENCE2_FIELD_NAME
¶ Name of the field containing sentence2
- Type
str
-
SENTENCE2_PARSE_FIELD_NAME
¶ Name of the field containing sentence2 parse
- Type
str
-
SENTENCE2_BINARY_PARSE_FIELD_NAME
¶ Name of the field containing sentence2 binary parse
- Type
str
-
static
get_default_fields
()¶ Method returns all SNLI fields in the following order: annotator_labels, captionID, gold_label, pairID, sentence1, sentence1_parse, sentence1_binary_parse, sentence2, sentence2_parse, sentence2_binary_parse
- Returns
fields – Dictionary mapping field names to respective Fields.
- Return type
dict(str, Field)
Notes
This dataset includes both parses for every sentence,
-
static
get_train_test_dev_dataset
(fields=None)¶ Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.
- Parameters
fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied,
`get_default_fields`
is used.- Returns
(train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.
- Return type
(Dataset, Dataset, Dataset)
-
-
class
podium.datasets.
SNLISimple
(file_path, fields)¶ Bases:
podium.datasets.dataset.Dataset
A Simple SNLI Dataset class. This class only uses three fields by default: gold_label, sentence1, sentence2.
-
NAME
¶ Name of the Dataset.
- Type
str
-
URL
¶ URL to the SNLI dataset.
- Type
str
-
DATASET_DIR
¶ Name of the directory in which the dataset files are stored.
- Type
str
-
ARCHIVE_TYPE
¶ Archive type, i.e. compression method used for archiving the downloaded dataset file.
- Type
str
-
TRAIN_FILE_NAME
¶ Name of the file in which the train dataset is stored.
- Type
str
-
TEST_FILE_NAME
¶ Name of the file in which the test dataset is stored.
- Type
str
-
DEV_FILE_NAME
¶ Name of the file in which the dev (validation) dataset is stored.
- Type
str
-
GOLD_LABEL_FIELD_NAME
¶ Name of the field containing gold label
- Type
str
-
SENTENCE1_FIELD_NAME
¶ Name of the field containing sentence1
- Type
str
-
SENTENCE2_FIELD_NAME
¶ Name of the field containing sentence2
- Type
str
-
static
get_default_fields
()¶ Method returns the three main SNLI fields in the following order: gold_label, sentence1, sentence2
- Returns
fields – Dictionary mapping field names to respective Fields.
- Return type
dict(str, Field)
-
static
get_train_test_dev_dataset
(fields=None)¶ Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.
- Parameters
fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied,
`get_default_fields`
is used.- Returns
(train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.
- Return type
(Dataset, Dataset, Dataset)
-