podium.storage package

Submodules

podium.storage.example_factory module

Module containing the Example Factory method used to dynamically create example classes used for storage in Dataset classes

class podium.storage.example_factory.Example(fieldnames)

Bases: object

Method models one example with fields that hold (raw, tokenized) values and special fields with “_” at the end that can cache numericalized values

class podium.storage.example_factory.ExampleFactory(fields)

Bases: object

Class used to create Example instances. Every ExampleFactory dynamically creates its own example class definition optimised for the fields provided in __init__.

create_empty_example()

Method creates empty example with field names stored in example factory.

Returns

example – empty Example instance with initialized field names

Return type

Example

from_csv(data, field_to_index=None, delimiter=', ')

Creates an Example from a CSV line and a corresponding list or dict of Fields.

Parameters
  • data (str) – A string containing a single row of values separated by the given delimiter.

  • field_to_index (dict) – A dict that maps column names to their indices in the line of data. Only needed if fields is a dict, otherwise ignored.

  • delimiter (str) – The delimiter that separates the values in the line of data.

Returns

An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

Return type

Example

from_dict(data)

Method creates example from data in dictionary format.

Parameters

data (dict(str, object)) – dictionary that maps field name to field value

Returns

example – example instance with given data saved to fields

Return type

Example

from_fields_tree(data, subtrees=False, label_transform=None)

Creates an Example (or multiple Examples) from a string representing an nltk tree and a list of corresponding values.

Parameters
  • data (str) – A string containing an nltk tree whose values are to be mapped to Fields.

  • subtrees (bool) – A flag denoting whether an example will be created from every subtree in the tree (when set to True), or just from the whole tree (when set to False).

  • label_transform (callable) – A function which converts the tree labels to a string representation, if wished. Useful for converting multiclass tasks to binary (SST) and making labels verbose. If None, the labels are not changed.

Returns

If subtrees was False, returns an Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

If subtrees was True, returns a list of such Examples for every subtree in the given tree.

Return type

(Example | list)

from_json(data)

Creates an Example from a JSON object and the corresponding fields.

Parameters

data (str) – A string containing a single JSON object (key-value pairs surrounded by curly braces).

Returns

An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

Return type

Example

Raises

ValueError – If JSON doesn’t contain key name.

from_list(data)

Method creates example from data in list format.

Parameters

data (list) – list containing values for fields in order that the fields were given to example factory

Returns

example – example instance with given data saved to fields

Return type

Example

from_xml_str(data)

Method creates and Example from xml string.

Parameters

data (str) – XML formated string that contains the values of a single data instance, that are to be mapped to Fields.

Returns

An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

Return type

Example

Raises
  • ValueError – If the name is not contained in the xml string.

  • ParseError – If there was a problem while parsing xml sting, invalid xml.

class podium.storage.example_factory.ExampleFormat

Bases: enum.Enum

An enumeration.

podium.storage.example_factory.set_example_attributes(example, field, val)

Method sets example attributes with given values.

Parameters
  • example (Example) – example instance to which we are setting attributes

  • field ((Field|tuple(Field))) – field instance or instances that we are mapping

  • val (str) – field value

podium.storage.example_factory.tree_to_list(tree)

Method joins tree leaves and label in one list.

Parameters

tree (tree) – nltk tree instance

Returns

tree_list – tree represented as list with its label

Return type

list

podium.storage.field module

Module contains dataset’s field definition and methods for construction.

class podium.storage.field.Field(name, tokenizer='split', language='en', vocab=None, tokenize=True, store_as_raw=False, store_as_tokenized=False, eager=True, is_numericalizable=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)

Bases: object

Holds the preprocessing and numericalization logic for a single field of a dataset.

__getstate__()

Method obtains field state. It is used for pickling dataset data to file.

Returns

state – dataset state dictionary

Return type

dict

__setstate__(state)

Method sets field state. It is used for unpickling dataset data from file.

Parameters

state (dict) – dataset state dictionary

add_posttokenize_hook(hook)

Add a post-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. Post-tokenization hooks are called only if the Field is sequential (in non-sequential fields there is no tokenization and only pre-tokenization hooks are called). The output of the final post-tokenization hook are the raw and tokenized data that the preprocess function will use to produce its result.

Posttokenize hooks have the following outline:
func post_tok_hook(raw_data, tokenized_data):

raw_out, tokenized_out = do_stuff(raw_data, tokenized_data) return raw_out, tokenized_out

where ‘tokenized_data’ is and ‘tokenized_out’ should be an iterable.

Parameters

hook (callable) – The post-tokenization hook that we want to add to the field.

Raises

If field is declared as non numericalizable.

add_pretokenize_hook(hook)

Add a pre-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.

Pretokenize hooks have the following signature:
func pre_tok_hook(raw_data):

raw_data_out = do_stuff(raw_data) return raw_data_out

This can be used to eliminate encoding errors in data, replace numbers and names, etc.

Parameters

hook (callable) – The pre-tokenization hook that we want to add to the field.

finalize()

Signals that this field’s vocab can be built.

property finalized

Returns whether the field’s Vocab vas finalized. If the field has no vocab, returns True.

Returns

Whether the field’s Vocab vas finalized. If the field has no vocab, returns True.

Return type

bool

get_default_value()

Method obtains default field value for missing data.

Returns

The index of the missing data token, if this field is numericalizable. None value otherwise.

Return type

missing_symbol index or None

Raises

ValueError – If missing data is not allowed in this field.

get_numericalization_for_example(example, cache=True)

Returns the numericalized data of this field for the provided example. The numericalized data is generated and cached in the example if ‘cache’ is true and the cached data is not already present. If already cached, the cached data is returned.

Parameters
  • example (Example) – example to get numericalized data for.

  • cache (bool) – whether to store the cache the calculated numericalization if not already cached

Returns

numericalized data – The numericalized data.

Return type

numpy array

get_output_fields()

Returns an Iterable of the contained output fields.

Returns

an Iterable of the contained output fields.

Return type

Iterable

numericalize(data)

Numericalize the already preprocessed data point based either on the vocab that was previously built, or on a custom numericalization function, if the field doesn’t use a vocab.

Parameters

data ((hashable, iterable(hashable))) – Tuple of (raw, tokenized) of preprocessed input data. If the field is sequential, ‘raw’ is ignored and can be None. Otherwise, ‘sequential’ is ignored and can be None.

Returns

Array of stoi indexes of the tokens, if data exists. None, if data is missing and missing data is allowed.

Return type

numpy array

Raises

ValueError – If data is None and missing data is not allowed in this field.

pad_to_length(row, length, custom_pad_symbol=None, pad_left=False, truncate_left=False)

Either pads the given row with pad symbols, or truncates the row to be of given length. The vocab provides the pad symbol for all fields that have vocabs, otherwise the pad symbol has to be given as a parameter.

Parameters
  • row (np.ndarray) – The row of numericalized data that is to be padded / truncated.

  • length (int) – The desired length of the row.

  • custom_pad_symbol (int) – The pad symbol that is to be used if the field doesn’t have a vocab. If the field has a vocab, this parameter is ignored and can be None.

  • pad_left (bool) – If True padding will be done on the left side, otherwise on the right side. Default: False.

  • truncate_left (bool) – If True field will be trucated on the left side, otherwise on the right side. Default: False.

Raises

ValueError – If the field doesn’t use a vocab and no custom pad symbol was given.

preprocess(data)

Preprocesses raw data, tokenizing it if the field is sequential, updating the vocab if the field is eager and preserving the raw data if field’s ‘store_raw’ is true.

Parameters

data (str or iterable(hashable)) – The raw data that needs to be preprocessed. String if ‘store_as_raw’ and/or ‘tokenize’ attributes are True. iterable(hashable) if store_as_tokenized attribute is True.

Returns

A tuple of (raw, tokenized). If the field’s ‘store_as_raw’ attribute is False, then ‘raw’ will be None (we don’t preserve the raw data). If field’s ‘tokenize’ and ‘store_as_tokenized’ attributes are False then ‘tokenized’ will be None. The attributes ‘store_as_raw’, ‘store_as_tokenized’ and ‘tokenize’ will never all be False, so the function will never return (None, None).

Return type

(str, Iterable(hashable))

Raises

If data is None and missing data is not allowed.

remove_posttokenize_hooks()

Remove all the post-tokenization hooks that were added to the Field.

remove_pretokenize_hooks()

Remove all the pre-tokenization hooks that were added to the Field.

update_vocab(raw, tokenized)

Updates the vocab with a data point in its raw and tokenized form. If the field is sequential, the vocab is updated with the tokenized form (and ‘raw’ can be None), otherwise the raw form is used to update (and ‘tokenized’ can be None).

Parameters
  • raw (hashable) – The raw form of the data point that the vocab is to be updated with. If the field is sequential, this parameter is ignored and can be None.

  • tokenized (iterable(hashable)) – The tokenized form of the data point that the vocab is to be updated with. If the field is NOT sequential (‘store_as_tokenized’ and ‘tokenize’ attributes are False), this parameter is ignored and can be None.

property use_vocab

A flag that tells whether the field uses a vocab or not.

Returns

Whether the field uses a vocab or not.

Return type

bool

class podium.storage.field.LabelField(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)

Bases: podium.storage.field.Field

class podium.storage.field.MultilabelField(name, num_of_classes=None, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)

Bases: podium.storage.field.TokenizedField

Class used for storing pre-tokenized labels. Used for multilabeled datasets.

finalize()

Signals that this field’s vocab can be built.

class podium.storage.field.MultioutputField(output_fields, tokenizer='split', language='en')

Bases: object

Field that does pretokenization and tokenization once and passes it to its output fields. Output fields are any type of field. The output fields are used only for posttokenization processing (posttokenization hooks and vocab updating).

add_output_field(field)

Adds the passed field to this field’s output fields.

Parameters

field (Field) – Field to add to output fields.

add_pretokenize_hook(hook)

Add a pre-tokenization hook to the MultioutputField. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.

Pretokenize hooks have the following signature:
func pre_tok_hook(raw_data):

raw_data_out = do_stuff(raw_data) return raw_data_out

This can be used to eliminate encoding errors in data, replace numbers and names, etc.

Parameters

hook (callable) – The pre-tokenization hook that we want to add to the field.

get_output_fields()

Returns an Iterable of the contained output fields.

Returns

an Iterable of the contained output fields.

Return type

Iterable

remove_pretokenize_hooks()

Remove all the pre-tokenization hooks that were added to the MultioutputField.

class podium.storage.field.TokenizedField(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)

Bases: podium.storage.field.Field

Tokenized version of the Field. Holds the preprocessing and numericalization logic for the pre-tokenized dataset fields.

podium.storage.field.unpack_fields(fields)

Flattens the given fields object into a flat list of fields.

Parameters

fields ((list | dict)) – List or dict that can contain nested tuples and None as values and column names as keys (dict).

Returns

A flat list of Fields found in the given ‘fields’ object.

Return type

list[Field]

podium.storage.vocab module

Module contains classes related to the vocabulary.

class podium.storage.vocab.SpecialVocabSymbols

Bases: enum.Enum

Class for special vocabular symbols

UNK

Tag for unknown word

Type

str

PAD

TAG for padding symbol

Type

str

class podium.storage.vocab.Vocab(max_size=None, min_freq=1, specials=(<SpecialVocabSymbols.UNK: '<unk>'>, <SpecialVocabSymbols.PAD: '<pad>'>), keep_freqs=False)

Bases: object

Class for storing vocabulary. It supports frequency counting and size limiting.

finalized

true if the vocab is finalized, false otherwise

Type

bool

itos

list of words

Type

list

stoi

mapping from word string to index

Type

dict

__add__(values: Union[Vocab, Iterable])

Method allows a vocabulary to be added to current vocabulary or that a set of values is added to the vocabulary.

If max_size if None for any of the two Vocabs, the max_size of the resulting Vocab will also be None. If they are both defined, the max_size of the resulting Vocab will be the sum of max_sizes.

Parameters

values (Iterable or Vocab) – If Vocab, a new Vocab will be created containing all of the special symbols and tokens from both Vocabs. If Iterable, a new Vocab will be returned containing a copy of this Vocab with the iterables’ tokens added.

Returns

Returns a new Vocab

Return type

Vocab

Raises

RuntimeError – If this vocab is finalized and values are tried to be added, or if both Vocabs are not either both finalized or not finalized.

__eq__(other)

Two vocabs are same if they have same finalization status, their stoi and itos mappings are same and their frequency counters are same.

Parameters

other (object) – object for which we want to knwo equality propertiy

Returns

equal – true if two vocabs are same, false otherwise

Return type

bool

__getitem__(token)

Returns the token index of the passed token. If the passed token has no index, UNK token index is returned. Otherwise, an exception is raised.

Parameters

token (str) – token whose index is to be returned.

Returns

stoi index of the token.

Return type

int

Raises

KeyError – If the passed token has no index and vocab has no UNK special token.

__iadd__(values: Union[Vocab, Iterable])

Adds additional values or another Vocab to this Vocab.

Parameters

values (Iterable or Vocab) –

Values to be added to this Vocab. If Vocab, all of the token frequencies and specials from that Vocab will be added to this Vocab.

If Iterable, all of the tokens from the Iterable will be added to this Vocab, increasing the frequencies of those tokens.

Returns

vocab – Returns current Vocab instance to enable chaining

Return type

Vocab

Raises
  • RuntimeError – If the current vocab is finalized, if ‘values’ is a string or if the RHS Vocab doesn’t contain token frequencies.

  • TypeError – If the values cannot be iterated over.

__iter__()

Method returns iterator over vocabulary, if the vocabulary is not finalized iteration is done over frequency counter and special symbols are not included, otherwise it is performed on itos and special symbols are included.

Returns

iterator over vocab tokens

Return type

iter

__len__()

Method calculates vocab lengths including special symbols.

Returns

length – vocab size including special symbols

Return type

int

finalize()

Method finalizes vocab building. It also releases frequency counter if user set not to keep them.

Raises

RuntimeError – If the vocab is already finalized.

get_freqs()

Method obtains vocabulary frequencies.

Returns

freq – mapping frequency for every word

Return type

Counter

Raises

RuntimeError – If the user stated that he doesn’t want to keep frequencies and the vocab is finalized.

property has_specials

Property that checks if the vocabulary contains special symbols.

Returns

flag – true if the vocabulary has special symbols, false otherwise.

Return type

bool

numericalize(data)

Method numericalizes given tokens.

Parameters

data (iter(str)) – iterable collection of tokens

Returns

numericalized_vector – numpy array of numericalized tokens

Return type

array-like

Raises

RuntimeError – If the vocabulary is not finalized.

padding_index()

Method returns padding symbol index.

Returns

pad_symbol_index – padding symbol index in the vocabulary

Return type

int

Raises

ValueError – If the padding symbol is not present in the vocabulary.

reverse_numericalize(numericalized_data: Iterable)

Transforms an iterable containing numericalized data into a list of tokens. The tokens are read from this Vocab’s itos and no additional processing is done.

Parameters

numericalized_data (Iterable) – data to be reverse numericalized

Returns

a list of tokens

Return type

list

Raises

RuntimeError – If the vocabulary is not finalized.

class podium.storage.vocab.VocabDict(default_factory=None, *args, **kwargs)

Bases: dict

Vocab dictionary class that is used like default dict but without adding missing key to the dictionary.

podium.storage.vocab.unique(values: Iterable)

Generator that iterates over the first occurrence of every value in values, preserving original order.

Parameters

values (Iterable) – Iterable of values

Yields

the first occurrence of every value in values, preserving order.

Module contents

Package contains modules for storing and loading datasets and vectors.

class podium.storage.BaseDownloader

Bases: abc.ABC

BaseDownloader interface for downloader classes.

abstract classmethod download(uri, path, overwrite=False, **kwargs)

Function downloades file from given URI to given path. If the overwrite variable is true and given path already exists it will be overwriten with new file.

Parameters
  • uri (str) – URI of file that needs to be downloaded

  • path (str) – destination path where to save downloaded file

  • overwrite (bool) – if true and given path exists downloaded file will overwrite existing files

Returns

rewrite_status – True if download was successful or False if the file already exists and given overwrite value was False.

Return type

bool

Raises
  • ValueError – if given uri or path are None

  • RuntimeError – if there was an error while obtaining resource from uri

class podium.storage.SCPDownloader

Bases: podium.storage.resources.downloader.BaseDownloader

Class for downloading file from server using sftp on top of ssh protocol.

USER_NAME_KEY

key for defining keyword argument for username

Type

str

PASSWORD_KEY

key for defining keyword argument for password if the private key file uses paraphrase, user should define it here

Type

str, optional

HOST_ADDR_KEY

key for defining keyword argument for remote host address

Type

str

PRIVATE_KEY_FILE_KEY

key for defining keyword argument for private key location if the user uses default linux private key location this argument can be set to None

Type

str, optional

classmethod download(uri, path, overwrite=False, **kwargs)

Method downloads a file from the remote machine and saves it to the local path. If the overwrite variable is true and given path already exists it will be overwriten with new file.

Parameters
  • uri (str) – URI of the file on remote machine

  • path (str) – path of the file on local machine

  • overwrite (bool) – if true and given path exists downloaded file will overwrite existing files

  • kwargs (dict(str, str)) – key word arguments that are described in class attributes used for connecting to the remote machine

Returns

rewrite_status – True if download was successful or False if the file already exists and given overwrite value was False.

Return type

bool

Raises
  • ValueError – If given uri or path are None, or if the host is not defined.

  • RuntimeError – If there was an error while obtaining resource from uri.

class podium.storage.HttpDownloader

Bases: podium.storage.resources.downloader.BaseDownloader

Interface for downloader that uses http protocol for data transfer.

class podium.storage.SimpleHttpDownloader

Bases: podium.storage.resources.downloader.HttpDownloader

Downloader that uses HTTP protocol for downloading. It doesn’t offer content confirmation (as needed for example in google drive) or any kind of authentication.

classmethod download(uri, path, overwrite=False, **kwargs)

Function downloades file from given URI to given path. If the overwrite variable is true and given path already exists it will be overwriten with new file.

Parameters
  • uri (str) – URI of file that needs to be downloaded

  • path (str) – destination path where to save downloaded file

  • overwrite (bool) – if true and given path exists downloaded file will overwrite existing files

Returns

rewrite_status – True if download was successful or False if the file already exists and given overwrite value was False.

Return type

bool

Raises
  • ValueError – if given uri or path are None

  • RuntimeError – if there was an error while obtaining resource from uri

class podium.storage.Field(name, tokenizer='split', language='en', vocab=None, tokenize=True, store_as_raw=False, store_as_tokenized=False, eager=True, is_numericalizable=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)

Bases: object

Holds the preprocessing and numericalization logic for a single field of a dataset.

__getstate__()

Method obtains field state. It is used for pickling dataset data to file.

Returns

state – dataset state dictionary

Return type

dict

__setstate__(state)

Method sets field state. It is used for unpickling dataset data from file.

Parameters

state (dict) – dataset state dictionary

add_posttokenize_hook(hook)

Add a post-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. Post-tokenization hooks are called only if the Field is sequential (in non-sequential fields there is no tokenization and only pre-tokenization hooks are called). The output of the final post-tokenization hook are the raw and tokenized data that the preprocess function will use to produce its result.

Posttokenize hooks have the following outline:
func post_tok_hook(raw_data, tokenized_data):

raw_out, tokenized_out = do_stuff(raw_data, tokenized_data) return raw_out, tokenized_out

where ‘tokenized_data’ is and ‘tokenized_out’ should be an iterable.

Parameters

hook (callable) – The post-tokenization hook that we want to add to the field.

Raises

If field is declared as non numericalizable.

add_pretokenize_hook(hook)

Add a pre-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.

Pretokenize hooks have the following signature:
func pre_tok_hook(raw_data):

raw_data_out = do_stuff(raw_data) return raw_data_out

This can be used to eliminate encoding errors in data, replace numbers and names, etc.

Parameters

hook (callable) – The pre-tokenization hook that we want to add to the field.

finalize()

Signals that this field’s vocab can be built.

property finalized

Returns whether the field’s Vocab vas finalized. If the field has no vocab, returns True.

Returns

Whether the field’s Vocab vas finalized. If the field has no vocab, returns True.

Return type

bool

get_default_value()

Method obtains default field value for missing data.

Returns

The index of the missing data token, if this field is numericalizable. None value otherwise.

Return type

missing_symbol index or None

Raises

ValueError – If missing data is not allowed in this field.

get_numericalization_for_example(example, cache=True)

Returns the numericalized data of this field for the provided example. The numericalized data is generated and cached in the example if ‘cache’ is true and the cached data is not already present. If already cached, the cached data is returned.

Parameters
  • example (Example) – example to get numericalized data for.

  • cache (bool) – whether to store the cache the calculated numericalization if not already cached

Returns

numericalized data – The numericalized data.

Return type

numpy array

get_output_fields()

Returns an Iterable of the contained output fields.

Returns

an Iterable of the contained output fields.

Return type

Iterable

numericalize(data)

Numericalize the already preprocessed data point based either on the vocab that was previously built, or on a custom numericalization function, if the field doesn’t use a vocab.

Parameters

data ((hashable, iterable(hashable))) – Tuple of (raw, tokenized) of preprocessed input data. If the field is sequential, ‘raw’ is ignored and can be None. Otherwise, ‘sequential’ is ignored and can be None.

Returns

Array of stoi indexes of the tokens, if data exists. None, if data is missing and missing data is allowed.

Return type

numpy array

Raises

ValueError – If data is None and missing data is not allowed in this field.

pad_to_length(row, length, custom_pad_symbol=None, pad_left=False, truncate_left=False)

Either pads the given row with pad symbols, or truncates the row to be of given length. The vocab provides the pad symbol for all fields that have vocabs, otherwise the pad symbol has to be given as a parameter.

Parameters
  • row (np.ndarray) – The row of numericalized data that is to be padded / truncated.

  • length (int) – The desired length of the row.

  • custom_pad_symbol (int) – The pad symbol that is to be used if the field doesn’t have a vocab. If the field has a vocab, this parameter is ignored and can be None.

  • pad_left (bool) – If True padding will be done on the left side, otherwise on the right side. Default: False.

  • truncate_left (bool) – If True field will be trucated on the left side, otherwise on the right side. Default: False.

Raises

ValueError – If the field doesn’t use a vocab and no custom pad symbol was given.

preprocess(data)

Preprocesses raw data, tokenizing it if the field is sequential, updating the vocab if the field is eager and preserving the raw data if field’s ‘store_raw’ is true.

Parameters

data (str or iterable(hashable)) – The raw data that needs to be preprocessed. String if ‘store_as_raw’ and/or ‘tokenize’ attributes are True. iterable(hashable) if store_as_tokenized attribute is True.

Returns

A tuple of (raw, tokenized). If the field’s ‘store_as_raw’ attribute is False, then ‘raw’ will be None (we don’t preserve the raw data). If field’s ‘tokenize’ and ‘store_as_tokenized’ attributes are False then ‘tokenized’ will be None. The attributes ‘store_as_raw’, ‘store_as_tokenized’ and ‘tokenize’ will never all be False, so the function will never return (None, None).

Return type

(str, Iterable(hashable))

Raises

If data is None and missing data is not allowed.

remove_posttokenize_hooks()

Remove all the post-tokenization hooks that were added to the Field.

remove_pretokenize_hooks()

Remove all the pre-tokenization hooks that were added to the Field.

update_vocab(raw, tokenized)

Updates the vocab with a data point in its raw and tokenized form. If the field is sequential, the vocab is updated with the tokenized form (and ‘raw’ can be None), otherwise the raw form is used to update (and ‘tokenized’ can be None).

Parameters
  • raw (hashable) – The raw form of the data point that the vocab is to be updated with. If the field is sequential, this parameter is ignored and can be None.

  • tokenized (iterable(hashable)) – The tokenized form of the data point that the vocab is to be updated with. If the field is NOT sequential (‘store_as_tokenized’ and ‘tokenize’ attributes are False), this parameter is ignored and can be None.

property use_vocab

A flag that tells whether the field uses a vocab or not.

Returns

Whether the field uses a vocab or not.

Return type

bool

class podium.storage.TokenizedField(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)

Bases: podium.storage.field.Field

Tokenized version of the Field. Holds the preprocessing and numericalization logic for the pre-tokenized dataset fields.

class podium.storage.LabelField(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)

Bases: podium.storage.field.Field

class podium.storage.MultilabelField(name, num_of_classes=None, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)

Bases: podium.storage.field.TokenizedField

Class used for storing pre-tokenized labels. Used for multilabeled datasets.

finalize()

Signals that this field’s vocab can be built.

class podium.storage.MultioutputField(output_fields, tokenizer='split', language='en')

Bases: object

Field that does pretokenization and tokenization once and passes it to its output fields. Output fields are any type of field. The output fields are used only for posttokenization processing (posttokenization hooks and vocab updating).

add_output_field(field)

Adds the passed field to this field’s output fields.

Parameters

field (Field) – Field to add to output fields.

add_pretokenize_hook(hook)

Add a pre-tokenization hook to the MultioutputField. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.

Pretokenize hooks have the following signature:
func pre_tok_hook(raw_data):

raw_data_out = do_stuff(raw_data) return raw_data_out

This can be used to eliminate encoding errors in data, replace numbers and names, etc.

Parameters

hook (callable) – The pre-tokenization hook that we want to add to the field.

get_output_fields()

Returns an Iterable of the contained output fields.

Returns

an Iterable of the contained output fields.

Return type

Iterable

remove_pretokenize_hooks()

Remove all the pre-tokenization hooks that were added to the MultioutputField.

podium.storage.unpack_fields(fields)

Flattens the given fields object into a flat list of fields.

Parameters

fields ((list | dict)) – List or dict that can contain nested tuples and None as values and column names as keys (dict).

Returns

A flat list of Fields found in the given ‘fields’ object.

Return type

list[Field]

class podium.storage.LargeResource(**kwargs)

Bases: object

Large resource that needs to download files from URL. Class also supports archive decompression.

BASE_RESOURCE_DIR

base large files directory path

Type

str

RESOURCE_NAME

key for defining resource directory name parameter

Type

str

URL

key for defining resource url parameter

Type

str

ARCHIVE

key for defining archiving method paramter

Type

str

SUPPORTED_ARCHIVE

list of supported archive file types

Type

list(str)

class podium.storage.SCPLargeResource(**kwargs)

Bases: podium.storage.resources.large_resource.LargeResource

Large resource that needs to download files from URI using scp protocol. For other functionalities class uses Large Resource class.

SCP_HOST_KEY

key for keyword argument that defines remote host address

Type

str

SCP_USER_KEY

key for keyword argument that defines remote host username

Type

str

SCP_PASS_KEY

key for keyword argument that defines remote host password or passphrase used in private key

Type

str, optional

SCP_PRIVATE_KEY

key for keyword argument that defines location for private key on linux OS it can be optional if the key is in default location

Type

str, optional

class podium.storage.VectorStorage(path, default_vector_function=None, cache_path=None, max_vectors=None)

Bases: abc.ABC

Interface for classes that can vectorize token. One example of such vectorizer is word2vec.

abstract __len__()

Method returns number of vectors in vector storage.

Returns

len – number of loaded vectors in vector storage

Return type

int

get_embedding_matrix(vocab=None)

Method constructs embedding matrix.

Note: From python 3.6 dictionaries preserve insertion order https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes

Parameters

vocab (iter(token)) – collection of tokens for creation of embedding matrix default use case is to give this function vocab or itos list or None if you wish to retrieve all loaded vectors. In case None is passed as argument, the order of vectors is the same as the insertion order of loaded vectors in VectorStorage.

Raises

RuntimeError – If vector storage is not initialized.

abstract get_vector_dim()

“Method returns vector dimension.

Returns

dim – vector dimension

Return type

int

Raises

RuntimeError – if vector storage is not initialized

abstract load_all()

Method loads all vectors stored in instance path to the vectors.

Raises
  • IOError – If there was a problem while reading vectors from instance path.

  • ValueError – If instance path is not a valid path.

  • RuntimeError – If different vector size is detected while loading vectors.

abstract load_vocab(vocab)

Method loads vectors for tokens in vocab stored in given path to the instance.

Parameters

vocab (iterable object) – vocabulary with unique words

Raises
  • IOError – If there was a problem while reading vectors from instance path.

  • ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.

  • RuntimeError – If different vector size is detected while loading vectors.

abstract token_to_vector(token)

Method obtains vector for given token.

Parameters

token (str) – token from vocabulary

Returns

vector – vector representation of given token

Return type

array_like

Raises
  • KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).

  • ValueError – If given token is None.

  • RuntimeError – If vector storage is not initialized.

class podium.storage.BasicVectorStorage(path, default_vector_function=<function random_normal_default_vector>, cache_path=None, max_vectors=None, encoding='utf-8', binary=True)

Bases: podium.storage.vectorizers.vectorizer.VectorStorage

Basic implementation of VectorStorage that handles loading vectors from system storage.

_vectors

dictionary offering word to vector mapping

Type

dict

_dim

vector dimension

Type

int

_initialized

has the vector storage been initialized by loading vectors

Type

bool

_binary

if True, the file is read as a binary file. Else, it’s read as a plain utf-8 text file.

Type

bool

get_vector_dim()

“Method returns vector dimension.

Returns

dim – vector dimension

Return type

int

Raises

RuntimeError – if vector storage is not initialized

load_all()

Method loads all vectors stored in instance path to the vectors.

Raises
  • IOError – If there was a problem while reading vectors from instance path.

  • ValueError – If instance path is not a valid path.

  • RuntimeError – If different vector size is detected while loading vectors.

load_vocab(vocab)

Method loads vectors for tokens in vocab stored in given path to the instance.

Parameters

vocab (iterable object) – vocabulary with unique words

Raises
  • IOError – If there was a problem while reading vectors from instance path.

  • ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.

  • RuntimeError – If different vector size is detected while loading vectors.

token_to_vector(token)

Method obtains vector for given token.

Parameters

token (str) – token from vocabulary

Returns

vector – vector representation of given token

Return type

array_like

Raises
  • KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).

  • ValueError – If given token is None.

  • RuntimeError – If vector storage is not initialized.

class podium.storage.SpecialVocabSymbols

Bases: enum.Enum

Class for special vocabular symbols

UNK

Tag for unknown word

Type

str

PAD

TAG for padding symbol

Type

str

class podium.storage.Vocab(max_size=None, min_freq=1, specials=(<SpecialVocabSymbols.UNK: '<unk>'>, <SpecialVocabSymbols.PAD: '<pad>'>), keep_freqs=False)

Bases: object

Class for storing vocabulary. It supports frequency counting and size limiting.

finalized

true if the vocab is finalized, false otherwise

Type

bool

itos

list of words

Type

list

stoi

mapping from word string to index

Type

dict

__add__(values: Union[Vocab, Iterable])

Method allows a vocabulary to be added to current vocabulary or that a set of values is added to the vocabulary.

If max_size if None for any of the two Vocabs, the max_size of the resulting Vocab will also be None. If they are both defined, the max_size of the resulting Vocab will be the sum of max_sizes.

Parameters

values (Iterable or Vocab) – If Vocab, a new Vocab will be created containing all of the special symbols and tokens from both Vocabs. If Iterable, a new Vocab will be returned containing a copy of this Vocab with the iterables’ tokens added.

Returns

Returns a new Vocab

Return type

Vocab

Raises

RuntimeError – If this vocab is finalized and values are tried to be added, or if both Vocabs are not either both finalized or not finalized.

__eq__(other)

Two vocabs are same if they have same finalization status, their stoi and itos mappings are same and their frequency counters are same.

Parameters

other (object) – object for which we want to knwo equality propertiy

Returns

equal – true if two vocabs are same, false otherwise

Return type

bool

__getitem__(token)

Returns the token index of the passed token. If the passed token has no index, UNK token index is returned. Otherwise, an exception is raised.

Parameters

token (str) – token whose index is to be returned.

Returns

stoi index of the token.

Return type

int

Raises

KeyError – If the passed token has no index and vocab has no UNK special token.

__iadd__(values: Union[Vocab, Iterable])

Adds additional values or another Vocab to this Vocab.

Parameters

values (Iterable or Vocab) –

Values to be added to this Vocab. If Vocab, all of the token frequencies and specials from that Vocab will be added to this Vocab.

If Iterable, all of the tokens from the Iterable will be added to this Vocab, increasing the frequencies of those tokens.

Returns

vocab – Returns current Vocab instance to enable chaining

Return type

Vocab

Raises
  • RuntimeError – If the current vocab is finalized, if ‘values’ is a string or if the RHS Vocab doesn’t contain token frequencies.

  • TypeError – If the values cannot be iterated over.

__iter__()

Method returns iterator over vocabulary, if the vocabulary is not finalized iteration is done over frequency counter and special symbols are not included, otherwise it is performed on itos and special symbols are included.

Returns

iterator over vocab tokens

Return type

iter

__len__()

Method calculates vocab lengths including special symbols.

Returns

length – vocab size including special symbols

Return type

int

finalize()

Method finalizes vocab building. It also releases frequency counter if user set not to keep them.

Raises

RuntimeError – If the vocab is already finalized.

get_freqs()

Method obtains vocabulary frequencies.

Returns

freq – mapping frequency for every word

Return type

Counter

Raises

RuntimeError – If the user stated that he doesn’t want to keep frequencies and the vocab is finalized.

property has_specials

Property that checks if the vocabulary contains special symbols.

Returns

flag – true if the vocabulary has special symbols, false otherwise.

Return type

bool

numericalize(data)

Method numericalizes given tokens.

Parameters

data (iter(str)) – iterable collection of tokens

Returns

numericalized_vector – numpy array of numericalized tokens

Return type

array-like

Raises

RuntimeError – If the vocabulary is not finalized.

padding_index()

Method returns padding symbol index.

Returns

pad_symbol_index – padding symbol index in the vocabulary

Return type

int

Raises

ValueError – If the padding symbol is not present in the vocabulary.

reverse_numericalize(numericalized_data: Iterable)

Transforms an iterable containing numericalized data into a list of tokens. The tokens are read from this Vocab’s itos and no additional processing is done.

Parameters

numericalized_data (Iterable) – data to be reverse numericalized

Returns

a list of tokens

Return type

list

Raises

RuntimeError – If the vocabulary is not finalized.

class podium.storage.ExampleFactory(fields)

Bases: object

Class used to create Example instances. Every ExampleFactory dynamically creates its own example class definition optimised for the fields provided in __init__.

create_empty_example()

Method creates empty example with field names stored in example factory.

Returns

example – empty Example instance with initialized field names

Return type

Example

from_csv(data, field_to_index=None, delimiter=', ')

Creates an Example from a CSV line and a corresponding list or dict of Fields.

Parameters
  • data (str) – A string containing a single row of values separated by the given delimiter.

  • field_to_index (dict) – A dict that maps column names to their indices in the line of data. Only needed if fields is a dict, otherwise ignored.

  • delimiter (str) – The delimiter that separates the values in the line of data.

Returns

An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

Return type

Example

from_dict(data)

Method creates example from data in dictionary format.

Parameters

data (dict(str, object)) – dictionary that maps field name to field value

Returns

example – example instance with given data saved to fields

Return type

Example

from_fields_tree(data, subtrees=False, label_transform=None)

Creates an Example (or multiple Examples) from a string representing an nltk tree and a list of corresponding values.

Parameters
  • data (str) – A string containing an nltk tree whose values are to be mapped to Fields.

  • subtrees (bool) – A flag denoting whether an example will be created from every subtree in the tree (when set to True), or just from the whole tree (when set to False).

  • label_transform (callable) – A function which converts the tree labels to a string representation, if wished. Useful for converting multiclass tasks to binary (SST) and making labels verbose. If None, the labels are not changed.

Returns

If subtrees was False, returns an Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

If subtrees was True, returns a list of such Examples for every subtree in the given tree.

Return type

(Example | list)

from_json(data)

Creates an Example from a JSON object and the corresponding fields.

Parameters

data (str) – A string containing a single JSON object (key-value pairs surrounded by curly braces).

Returns

An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

Return type

Example

Raises

ValueError – If JSON doesn’t contain key name.

from_list(data)

Method creates example from data in list format.

Parameters

data (list) – list containing values for fields in order that the fields were given to example factory

Returns

example – example instance with given data saved to fields

Return type

Example

from_xml_str(data)

Method creates and Example from xml string.

Parameters

data (str) – XML formated string that contains the values of a single data instance, that are to be mapped to Fields.

Returns

An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

Return type

Example

Raises
  • ValueError – If the name is not contained in the xml string.

  • ParseError – If there was a problem while parsing xml sting, invalid xml.

class podium.storage.ExampleFormat

Bases: enum.Enum

An enumeration.

class podium.storage.TfIdfVectorizer(vocab=None, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False, specials=None)

Bases: podium.storage.vectorizers.tfidf.CountVectorizer

Class converts data from one field in examples to matrix of tf-idf features. It is equivalent to scikit-learn TfidfVectorizer available at https://scikit-learn.org. Class is dependant on TfidfTransformer defined in scikit-learn library.

fit(dataset, field)

Learn idf from dataset on data in given field.

Parameters
  • dataset (Dataset) – dataset instance cointaining data on which to build idf matrix

  • field (Field) – which field in dataset to use for tfidf

Returns

self

Return type

TfIdfVectorizer

Raises

ValueError – If dataset or field are None and if name of given field is not in dataset.

transform(examples, **kwargs)

Transforms examples to example-term matrix. Uses vocabulary that is given in constructor.

Parameters

example (iterable) – an iterable which yields array with numericalized tokens

Returns

X – Tf-idf weighted document-term matrix

Return type

sparse matrix, [n_samples, n_features]

Raises
  • ValueError – If examples are None.

  • RuntimeError – If vectorizer is not fitted yet.