podium.storage package¶

Subpackages¶

Submodules¶

podium.storage.example_factory module¶

Module containing the Example Factory method used to dynamically create example classes used for storage in Dataset classes

class podium.storage.example_factory.Example(fieldnames)¶

Bases: object

Method models one example with fields that hold (raw, tokenized) values and special fields with “_” at the end that can cache numericalized values

class podium.storage.example_factory.ExampleFactory(fields)¶

Bases: object

Class used to create Example instances. Every ExampleFactory dynamically creates its own example class definition optimised for the fields provided in __init__.

create_empty_example()¶

Method creates empty example with field names stored in example factory.

Returns: example – empty Example instance with initialized field names
Return type: Example

from_csv(data, field_to_index=None, delimiter=', ')¶

Creates an Example from a CSV line and a corresponding list or dict of Fields.

Parameters

data (str) – A string containing a single row of values separated by the given delimiter.
field_to_index (dict) – A dict that maps column names to their indices in the line of data. Only needed if fields is a dict, otherwise ignored.
delimiter (str) – The delimiter that separates the values in the line of data.

Returns

An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

Return type

Example

from_dict(data)¶

Method creates example from data in dictionary format.

Parameters: data (dict(str, object)) – dictionary that maps field name to field value
Returns: example – example instance with given data saved to fields
Return type: Example

from_fields_tree(data, subtrees=False, label_transform=None)¶

Creates an Example (or multiple Examples) from a string representing an nltk tree and a list of corresponding values.

Parameters

data (str) – A string containing an nltk tree whose values are to be mapped to Fields.
subtrees (bool) – A flag denoting whether an example will be created from every subtree in the tree (when set to True), or just from the whole tree (when set to False).
label_transform (callable) – A function which converts the tree labels to a string representation, if wished. Useful for converting multiclass tasks to binary (SST) and making labels verbose. If None, the labels are not changed.

Returns

If subtrees was False, returns an Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

If subtrees was True, returns a list of such Examples for every subtree in the given tree.

Return type

(Example | list)

from_json(data)¶

Creates an Example from a JSON object and the corresponding fields.

Parameters: data (str) – A string containing a single JSON object (key-value pairs surrounded by curly braces).
Returns: An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.
Return type: Example
Raises: ValueError – If JSON doesn’t contain key name.

from_list(data)¶

Method creates example from data in list format.

Parameters: data (list) – list containing values for fields in order that the fields were given to example factory
Returns: example – example instance with given data saved to fields
Return type: Example

from_xml_str(data)¶

Method creates and Example from xml string.

Parameters

data (str) – XML formated string that contains the values of a single data instance, that are to be mapped to Fields.

Returns

An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

Return type

Example

Raises

ValueError – If the name is not contained in the xml string.
ParseError – If there was a problem while parsing xml sting, invalid xml.

class podium.storage.example_factory.ExampleFormat¶

Bases: enum.Enum

An enumeration.

podium.storage.example_factory.set_example_attributes(example, field, val)¶

Method sets example attributes with given values.

Parameters

example (Example) – example instance to which we are setting attributes
field ((Field|tuple(Field))) – field instance or instances that we are mapping
val (str) – field value

podium.storage.example_factory.tree_to_list(tree)¶

Method joins tree leaves and label in one list.

Parameters: tree (tree) – nltk tree instance
Returns: tree_list – tree represented as list with its label
Return type: list

podium.storage.field module¶

Module contains dataset’s field definition and methods for construction.

class podium.storage.field.Field(name, tokenizer='split', language='en', vocab=None, tokenize=True, store_as_raw=False, store_as_tokenized=False, eager=True, is_numericalizable=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)¶

Bases: object

Holds the preprocessing and numericalization logic for a single field of a dataset.

__getstate__()¶

Method obtains field state. It is used for pickling dataset data to file.

Returns: state – dataset state dictionary
Return type: dict

__setstate__(state)¶

Method sets field state. It is used for unpickling dataset data from file.

Parameters: state (dict) – dataset state dictionary

add_posttokenize_hook(hook)¶

Add a post-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. Post-tokenization hooks are called only if the Field is sequential (in non-sequential fields there is no tokenization and only pre-tokenization hooks are called). The output of the final post-tokenization hook are the raw and tokenized data that the preprocess function will use to produce its result.

Posttokenize hooks have the following outline:

func post_tok_hook(raw_data, tokenized_data):: raw_out, tokenized_out = do_stuff(raw_data, tokenized_data) return raw_out, tokenized_out

where ‘tokenized_data’ is and ‘tokenized_out’ should be an iterable.

Parameters: hook (callable) – The post-tokenization hook that we want to add to the field.
Raises: If field is declared as non numericalizable. –

add_pretokenize_hook(hook)¶

Add a pre-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.

Pretokenize hooks have the following signature:

func pre_tok_hook(raw_data):: raw_data_out = do_stuff(raw_data) return raw_data_out

This can be used to eliminate encoding errors in data, replace numbers and names, etc.

Parameters: hook (callable) – The pre-tokenization hook that we want to add to the field.

finalize()¶: Signals that this field’s vocab can be built.

property finalized¶

Returns whether the field’s Vocab vas finalized. If the field has no vocab, returns True.

Returns: Whether the field’s Vocab vas finalized. If the field has no vocab, returns True.
Return type: bool

get_default_value()¶

Method obtains default field value for missing data.

Returns: The index of the missing data token, if this field is numericalizable. None value otherwise.
Return type: missing_symbol index or None
Raises: ValueError – If missing data is not allowed in this field.

get_numericalization_for_example(example, cache=True)¶

Returns the numericalized data of this field for the provided example. The numericalized data is generated and cached in the example if ‘cache’ is true and the cached data is not already present. If already cached, the cached data is returned.

Parameters

example (Example) – example to get numericalized data for.
cache (bool) – whether to store the cache the calculated numericalization if not already cached

Returns

numericalized data – The numericalized data.

Return type

numpy array

get_output_fields()¶

Returns an Iterable of the contained output fields.

Returns: an Iterable of the contained output fields.
Return type: Iterable

numericalize(data)¶

Numericalize the already preprocessed data point based either on the vocab that was previously built, or on a custom numericalization function, if the field doesn’t use a vocab.

Parameters: data ((hashable, iterable(hashable))) – Tuple of (raw, tokenized) of preprocessed input data. If the field is sequential, ‘raw’ is ignored and can be None. Otherwise, ‘sequential’ is ignored and can be None.
Returns: Array of stoi indexes of the tokens, if data exists. None, if data is missing and missing data is allowed.
Return type: numpy array
Raises: ValueError – If data is None and missing data is not allowed in this field.

pad_to_length(row, length, custom_pad_symbol=None, pad_left=False, truncate_left=False)¶

Either pads the given row with pad symbols, or truncates the row to be of given length. The vocab provides the pad symbol for all fields that have vocabs, otherwise the pad symbol has to be given as a parameter.

Parameters

row (np.ndarray) – The row of numericalized data that is to be padded / truncated.
length (int) – The desired length of the row.
custom_pad_symbol (int) – The pad symbol that is to be used if the field doesn’t have a vocab. If the field has a vocab, this parameter is ignored and can be None.
pad_left (bool) – If True padding will be done on the left side, otherwise on the right side. Default: False.
truncate_left (bool) – If True field will be trucated on the left side, otherwise on the right side. Default: False.

Raises

ValueError – If the field doesn’t use a vocab and no custom pad symbol was given.

preprocess(data)¶

Preprocesses raw data, tokenizing it if the field is sequential, updating the vocab if the field is eager and preserving the raw data if field’s ‘store_raw’ is true.

Parameters: data (str or iterable(hashable)) – The raw data that needs to be preprocessed. String if ‘store_as_raw’ and/or ‘tokenize’ attributes are True. iterable(hashable) if store_as_tokenized attribute is True.
Returns: A tuple of (raw, tokenized). If the field’s ‘store_as_raw’ attribute is False, then ‘raw’ will be None (we don’t preserve the raw data). If field’s ‘tokenize’ and ‘store_as_tokenized’ attributes are False then ‘tokenized’ will be None. The attributes ‘store_as_raw’, ‘store_as_tokenized’ and ‘tokenize’ will never all be False, so the function will never return (None, None).
Return type: (str, Iterable(hashable))
Raises: If data is None and missing data is not allowed. –

remove_posttokenize_hooks()¶: Remove all the post-tokenization hooks that were added to the Field.

remove_pretokenize_hooks()¶: Remove all the pre-tokenization hooks that were added to the Field.

update_vocab(raw, tokenized)¶

Updates the vocab with a data point in its raw and tokenized form. If the field is sequential, the vocab is updated with the tokenized form (and ‘raw’ can be None), otherwise the raw form is used to update (and ‘tokenized’ can be None).

Parameters

raw (hashable) – The raw form of the data point that the vocab is to be updated with. If the field is sequential, this parameter is ignored and can be None.
tokenized (iterable(hashable)) – The tokenized form of the data point that the vocab is to be updated with. If the field is NOT sequential (‘store_as_tokenized’ and ‘tokenize’ attributes are False), this parameter is ignored and can be None.

property use_vocab¶

A flag that tells whether the field uses a vocab or not.

Returns: Whether the field uses a vocab or not.
Return type: bool

class podium.storage.field.LabelField(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)¶: Bases: podium.storage.field.Field

class podium.storage.field.MultilabelField(name, num_of_classes=None, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)¶

Bases: podium.storage.field.TokenizedField

Class used for storing pre-tokenized labels. Used for multilabeled datasets.

finalize()¶: Signals that this field’s vocab can be built.

class podium.storage.field.MultioutputField(output_fields, tokenizer='split', language='en')¶

Bases: object

Field that does pretokenization and tokenization once and passes it to its output fields. Output fields are any type of field. The output fields are used only for posttokenization processing (posttokenization hooks and vocab updating).

add_output_field(field)¶

Adds the passed field to this field’s output fields.

Parameters: field (Field) – Field to add to output fields.

add_pretokenize_hook(hook)¶

Add a pre-tokenization hook to the MultioutputField. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.

Pretokenize hooks have the following signature:

func pre_tok_hook(raw_data):: raw_data_out = do_stuff(raw_data) return raw_data_out

This can be used to eliminate encoding errors in data, replace numbers and names, etc.

Parameters: hook (callable) – The pre-tokenization hook that we want to add to the field.

get_output_fields()¶

Returns an Iterable of the contained output fields.

Returns: an Iterable of the contained output fields.
Return type: Iterable

remove_pretokenize_hooks()¶: Remove all the pre-tokenization hooks that were added to the MultioutputField.

class podium.storage.field.TokenizedField(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)¶

Bases: podium.storage.field.Field

Tokenized version of the Field. Holds the preprocessing and numericalization logic for the pre-tokenized dataset fields.

podium.storage.field.unpack_fields(fields)¶

Flattens the given fields object into a flat list of fields.

Parameters: fields ((list | dict)) – List or dict that can contain nested tuples and None as values and column names as keys (dict).
Returns: A flat list of Fields found in the given ‘fields’ object.
Return type: list[Field]

podium.storage.vocab module¶

Module contains classes related to the vocabulary.

class podium.storage.vocab.SpecialVocabSymbols¶

Bases: enum.Enum

Class for special vocabular symbols

UNK¶

Tag for unknown word

Type: str

PAD¶

TAG for padding symbol

Type: str

class podium.storage.vocab.Vocab(max_size=None, min_freq=1, specials=(<SpecialVocabSymbols.UNK: '<unk>'>, <SpecialVocabSymbols.PAD: '<pad>'>), keep_freqs=False)¶

Bases: object

Class for storing vocabulary. It supports frequency counting and size limiting.

finalized¶

true if the vocab is finalized, false otherwise

Type: bool

itos¶

list of words

Type: list

stoi¶

mapping from word string to index

Type: dict

__add__(values: Union[Vocab, Iterable])¶

Method allows a vocabulary to be added to current vocabulary or that a set of values is added to the vocabulary.

If max_size if None for any of the two Vocabs, the max_size of the resulting Vocab will also be None. If they are both defined, the max_size of the resulting Vocab will be the sum of max_sizes.

Parameters: values (Iterable or Vocab) – If Vocab, a new Vocab will be created containing all of the special symbols and tokens from both Vocabs. If Iterable, a new Vocab will be returned containing a copy of this Vocab with the iterables’ tokens added.
Returns: Returns a new Vocab
Return type: Vocab
Raises: RuntimeError – If this vocab is finalized and values are tried to be added, or if both Vocabs are not either both finalized or not finalized.

__eq__(other)¶

Two vocabs are same if they have same finalization status, their stoi and itos mappings are same and their frequency counters are same.

Parameters: other (object) – object for which we want to knwo equality propertiy
Returns: equal – true if two vocabs are same, false otherwise
Return type: bool

__getitem__(token)¶

Returns the token index of the passed token. If the passed token has no index, UNK token index is returned. Otherwise, an exception is raised.

Parameters: token (str) – token whose index is to be returned.
Returns: stoi index of the token.
Return type: int
Raises: KeyError – If the passed token has no index and vocab has no UNK special token.

__iadd__(values: Union[Vocab, Iterable])¶

Adds additional values or another Vocab to this Vocab.

Parameters

values (Iterable or Vocab) –

Values to be added to this Vocab. If Vocab, all of the token frequencies and specials from that Vocab will be added to this Vocab.

If Iterable, all of the tokens from the Iterable will be added to this Vocab, increasing the frequencies of those tokens.

Returns

vocab – Returns current Vocab instance to enable chaining

Return type

Vocab

Raises

RuntimeError – If the current vocab is finalized, if ‘values’ is a string or if the RHS Vocab doesn’t contain token frequencies.
TypeError – If the values cannot be iterated over.

__iter__()¶

Method returns iterator over vocabulary, if the vocabulary is not finalized iteration is done over frequency counter and special symbols are not included, otherwise it is performed on itos and special symbols are included.

Returns: iterator over vocab tokens
Return type: iter

__len__()¶

Method calculates vocab lengths including special symbols.

Returns: length – vocab size including special symbols
Return type: int

finalize()¶

Method finalizes vocab building. It also releases frequency counter if user set not to keep them.

Raises: RuntimeError – If the vocab is already finalized.

get_freqs()¶

Method obtains vocabulary frequencies.

Returns: freq – mapping frequency for every word
Return type: Counter
Raises: RuntimeError – If the user stated that he doesn’t want to keep frequencies and the vocab is finalized.

property has_specials¶

Property that checks if the vocabulary contains special symbols.

Returns: flag – true if the vocabulary has special symbols, false otherwise.
Return type: bool

numericalize(data)¶

Method numericalizes given tokens.

Parameters: data (iter(str)) – iterable collection of tokens
Returns: numericalized_vector – numpy array of numericalized tokens
Return type: array-like
Raises: RuntimeError – If the vocabulary is not finalized.

padding_index()¶

Method returns padding symbol index.

Returns: pad_symbol_index – padding symbol index in the vocabulary
Return type: int
Raises: ValueError – If the padding symbol is not present in the vocabulary.

reverse_numericalize(numericalized_data: Iterable)¶

Transforms an iterable containing numericalized data into a list of tokens. The tokens are read from this Vocab’s itos and no additional processing is done.

Parameters: numericalized_data (Iterable) – data to be reverse numericalized
Returns: a list of tokens
Return type: list
Raises: RuntimeError – If the vocabulary is not finalized.

class podium.storage.vocab.VocabDict(default_factory=None, *args, **kwargs)¶

Bases: dict

Vocab dictionary class that is used like default dict but without adding missing key to the dictionary.

podium.storage.vocab.unique(values: Iterable)¶

Generator that iterates over the first occurrence of every value in values, preserving original order.

Parameters: values (Iterable) – Iterable of values
Yields: the first occurrence of every value in values, preserving order.

Module contents¶

Package contains modules for storing and loading datasets and vectors.

class podium.storage.BaseDownloader¶

Bases: abc.ABC

BaseDownloader interface for downloader classes.

abstract classmethod download(uri, path, overwrite=False, **kwargs)¶

Function downloades file from given URI to given path. If the overwrite variable is true and given path already exists it will be overwriten with new file.

Parameters

uri (str) – URI of file that needs to be downloaded
path (str) – destination path where to save downloaded file
overwrite (bool) – if true and given path exists downloaded file will overwrite existing files

Returns

rewrite_status – True if download was successful or False if the file already exists and given overwrite value was False.

Return type

bool

Raises

ValueError – if given uri or path are None
RuntimeError – if there was an error while obtaining resource from uri

class podium.storage.SCPDownloader¶

Bases: podium.storage.resources.downloader.BaseDownloader

Class for downloading file from server using sftp on top of ssh protocol.

USER_NAME_KEY¶

key for defining keyword argument for username

Type: str

PASSWORD_KEY¶

key for defining keyword argument for password if the private key file uses paraphrase, user should define it here

Type: str, optional

HOST_ADDR_KEY¶

key for defining keyword argument for remote host address

Type: str

PRIVATE_KEY_FILE_KEY¶

key for defining keyword argument for private key location if the user uses default linux private key location this argument can be set to None

Type: str, optional

classmethod download(uri, path, overwrite=False, **kwargs)¶

Method downloads a file from the remote machine and saves it to the local path. If the overwrite variable is true and given path already exists it will be overwriten with new file.

Parameters

uri (str) – URI of the file on remote machine
path (str) – path of the file on local machine
overwrite (bool) – if true and given path exists downloaded file will overwrite existing files
kwargs (dict(str, str)) – key word arguments that are described in class attributes used for connecting to the remote machine

Returns

rewrite_status – True if download was successful or False if the file already exists and given overwrite value was False.

Return type

bool

Raises

ValueError – If given uri or path are None, or if the host is not defined.
RuntimeError – If there was an error while obtaining resource from uri.

class podium.storage.HttpDownloader¶

Bases: podium.storage.resources.downloader.BaseDownloader

Interface for downloader that uses http protocol for data transfer.

class podium.storage.SimpleHttpDownloader¶

Bases: podium.storage.resources.downloader.HttpDownloader

Downloader that uses HTTP protocol for downloading. It doesn’t offer content confirmation (as needed for example in google drive) or any kind of authentication.

classmethod download(uri, path, overwrite=False, **kwargs)¶

Function downloades file from given URI to given path. If the overwrite variable is true and given path already exists it will be overwriten with new file.

Parameters

uri (str) – URI of file that needs to be downloaded
path (str) – destination path where to save downloaded file
overwrite (bool) – if true and given path exists downloaded file will overwrite existing files

Returns

rewrite_status – True if download was successful or False if the file already exists and given overwrite value was False.

Return type

bool

Raises

ValueError – if given uri or path are None
RuntimeError – if there was an error while obtaining resource from uri

class podium.storage.Field(name, tokenizer='split', language='en', vocab=None, tokenize=True, store_as_raw=False, store_as_tokenized=False, eager=True, is_numericalizable=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)¶

Bases: object

Holds the preprocessing and numericalization logic for a single field of a dataset.

__getstate__()¶

Method obtains field state. It is used for pickling dataset data to file.

Returns: state – dataset state dictionary
Return type: dict

__setstate__(state)¶

Method sets field state. It is used for unpickling dataset data from file.

Parameters: state (dict) – dataset state dictionary

add_posttokenize_hook(hook)¶

Add a post-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. Post-tokenization hooks are called only if the Field is sequential (in non-sequential fields there is no tokenization and only pre-tokenization hooks are called). The output of the final post-tokenization hook are the raw and tokenized data that the preprocess function will use to produce its result.

Posttokenize hooks have the following outline:

func post_tok_hook(raw_data, tokenized_data):: raw_out, tokenized_out = do_stuff(raw_data, tokenized_data) return raw_out, tokenized_out

where ‘tokenized_data’ is and ‘tokenized_out’ should be an iterable.

Parameters: hook (callable) – The post-tokenization hook that we want to add to the field.
Raises: If field is declared as non numericalizable. –

add_pretokenize_hook(hook)¶

Add a pre-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.

Pretokenize hooks have the following signature:

func pre_tok_hook(raw_data):: raw_data_out = do_stuff(raw_data) return raw_data_out

This can be used to eliminate encoding errors in data, replace numbers and names, etc.

Parameters: hook (callable) – The pre-tokenization hook that we want to add to the field.

finalize()¶: Signals that this field’s vocab can be built.

property finalized¶

Returns whether the field’s Vocab vas finalized. If the field has no vocab, returns True.

Returns: Whether the field’s Vocab vas finalized. If the field has no vocab, returns True.
Return type: bool

get_default_value()¶

Method obtains default field value for missing data.

Returns: The index of the missing data token, if this field is numericalizable. None value otherwise.
Return type: missing_symbol index or None
Raises: ValueError – If missing data is not allowed in this field.

get_numericalization_for_example(example, cache=True)¶

Returns the numericalized data of this field for the provided example. The numericalized data is generated and cached in the example if ‘cache’ is true and the cached data is not already present. If already cached, the cached data is returned.

Parameters

example (Example) – example to get numericalized data for.
cache (bool) – whether to store the cache the calculated numericalization if not already cached

Returns

numericalized data – The numericalized data.

Return type

numpy array

get_output_fields()¶

Returns an Iterable of the contained output fields.

Returns: an Iterable of the contained output fields.
Return type: Iterable

numericalize(data)¶

Numericalize the already preprocessed data point based either on the vocab that was previously built, or on a custom numericalization function, if the field doesn’t use a vocab.

Parameters: data ((hashable, iterable(hashable))) – Tuple of (raw, tokenized) of preprocessed input data. If the field is sequential, ‘raw’ is ignored and can be None. Otherwise, ‘sequential’ is ignored and can be None.
Returns: Array of stoi indexes of the tokens, if data exists. None, if data is missing and missing data is allowed.
Return type: numpy array
Raises: ValueError – If data is None and missing data is not allowed in this field.

pad_to_length(row, length, custom_pad_symbol=None, pad_left=False, truncate_left=False)¶

Either pads the given row with pad symbols, or truncates the row to be of given length. The vocab provides the pad symbol for all fields that have vocabs, otherwise the pad symbol has to be given as a parameter.

Parameters

row (np.ndarray) – The row of numericalized data that is to be padded / truncated.
length (int) – The desired length of the row.
custom_pad_symbol (int) – The pad symbol that is to be used if the field doesn’t have a vocab. If the field has a vocab, this parameter is ignored and can be None.
pad_left (bool) – If True padding will be done on the left side, otherwise on the right side. Default: False.
truncate_left (bool) – If True field will be trucated on the left side, otherwise on the right side. Default: False.

Raises

ValueError – If the field doesn’t use a vocab and no custom pad symbol was given.

preprocess(data)¶

Preprocesses raw data, tokenizing it if the field is sequential, updating the vocab if the field is eager and preserving the raw data if field’s ‘store_raw’ is true.

Parameters: data (str or iterable(hashable)) – The raw data that needs to be preprocessed. String if ‘store_as_raw’ and/or ‘tokenize’ attributes are True. iterable(hashable) if store_as_tokenized attribute is True.
Returns: A tuple of (raw, tokenized). If the field’s ‘store_as_raw’ attribute is False, then ‘raw’ will be None (we don’t preserve the raw data). If field’s ‘tokenize’ and ‘store_as_tokenized’ attributes are False then ‘tokenized’ will be None. The attributes ‘store_as_raw’, ‘store_as_tokenized’ and ‘tokenize’ will never all be False, so the function will never return (None, None).
Return type: (str, Iterable(hashable))
Raises: If data is None and missing data is not allowed. –

remove_posttokenize_hooks()¶: Remove all the post-tokenization hooks that were added to the Field.

remove_pretokenize_hooks()¶: Remove all the pre-tokenization hooks that were added to the Field.

update_vocab(raw, tokenized)¶

Updates the vocab with a data point in its raw and tokenized form. If the field is sequential, the vocab is updated with the tokenized form (and ‘raw’ can be None), otherwise the raw form is used to update (and ‘tokenized’ can be None).

Parameters

raw (hashable) – The raw form of the data point that the vocab is to be updated with. If the field is sequential, this parameter is ignored and can be None.
tokenized (iterable(hashable)) – The tokenized form of the data point that the vocab is to be updated with. If the field is NOT sequential (‘store_as_tokenized’ and ‘tokenize’ attributes are False), this parameter is ignored and can be None.

property use_vocab¶

A flag that tells whether the field uses a vocab or not.

Returns: Whether the field uses a vocab or not.
Return type: bool

class podium.storage.TokenizedField(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)¶

Bases: podium.storage.field.Field

Tokenized version of the Field. Holds the preprocessing and numericalization logic for the pre-tokenized dataset fields.

class podium.storage.LabelField(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)¶: Bases: podium.storage.field.Field

class podium.storage.MultilabelField(name, num_of_classes=None, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)¶

Bases: podium.storage.field.TokenizedField

Class used for storing pre-tokenized labels. Used for multilabeled datasets.

finalize()¶: Signals that this field’s vocab can be built.

class podium.storage.MultioutputField(output_fields, tokenizer='split', language='en')¶

Bases: object

Field that does pretokenization and tokenization once and passes it to its output fields. Output fields are any type of field. The output fields are used only for posttokenization processing (posttokenization hooks and vocab updating).

add_output_field(field)¶

Adds the passed field to this field’s output fields.

Parameters: field (Field) – Field to add to output fields.

add_pretokenize_hook(hook)¶

Add a pre-tokenization hook to the MultioutputField. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.

Pretokenize hooks have the following signature:

func pre_tok_hook(raw_data):: raw_data_out = do_stuff(raw_data) return raw_data_out

This can be used to eliminate encoding errors in data, replace numbers and names, etc.

Parameters: hook (callable) – The pre-tokenization hook that we want to add to the field.

get_output_fields()¶

Returns an Iterable of the contained output fields.

Returns: an Iterable of the contained output fields.
Return type: Iterable

remove_pretokenize_hooks()¶: Remove all the pre-tokenization hooks that were added to the MultioutputField.

podium.storage.unpack_fields(fields)¶

Flattens the given fields object into a flat list of fields.

Parameters: fields ((list | dict)) – List or dict that can contain nested tuples and None as values and column names as keys (dict).
Returns: A flat list of Fields found in the given ‘fields’ object.
Return type: list[Field]

class podium.storage.LargeResource(**kwargs)¶

Bases: object

Large resource that needs to download files from URL. Class also supports archive decompression.

BASE_RESOURCE_DIR¶

base large files directory path

Type: str

RESOURCE_NAME¶

key for defining resource directory name parameter

Type: str

URL¶

key for defining resource url parameter

Type: str

ARCHIVE¶

key for defining archiving method paramter

Type: str

SUPPORTED_ARCHIVE¶

list of supported archive file types

Type: list(str)

class podium.storage.SCPLargeResource(**kwargs)¶

Bases: podium.storage.resources.large_resource.LargeResource

Large resource that needs to download files from URI using scp protocol. For other functionalities class uses Large Resource class.

SCP_HOST_KEY¶

key for keyword argument that defines remote host address

Type: str

SCP_USER_KEY¶

key for keyword argument that defines remote host username

Type: str

SCP_PASS_KEY¶

key for keyword argument that defines remote host password or passphrase used in private key

Type: str, optional

SCP_PRIVATE_KEY¶

key for keyword argument that defines location for private key on linux OS it can be optional if the key is in default location

Type: str, optional

class podium.storage.VectorStorage(path, default_vector_function=None, cache_path=None, max_vectors=None)¶

Bases: abc.ABC

Interface for classes that can vectorize token. One example of such vectorizer is word2vec.

abstract __len__()¶

Method returns number of vectors in vector storage.

Returns: len – number of loaded vectors in vector storage
Return type: int

get_embedding_matrix(vocab=None)¶

Method constructs embedding matrix.

Note: From python 3.6 dictionaries preserve insertion order https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes

Parameters: vocab (iter(token)) – collection of tokens for creation of embedding matrix default use case is to give this function vocab or itos list or None if you wish to retrieve all loaded vectors. In case None is passed as argument, the order of vectors is the same as the insertion order of loaded vectors in VectorStorage.
Raises: RuntimeError – If vector storage is not initialized.

abstract get_vector_dim()¶

“Method returns vector dimension.

Returns: dim – vector dimension
Return type: int
Raises: RuntimeError – if vector storage is not initialized

abstract load_all()¶

Method loads all vectors stored in instance path to the vectors.

Raises

IOError – If there was a problem while reading vectors from instance path.
ValueError – If instance path is not a valid path.
RuntimeError – If different vector size is detected while loading vectors.

abstract load_vocab(vocab)¶

Method loads vectors for tokens in vocab stored in given path to the instance.

Parameters

vocab (iterable object) – vocabulary with unique words

Raises

IOError – If there was a problem while reading vectors from instance path.
ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.
RuntimeError – If different vector size is detected while loading vectors.

abstract token_to_vector(token)¶

Method obtains vector for given token.

Parameters

token (str) – token from vocabulary

Returns

vector – vector representation of given token

Return type

array_like

Raises

KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).
ValueError – If given token is None.
RuntimeError – If vector storage is not initialized.

class podium.storage.BasicVectorStorage(path, default_vector_function=<function random_normal_default_vector>, cache_path=None, max_vectors=None, encoding='utf-8', binary=True)¶

Bases: podium.storage.vectorizers.vectorizer.VectorStorage

Basic implementation of VectorStorage that handles loading vectors from system storage.

_vectors¶

dictionary offering word to vector mapping

Type: dict

_dim¶

vector dimension

Type: int

_initialized¶

has the vector storage been initialized by loading vectors

Type: bool

_binary¶

if True, the file is read as a binary file. Else, it’s read as a plain utf-8 text file.

Type: bool

get_vector_dim()¶

“Method returns vector dimension.

Returns: dim – vector dimension
Return type: int
Raises: RuntimeError – if vector storage is not initialized

load_all()¶

Method loads all vectors stored in instance path to the vectors.

Raises

IOError – If there was a problem while reading vectors from instance path.
ValueError – If instance path is not a valid path.
RuntimeError – If different vector size is detected while loading vectors.

load_vocab(vocab)¶

Method loads vectors for tokens in vocab stored in given path to the instance.

Parameters

vocab (iterable object) – vocabulary with unique words

Raises

IOError – If there was a problem while reading vectors from instance path.
ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.
RuntimeError – If different vector size is detected while loading vectors.

token_to_vector(token)¶

Method obtains vector for given token.

Parameters

token (str) – token from vocabulary

Returns

vector – vector representation of given token

Return type

array_like

Raises

KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).
ValueError – If given token is None.
RuntimeError – If vector storage is not initialized.

class podium.storage.SpecialVocabSymbols¶

Bases: enum.Enum

Class for special vocabular symbols

UNK¶

Tag for unknown word

Type: str

PAD¶

TAG for padding symbol

Type: str

class podium.storage.Vocab(max_size=None, min_freq=1, specials=(<SpecialVocabSymbols.UNK: '<unk>'>, <SpecialVocabSymbols.PAD: '<pad>'>), keep_freqs=False)¶

Bases: object

Class for storing vocabulary. It supports frequency counting and size limiting.

finalized¶

true if the vocab is finalized, false otherwise

Type: bool

itos¶

list of words

Type: list

stoi¶

mapping from word string to index

Type: dict

__add__(values: Union[Vocab, Iterable])¶

Method allows a vocabulary to be added to current vocabulary or that a set of values is added to the vocabulary.

If max_size if None for any of the two Vocabs, the max_size of the resulting Vocab will also be None. If they are both defined, the max_size of the resulting Vocab will be the sum of max_sizes.

Parameters: values (Iterable or Vocab) – If Vocab, a new Vocab will be created containing all of the special symbols and tokens from both Vocabs. If Iterable, a new Vocab will be returned containing a copy of this Vocab with the iterables’ tokens added.
Returns: Returns a new Vocab
Return type: Vocab
Raises: RuntimeError – If this vocab is finalized and values are tried to be added, or if both Vocabs are not either both finalized or not finalized.

__eq__(other)¶

Two vocabs are same if they have same finalization status, their stoi and itos mappings are same and their frequency counters are same.

Parameters: other (object) – object for which we want to knwo equality propertiy
Returns: equal – true if two vocabs are same, false otherwise
Return type: bool

__getitem__(token)¶

Returns the token index of the passed token. If the passed token has no index, UNK token index is returned. Otherwise, an exception is raised.

Parameters: token (str) – token whose index is to be returned.
Returns: stoi index of the token.
Return type: int
Raises: KeyError – If the passed token has no index and vocab has no UNK special token.

__iadd__(values: Union[Vocab, Iterable])¶

Adds additional values or another Vocab to this Vocab.

Parameters

values (Iterable or Vocab) –

Values to be added to this Vocab. If Vocab, all of the token frequencies and specials from that Vocab will be added to this Vocab.

If Iterable, all of the tokens from the Iterable will be added to this Vocab, increasing the frequencies of those tokens.

Returns

vocab – Returns current Vocab instance to enable chaining

Return type

Vocab

Raises

RuntimeError – If the current vocab is finalized, if ‘values’ is a string or if the RHS Vocab doesn’t contain token frequencies.
TypeError – If the values cannot be iterated over.

__iter__()¶

Method returns iterator over vocabulary, if the vocabulary is not finalized iteration is done over frequency counter and special symbols are not included, otherwise it is performed on itos and special symbols are included.

Returns: iterator over vocab tokens
Return type: iter

__len__()¶

Method calculates vocab lengths including special symbols.

Returns: length – vocab size including special symbols
Return type: int

finalize()¶

Method finalizes vocab building. It also releases frequency counter if user set not to keep them.

Raises: RuntimeError – If the vocab is already finalized.

get_freqs()¶

Method obtains vocabulary frequencies.

Returns: freq – mapping frequency for every word
Return type: Counter
Raises: RuntimeError – If the user stated that he doesn’t want to keep frequencies and the vocab is finalized.

property has_specials¶

Property that checks if the vocabulary contains special symbols.

Returns: flag – true if the vocabulary has special symbols, false otherwise.
Return type: bool

numericalize(data)¶

Method numericalizes given tokens.

Parameters: data (iter(str)) – iterable collection of tokens
Returns: numericalized_vector – numpy array of numericalized tokens
Return type: array-like
Raises: RuntimeError – If the vocabulary is not finalized.

padding_index()¶

Method returns padding symbol index.

Returns: pad_symbol_index – padding symbol index in the vocabulary
Return type: int
Raises: ValueError – If the padding symbol is not present in the vocabulary.

reverse_numericalize(numericalized_data: Iterable)¶

Transforms an iterable containing numericalized data into a list of tokens. The tokens are read from this Vocab’s itos and no additional processing is done.

Parameters: numericalized_data (Iterable) – data to be reverse numericalized
Returns: a list of tokens
Return type: list
Raises: RuntimeError – If the vocabulary is not finalized.

class podium.storage.ExampleFactory(fields)¶

Bases: object

Class used to create Example instances. Every ExampleFactory dynamically creates its own example class definition optimised for the fields provided in __init__.

create_empty_example()¶

Method creates empty example with field names stored in example factory.

Returns: example – empty Example instance with initialized field names
Return type: Example

from_csv(data, field_to_index=None, delimiter=', ')¶

Creates an Example from a CSV line and a corresponding list or dict of Fields.

Parameters

data (str) – A string containing a single row of values separated by the given delimiter.
field_to_index (dict) – A dict that maps column names to their indices in the line of data. Only needed if fields is a dict, otherwise ignored.
delimiter (str) – The delimiter that separates the values in the line of data.

Returns

An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

Return type

Example

from_dict(data)¶

Method creates example from data in dictionary format.

Parameters: data (dict(str, object)) – dictionary that maps field name to field value
Returns: example – example instance with given data saved to fields
Return type: Example

from_fields_tree(data, subtrees=False, label_transform=None)¶

Creates an Example (or multiple Examples) from a string representing an nltk tree and a list of corresponding values.

Parameters

data (str) – A string containing an nltk tree whose values are to be mapped to Fields.
subtrees (bool) – A flag denoting whether an example will be created from every subtree in the tree (when set to True), or just from the whole tree (when set to False).
label_transform (callable) – A function which converts the tree labels to a string representation, if wished. Useful for converting multiclass tasks to binary (SST) and making labels verbose. If None, the labels are not changed.

Returns

If subtrees was False, returns an Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

If subtrees was True, returns a list of such Examples for every subtree in the given tree.

Return type

(Example | list)

from_json(data)¶

Creates an Example from a JSON object and the corresponding fields.

Parameters: data (str) – A string containing a single JSON object (key-value pairs surrounded by curly braces).
Returns: An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.
Return type: Example
Raises: ValueError – If JSON doesn’t contain key name.

from_list(data)¶

Method creates example from data in list format.

Parameters: data (list) – list containing values for fields in order that the fields were given to example factory
Returns: example – example instance with given data saved to fields
Return type: Example

from_xml_str(data)¶

Method creates and Example from xml string.

Parameters

data (str) – XML formated string that contains the values of a single data instance, that are to be mapped to Fields.

Returns

An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.

Return type

Example

Raises

ValueError – If the name is not contained in the xml string.
ParseError – If there was a problem while parsing xml sting, invalid xml.

class podium.storage.ExampleFormat¶

Bases: enum.Enum

An enumeration.

class podium.storage.TfIdfVectorizer(vocab=None, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False, specials=None)¶

Bases: podium.storage.vectorizers.tfidf.CountVectorizer

Class converts data from one field in examples to matrix of tf-idf features. It is equivalent to scikit-learn TfidfVectorizer available at https://scikit-learn.org. Class is dependant on TfidfTransformer defined in scikit-learn library.

fit(dataset, field)¶

Learn idf from dataset on data in given field.

Parameters

dataset (Dataset) – dataset instance cointaining data on which to build idf matrix
field (Field) – which field in dataset to use for tfidf

Returns

self

Return type

TfIdfVectorizer

Raises

ValueError – If dataset or field are None and if name of given field is not in dataset.

transform(examples, **kwargs)¶

Transforms examples to example-term matrix. Uses vocabulary that is given in constructor.

Parameters

example (iterable) – an iterable which yields array with numericalized tokens

Returns

X – Tf-idf weighted document-term matrix

Return type

sparse matrix, [n_samples, n_features]

Raises

ValueError – If examples are None.
RuntimeError – If vectorizer is not fitted yet.

podium.storage package¶

Subpackages¶

Submodules¶

podium.storage.example_factory module¶

podium.storage.field module¶

podium.storage.vocab module¶

Module contents¶

Table of Contents

Previous topic

Next topic

This Page