podium.storage package¶
Submodules¶
podium.storage.example_factory module¶
Module containing the Example Factory method used to dynamically create example classes used for storage in Dataset classes
-
class
podium.storage.example_factory.
Example
(fieldnames)¶ Bases:
object
Method models one example with fields that hold (raw, tokenized) values and special fields with “_” at the end that can cache numericalized values
-
class
podium.storage.example_factory.
ExampleFactory
(fields)¶ Bases:
object
Class used to create Example instances. Every ExampleFactory dynamically creates its own example class definition optimised for the fields provided in __init__.
-
create_empty_example
()¶ Method creates empty example with field names stored in example factory.
- Returns
example – empty Example instance with initialized field names
- Return type
Example
-
from_csv
(data, field_to_index=None, delimiter=', ')¶ Creates an Example from a CSV line and a corresponding list or dict of Fields.
- Parameters
data (str) – A string containing a single row of values separated by the given delimiter.
field_to_index (dict) – A dict that maps column names to their indices in the line of data. Only needed if fields is a dict, otherwise ignored.
delimiter (str) – The delimiter that separates the values in the line of data.
- Returns
An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.
- Return type
Example
-
from_dict
(data)¶ Method creates example from data in dictionary format.
- Parameters
data (dict(str, object)) – dictionary that maps field name to field value
- Returns
example – example instance with given data saved to fields
- Return type
Example
-
from_fields_tree
(data, subtrees=False, label_transform=None)¶ Creates an Example (or multiple Examples) from a string representing an nltk tree and a list of corresponding values.
- Parameters
data (str) – A string containing an nltk tree whose values are to be mapped to Fields.
subtrees (bool) – A flag denoting whether an example will be created from every subtree in the tree (when set to True), or just from the whole tree (when set to False).
label_transform (callable) – A function which converts the tree labels to a string representation, if wished. Useful for converting multiclass tasks to binary (SST) and making labels verbose. If None, the labels are not changed.
- Returns
If subtrees was False, returns an Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.
If subtrees was True, returns a list of such Examples for every subtree in the given tree.
- Return type
(Example | list)
-
from_json
(data)¶ Creates an Example from a JSON object and the corresponding fields.
- Parameters
data (str) – A string containing a single JSON object (key-value pairs surrounded by curly braces).
- Returns
An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.
- Return type
Example
- Raises
ValueError – If JSON doesn’t contain key name.
-
from_list
(data)¶ Method creates example from data in list format.
- Parameters
data (list) – list containing values for fields in order that the fields were given to example factory
- Returns
example – example instance with given data saved to fields
- Return type
Example
-
from_xml_str
(data)¶ Method creates and Example from xml string.
- Parameters
data (str) – XML formated string that contains the values of a single data instance, that are to be mapped to Fields.
- Returns
An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.
- Return type
Example
- Raises
ValueError – If the name is not contained in the xml string.
ParseError – If there was a problem while parsing xml sting, invalid xml.
-
-
class
podium.storage.example_factory.
ExampleFormat
¶ Bases:
enum.Enum
An enumeration.
-
podium.storage.example_factory.
set_example_attributes
(example, field, val)¶ Method sets example attributes with given values.
- Parameters
example (Example) – example instance to which we are setting attributes
field ((Field|tuple(Field))) – field instance or instances that we are mapping
val (str) – field value
-
podium.storage.example_factory.
tree_to_list
(tree)¶ Method joins tree leaves and label in one list.
- Parameters
tree (tree) – nltk tree instance
- Returns
tree_list – tree represented as list with its label
- Return type
list
podium.storage.field module¶
Module contains dataset’s field definition and methods for construction.
-
class
podium.storage.field.
Field
(name, tokenizer='split', language='en', vocab=None, tokenize=True, store_as_raw=False, store_as_tokenized=False, eager=True, is_numericalizable=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)¶ Bases:
object
Holds the preprocessing and numericalization logic for a single field of a dataset.
-
__getstate__
()¶ Method obtains field state. It is used for pickling dataset data to file.
- Returns
state – dataset state dictionary
- Return type
dict
-
__setstate__
(state)¶ Method sets field state. It is used for unpickling dataset data from file.
- Parameters
state (dict) – dataset state dictionary
-
add_posttokenize_hook
(hook)¶ Add a post-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. Post-tokenization hooks are called only if the Field is sequential (in non-sequential fields there is no tokenization and only pre-tokenization hooks are called). The output of the final post-tokenization hook are the raw and tokenized data that the preprocess function will use to produce its result.
- Posttokenize hooks have the following outline:
- func post_tok_hook(raw_data, tokenized_data):
raw_out, tokenized_out = do_stuff(raw_data, tokenized_data) return raw_out, tokenized_out
where ‘tokenized_data’ is and ‘tokenized_out’ should be an iterable.
- Parameters
hook (callable) – The post-tokenization hook that we want to add to the field.
- Raises
If field is declared as non numericalizable. –
-
add_pretokenize_hook
(hook)¶ Add a pre-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.
- Pretokenize hooks have the following signature:
- func pre_tok_hook(raw_data):
raw_data_out = do_stuff(raw_data) return raw_data_out
This can be used to eliminate encoding errors in data, replace numbers and names, etc.
- Parameters
hook (callable) – The pre-tokenization hook that we want to add to the field.
-
finalize
()¶ Signals that this field’s vocab can be built.
-
property
finalized
¶ Returns whether the field’s Vocab vas finalized. If the field has no vocab, returns True.
- Returns
Whether the field’s Vocab vas finalized. If the field has no vocab, returns True.
- Return type
bool
-
get_default_value
()¶ Method obtains default field value for missing data.
- Returns
The index of the missing data token, if this field is numericalizable. None value otherwise.
- Return type
missing_symbol index or None
- Raises
ValueError – If missing data is not allowed in this field.
-
get_numericalization_for_example
(example, cache=True)¶ Returns the numericalized data of this field for the provided example. The numericalized data is generated and cached in the example if ‘cache’ is true and the cached data is not already present. If already cached, the cached data is returned.
- Parameters
example (Example) – example to get numericalized data for.
cache (bool) – whether to store the cache the calculated numericalization if not already cached
- Returns
numericalized data – The numericalized data.
- Return type
numpy array
-
get_output_fields
()¶ Returns an Iterable of the contained output fields.
- Returns
an Iterable of the contained output fields.
- Return type
Iterable
-
numericalize
(data)¶ Numericalize the already preprocessed data point based either on the vocab that was previously built, or on a custom numericalization function, if the field doesn’t use a vocab.
- Parameters
data ((hashable, iterable(hashable))) – Tuple of (raw, tokenized) of preprocessed input data. If the field is sequential, ‘raw’ is ignored and can be None. Otherwise, ‘sequential’ is ignored and can be None.
- Returns
Array of stoi indexes of the tokens, if data exists. None, if data is missing and missing data is allowed.
- Return type
numpy array
- Raises
ValueError – If data is None and missing data is not allowed in this field.
-
pad_to_length
(row, length, custom_pad_symbol=None, pad_left=False, truncate_left=False)¶ Either pads the given row with pad symbols, or truncates the row to be of given length. The vocab provides the pad symbol for all fields that have vocabs, otherwise the pad symbol has to be given as a parameter.
- Parameters
row (np.ndarray) – The row of numericalized data that is to be padded / truncated.
length (int) – The desired length of the row.
custom_pad_symbol (int) – The pad symbol that is to be used if the field doesn’t have a vocab. If the field has a vocab, this parameter is ignored and can be None.
pad_left (bool) – If True padding will be done on the left side, otherwise on the right side. Default: False.
truncate_left (bool) – If True field will be trucated on the left side, otherwise on the right side. Default: False.
- Raises
ValueError – If the field doesn’t use a vocab and no custom pad symbol was given.
-
preprocess
(data)¶ Preprocesses raw data, tokenizing it if the field is sequential, updating the vocab if the field is eager and preserving the raw data if field’s ‘store_raw’ is true.
- Parameters
data (str or iterable(hashable)) – The raw data that needs to be preprocessed. String if ‘store_as_raw’ and/or ‘tokenize’ attributes are True. iterable(hashable) if store_as_tokenized attribute is True.
- Returns
A tuple of (raw, tokenized). If the field’s ‘store_as_raw’ attribute is False, then ‘raw’ will be None (we don’t preserve the raw data). If field’s ‘tokenize’ and ‘store_as_tokenized’ attributes are False then ‘tokenized’ will be None. The attributes ‘store_as_raw’, ‘store_as_tokenized’ and ‘tokenize’ will never all be False, so the function will never return (None, None).
- Return type
(str, Iterable(hashable))
- Raises
If data is None and missing data is not allowed. –
-
remove_posttokenize_hooks
()¶ Remove all the post-tokenization hooks that were added to the Field.
-
remove_pretokenize_hooks
()¶ Remove all the pre-tokenization hooks that were added to the Field.
-
update_vocab
(raw, tokenized)¶ Updates the vocab with a data point in its raw and tokenized form. If the field is sequential, the vocab is updated with the tokenized form (and ‘raw’ can be None), otherwise the raw form is used to update (and ‘tokenized’ can be None).
- Parameters
raw (hashable) – The raw form of the data point that the vocab is to be updated with. If the field is sequential, this parameter is ignored and can be None.
tokenized (iterable(hashable)) – The tokenized form of the data point that the vocab is to be updated with. If the field is NOT sequential (‘store_as_tokenized’ and ‘tokenize’ attributes are False), this parameter is ignored and can be None.
-
property
use_vocab
¶ A flag that tells whether the field uses a vocab or not.
- Returns
Whether the field uses a vocab or not.
- Return type
bool
-
-
class
podium.storage.field.
LabelField
(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)¶ Bases:
podium.storage.field.Field
-
class
podium.storage.field.
MultilabelField
(name, num_of_classes=None, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)¶ Bases:
podium.storage.field.TokenizedField
Class used for storing pre-tokenized labels. Used for multilabeled datasets.
-
finalize
()¶ Signals that this field’s vocab can be built.
-
-
class
podium.storage.field.
MultioutputField
(output_fields, tokenizer='split', language='en')¶ Bases:
object
Field that does pretokenization and tokenization once and passes it to its output fields. Output fields are any type of field. The output fields are used only for posttokenization processing (posttokenization hooks and vocab updating).
-
add_output_field
(field)¶ Adds the passed field to this field’s output fields.
- Parameters
field (Field) – Field to add to output fields.
-
add_pretokenize_hook
(hook)¶ Add a pre-tokenization hook to the MultioutputField. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.
- Pretokenize hooks have the following signature:
- func pre_tok_hook(raw_data):
raw_data_out = do_stuff(raw_data) return raw_data_out
This can be used to eliminate encoding errors in data, replace numbers and names, etc.
- Parameters
hook (callable) – The pre-tokenization hook that we want to add to the field.
-
get_output_fields
()¶ Returns an Iterable of the contained output fields.
- Returns
an Iterable of the contained output fields.
- Return type
Iterable
-
remove_pretokenize_hooks
()¶ Remove all the pre-tokenization hooks that were added to the MultioutputField.
-
-
class
podium.storage.field.
TokenizedField
(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)¶ Bases:
podium.storage.field.Field
Tokenized version of the Field. Holds the preprocessing and numericalization logic for the pre-tokenized dataset fields.
-
podium.storage.field.
unpack_fields
(fields)¶ Flattens the given fields object into a flat list of fields.
- Parameters
fields ((list | dict)) – List or dict that can contain nested tuples and None as values and column names as keys (dict).
- Returns
A flat list of Fields found in the given ‘fields’ object.
- Return type
list[Field]
podium.storage.vocab module¶
Module contains classes related to the vocabulary.
-
class
podium.storage.vocab.
SpecialVocabSymbols
¶ Bases:
enum.Enum
Class for special vocabular symbols
-
UNK
¶ Tag for unknown word
- Type
str
-
PAD
¶ TAG for padding symbol
- Type
str
-
-
class
podium.storage.vocab.
Vocab
(max_size=None, min_freq=1, specials=(<SpecialVocabSymbols.UNK: '<unk>'>, <SpecialVocabSymbols.PAD: '<pad>'>), keep_freqs=False)¶ Bases:
object
Class for storing vocabulary. It supports frequency counting and size limiting.
-
finalized
¶ true if the vocab is finalized, false otherwise
- Type
bool
-
itos
¶ list of words
- Type
list
-
stoi
¶ mapping from word string to index
- Type
dict
-
__add__
(values: Union[Vocab, Iterable])¶ Method allows a vocabulary to be added to current vocabulary or that a set of values is added to the vocabulary.
If max_size if None for any of the two Vocabs, the max_size of the resulting Vocab will also be None. If they are both defined, the max_size of the resulting Vocab will be the sum of max_sizes.
- Parameters
values (Iterable or Vocab) – If Vocab, a new Vocab will be created containing all of the special symbols and tokens from both Vocabs. If Iterable, a new Vocab will be returned containing a copy of this Vocab with the iterables’ tokens added.
- Returns
Returns a new Vocab
- Return type
Vocab
- Raises
RuntimeError – If this vocab is finalized and values are tried to be added, or if both Vocabs are not either both finalized or not finalized.
-
__eq__
(other)¶ Two vocabs are same if they have same finalization status, their stoi and itos mappings are same and their frequency counters are same.
- Parameters
other (object) – object for which we want to knwo equality propertiy
- Returns
equal – true if two vocabs are same, false otherwise
- Return type
bool
-
__getitem__
(token)¶ Returns the token index of the passed token. If the passed token has no index, UNK token index is returned. Otherwise, an exception is raised.
- Parameters
token (str) – token whose index is to be returned.
- Returns
stoi index of the token.
- Return type
int
- Raises
KeyError – If the passed token has no index and vocab has no UNK special token.
-
__iadd__
(values: Union[Vocab, Iterable])¶ Adds additional values or another Vocab to this Vocab.
- Parameters
values (Iterable or Vocab) –
Values to be added to this Vocab. If Vocab, all of the token frequencies and specials from that Vocab will be added to this Vocab.
If Iterable, all of the tokens from the Iterable will be added to this Vocab, increasing the frequencies of those tokens.
- Returns
vocab – Returns current Vocab instance to enable chaining
- Return type
Vocab
- Raises
RuntimeError – If the current vocab is finalized, if ‘values’ is a string or if the RHS Vocab doesn’t contain token frequencies.
TypeError – If the values cannot be iterated over.
-
__iter__
()¶ Method returns iterator over vocabulary, if the vocabulary is not finalized iteration is done over frequency counter and special symbols are not included, otherwise it is performed on itos and special symbols are included.
- Returns
iterator over vocab tokens
- Return type
iter
-
__len__
()¶ Method calculates vocab lengths including special symbols.
- Returns
length – vocab size including special symbols
- Return type
int
-
finalize
()¶ Method finalizes vocab building. It also releases frequency counter if user set not to keep them.
- Raises
RuntimeError – If the vocab is already finalized.
-
get_freqs
()¶ Method obtains vocabulary frequencies.
- Returns
freq – mapping frequency for every word
- Return type
Counter
- Raises
RuntimeError – If the user stated that he doesn’t want to keep frequencies and the vocab is finalized.
-
property
has_specials
¶ Property that checks if the vocabulary contains special symbols.
- Returns
flag – true if the vocabulary has special symbols, false otherwise.
- Return type
bool
-
numericalize
(data)¶ Method numericalizes given tokens.
- Parameters
data (iter(str)) – iterable collection of tokens
- Returns
numericalized_vector – numpy array of numericalized tokens
- Return type
array-like
- Raises
RuntimeError – If the vocabulary is not finalized.
-
padding_index
()¶ Method returns padding symbol index.
- Returns
pad_symbol_index – padding symbol index in the vocabulary
- Return type
int
- Raises
ValueError – If the padding symbol is not present in the vocabulary.
-
reverse_numericalize
(numericalized_data: Iterable)¶ Transforms an iterable containing numericalized data into a list of tokens. The tokens are read from this Vocab’s itos and no additional processing is done.
- Parameters
numericalized_data (Iterable) – data to be reverse numericalized
- Returns
a list of tokens
- Return type
list
- Raises
RuntimeError – If the vocabulary is not finalized.
-
-
class
podium.storage.vocab.
VocabDict
(default_factory=None, *args, **kwargs)¶ Bases:
dict
Vocab dictionary class that is used like default dict but without adding missing key to the dictionary.
-
podium.storage.vocab.
unique
(values: Iterable)¶ Generator that iterates over the first occurrence of every value in values, preserving original order.
- Parameters
values (Iterable) – Iterable of values
- Yields
the first occurrence of every value in values, preserving order.
Module contents¶
Package contains modules for storing and loading datasets and vectors.
-
class
podium.storage.
BaseDownloader
¶ Bases:
abc.ABC
BaseDownloader interface for downloader classes.
-
abstract classmethod
download
(uri, path, overwrite=False, **kwargs)¶ Function downloades file from given URI to given path. If the overwrite variable is true and given path already exists it will be overwriten with new file.
- Parameters
uri (str) – URI of file that needs to be downloaded
path (str) – destination path where to save downloaded file
overwrite (bool) – if true and given path exists downloaded file will overwrite existing files
- Returns
rewrite_status – True if download was successful or False if the file already exists and given overwrite value was False.
- Return type
bool
- Raises
ValueError – if given uri or path are None
RuntimeError – if there was an error while obtaining resource from uri
-
abstract classmethod
-
class
podium.storage.
SCPDownloader
¶ Bases:
podium.storage.resources.downloader.BaseDownloader
Class for downloading file from server using sftp on top of ssh protocol.
-
USER_NAME_KEY
¶ key for defining keyword argument for username
- Type
str
-
PASSWORD_KEY
¶ key for defining keyword argument for password if the private key file uses paraphrase, user should define it here
- Type
str, optional
-
HOST_ADDR_KEY
¶ key for defining keyword argument for remote host address
- Type
str
-
PRIVATE_KEY_FILE_KEY
¶ key for defining keyword argument for private key location if the user uses default linux private key location this argument can be set to None
- Type
str, optional
-
classmethod
download
(uri, path, overwrite=False, **kwargs)¶ Method downloads a file from the remote machine and saves it to the local path. If the overwrite variable is true and given path already exists it will be overwriten with new file.
- Parameters
uri (str) – URI of the file on remote machine
path (str) – path of the file on local machine
overwrite (bool) – if true and given path exists downloaded file will overwrite existing files
kwargs (dict(str, str)) – key word arguments that are described in class attributes used for connecting to the remote machine
- Returns
rewrite_status – True if download was successful or False if the file already exists and given overwrite value was False.
- Return type
bool
- Raises
ValueError – If given uri or path are None, or if the host is not defined.
RuntimeError – If there was an error while obtaining resource from uri.
-
-
class
podium.storage.
HttpDownloader
¶ Bases:
podium.storage.resources.downloader.BaseDownloader
Interface for downloader that uses http protocol for data transfer.
-
class
podium.storage.
SimpleHttpDownloader
¶ Bases:
podium.storage.resources.downloader.HttpDownloader
Downloader that uses HTTP protocol for downloading. It doesn’t offer content confirmation (as needed for example in google drive) or any kind of authentication.
-
classmethod
download
(uri, path, overwrite=False, **kwargs)¶ Function downloades file from given URI to given path. If the overwrite variable is true and given path already exists it will be overwriten with new file.
- Parameters
uri (str) – URI of file that needs to be downloaded
path (str) – destination path where to save downloaded file
overwrite (bool) – if true and given path exists downloaded file will overwrite existing files
- Returns
rewrite_status – True if download was successful or False if the file already exists and given overwrite value was False.
- Return type
bool
- Raises
ValueError – if given uri or path are None
RuntimeError – if there was an error while obtaining resource from uri
-
classmethod
-
class
podium.storage.
Field
(name, tokenizer='split', language='en', vocab=None, tokenize=True, store_as_raw=False, store_as_tokenized=False, eager=True, is_numericalizable=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)¶ Bases:
object
Holds the preprocessing and numericalization logic for a single field of a dataset.
-
__getstate__
()¶ Method obtains field state. It is used for pickling dataset data to file.
- Returns
state – dataset state dictionary
- Return type
dict
-
__setstate__
(state)¶ Method sets field state. It is used for unpickling dataset data from file.
- Parameters
state (dict) – dataset state dictionary
-
add_posttokenize_hook
(hook)¶ Add a post-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. Post-tokenization hooks are called only if the Field is sequential (in non-sequential fields there is no tokenization and only pre-tokenization hooks are called). The output of the final post-tokenization hook are the raw and tokenized data that the preprocess function will use to produce its result.
- Posttokenize hooks have the following outline:
- func post_tok_hook(raw_data, tokenized_data):
raw_out, tokenized_out = do_stuff(raw_data, tokenized_data) return raw_out, tokenized_out
where ‘tokenized_data’ is and ‘tokenized_out’ should be an iterable.
- Parameters
hook (callable) – The post-tokenization hook that we want to add to the field.
- Raises
If field is declared as non numericalizable. –
-
add_pretokenize_hook
(hook)¶ Add a pre-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.
- Pretokenize hooks have the following signature:
- func pre_tok_hook(raw_data):
raw_data_out = do_stuff(raw_data) return raw_data_out
This can be used to eliminate encoding errors in data, replace numbers and names, etc.
- Parameters
hook (callable) – The pre-tokenization hook that we want to add to the field.
-
finalize
()¶ Signals that this field’s vocab can be built.
-
property
finalized
¶ Returns whether the field’s Vocab vas finalized. If the field has no vocab, returns True.
- Returns
Whether the field’s Vocab vas finalized. If the field has no vocab, returns True.
- Return type
bool
-
get_default_value
()¶ Method obtains default field value for missing data.
- Returns
The index of the missing data token, if this field is numericalizable. None value otherwise.
- Return type
missing_symbol index or None
- Raises
ValueError – If missing data is not allowed in this field.
-
get_numericalization_for_example
(example, cache=True)¶ Returns the numericalized data of this field for the provided example. The numericalized data is generated and cached in the example if ‘cache’ is true and the cached data is not already present. If already cached, the cached data is returned.
- Parameters
example (Example) – example to get numericalized data for.
cache (bool) – whether to store the cache the calculated numericalization if not already cached
- Returns
numericalized data – The numericalized data.
- Return type
numpy array
-
get_output_fields
()¶ Returns an Iterable of the contained output fields.
- Returns
an Iterable of the contained output fields.
- Return type
Iterable
-
numericalize
(data)¶ Numericalize the already preprocessed data point based either on the vocab that was previously built, or on a custom numericalization function, if the field doesn’t use a vocab.
- Parameters
data ((hashable, iterable(hashable))) – Tuple of (raw, tokenized) of preprocessed input data. If the field is sequential, ‘raw’ is ignored and can be None. Otherwise, ‘sequential’ is ignored and can be None.
- Returns
Array of stoi indexes of the tokens, if data exists. None, if data is missing and missing data is allowed.
- Return type
numpy array
- Raises
ValueError – If data is None and missing data is not allowed in this field.
-
pad_to_length
(row, length, custom_pad_symbol=None, pad_left=False, truncate_left=False)¶ Either pads the given row with pad symbols, or truncates the row to be of given length. The vocab provides the pad symbol for all fields that have vocabs, otherwise the pad symbol has to be given as a parameter.
- Parameters
row (np.ndarray) – The row of numericalized data that is to be padded / truncated.
length (int) – The desired length of the row.
custom_pad_symbol (int) – The pad symbol that is to be used if the field doesn’t have a vocab. If the field has a vocab, this parameter is ignored and can be None.
pad_left (bool) – If True padding will be done on the left side, otherwise on the right side. Default: False.
truncate_left (bool) – If True field will be trucated on the left side, otherwise on the right side. Default: False.
- Raises
ValueError – If the field doesn’t use a vocab and no custom pad symbol was given.
-
preprocess
(data)¶ Preprocesses raw data, tokenizing it if the field is sequential, updating the vocab if the field is eager and preserving the raw data if field’s ‘store_raw’ is true.
- Parameters
data (str or iterable(hashable)) – The raw data that needs to be preprocessed. String if ‘store_as_raw’ and/or ‘tokenize’ attributes are True. iterable(hashable) if store_as_tokenized attribute is True.
- Returns
A tuple of (raw, tokenized). If the field’s ‘store_as_raw’ attribute is False, then ‘raw’ will be None (we don’t preserve the raw data). If field’s ‘tokenize’ and ‘store_as_tokenized’ attributes are False then ‘tokenized’ will be None. The attributes ‘store_as_raw’, ‘store_as_tokenized’ and ‘tokenize’ will never all be False, so the function will never return (None, None).
- Return type
(str, Iterable(hashable))
- Raises
If data is None and missing data is not allowed. –
-
remove_posttokenize_hooks
()¶ Remove all the post-tokenization hooks that were added to the Field.
-
remove_pretokenize_hooks
()¶ Remove all the pre-tokenization hooks that were added to the Field.
-
update_vocab
(raw, tokenized)¶ Updates the vocab with a data point in its raw and tokenized form. If the field is sequential, the vocab is updated with the tokenized form (and ‘raw’ can be None), otherwise the raw form is used to update (and ‘tokenized’ can be None).
- Parameters
raw (hashable) – The raw form of the data point that the vocab is to be updated with. If the field is sequential, this parameter is ignored and can be None.
tokenized (iterable(hashable)) – The tokenized form of the data point that the vocab is to be updated with. If the field is NOT sequential (‘store_as_tokenized’ and ‘tokenize’ attributes are False), this parameter is ignored and can be None.
-
property
use_vocab
¶ A flag that tells whether the field uses a vocab or not.
- Returns
Whether the field uses a vocab or not.
- Return type
bool
-
-
class
podium.storage.
TokenizedField
(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, padding_token=-999, is_target=False, fixed_length=None, allow_missing_data=False, missing_data_token=-1)¶ Bases:
podium.storage.field.Field
Tokenized version of the Field. Holds the preprocessing and numericalization logic for the pre-tokenized dataset fields.
-
class
podium.storage.
LabelField
(name, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)¶ Bases:
podium.storage.field.Field
-
class
podium.storage.
MultilabelField
(name, num_of_classes=None, vocab=None, eager=True, custom_numericalize=None, batch_as_matrix=True, allow_missing_data=False, missing_data_token=-1)¶ Bases:
podium.storage.field.TokenizedField
Class used for storing pre-tokenized labels. Used for multilabeled datasets.
-
finalize
()¶ Signals that this field’s vocab can be built.
-
-
class
podium.storage.
MultioutputField
(output_fields, tokenizer='split', language='en')¶ Bases:
object
Field that does pretokenization and tokenization once and passes it to its output fields. Output fields are any type of field. The output fields are used only for posttokenization processing (posttokenization hooks and vocab updating).
-
add_output_field
(field)¶ Adds the passed field to this field’s output fields.
- Parameters
field (Field) – Field to add to output fields.
-
add_pretokenize_hook
(hook)¶ Add a pre-tokenization hook to the MultioutputField. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.
- Pretokenize hooks have the following signature:
- func pre_tok_hook(raw_data):
raw_data_out = do_stuff(raw_data) return raw_data_out
This can be used to eliminate encoding errors in data, replace numbers and names, etc.
- Parameters
hook (callable) – The pre-tokenization hook that we want to add to the field.
-
get_output_fields
()¶ Returns an Iterable of the contained output fields.
- Returns
an Iterable of the contained output fields.
- Return type
Iterable
-
remove_pretokenize_hooks
()¶ Remove all the pre-tokenization hooks that were added to the MultioutputField.
-
-
podium.storage.
unpack_fields
(fields)¶ Flattens the given fields object into a flat list of fields.
- Parameters
fields ((list | dict)) – List or dict that can contain nested tuples and None as values and column names as keys (dict).
- Returns
A flat list of Fields found in the given ‘fields’ object.
- Return type
list[Field]
-
class
podium.storage.
LargeResource
(**kwargs)¶ Bases:
object
Large resource that needs to download files from URL. Class also supports archive decompression.
-
BASE_RESOURCE_DIR
¶ base large files directory path
- Type
str
-
RESOURCE_NAME
¶ key for defining resource directory name parameter
- Type
str
-
URL
¶ key for defining resource url parameter
- Type
str
-
ARCHIVE
¶ key for defining archiving method paramter
- Type
str
-
SUPPORTED_ARCHIVE
¶ list of supported archive file types
- Type
list(str)
-
-
class
podium.storage.
SCPLargeResource
(**kwargs)¶ Bases:
podium.storage.resources.large_resource.LargeResource
Large resource that needs to download files from URI using scp protocol. For other functionalities class uses Large Resource class.
-
SCP_HOST_KEY
¶ key for keyword argument that defines remote host address
- Type
str
-
SCP_USER_KEY
¶ key for keyword argument that defines remote host username
- Type
str
-
SCP_PASS_KEY
¶ key for keyword argument that defines remote host password or passphrase used in private key
- Type
str, optional
-
SCP_PRIVATE_KEY
¶ key for keyword argument that defines location for private key on linux OS it can be optional if the key is in default location
- Type
str, optional
-
-
class
podium.storage.
VectorStorage
(path, default_vector_function=None, cache_path=None, max_vectors=None)¶ Bases:
abc.ABC
Interface for classes that can vectorize token. One example of such vectorizer is word2vec.
-
abstract
__len__
()¶ Method returns number of vectors in vector storage.
- Returns
len – number of loaded vectors in vector storage
- Return type
int
-
get_embedding_matrix
(vocab=None)¶ Method constructs embedding matrix.
Note: From python 3.6 dictionaries preserve insertion order https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes
- Parameters
vocab (iter(token)) – collection of tokens for creation of embedding matrix default use case is to give this function vocab or itos list or None if you wish to retrieve all loaded vectors. In case None is passed as argument, the order of vectors is the same as the insertion order of loaded vectors in VectorStorage.
- Raises
RuntimeError – If vector storage is not initialized.
-
abstract
get_vector_dim
()¶ “Method returns vector dimension.
- Returns
dim – vector dimension
- Return type
int
- Raises
RuntimeError – if vector storage is not initialized
-
abstract
load_all
()¶ Method loads all vectors stored in instance path to the vectors.
- Raises
IOError – If there was a problem while reading vectors from instance path.
ValueError – If instance path is not a valid path.
RuntimeError – If different vector size is detected while loading vectors.
-
abstract
load_vocab
(vocab)¶ Method loads vectors for tokens in vocab stored in given path to the instance.
- Parameters
vocab (iterable object) – vocabulary with unique words
- Raises
IOError – If there was a problem while reading vectors from instance path.
ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.
RuntimeError – If different vector size is detected while loading vectors.
-
abstract
token_to_vector
(token)¶ Method obtains vector for given token.
- Parameters
token (str) – token from vocabulary
- Returns
vector – vector representation of given token
- Return type
array_like
- Raises
KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).
ValueError – If given token is None.
RuntimeError – If vector storage is not initialized.
-
abstract
-
class
podium.storage.
BasicVectorStorage
(path, default_vector_function=<function random_normal_default_vector>, cache_path=None, max_vectors=None, encoding='utf-8', binary=True)¶ Bases:
podium.storage.vectorizers.vectorizer.VectorStorage
Basic implementation of VectorStorage that handles loading vectors from system storage.
-
_vectors
¶ dictionary offering word to vector mapping
- Type
dict
-
_dim
¶ vector dimension
- Type
int
-
_initialized
¶ has the vector storage been initialized by loading vectors
- Type
bool
-
_binary
¶ if True, the file is read as a binary file. Else, it’s read as a plain utf-8 text file.
- Type
bool
-
get_vector_dim
()¶ “Method returns vector dimension.
- Returns
dim – vector dimension
- Return type
int
- Raises
RuntimeError – if vector storage is not initialized
-
load_all
()¶ Method loads all vectors stored in instance path to the vectors.
- Raises
IOError – If there was a problem while reading vectors from instance path.
ValueError – If instance path is not a valid path.
RuntimeError – If different vector size is detected while loading vectors.
-
load_vocab
(vocab)¶ Method loads vectors for tokens in vocab stored in given path to the instance.
- Parameters
vocab (iterable object) – vocabulary with unique words
- Raises
IOError – If there was a problem while reading vectors from instance path.
ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.
RuntimeError – If different vector size is detected while loading vectors.
-
token_to_vector
(token)¶ Method obtains vector for given token.
- Parameters
token (str) – token from vocabulary
- Returns
vector – vector representation of given token
- Return type
array_like
- Raises
KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).
ValueError – If given token is None.
RuntimeError – If vector storage is not initialized.
-
-
class
podium.storage.
SpecialVocabSymbols
¶ Bases:
enum.Enum
Class for special vocabular symbols
-
UNK
¶ Tag for unknown word
- Type
str
-
PAD
¶ TAG for padding symbol
- Type
str
-
-
class
podium.storage.
Vocab
(max_size=None, min_freq=1, specials=(<SpecialVocabSymbols.UNK: '<unk>'>, <SpecialVocabSymbols.PAD: '<pad>'>), keep_freqs=False)¶ Bases:
object
Class for storing vocabulary. It supports frequency counting and size limiting.
-
finalized
¶ true if the vocab is finalized, false otherwise
- Type
bool
-
itos
¶ list of words
- Type
list
-
stoi
¶ mapping from word string to index
- Type
dict
-
__add__
(values: Union[Vocab, Iterable])¶ Method allows a vocabulary to be added to current vocabulary or that a set of values is added to the vocabulary.
If max_size if None for any of the two Vocabs, the max_size of the resulting Vocab will also be None. If they are both defined, the max_size of the resulting Vocab will be the sum of max_sizes.
- Parameters
values (Iterable or Vocab) – If Vocab, a new Vocab will be created containing all of the special symbols and tokens from both Vocabs. If Iterable, a new Vocab will be returned containing a copy of this Vocab with the iterables’ tokens added.
- Returns
Returns a new Vocab
- Return type
Vocab
- Raises
RuntimeError – If this vocab is finalized and values are tried to be added, or if both Vocabs are not either both finalized or not finalized.
-
__eq__
(other)¶ Two vocabs are same if they have same finalization status, their stoi and itos mappings are same and their frequency counters are same.
- Parameters
other (object) – object for which we want to knwo equality propertiy
- Returns
equal – true if two vocabs are same, false otherwise
- Return type
bool
-
__getitem__
(token)¶ Returns the token index of the passed token. If the passed token has no index, UNK token index is returned. Otherwise, an exception is raised.
- Parameters
token (str) – token whose index is to be returned.
- Returns
stoi index of the token.
- Return type
int
- Raises
KeyError – If the passed token has no index and vocab has no UNK special token.
-
__iadd__
(values: Union[Vocab, Iterable])¶ Adds additional values or another Vocab to this Vocab.
- Parameters
values (Iterable or Vocab) –
Values to be added to this Vocab. If Vocab, all of the token frequencies and specials from that Vocab will be added to this Vocab.
If Iterable, all of the tokens from the Iterable will be added to this Vocab, increasing the frequencies of those tokens.
- Returns
vocab – Returns current Vocab instance to enable chaining
- Return type
Vocab
- Raises
RuntimeError – If the current vocab is finalized, if ‘values’ is a string or if the RHS Vocab doesn’t contain token frequencies.
TypeError – If the values cannot be iterated over.
-
__iter__
()¶ Method returns iterator over vocabulary, if the vocabulary is not finalized iteration is done over frequency counter and special symbols are not included, otherwise it is performed on itos and special symbols are included.
- Returns
iterator over vocab tokens
- Return type
iter
-
__len__
()¶ Method calculates vocab lengths including special symbols.
- Returns
length – vocab size including special symbols
- Return type
int
-
finalize
()¶ Method finalizes vocab building. It also releases frequency counter if user set not to keep them.
- Raises
RuntimeError – If the vocab is already finalized.
-
get_freqs
()¶ Method obtains vocabulary frequencies.
- Returns
freq – mapping frequency for every word
- Return type
Counter
- Raises
RuntimeError – If the user stated that he doesn’t want to keep frequencies and the vocab is finalized.
-
property
has_specials
¶ Property that checks if the vocabulary contains special symbols.
- Returns
flag – true if the vocabulary has special symbols, false otherwise.
- Return type
bool
-
numericalize
(data)¶ Method numericalizes given tokens.
- Parameters
data (iter(str)) – iterable collection of tokens
- Returns
numericalized_vector – numpy array of numericalized tokens
- Return type
array-like
- Raises
RuntimeError – If the vocabulary is not finalized.
-
padding_index
()¶ Method returns padding symbol index.
- Returns
pad_symbol_index – padding symbol index in the vocabulary
- Return type
int
- Raises
ValueError – If the padding symbol is not present in the vocabulary.
-
reverse_numericalize
(numericalized_data: Iterable)¶ Transforms an iterable containing numericalized data into a list of tokens. The tokens are read from this Vocab’s itos and no additional processing is done.
- Parameters
numericalized_data (Iterable) – data to be reverse numericalized
- Returns
a list of tokens
- Return type
list
- Raises
RuntimeError – If the vocabulary is not finalized.
-
-
class
podium.storage.
ExampleFactory
(fields)¶ Bases:
object
Class used to create Example instances. Every ExampleFactory dynamically creates its own example class definition optimised for the fields provided in __init__.
-
create_empty_example
()¶ Method creates empty example with field names stored in example factory.
- Returns
example – empty Example instance with initialized field names
- Return type
Example
-
from_csv
(data, field_to_index=None, delimiter=', ')¶ Creates an Example from a CSV line and a corresponding list or dict of Fields.
- Parameters
data (str) – A string containing a single row of values separated by the given delimiter.
field_to_index (dict) – A dict that maps column names to their indices in the line of data. Only needed if fields is a dict, otherwise ignored.
delimiter (str) – The delimiter that separates the values in the line of data.
- Returns
An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.
- Return type
Example
-
from_dict
(data)¶ Method creates example from data in dictionary format.
- Parameters
data (dict(str, object)) – dictionary that maps field name to field value
- Returns
example – example instance with given data saved to fields
- Return type
Example
-
from_fields_tree
(data, subtrees=False, label_transform=None)¶ Creates an Example (or multiple Examples) from a string representing an nltk tree and a list of corresponding values.
- Parameters
data (str) – A string containing an nltk tree whose values are to be mapped to Fields.
subtrees (bool) – A flag denoting whether an example will be created from every subtree in the tree (when set to True), or just from the whole tree (when set to False).
label_transform (callable) – A function which converts the tree labels to a string representation, if wished. Useful for converting multiclass tasks to binary (SST) and making labels verbose. If None, the labels are not changed.
- Returns
If subtrees was False, returns an Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.
If subtrees was True, returns a list of such Examples for every subtree in the given tree.
- Return type
(Example | list)
-
from_json
(data)¶ Creates an Example from a JSON object and the corresponding fields.
- Parameters
data (str) – A string containing a single JSON object (key-value pairs surrounded by curly braces).
- Returns
An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.
- Return type
Example
- Raises
ValueError – If JSON doesn’t contain key name.
-
from_list
(data)¶ Method creates example from data in list format.
- Parameters
data (list) – list containing values for fields in order that the fields were given to example factory
- Returns
example – example instance with given data saved to fields
- Return type
Example
-
from_xml_str
(data)¶ Method creates and Example from xml string.
- Parameters
data (str) – XML formated string that contains the values of a single data instance, that are to be mapped to Fields.
- Returns
An Example whose attributes are the given Fields created with the given column values. These Fields can be accessed by their names.
- Return type
Example
- Raises
ValueError – If the name is not contained in the xml string.
ParseError – If there was a problem while parsing xml sting, invalid xml.
-
-
class
podium.storage.
ExampleFormat
¶ Bases:
enum.Enum
An enumeration.
-
class
podium.storage.
TfIdfVectorizer
(vocab=None, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False, specials=None)¶ Bases:
podium.storage.vectorizers.tfidf.CountVectorizer
Class converts data from one field in examples to matrix of tf-idf features. It is equivalent to scikit-learn TfidfVectorizer available at https://scikit-learn.org. Class is dependant on TfidfTransformer defined in scikit-learn library.
-
fit
(dataset, field)¶ Learn idf from dataset on data in given field.
- Parameters
dataset (Dataset) – dataset instance cointaining data on which to build idf matrix
field (Field) – which field in dataset to use for tfidf
- Returns
self
- Return type
TfIdfVectorizer
- Raises
ValueError – If dataset or field are None and if name of given field is not in dataset.
-
transform
(examples, **kwargs)¶ Transforms examples to example-term matrix. Uses vocabulary that is given in constructor.
- Parameters
example (iterable) – an iterable which yields array with numericalized tokens
- Returns
X – Tf-idf weighted document-term matrix
- Return type
sparse matrix, [n_samples, n_features]
- Raises
ValueError – If examples are None.
RuntimeError – If vectorizer is not fitted yet.
-