podium.storage.vectorizers package¶
Submodules¶
podium.storage.vectorizers.tfidf module¶
Module contains classes related to creating tfidf vectors from examples.
-
class
podium.storage.vectorizers.tfidf.
CountVectorizer
(vocab=None, specials=None)¶ Bases:
object
Class converts data from one field in examples to matrix of bag of words features. It is equivalent to scikit-learn CountVectorizer available at https://scikit-learn.org.
-
fit
(dataset, field)¶ Method initializes count vectorizer.
- Parameters
dataset (Dataset, optional) – dataset instance which contains field
field (Field, optional) – which field in dataset to use for vocab, if None vocab given in constructor is used
- Returns
self
- Return type
CountVectorizer
- Raises
ValueError – If the vocab or fields vocab are None
-
transform
(examples, **kwargs)¶ Method transforms given examples to count matrix where rows are examples and columns represent token counts.
- Parameters
examples (iterable) – an iterable which yields array with numericalized tokens or list of examples
tokens_tensor (bool, optional) – if True method expects for examples to be a tensor of numericalized values, otherwise it expects to receive list of examples(which can be in fact dataset) and a field for numericalization
field (Field, optional) – if tokens_tensor is False, method expects reference to field that is used for numericalization
- Raises
ValueError – If user has given invalid arguments - if examples are None or the field is not provided and given examples are not in token tensor format.
-
-
class
podium.storage.vectorizers.tfidf.
TfIdfVectorizer
(vocab=None, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False, specials=None)¶ Bases:
podium.storage.vectorizers.tfidf.CountVectorizer
Class converts data from one field in examples to matrix of tf-idf features. It is equivalent to scikit-learn TfidfVectorizer available at https://scikit-learn.org. Class is dependant on TfidfTransformer defined in scikit-learn library.
-
fit
(dataset, field)¶ Learn idf from dataset on data in given field.
- Parameters
dataset (Dataset) – dataset instance cointaining data on which to build idf matrix
field (Field) – which field in dataset to use for tfidf
- Returns
self
- Return type
TfIdfVectorizer
- Raises
ValueError – If dataset or field are None and if name of given field is not in dataset.
-
transform
(examples, **kwargs)¶ Transforms examples to example-term matrix. Uses vocabulary that is given in constructor.
- Parameters
example (iterable) – an iterable which yields array with numericalized tokens
- Returns
X – Tf-idf weighted document-term matrix
- Return type
sparse matrix, [n_samples, n_features]
- Raises
ValueError – If examples are None.
RuntimeError – If vectorizer is not fitted yet.
-
podium.storage.vectorizers.vectorizer module¶
Module vectorizer offers classes for vectorizing tokens. Interface of implemented concrete vectorizers is given in Vectorizer class.
-
class
podium.storage.vectorizers.vectorizer.
BasicVectorStorage
(path, default_vector_function=<function random_normal_default_vector>, cache_path=None, max_vectors=None, encoding='utf-8', binary=True)¶ Bases:
podium.storage.vectorizers.vectorizer.VectorStorage
Basic implementation of VectorStorage that handles loading vectors from system storage.
-
_vectors
¶ dictionary offering word to vector mapping
- Type
dict
-
_dim
¶ vector dimension
- Type
int
-
_initialized
¶ has the vector storage been initialized by loading vectors
- Type
bool
-
_binary
¶ if True, the file is read as a binary file. Else, it’s read as a plain utf-8 text file.
- Type
bool
-
get_vector_dim
()¶ “Method returns vector dimension.
- Returns
dim – vector dimension
- Return type
int
- Raises
RuntimeError – if vector storage is not initialized
-
load_all
()¶ Method loads all vectors stored in instance path to the vectors.
- Raises
IOError – If there was a problem while reading vectors from instance path.
ValueError – If instance path is not a valid path.
RuntimeError – If different vector size is detected while loading vectors.
-
load_vocab
(vocab)¶ Method loads vectors for tokens in vocab stored in given path to the instance.
- Parameters
vocab (iterable object) – vocabulary with unique words
- Raises
IOError – If there was a problem while reading vectors from instance path.
ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.
RuntimeError – If different vector size is detected while loading vectors.
-
token_to_vector
(token)¶ Method obtains vector for given token.
- Parameters
token (str) – token from vocabulary
- Returns
vector – vector representation of given token
- Return type
array_like
- Raises
KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).
ValueError – If given token is None.
RuntimeError – If vector storage is not initialized.
-
-
class
podium.storage.vectorizers.vectorizer.
VectorStorage
(path, default_vector_function=None, cache_path=None, max_vectors=None)¶ Bases:
abc.ABC
Interface for classes that can vectorize token. One example of such vectorizer is word2vec.
-
abstract
__len__
()¶ Method returns number of vectors in vector storage.
- Returns
len – number of loaded vectors in vector storage
- Return type
int
-
get_embedding_matrix
(vocab=None)¶ Method constructs embedding matrix.
Note: From python 3.6 dictionaries preserve insertion order https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes
- Parameters
vocab (iter(token)) – collection of tokens for creation of embedding matrix default use case is to give this function vocab or itos list or None if you wish to retrieve all loaded vectors. In case None is passed as argument, the order of vectors is the same as the insertion order of loaded vectors in VectorStorage.
- Raises
RuntimeError – If vector storage is not initialized.
-
abstract
get_vector_dim
()¶ “Method returns vector dimension.
- Returns
dim – vector dimension
- Return type
int
- Raises
RuntimeError – if vector storage is not initialized
-
abstract
load_all
()¶ Method loads all vectors stored in instance path to the vectors.
- Raises
IOError – If there was a problem while reading vectors from instance path.
ValueError – If instance path is not a valid path.
RuntimeError – If different vector size is detected while loading vectors.
-
abstract
load_vocab
(vocab)¶ Method loads vectors for tokens in vocab stored in given path to the instance.
- Parameters
vocab (iterable object) – vocabulary with unique words
- Raises
IOError – If there was a problem while reading vectors from instance path.
ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.
RuntimeError – If different vector size is detected while loading vectors.
-
abstract
token_to_vector
(token)¶ Method obtains vector for given token.
- Parameters
token (str) – token from vocabulary
- Returns
vector – vector representation of given token
- Return type
array_like
- Raises
KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).
ValueError – If given token is None.
RuntimeError – If vector storage is not initialized.
-
abstract
-
podium.storage.vectorizers.vectorizer.
random_normal_default_vector
(token, dim)¶ Draw a random vector from a standard normal distribution. Dimension of returned array is equal to given dim.
- Parameters
token (str) – string token from vocabulary
dim (int) – vector dimension
- Returns
vector – sampled from normal distribution with given dimension
- Return type
array-like
-
podium.storage.vectorizers.vectorizer.
zeros_default_vector
(token, dim)¶ Function for creating default vector for given token in form of zeros array. Dimension of returned array is equal to given dim.
- Parameters
token (str) – string token from vocabulary
dim (int) – vector dimension
- Returns
vector – zeros vector with given dimension
- Return type
array-like
- Raises
If dim is None. –