podium.storage.vectorizers package¶

Submodules¶

podium.storage.vectorizers.tfidf module¶

Module contains classes related to creating tfidf vectors from examples.

class podium.storage.vectorizers.tfidf.CountVectorizer(vocab=None, specials=None)¶

Bases: object

Class converts data from one field in examples to matrix of bag of words features. It is equivalent to scikit-learn CountVectorizer available at https://scikit-learn.org.

fit(dataset, field)¶

Method initializes count vectorizer.

Parameters

dataset (Dataset, optional) – dataset instance which contains field
field (Field, optional) – which field in dataset to use for vocab, if None vocab given in constructor is used

Returns

self

Return type

CountVectorizer

Raises

ValueError – If the vocab or fields vocab are None

transform(examples, **kwargs)¶

Method transforms given examples to count matrix where rows are examples and columns represent token counts.

Parameters

examples (iterable) – an iterable which yields array with numericalized tokens or list of examples
tokens_tensor (bool, optional) – if True method expects for examples to be a tensor of numericalized values, otherwise it expects to receive list of examples(which can be in fact dataset) and a field for numericalization
field (Field, optional) – if tokens_tensor is False, method expects reference to field that is used for numericalization

Raises

ValueError – If user has given invalid arguments - if examples are None or the field is not provided and given examples are not in token tensor format.

class podium.storage.vectorizers.tfidf.TfIdfVectorizer(vocab=None, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False, specials=None)¶

Bases: podium.storage.vectorizers.tfidf.CountVectorizer

Class converts data from one field in examples to matrix of tf-idf features. It is equivalent to scikit-learn TfidfVectorizer available at https://scikit-learn.org. Class is dependant on TfidfTransformer defined in scikit-learn library.

fit(dataset, field)¶

Learn idf from dataset on data in given field.

Parameters

dataset (Dataset) – dataset instance cointaining data on which to build idf matrix
field (Field) – which field in dataset to use for tfidf

Returns

self

Return type

TfIdfVectorizer

Raises

ValueError – If dataset or field are None and if name of given field is not in dataset.

transform(examples, **kwargs)¶

Transforms examples to example-term matrix. Uses vocabulary that is given in constructor.

Parameters

example (iterable) – an iterable which yields array with numericalized tokens

Returns

X – Tf-idf weighted document-term matrix

Return type

sparse matrix, [n_samples, n_features]

Raises

ValueError – If examples are None.
RuntimeError – If vectorizer is not fitted yet.

podium.storage.vectorizers.vectorizer module¶

Module vectorizer offers classes for vectorizing tokens. Interface of implemented concrete vectorizers is given in Vectorizer class.

class podium.storage.vectorizers.vectorizer.BasicVectorStorage(path, default_vector_function=<function random_normal_default_vector>, cache_path=None, max_vectors=None, encoding='utf-8', binary=True)¶

Bases: podium.storage.vectorizers.vectorizer.VectorStorage

Basic implementation of VectorStorage that handles loading vectors from system storage.

_vectors¶

dictionary offering word to vector mapping

Type: dict

_dim¶

vector dimension

Type: int

_initialized¶

has the vector storage been initialized by loading vectors

Type: bool

_binary¶

if True, the file is read as a binary file. Else, it’s read as a plain utf-8 text file.

Type: bool

get_vector_dim()¶

“Method returns vector dimension.

Returns: dim – vector dimension
Return type: int
Raises: RuntimeError – if vector storage is not initialized

load_all()¶

Method loads all vectors stored in instance path to the vectors.

Raises

IOError – If there was a problem while reading vectors from instance path.
ValueError – If instance path is not a valid path.
RuntimeError – If different vector size is detected while loading vectors.

load_vocab(vocab)¶

Method loads vectors for tokens in vocab stored in given path to the instance.

Parameters

vocab (iterable object) – vocabulary with unique words

Raises

IOError – If there was a problem while reading vectors from instance path.
ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.
RuntimeError – If different vector size is detected while loading vectors.

token_to_vector(token)¶

Method obtains vector for given token.

Parameters

token (str) – token from vocabulary

Returns

vector – vector representation of given token

Return type

array_like

Raises

KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).
ValueError – If given token is None.
RuntimeError – If vector storage is not initialized.

class podium.storage.vectorizers.vectorizer.VectorStorage(path, default_vector_function=None, cache_path=None, max_vectors=None)¶

Bases: abc.ABC

Interface for classes that can vectorize token. One example of such vectorizer is word2vec.

abstract __len__()¶

Method returns number of vectors in vector storage.

Returns: len – number of loaded vectors in vector storage
Return type: int

get_embedding_matrix(vocab=None)¶

Method constructs embedding matrix.

Note: From python 3.6 dictionaries preserve insertion order https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes

Parameters: vocab (iter(token)) – collection of tokens for creation of embedding matrix default use case is to give this function vocab or itos list or None if you wish to retrieve all loaded vectors. In case None is passed as argument, the order of vectors is the same as the insertion order of loaded vectors in VectorStorage.
Raises: RuntimeError – If vector storage is not initialized.

abstract get_vector_dim()¶

“Method returns vector dimension.

Returns: dim – vector dimension
Return type: int
Raises: RuntimeError – if vector storage is not initialized

abstract load_all()¶

Method loads all vectors stored in instance path to the vectors.

Raises

IOError – If there was a problem while reading vectors from instance path.
ValueError – If instance path is not a valid path.
RuntimeError – If different vector size is detected while loading vectors.

abstract load_vocab(vocab)¶

Method loads vectors for tokens in vocab stored in given path to the instance.

Parameters

vocab (iterable object) – vocabulary with unique words

Raises

IOError – If there was a problem while reading vectors from instance path.
ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.
RuntimeError – If different vector size is detected while loading vectors.

abstract token_to_vector(token)¶

Method obtains vector for given token.

Parameters

token (str) – token from vocabulary

Returns

vector – vector representation of given token

Return type

array_like

Raises

KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).
ValueError – If given token is None.
RuntimeError – If vector storage is not initialized.

podium.storage.vectorizers.vectorizer.random_normal_default_vector(token, dim)¶

Draw a random vector from a standard normal distribution. Dimension of returned array is equal to given dim.

Parameters

token (str) – string token from vocabulary
dim (int) – vector dimension

Returns

vector – sampled from normal distribution with given dimension

Return type

array-like

podium.storage.vectorizers.vectorizer.zeros_default_vector(token, dim)¶

Function for creating default vector for given token in form of zeros array. Dimension of returned array is equal to given dim.

Parameters

token (str) – string token from vocabulary
dim (int) – vector dimension

Returns

vector – zeros vector with given dimension

Return type

array-like

Raises

If dim is None. –

podium.storage.vectorizers package¶

Submodules¶

podium.storage.vectorizers.tfidf module¶

podium.storage.vectorizers.vectorizer module¶

Module contents¶

Table of Contents

Previous topic

Next topic

This Page