podium.storage.vectorizers package

Submodules

podium.storage.vectorizers.tfidf module

Module contains classes related to creating tfidf vectors from examples.

class podium.storage.vectorizers.tfidf.CountVectorizer(vocab=None, specials=None)

Bases: object

Class converts data from one field in examples to matrix of bag of words features. It is equivalent to scikit-learn CountVectorizer available at https://scikit-learn.org.

fit(dataset, field)

Method initializes count vectorizer.

Parameters
  • dataset (Dataset, optional) – dataset instance which contains field

  • field (Field, optional) – which field in dataset to use for vocab, if None vocab given in constructor is used

Returns

self

Return type

CountVectorizer

Raises

ValueError – If the vocab or fields vocab are None

transform(examples, **kwargs)

Method transforms given examples to count matrix where rows are examples and columns represent token counts.

Parameters
  • examples (iterable) – an iterable which yields array with numericalized tokens or list of examples

  • tokens_tensor (bool, optional) – if True method expects for examples to be a tensor of numericalized values, otherwise it expects to receive list of examples(which can be in fact dataset) and a field for numericalization

  • field (Field, optional) – if tokens_tensor is False, method expects reference to field that is used for numericalization

Raises

ValueError – If user has given invalid arguments - if examples are None or the field is not provided and given examples are not in token tensor format.

class podium.storage.vectorizers.tfidf.TfIdfVectorizer(vocab=None, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False, specials=None)

Bases: podium.storage.vectorizers.tfidf.CountVectorizer

Class converts data from one field in examples to matrix of tf-idf features. It is equivalent to scikit-learn TfidfVectorizer available at https://scikit-learn.org. Class is dependant on TfidfTransformer defined in scikit-learn library.

fit(dataset, field)

Learn idf from dataset on data in given field.

Parameters
  • dataset (Dataset) – dataset instance cointaining data on which to build idf matrix

  • field (Field) – which field in dataset to use for tfidf

Returns

self

Return type

TfIdfVectorizer

Raises

ValueError – If dataset or field are None and if name of given field is not in dataset.

transform(examples, **kwargs)

Transforms examples to example-term matrix. Uses vocabulary that is given in constructor.

Parameters

example (iterable) – an iterable which yields array with numericalized tokens

Returns

X – Tf-idf weighted document-term matrix

Return type

sparse matrix, [n_samples, n_features]

Raises
  • ValueError – If examples are None.

  • RuntimeError – If vectorizer is not fitted yet.

podium.storage.vectorizers.vectorizer module

Module vectorizer offers classes for vectorizing tokens. Interface of implemented concrete vectorizers is given in Vectorizer class.

class podium.storage.vectorizers.vectorizer.BasicVectorStorage(path, default_vector_function=<function random_normal_default_vector>, cache_path=None, max_vectors=None, encoding='utf-8', binary=True)

Bases: podium.storage.vectorizers.vectorizer.VectorStorage

Basic implementation of VectorStorage that handles loading vectors from system storage.

_vectors

dictionary offering word to vector mapping

Type

dict

_dim

vector dimension

Type

int

_initialized

has the vector storage been initialized by loading vectors

Type

bool

_binary

if True, the file is read as a binary file. Else, it’s read as a plain utf-8 text file.

Type

bool

get_vector_dim()

“Method returns vector dimension.

Returns

dim – vector dimension

Return type

int

Raises

RuntimeError – if vector storage is not initialized

load_all()

Method loads all vectors stored in instance path to the vectors.

Raises
  • IOError – If there was a problem while reading vectors from instance path.

  • ValueError – If instance path is not a valid path.

  • RuntimeError – If different vector size is detected while loading vectors.

load_vocab(vocab)

Method loads vectors for tokens in vocab stored in given path to the instance.

Parameters

vocab (iterable object) – vocabulary with unique words

Raises
  • IOError – If there was a problem while reading vectors from instance path.

  • ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.

  • RuntimeError – If different vector size is detected while loading vectors.

token_to_vector(token)

Method obtains vector for given token.

Parameters

token (str) – token from vocabulary

Returns

vector – vector representation of given token

Return type

array_like

Raises
  • KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).

  • ValueError – If given token is None.

  • RuntimeError – If vector storage is not initialized.

class podium.storage.vectorizers.vectorizer.VectorStorage(path, default_vector_function=None, cache_path=None, max_vectors=None)

Bases: abc.ABC

Interface for classes that can vectorize token. One example of such vectorizer is word2vec.

abstract __len__()

Method returns number of vectors in vector storage.

Returns

len – number of loaded vectors in vector storage

Return type

int

get_embedding_matrix(vocab=None)

Method constructs embedding matrix.

Note: From python 3.6 dictionaries preserve insertion order https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes

Parameters

vocab (iter(token)) – collection of tokens for creation of embedding matrix default use case is to give this function vocab or itos list or None if you wish to retrieve all loaded vectors. In case None is passed as argument, the order of vectors is the same as the insertion order of loaded vectors in VectorStorage.

Raises

RuntimeError – If vector storage is not initialized.

abstract get_vector_dim()

“Method returns vector dimension.

Returns

dim – vector dimension

Return type

int

Raises

RuntimeError – if vector storage is not initialized

abstract load_all()

Method loads all vectors stored in instance path to the vectors.

Raises
  • IOError – If there was a problem while reading vectors from instance path.

  • ValueError – If instance path is not a valid path.

  • RuntimeError – If different vector size is detected while loading vectors.

abstract load_vocab(vocab)

Method loads vectors for tokens in vocab stored in given path to the instance.

Parameters

vocab (iterable object) – vocabulary with unique words

Raises
  • IOError – If there was a problem while reading vectors from instance path.

  • ValueError – If given path is not a valid path or given vocab is none or if the vector values in vector storage cannot be casted to float.

  • RuntimeError – If different vector size is detected while loading vectors.

abstract token_to_vector(token)

Method obtains vector for given token.

Parameters

token (str) – token from vocabulary

Returns

vector – vector representation of given token

Return type

array_like

Raises
  • KeyError – If given token doesn’t have vector representation and default vector function is not defined (None).

  • ValueError – If given token is None.

  • RuntimeError – If vector storage is not initialized.

podium.storage.vectorizers.vectorizer.random_normal_default_vector(token, dim)

Draw a random vector from a standard normal distribution. Dimension of returned array is equal to given dim.

Parameters
  • token (str) – string token from vocabulary

  • dim (int) – vector dimension

Returns

vector – sampled from normal distribution with given dimension

Return type

array-like

podium.storage.vectorizers.vectorizer.zeros_default_vector(token, dim)

Function for creating default vector for given token in form of zeros array. Dimension of returned array is equal to given dim.

Parameters
  • token (str) – string token from vocabulary

  • dim (int) – vector dimension

Returns

vector – zeros vector with given dimension

Return type

array-like

Raises

If dim is None.

Module contents