podium.preproc package

Submodules

podium.preproc.stop_words module

Module contains sets of stop words and stop words removal hook.

podium.preproc.stop_words.get_croatian_stop_words_removal_hook(stop_words_set)

Method obtains stop words removal hook.

Parameters

stop_words_set (set) – set of lowercased stopwords

podium.preproc.tokenizers module

Module contains text tokenizers.

podium.preproc.tokenizers.get_tokenizer(tokenizer, language='en')

Returns a tokenizer according to the parameters given.

Parameters
  • tokenizer (str | callable) –

    If a callable object is given, it will just be returned. Otherwise, a string can be given to create one of the premade tokenizers.

    The available premade tokenizers are:
    • ’split’ - default str.split()

    • ’spacy’ - the spacy tokenizer, using the ‘en’ language model by default (unless the user provides a different ‘language’ parameter). If spacy model is used for the first time user should download it by using command similar to the following python -m spacy download en. More details can be found in spacy documentation https://spacy.io/usage/models

  • language (str) – The language argument for the tokenizer (if necessary, e. g. for spacy). Default is ‘en’.

Returns

Return type

The created (or given) tokenizer.

Raises

ValueError – If the given tokenizer is not a callable or a string, or is a string that doesn’t correspond to any of the premade tokenizers.

podium.preproc.util module

Module contains utility functions to preprocess text data

podium.preproc.util.capitalize_target_like_source(func)

Capitalization decorator of a method that processes a word. Method invokes the parameter function with a lowercased input, then capitalizes the return value such that capitalization corresponds to the original input provided

Parameters

func (function) – function which gets called, MUST be a class member with one positional argument (like def func(self, word), but may contain additional keyword arguments (like func(self, word, my_arg=’my_value’))

Returns

wrapper – decorator function to decorate func with

Return type

function

podium.preproc.util.find_word_by_prefix(trie, word)

Searches through a trie data structure and returns an element of the trie is the word is a prefix or exact match of one of the trie elements. Otherwise returns None

Parameters
  • trie (dict) – Nested dict trie data structure

  • word (str) – String being searched for in the trie data structure

Returns

found_word – String found which is either the exact word, it’s prefix or None if not found in trie

Return type

str

podium.preproc.util.make_trie(words)

Creates a prefix trie data structure given a list of strings. Strings are split into chars and a char nested trie dict is returned

Parameters

words (list(str)) – List of strings to create a trie structure from

Returns

trie – Nested dict trie data structure

Return type

dict

podium.preproc.util.uppercase_target_like_source(source, target)

Function uppercases target on the same places source is uppercased.

Parameters
  • source (str) – source string from which uppercasing is transfered

  • target (str) – target string that needs to be uppercased

Returns

uppercased_target – uppercased target string

Return type

str

podium.preproc.yake module

Module contents

Package contains modules for preprocessing.

class podium.preproc.CroatianLemmatizer(**kwargs)

Bases: object

Class for lemmatizing words and fetching word inflections for a given lemma

BASE_FOLDER

folder to download lemmatizer resources

Type

str

MOLEX14_LEMMA2WORD

dictionary file path containing lemma to words mappings

Type

str

MOLEX14_WORD2LEMMA

dictionary file path containing word to lemma mappings

Type

str

get_words_for_lemma(lemma)

Returns a list of words that shares the provided lemma.

Parameters

word (str) – Word lemma to find words that share this lemma

Returns

List of words that share the lemma provided uppercased at same chars as lemma provided

Return type

list(str)

Raises

ValueError – If no words for the provided lemma are found.

class podium.preproc.CroatianStemmer

Bases: object

Simple stemmer for Croatian language

root_word(word)

Method returns root of a word.

Parameters

word (str) – word string

Returns

root – root of a word

Return type

str

transform(word)

Method transforms given word from a dict, given it ending with a specific suffix

Parameters

word (str) – word

Returns

transformed_word – transformed word according to transformation mappings

Return type

str

podium.preproc.get_tokenizer(tokenizer, language='en')

Returns a tokenizer according to the parameters given.

Parameters
  • tokenizer (str | callable) –

    If a callable object is given, it will just be returned. Otherwise, a string can be given to create one of the premade tokenizers.

    The available premade tokenizers are:
    • ’split’ - default str.split()

    • ’spacy’ - the spacy tokenizer, using the ‘en’ language model by default (unless the user provides a different ‘language’ parameter). If spacy model is used for the first time user should download it by using command similar to the following python -m spacy download en. More details can be found in spacy documentation https://spacy.io/usage/models

  • language (str) – The language argument for the tokenizer (if necessary, e. g. for spacy). Default is ‘en’.

Returns

Return type

The created (or given) tokenizer.

Raises

ValueError – If the given tokenizer is not a callable or a string, or is a string that doesn’t correspond to any of the premade tokenizers.