podium.preproc package¶

Subpackages¶

Submodules¶

podium.preproc.stop_words module¶

Module contains sets of stop words and stop words removal hook.

podium.preproc.stop_words.get_croatian_stop_words_removal_hook(stop_words_set)¶

Method obtains stop words removal hook.

Parameters: stop_words_set (set) – set of lowercased stopwords

podium.preproc.tokenizers module¶

Module contains text tokenizers.

podium.preproc.tokenizers.get_tokenizer(tokenizer, language='en')¶

Returns a tokenizer according to the parameters given.

Parameters

tokenizer (str | callable) –
If a callable object is given, it will just be returned. Otherwise, a string can be given to create one of the premade tokenizers.
The available premade tokenizers are:
- ’split’ - default str.split()
- ’spacy’ - the spacy tokenizer, using the ‘en’ language model by default (unless the user provides a different ‘language’ parameter). If spacy model is used for the first time user should download it by using command similar to the following python -m spacy download en. More details can be found in spacy documentation https://spacy.io/usage/models
language (str) – The language argument for the tokenizer (if necessary, e. g. for spacy). Default is ‘en’.

Returns

Return type

The created (or given) tokenizer.

Raises

ValueError – If the given tokenizer is not a callable or a string, or is a string that doesn’t correspond to any of the premade tokenizers.

podium.preproc.util module¶

Module contains utility functions to preprocess text data

podium.preproc.util.capitalize_target_like_source(func)¶

Capitalization decorator of a method that processes a word. Method invokes the parameter function with a lowercased input, then capitalizes the return value such that capitalization corresponds to the original input provided

Parameters: func (function) – function which gets called, MUST be a class member with one positional argument (like def func(self, word), but may contain additional keyword arguments (like func(self, word, my_arg=’my_value’))
Returns: wrapper – decorator function to decorate func with
Return type: function

podium.preproc.util.find_word_by_prefix(trie, word)¶

Searches through a trie data structure and returns an element of the trie is the word is a prefix or exact match of one of the trie elements. Otherwise returns None

Parameters

trie (dict) – Nested dict trie data structure
word (str) – String being searched for in the trie data structure

Returns

found_word – String found which is either the exact word, it’s prefix or None if not found in trie

Return type

str

podium.preproc.util.make_trie(words)¶

Creates a prefix trie data structure given a list of strings. Strings are split into chars and a char nested trie dict is returned

Parameters: words (list(str)) – List of strings to create a trie structure from
Returns: trie – Nested dict trie data structure
Return type: dict

podium.preproc.util.uppercase_target_like_source(source, target)¶

Function uppercases target on the same places source is uppercased.

Parameters

source (str) – source string from which uppercasing is transfered
target (str) – target string that needs to be uppercased

Returns

uppercased_target – uppercased target string

Return type

str

podium.preproc.yake module¶

Module contents¶

Package contains modules for preprocessing.

class podium.preproc.CroatianLemmatizer(**kwargs)¶

Bases: object

Class for lemmatizing words and fetching word inflections for a given lemma

BASE_FOLDER¶

folder to download lemmatizer resources

Type: str

MOLEX14_LEMMA2WORD¶

dictionary file path containing lemma to words mappings

Type: str

MOLEX14_WORD2LEMMA¶

dictionary file path containing word to lemma mappings

Type: str

get_words_for_lemma(lemma)¶

Returns a list of words that shares the provided lemma.

Parameters: word (str) – Word lemma to find words that share this lemma
Returns: List of words that share the lemma provided uppercased at same chars as lemma provided
Return type: list(str)
Raises: ValueError – If no words for the provided lemma are found.

class podium.preproc.CroatianStemmer¶

Bases: object

Simple stemmer for Croatian language

root_word(word)¶

Method returns root of a word.

Parameters: word (str) – word string
Returns: root – root of a word
Return type: str

transform(word)¶

Method transforms given word from a dict, given it ending with a specific suffix

Parameters: word (str) – word
Returns: transformed_word – transformed word according to transformation mappings
Return type: str

podium.preproc.get_tokenizer(tokenizer, language='en')¶

Returns a tokenizer according to the parameters given.

Parameters

tokenizer (str | callable) –
If a callable object is given, it will just be returned. Otherwise, a string can be given to create one of the premade tokenizers.
The available premade tokenizers are:
- ’split’ - default str.split()
- ’spacy’ - the spacy tokenizer, using the ‘en’ language model by default (unless the user provides a different ‘language’ parameter). If spacy model is used for the first time user should download it by using command similar to the following python -m spacy download en. More details can be found in spacy documentation https://spacy.io/usage/models
language (str) – The language argument for the tokenizer (if necessary, e. g. for spacy). Default is ‘en’.

Returns

Return type

The created (or given) tokenizer.

Raises

ValueError – If the given tokenizer is not a callable or a string, or is a string that doesn’t correspond to any of the premade tokenizers.

podium.preproc package¶

Subpackages¶

Submodules¶

podium.preproc.stop_words module¶

podium.preproc.tokenizers module¶

podium.preproc.util module¶

podium.preproc.yake module¶

Module contents¶

Table of Contents

Previous topic

Next topic

This Page