podium.preproc package¶
Subpackages¶
Submodules¶
podium.preproc.stop_words module¶
Module contains sets of stop words and stop words removal hook.
-
podium.preproc.stop_words.
get_croatian_stop_words_removal_hook
(stop_words_set)¶ Method obtains stop words removal hook.
- Parameters
stop_words_set (set) – set of lowercased stopwords
podium.preproc.tokenizers module¶
Module contains text tokenizers.
-
podium.preproc.tokenizers.
get_tokenizer
(tokenizer, language='en')¶ Returns a tokenizer according to the parameters given.
- Parameters
tokenizer (str | callable) –
If a callable object is given, it will just be returned. Otherwise, a string can be given to create one of the premade tokenizers.
- The available premade tokenizers are:
’split’ - default str.split()
’spacy’ - the spacy tokenizer, using the ‘en’ language model by default (unless the user provides a different ‘language’ parameter). If spacy model is used for the first time user should download it by using command similar to the following python -m spacy download en. More details can be found in spacy documentation https://spacy.io/usage/models
language (str) – The language argument for the tokenizer (if necessary, e. g. for spacy). Default is ‘en’.
- Returns
- Return type
The created (or given) tokenizer.
- Raises
ValueError – If the given tokenizer is not a callable or a string, or is a string that doesn’t correspond to any of the premade tokenizers.
podium.preproc.util module¶
Module contains utility functions to preprocess text data
-
podium.preproc.util.
capitalize_target_like_source
(func)¶ Capitalization decorator of a method that processes a word. Method invokes the parameter function with a lowercased input, then capitalizes the return value such that capitalization corresponds to the original input provided
- Parameters
func (function) – function which gets called, MUST be a class member with one positional argument (like def func(self, word), but may contain additional keyword arguments (like func(self, word, my_arg=’my_value’))
- Returns
wrapper – decorator function to decorate func with
- Return type
function
-
podium.preproc.util.
find_word_by_prefix
(trie, word)¶ Searches through a trie data structure and returns an element of the trie is the word is a prefix or exact match of one of the trie elements. Otherwise returns None
- Parameters
trie (dict) – Nested dict trie data structure
word (str) – String being searched for in the trie data structure
- Returns
found_word – String found which is either the exact word, it’s prefix or None if not found in trie
- Return type
str
-
podium.preproc.util.
make_trie
(words)¶ Creates a prefix trie data structure given a list of strings. Strings are split into chars and a char nested trie dict is returned
- Parameters
words (list(str)) – List of strings to create a trie structure from
- Returns
trie – Nested dict trie data structure
- Return type
dict
-
podium.preproc.util.
uppercase_target_like_source
(source, target)¶ Function uppercases target on the same places source is uppercased.
- Parameters
source (str) – source string from which uppercasing is transfered
target (str) – target string that needs to be uppercased
- Returns
uppercased_target – uppercased target string
- Return type
str
podium.preproc.yake module¶
Module contents¶
Package contains modules for preprocessing.
-
class
podium.preproc.
CroatianLemmatizer
(**kwargs)¶ Bases:
object
Class for lemmatizing words and fetching word inflections for a given lemma
-
BASE_FOLDER
¶ folder to download lemmatizer resources
- Type
str
-
MOLEX14_LEMMA2WORD
¶ dictionary file path containing lemma to words mappings
- Type
str
-
MOLEX14_WORD2LEMMA
¶ dictionary file path containing word to lemma mappings
- Type
str
-
get_words_for_lemma
(lemma)¶ Returns a list of words that shares the provided lemma.
- Parameters
word (str) – Word lemma to find words that share this lemma
- Returns
List of words that share the lemma provided uppercased at same chars as lemma provided
- Return type
list(str)
- Raises
ValueError – If no words for the provided lemma are found.
-
-
class
podium.preproc.
CroatianStemmer
¶ Bases:
object
Simple stemmer for Croatian language
-
root_word
(word)¶ Method returns root of a word.
- Parameters
word (str) – word string
- Returns
root – root of a word
- Return type
str
-
transform
(word)¶ Method transforms given word from a dict, given it ending with a specific suffix
- Parameters
word (str) – word
- Returns
transformed_word – transformed word according to transformation mappings
- Return type
str
-
-
podium.preproc.
get_tokenizer
(tokenizer, language='en')¶ Returns a tokenizer according to the parameters given.
- Parameters
tokenizer (str | callable) –
If a callable object is given, it will just be returned. Otherwise, a string can be given to create one of the premade tokenizers.
- The available premade tokenizers are:
’split’ - default str.split()
’spacy’ - the spacy tokenizer, using the ‘en’ language model by default (unless the user provides a different ‘language’ parameter). If spacy model is used for the first time user should download it by using command similar to the following python -m spacy download en. More details can be found in spacy documentation https://spacy.io/usage/models
language (str) – The language argument for the tokenizer (if necessary, e. g. for spacy). Default is ‘en’.
- Returns
- Return type
The created (or given) tokenizer.
- Raises
ValueError – If the given tokenizer is not a callable or a string, or is a string that doesn’t correspond to any of the premade tokenizers.