podium.dataload package¶
Submodules¶
podium.dataload.cornell_movie_dialogs module¶
Dataloader for Cornell Movie-Dialogs Corpus, available at http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
-
class
podium.dataload.cornell_movie_dialogs.
CornellMovieDialogsLoader
¶ Bases:
object
Class for downloading and parsing the Cornell Movie-Dialogs dataset.
This class is used for downloading the dataset (if it’s not already downloaded) and parsing the files in the dataset. If it’s not already present LargeResource.BASE_RESOURCE_DIR, the dataset is automatically downloaded when an instance of the loader is created. The downloaded resources can be parsed using the load_dataset method.
-
load_characters
()¶ Method loads file containing movie characters.
-
load_conversations
()¶ Method loads file containing movie conversations.
-
load_dataset
()¶ Loads and parses all the necessary files from the dataset folder.
- Returns
data – tuple that contains dictionaries for 5 types of Cornell movie dialogs data: titles, conversations, lines, characters and script urls. Fields for every type are defined in class constants.
- Return type
CornellMovieDialogsNamedTuple
-
load_lines
()¶ Method loads file containing movie lines.
-
load_titles
()¶ Method loads file containing movie titles.
-
load_urls
()¶ Method loads file containing movie script urls.
-
-
class
podium.dataload.cornell_movie_dialogs.
CornellMovieDialogsNamedTuple
(titles, conversations, lines, characters, url)¶ Bases:
tuple
-
property
characters
¶ Alias for field number 3
-
property
conversations
¶ Alias for field number 1
-
property
lines
¶ Alias for field number 2
-
property
titles
¶ Alias for field number 0
-
property
url
¶ Alias for field number 4
-
property
podium.dataload.eurovoc module¶
Module for loading raw eurovoc dataset
-
class
podium.dataload.eurovoc.
Document
(filename, title, text)¶ Bases:
tuple
-
property
filename
¶ Alias for field number 0
-
property
text
¶ Alias for field number 2
-
property
title
¶ Alias for field number 1
-
property
-
class
podium.dataload.eurovoc.
EuroVocLoader
(**kwargs)¶ Bases:
object
Class for downloading and parsing the EuroVoc dataset.
This class is used for downloading the EuroVoc dataset (if it’s not already downloaded) and parsing the files in the dataset. If it’s not already present LargeResource.BASE_RESOURCE_DIR, the dataset is automatically downloaded when an instance of EuroVocLoader is created. The downloaded resources can be parsed using the load_dataset method.
-
load_dataset
()¶ Loads and parses all the necessary files from the dataset folder.
- Returns
(EuroVoc label hierarchy, CroVoc label hierarchy, document mapping, documents)
EuroVoc label hierarchy : dict(label_id : Label) CroVoc label hierarchy : dict(label_id : Label) document mapping : dict(document_id : list of label ids) documents : list(Document)
- Return type
tuple
-
-
class
podium.dataload.eurovoc.
Label
(name, id, direct_parents, similar_terms, rank, thesaurus=None, micro_thesaurus=None, all_ancestors=None)¶ Bases:
object
Label in EuroVoc dataset.
Labels are assigned to documents. One document has multiple labels. Labels have a hierarchy in which one label can have one or more parents (broader terms). All labels apart from thesaurus rank labels have at least one parent. Apart from parents, labels can also have similar labels which describe related areas, but aren’t connected by the label hierarchy.
-
class
podium.dataload.eurovoc.
LabelRank
¶ Bases:
enum.Enum
Levels of labels in EuroVoc.
-
podium.dataload.eurovoc.
dill_dataset
(output_path)¶ Downloads the EuroVoc dataset (if not already present) and stores the dataset in a dill file.
- Parameters
output_path (str) – Path to the file where the dataset instance will be stored.
podium.dataload.huggingface_dataset_converter module¶
podium.dataload.ner_croatian module¶
Simple NERCroatian dataset module.
-
class
podium.dataload.ner_croatian.
NERCroatianXMLLoader
(path='downloaded_datasets/', tokenizer='split', tag_schema='IOB', **kwargs)¶ Bases:
object
Simple croatian NER class
-
load_dataset
()¶ Method loads the dataset and returns tokenized NER documents.
- Returns
tokenized_documents – List of tokenized documents. Each document is represented as a list of tuples (token, label). The sentences in document are delimited by tuple (None, None)
- Return type
list of lists of tuples
-
-
podium.dataload.ner_croatian.
convert_sequence_to_entities
(sequence, text, delimiter='-')¶ Converts sequences of the BIO tagging schema to entities
- Parameters
sequence (list(string)) – Sequence of tags consisting that start with either B, I, or O.
label (list(string)) – Tokenized text that correponds to the tag sequence
- Returns
entities – List of entities. Each entity is a dict that has four attributes: name, type, start, and end. Name is a list of tokens from text that belong to that entity, start denotes the index which starts the entity, and end is the end index of the entity.
`text[entity['start'] : entity['end']]`
retrieves the entity text- This means that the entity has the following form:
{ ‘name’: list(str), ‘type’: str, ‘start’: int, ‘end’: int }
- Return type
list(dict)
- Raises
ValueError – If the given sequence and text are not of the same length.
Module contents¶
Package with concrete datasets dataloaders.