podium.dataload package¶

Submodules¶

podium.dataload.cornell_movie_dialogs module¶

Dataloader for Cornell Movie-Dialogs Corpus, available at http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

class podium.dataload.cornell_movie_dialogs.CornellMovieDialogsLoader¶

Bases: object

Class for downloading and parsing the Cornell Movie-Dialogs dataset.

This class is used for downloading the dataset (if it’s not already downloaded) and parsing the files in the dataset. If it’s not already present LargeResource.BASE_RESOURCE_DIR, the dataset is automatically downloaded when an instance of the loader is created. The downloaded resources can be parsed using the load_dataset method.

load_characters()¶: Method loads file containing movie characters.

load_conversations()¶: Method loads file containing movie conversations.

load_dataset()¶

Loads and parses all the necessary files from the dataset folder.

Returns: data – tuple that contains dictionaries for 5 types of Cornell movie dialogs data: titles, conversations, lines, characters and script urls. Fields for every type are defined in class constants.
Return type: CornellMovieDialogsNamedTuple

load_lines()¶: Method loads file containing movie lines.

load_titles()¶: Method loads file containing movie titles.

load_urls()¶: Method loads file containing movie script urls.

class podium.dataload.cornell_movie_dialogs.CornellMovieDialogsNamedTuple(titles, conversations, lines, characters, url)¶

Bases: tuple

property characters¶: Alias for field number 3

property conversations¶: Alias for field number 1

property lines¶: Alias for field number 2

property titles¶: Alias for field number 0

property url¶: Alias for field number 4

podium.dataload.eurovoc module¶

Module for loading raw eurovoc dataset

class podium.dataload.eurovoc.Document(filename, title, text)¶

Bases: tuple

property filename¶: Alias for field number 0

property text¶: Alias for field number 2

property title¶: Alias for field number 1

class podium.dataload.eurovoc.EuroVocLoader(**kwargs)¶

Bases: object

Class for downloading and parsing the EuroVoc dataset.

This class is used for downloading the EuroVoc dataset (if it’s not already downloaded) and parsing the files in the dataset. If it’s not already present LargeResource.BASE_RESOURCE_DIR, the dataset is automatically downloaded when an instance of EuroVocLoader is created. The downloaded resources can be parsed using the load_dataset method.

load_dataset()¶

Loads and parses all the necessary files from the dataset folder.

Returns

(EuroVoc label hierarchy, CroVoc label hierarchy, document mapping, documents)

EuroVoc label hierarchy : dict(label_id : Label) CroVoc label hierarchy : dict(label_id : Label) document mapping : dict(document_id : list of label ids) documents : list(Document)

Return type

tuple

class podium.dataload.eurovoc.Label(name, id, direct_parents, similar_terms, rank, thesaurus=None, micro_thesaurus=None, all_ancestors=None)¶

Bases: object

Label in EuroVoc dataset.

Labels are assigned to documents. One document has multiple labels. Labels have a hierarchy in which one label can have one or more parents (broader terms). All labels apart from thesaurus rank labels have at least one parent. Apart from parents, labels can also have similar labels which describe related areas, but aren’t connected by the label hierarchy.

class podium.dataload.eurovoc.LabelRank¶

Bases: enum.Enum

Levels of labels in EuroVoc.

podium.dataload.eurovoc.dill_dataset(output_path)¶

Downloads the EuroVoc dataset (if not already present) and stores the dataset in a dill file.

Parameters: output_path (str) – Path to the file where the dataset instance will be stored.

podium.dataload.huggingface_dataset_converter module¶

podium.dataload.ner_croatian module¶

Simple NERCroatian dataset module.

class podium.dataload.ner_croatian.NERCroatianXMLLoader(path='downloaded_datasets/', tokenizer='split', tag_schema='IOB', **kwargs)¶

Bases: object

Simple croatian NER class

load_dataset()¶

Method loads the dataset and returns tokenized NER documents.

Returns: tokenized_documents – List of tokenized documents. Each document is represented as a list of tuples (token, label). The sentences in document are delimited by tuple (None, None)
Return type: list of lists of tuples

podium.dataload.ner_croatian.convert_sequence_to_entities(sequence, text, delimiter='-')¶

Converts sequences of the BIO tagging schema to entities

Parameters

sequence (list(string)) – Sequence of tags consisting that start with either B, I, or O.
label (list(string)) – Tokenized text that correponds to the tag sequence

Returns

entities – List of entities. Each entity is a dict that has four attributes: name, type, start, and end. Name is a list of tokens from text that belong to that entity, start denotes the index which starts the entity, and end is the end index of the entity.

`text[entity['start'] : entity['end']]` retrieves the entity text

This means that the entity has the following form:: { ‘name’: list(str), ‘type’: str, ‘start’: int, ‘end’: int }

Return type

list(dict)

Raises

ValueError – If the given sequence and text are not of the same length.

Module contents¶

Package with concrete datasets dataloaders.

podium.dataload package¶

Submodules¶

podium.dataload.cornell_movie_dialogs module¶

podium.dataload.eurovoc module¶

podium.dataload.huggingface_dataset_converter module¶

podium.dataload.ner_croatian module¶

Module contents¶

Table of Contents

Previous topic

Next topic

This Page