podium.dataload package

Submodules

podium.dataload.cornell_movie_dialogs module

Dataloader for Cornell Movie-Dialogs Corpus, available at http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

class podium.dataload.cornell_movie_dialogs.CornellMovieDialogsLoader

Bases: object

Class for downloading and parsing the Cornell Movie-Dialogs dataset.

This class is used for downloading the dataset (if it’s not already downloaded) and parsing the files in the dataset. If it’s not already present LargeResource.BASE_RESOURCE_DIR, the dataset is automatically downloaded when an instance of the loader is created. The downloaded resources can be parsed using the load_dataset method.

load_characters()

Method loads file containing movie characters.

load_conversations()

Method loads file containing movie conversations.

load_dataset()

Loads and parses all the necessary files from the dataset folder.

Returns

data – tuple that contains dictionaries for 5 types of Cornell movie dialogs data: titles, conversations, lines, characters and script urls. Fields for every type are defined in class constants.

Return type

CornellMovieDialogsNamedTuple

load_lines()

Method loads file containing movie lines.

load_titles()

Method loads file containing movie titles.

load_urls()

Method loads file containing movie script urls.

class podium.dataload.cornell_movie_dialogs.CornellMovieDialogsNamedTuple(titles, conversations, lines, characters, url)

Bases: tuple

property characters

Alias for field number 3

property conversations

Alias for field number 1

property lines

Alias for field number 2

property titles

Alias for field number 0

property url

Alias for field number 4

podium.dataload.eurovoc module

Module for loading raw eurovoc dataset

class podium.dataload.eurovoc.Document(filename, title, text)

Bases: tuple

property filename

Alias for field number 0

property text

Alias for field number 2

property title

Alias for field number 1

class podium.dataload.eurovoc.EuroVocLoader(**kwargs)

Bases: object

Class for downloading and parsing the EuroVoc dataset.

This class is used for downloading the EuroVoc dataset (if it’s not already downloaded) and parsing the files in the dataset. If it’s not already present LargeResource.BASE_RESOURCE_DIR, the dataset is automatically downloaded when an instance of EuroVocLoader is created. The downloaded resources can be parsed using the load_dataset method.

load_dataset()

Loads and parses all the necessary files from the dataset folder.

Returns

(EuroVoc label hierarchy, CroVoc label hierarchy, document mapping, documents)

EuroVoc label hierarchy : dict(label_id : Label) CroVoc label hierarchy : dict(label_id : Label) document mapping : dict(document_id : list of label ids) documents : list(Document)

Return type

tuple

class podium.dataload.eurovoc.Label(name, id, direct_parents, similar_terms, rank, thesaurus=None, micro_thesaurus=None, all_ancestors=None)

Bases: object

Label in EuroVoc dataset.

Labels are assigned to documents. One document has multiple labels. Labels have a hierarchy in which one label can have one or more parents (broader terms). All labels apart from thesaurus rank labels have at least one parent. Apart from parents, labels can also have similar labels which describe related areas, but aren’t connected by the label hierarchy.

class podium.dataload.eurovoc.LabelRank

Bases: enum.Enum

Levels of labels in EuroVoc.

podium.dataload.eurovoc.dill_dataset(output_path)

Downloads the EuroVoc dataset (if not already present) and stores the dataset in a dill file.

Parameters

output_path (str) – Path to the file where the dataset instance will be stored.

podium.dataload.huggingface_dataset_converter module

podium.dataload.ner_croatian module

Simple NERCroatian dataset module.

class podium.dataload.ner_croatian.NERCroatianXMLLoader(path='downloaded_datasets/', tokenizer='split', tag_schema='IOB', **kwargs)

Bases: object

Simple croatian NER class

load_dataset()

Method loads the dataset and returns tokenized NER documents.

Returns

tokenized_documents – List of tokenized documents. Each document is represented as a list of tuples (token, label). The sentences in document are delimited by tuple (None, None)

Return type

list of lists of tuples

podium.dataload.ner_croatian.convert_sequence_to_entities(sequence, text, delimiter='-')

Converts sequences of the BIO tagging schema to entities

Parameters
  • sequence (list(string)) – Sequence of tags consisting that start with either B, I, or O.

  • label (list(string)) – Tokenized text that correponds to the tag sequence

Returns

entities – List of entities. Each entity is a dict that has four attributes: name, type, start, and end. Name is a list of tokens from text that belong to that entity, start denotes the index which starts the entity, and end is the end index of the entity.

`text[entity['start'] : entity['end']]` retrieves the entity text

This means that the entity has the following form:

{ ‘name’: list(str), ‘type’: str, ‘start’: int, ‘end’: int }

Return type

list(dict)

Raises

ValueError – If the given sequence and text are not of the same length.

Module contents

Package with concrete datasets dataloaders.