podium.datasets.impl package¶

Submodules¶

podium.datasets.impl.catacx_comments_dataset module¶

Module contains the catacx dataset.

class podium.datasets.impl.catacx_comments_dataset.CatacxCommentsDataset(dir_path, fields=None)¶

Bases: podium.datasets.dataset.Dataset

Simple Catacx dataset. Contains only the comments.

static get_dataset(fields=None)¶

Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.

Parameters: fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.
Returns: The loaded dataset.
Return type: CatacxCommentsDataset

static get_default_fields()¶

Method returns a dict of default Catacx comment fields. fields : author_name, author_id, id, likes_cnt, message

Returns: fields – dict containing all default Catacx fields
Return type: dict(str, Field)

podium.datasets.impl.catacx_dataset module¶

class podium.datasets.impl.catacx_dataset.CatacxDataset(dir_path, fields=None)¶

Bases: podium.datasets.hierarhical_dataset.HierarchicalDataset

Catacx dataset.

static get_dataset(fields=None)¶

Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.

Parameters: fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.
Returns: The loaded dataset.
Return type: CatacxDataset

static get_default_fields()¶

Method returns a dict of default Catacx fields.

Returns: fields – dict containing all default Catacx fields
Return type: dict(str, Field)

podium.datasets.impl.conllu_dataset module¶

Module contains the CoNLL-U dataset.

class podium.datasets.impl.conllu_dataset.CoNLLUDataset(file_path, fields=None)¶

Bases: podium.datasets.dataset.Dataset

A CoNLL-U dataset class. This class uses all default CoNLL-U fields.

static get_default_fields()¶

Method returns a dict of default CoNLL-U fields. fields : id, form, lemma, upos, xpos, feats, head, deprel, deps, misc

Returns: fields – Dict containing all default CoNLL-U fields.
Return type: dict(str, Field)

podium.datasets.impl.cornell_movie_dialogs_dataset module¶

Module contains Cornell Movie Dialogs datasets.

class podium.datasets.impl.cornell_movie_dialogs_dataset.CornellMovieDialogsConversationalDataset(data, fields=None)¶

Bases: podium.datasets.dataset.Dataset

Cornell Movie Dialogs Conversational dataset which contains sentences and replies from movies.

static get_default_fields()¶

Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

podium.datasets.impl.croatian_ner_dataset module¶

Module contains Croatian NER dataset.

class podium.datasets.impl.croatian_ner_dataset.CroatianNERDataset(tokenized_documents, fields)¶

Bases: podium.datasets.dataset.Dataset

Croatian NER dataset.

A single example in the dataset represents a single sentence in the input data.

classmethod get_dataset(tokenizer='split', tag_schema='IOB', fields=None, **kwargs)¶

Method downloads (if necessary) and loads the dataset.

Parameters

tokenizer (str | callable) – Word-level tokenizer used to tokenize the input text
tag_schema (str) –
Tag schema used for constructing the token labels

supported tag schemas:
’IOB’: the label of the beginning token of the entity is prefixed with ‘B-‘, the remaining tokens that belong to the same entity are prefixed with ‘I-‘. The tokens that don’t belong to any named entity are labeled ‘O’
fields (dict(str, Field)) – dictionary mapping field names to fields. If set to None, the default fields are used.
**kwargs –

SCPLargeResource.SCP_USER_KEY:
User on the host machine. Not required if the user on the local machine matches the user on the host machine.

SCPLargeResource.SCP_PRIVATE_KEY:
Path to the ssh private key eligible to access the host machine. Not required on Unix if the private is in the default location.

SCPLargeResource.SCP_PASS_KEY:
Password for the ssh private key (optional). Can be omitted if the private key is not encrypted.

Returns

The loaded dataset.

Return type

CroatianNERDataset

static get_default_fields()¶

Method returns default Croatian NER dataset fields.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

podium.datasets.impl.eurovoc_dataset module¶

Module contains EuroVoc dataset.

class podium.datasets.impl.eurovoc_dataset.EuroVocDataset(eurovoc_labels, crovoc_labels, documents, mappings, fields=None)¶

Bases: podium.datasets.dataset.Dataset

EuroVoc dataset class that contains labeled documents and the label hierarchy.

get_all_ancestors(label_id)¶

Returns ids of all ancestors of the label with the given label id.

Parameters: label_id (int) – id of the label
Returns: list of label_ids of all ancestors of the given label or None if the label is not present in the dataset label hierarchies
Return type: list(int)

get_crovoc_label_hierarchy()¶

Returns CroVoc label hierarchy.

Returns: dict(int – dictionary that maps label id to label
Return type: Label)

static get_default_fields()¶

Method returns default EuroVoc fields: title, text, eurovoc and crovoc labels.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

get_direct_parents(label_id)¶

Returns ids of direct parents of the label with the given label id.

Parameters: label_id (int) – id of the label
Returns: list of label_ids of all direct parents of the given label or None if the label is not present in the dataset label hierarchies
Return type: list(int)

get_eurovoc_label_hierarchy()¶

Returns the EuroVoc label hierarchy.

Returns: dict(int – dictionary that maps label id to label
Return type: Label)

is_ancestor(label_id, example)¶

Checks if the given label_id is an ancestor of any labels of the example.

Parameters

label_id (int) – id of the label
example (Example) – example from dataset

Returns

True if label is ancestor to any of the example labels, False otherwise

Return type

boolean

podium.datasets.impl.eurovoc_dataset.remove_nonalpha_and_stopwords(raw, tokenized, stop_words)¶

Removes all non alphabetical characters and stop words from tokens.

Parameters

raw (string) – raw text
tokenized (list(str)) – tokenized text

Returns

Return type

tuple(str, list(str))

podium.datasets.impl.imdb_sentiment_dataset module¶

Module contains IMDB Large Movie Review Dataset Dataset webpage: http://ai.stanford.edu/~amaas/data/sentiment/

When using this dataset, please cite:: @InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142–150}, url = {http://www.aclweb.org/anthology/P11-1015} }

class podium.datasets.impl.imdb_sentiment_dataset.IMDB(dir_path, fields)¶

Bases: podium.datasets.dataset.Dataset

Simple Imdb dataset with only supervised data which uses non processed data.

NAME¶

dataset name

Type: str

URL¶

url to the imdb dataset

Type: str

DATASET_DIR¶

name of the folder in the dataset containing train and test directories

Type: str

ARCHIVE_TYPE¶

string that defines archive type, used for unpacking dataset

Type: str

TRAIN_DIR¶

name of the training directory

Type: str

TEST_DIR¶

name of the directory containing test examples

Type: str

POSITIVE_LABEL_DIR¶

name of the subdirectory containing examples with positive sentiment

Type: str

NEGATIVE_LABEL_DIR¶

name of the subdirectory containing examples with negative sentiment

Type: str

TEXT_FIELD_NAME¶

name of the field containing comment text

Type: str

LABEL_FIELD_NAME¶

name of the field containing label value

Type: str

POSITIVE_LABEL¶

positive sentiment label

Type: int

NEGATIVE_LABEL¶

negative sentiment label

Type: int

static get_dataset_splits(fields=None)¶

Method creates train and test dataset for Imdb dataset.

Parameters: fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.
Returns: (train_dataset, test_dataset) – tuple containing train dataset and test dataset
Return type: (Dataset, Dataset)

static get_default_fields()¶

Method returns default Imdb fields: text and label.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

podium.datasets.impl.pauza_dataset module¶

Module contains PauzaHR datasets.

class podium.datasets.impl.pauza_dataset.PauzaHRDataset(dir_path, fields)¶

Bases: podium.datasets.dataset.Dataset

Simple PauzaHR dataset class which uses original reviews.

URL¶

url to the PauzaHR dataset

Type: str

NAME¶

dataset name

Type: str

DATASET_DIR¶

name of the folder in the dataset containing train and test directories

Type: str

ARCHIVE_TYPE¶

string that defines archive type, used for unpacking dataset

Type: str

TRAIN_DIR¶

name of the training directory

Type: str

TEST_DIR¶

name of the directory containing test examples

Type: str

static get_default_fields()¶

Method returns default PauzaHR fields: rating, source and text.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

static get_train_test_dataset(fields=None)¶

Method creates train and test dataset for PauzaHR dataset.

Parameters: fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`.
Returns: (train_dataset, test_dataset) – tuple containing train dataset and test dataset
Return type: (Dataset, Dataset)

Module contents¶

Package contains concrete datasets

class podium.datasets.impl.IMDB(dir_path, fields)¶

Bases: podium.datasets.dataset.Dataset

Simple Imdb dataset with only supervised data which uses non processed data.

NAME¶

dataset name

Type: str

URL¶

url to the imdb dataset

Type: str

DATASET_DIR¶

name of the folder in the dataset containing train and test directories

Type: str

ARCHIVE_TYPE¶

string that defines archive type, used for unpacking dataset

Type: str

TRAIN_DIR¶

name of the training directory

Type: str

TEST_DIR¶

name of the directory containing test examples

Type: str

POSITIVE_LABEL_DIR¶

name of the subdirectory containing examples with positive sentiment

Type: str

NEGATIVE_LABEL_DIR¶

name of the subdirectory containing examples with negative sentiment

Type: str

TEXT_FIELD_NAME¶

name of the field containing comment text

Type: str

LABEL_FIELD_NAME¶

name of the field containing label value

Type: str

POSITIVE_LABEL¶

positive sentiment label

Type: int

NEGATIVE_LABEL¶

negative sentiment label

Type: int

static get_dataset_splits(fields=None)¶

Method creates train and test dataset for Imdb dataset.

Parameters: fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.
Returns: (train_dataset, test_dataset) – tuple containing train dataset and test dataset
Return type: (Dataset, Dataset)

static get_default_fields()¶

Method returns default Imdb fields: text and label.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

class podium.datasets.impl.CatacxDataset(dir_path, fields=None)¶

Bases: podium.datasets.hierarhical_dataset.HierarchicalDataset

Catacx dataset.

static get_dataset(fields=None)¶

Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.

Parameters: fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.
Returns: The loaded dataset.
Return type: CatacxDataset

static get_default_fields()¶

Method returns a dict of default Catacx fields.

Returns: fields – dict containing all default Catacx fields
Return type: dict(str, Field)

class podium.datasets.impl.CoNLLUDataset(file_path, fields=None)¶

Bases: podium.datasets.dataset.Dataset

A CoNLL-U dataset class. This class uses all default CoNLL-U fields.

static get_default_fields()¶

Method returns a dict of default CoNLL-U fields. fields : id, form, lemma, upos, xpos, feats, head, deprel, deps, misc

Returns: fields – Dict containing all default CoNLL-U fields.
Return type: dict(str, Field)

class podium.datasets.impl.CornellMovieDialogsConversationalDataset(data, fields=None)¶

Bases: podium.datasets.dataset.Dataset

Cornell Movie Dialogs Conversational dataset which contains sentences and replies from movies.

static get_default_fields()¶

Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

class podium.datasets.impl.EuroVocDataset(eurovoc_labels, crovoc_labels, documents, mappings, fields=None)¶

Bases: podium.datasets.dataset.Dataset

EuroVoc dataset class that contains labeled documents and the label hierarchy.

get_all_ancestors(label_id)¶

Returns ids of all ancestors of the label with the given label id.

Parameters: label_id (int) – id of the label
Returns: list of label_ids of all ancestors of the given label or None if the label is not present in the dataset label hierarchies
Return type: list(int)

get_crovoc_label_hierarchy()¶

Returns CroVoc label hierarchy.

Returns: dict(int – dictionary that maps label id to label
Return type: Label)

static get_default_fields()¶

Method returns default EuroVoc fields: title, text, eurovoc and crovoc labels.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

get_direct_parents(label_id)¶

Returns ids of direct parents of the label with the given label id.

Parameters: label_id (int) – id of the label
Returns: list of label_ids of all direct parents of the given label or None if the label is not present in the dataset label hierarchies
Return type: list(int)

get_eurovoc_label_hierarchy()¶

Returns the EuroVoc label hierarchy.

Returns: dict(int – dictionary that maps label id to label
Return type: Label)

is_ancestor(label_id, example)¶

Checks if the given label_id is an ancestor of any labels of the example.

Parameters

label_id (int) – id of the label
example (Example) – example from dataset

Returns

True if label is ancestor to any of the example labels, False otherwise

Return type

boolean

class podium.datasets.impl.PauzaHRDataset(dir_path, fields)¶

Bases: podium.datasets.dataset.Dataset

Simple PauzaHR dataset class which uses original reviews.

URL¶

url to the PauzaHR dataset

Type: str

NAME¶

dataset name

Type: str

DATASET_DIR¶

name of the folder in the dataset containing train and test directories

Type: str

ARCHIVE_TYPE¶

string that defines archive type, used for unpacking dataset

Type: str

TRAIN_DIR¶

name of the training directory

Type: str

TEST_DIR¶

name of the directory containing test examples

Type: str

static get_default_fields()¶

Method returns default PauzaHR fields: rating, source and text.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

static get_train_test_dataset(fields=None)¶

Method creates train and test dataset for PauzaHR dataset.

Parameters: fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`.
Returns: (train_dataset, test_dataset) – tuple containing train dataset and test dataset
Return type: (Dataset, Dataset)

class podium.datasets.impl.SNLIDataset(file_path, fields)¶

Bases: podium.datasets.impl.snli_dataset.SNLISimple

A SNLI Dataset class. Unlike SNLISimple, this class includes all the fields included in the SNLI dataset by default.

NAME¶

Name of the Dataset.

Type: str

URL¶

URL to the SNLI dataset.

Type: str

DATASET_DIR¶

Name of the directory in which the dataset files are stored.

Type: str

ARCHIVE_TYPE¶

Archive type, i.e. compression method used for archiving the downloaded dataset file.

Type: str

TRAIN_FILE_NAME¶

Name of the file in which the train dataset is stored.

Type: str

TEST_FILE_NAME¶

Name of the file in which the test dataset is stored.

Type: str

DEV_FILE_NAME¶

Name of the file in which the dev (validation) dataset is stored.

Type: str

ANNOTATOR_LABELS_FIELD_NAME¶

Name of the field containing annotator labels

Type: str

CAPTION_ID_FIELD_NAME¶

Name of the field containing caption ID

Type: str

GOLD_LABEL_FIELD_NAME¶

Name of the field containing gold label

Type: str

PAIR_ID_FIELD_NAME¶

Name of the field containing pair ID

Type: str

SENTENCE1_FIELD_NAME¶

Name of the field containing sentence1

Type: str

SENTENCE1_PARSE_FIELD_NAME¶

Name of the field containing sentence1 parse

Type: str

SENTENCE1_BINARY_PARSE_FIELD_NAME¶

Name of the field containing sentence1 binary parse

Type: str

SENTENCE2_FIELD_NAME¶

Name of the field containing sentence2

Type: str

SENTENCE2_PARSE_FIELD_NAME¶

Name of the field containing sentence2 parse

Type: str

SENTENCE2_BINARY_PARSE_FIELD_NAME¶

Name of the field containing sentence2 binary parse

Type: str

static get_default_fields()¶

Method returns all SNLI fields in the following order: annotator_labels, captionID, gold_label, pairID, sentence1, sentence1_parse, sentence1_binary_parse, sentence2, sentence2_parse, sentence2_binary_parse

Returns: fields – Dictionary mapping field names to respective Fields.
Return type: dict(str, Field)

Notes

This dataset includes both parses for every sentence,

static get_train_test_dev_dataset(fields=None)¶

Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.

Parameters: fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied, `get_default_fields` is used.
Returns: (train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.
Return type: (Dataset, Dataset, Dataset)

class podium.datasets.impl.SNLISimple(file_path, fields)¶

Bases: podium.datasets.dataset.Dataset

A Simple SNLI Dataset class. This class only uses three fields by default: gold_label, sentence1, sentence2.

NAME¶

Name of the Dataset.

Type: str

URL¶

URL to the SNLI dataset.

Type: str

DATASET_DIR¶

Name of the directory in which the dataset files are stored.

Type: str

ARCHIVE_TYPE¶

Archive type, i.e. compression method used for archiving the downloaded dataset file.

Type: str

TRAIN_FILE_NAME¶

Name of the file in which the train dataset is stored.

Type: str

TEST_FILE_NAME¶

Name of the file in which the test dataset is stored.

Type: str

DEV_FILE_NAME¶

Name of the file in which the dev (validation) dataset is stored.

Type: str

GOLD_LABEL_FIELD_NAME¶

Name of the field containing gold label

Type: str

SENTENCE1_FIELD_NAME¶

Name of the field containing sentence1

Type: str

SENTENCE2_FIELD_NAME¶

Name of the field containing sentence2

Type: str

static get_default_fields()¶

Method returns the three main SNLI fields in the following order: gold_label, sentence1, sentence2

Returns: fields – Dictionary mapping field names to respective Fields.
Return type: dict(str, Field)

static get_train_test_dev_dataset(fields=None)¶

Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.

Parameters: fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied, `get_default_fields` is used.
Returns: (train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.
Return type: (Dataset, Dataset, Dataset)

class podium.datasets.impl.SST(file_path, fields, fine_grained=False, subtrees=False)¶

Bases: podium.datasets.dataset.Dataset

The Stanford sentiment treebank dataset.

NAME¶

dataset name

Type: str

URL¶

url to the SST dataset

Type: str

DATASET_DIR¶

name of the folder in the dataset containing train and test directories

Type: str

ARCHIVE_TYPE¶

string that defines archive type, used for unpacking dataset

Type: str

TEXT_FIELD_NAME¶

name of the field containing comment text

Type: str

LABEL_FIELD_NAME¶

name of the field containing label value

Type: str

POSITIVE_LABEL¶

positive sentiment label

Type: int

NEGATIVE_LABEL¶

negative sentiment label

Type: int

static get_dataset_splits(fields=None, fine_grained=False, subtrees=False)¶

Method loads and creates dataset splits for the SST dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.
fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.
subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.

Returns

(train_dataset, valid_dataset, test_dataset) – tuple containing train, valid and test dataset

Return type

(Dataset, Dataset, Dataset)

static get_default_fields()¶

Method returns default Imdb fields: text and label.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

podium.datasets.impl package¶

Submodules¶

podium.datasets.impl.catacx_comments_dataset module¶

podium.datasets.impl.catacx_dataset module¶

podium.datasets.impl.conllu_dataset module¶

podium.datasets.impl.cornell_movie_dialogs_dataset module¶

podium.datasets.impl.croatian_ner_dataset module¶

podium.datasets.impl.eurovoc_dataset module¶

podium.datasets.impl.imdb_sentiment_dataset module¶

podium.datasets.impl.pauza_dataset module¶

Module contents¶

Table of Contents

Previous topic

Next topic

This Page