podium.datasets.impl package

Submodules

podium.datasets.impl.catacx_comments_dataset module

Module contains the catacx dataset.

class podium.datasets.impl.catacx_comments_dataset.CatacxCommentsDataset(dir_path, fields=None)

Bases: podium.datasets.dataset.Dataset

Simple Catacx dataset. Contains only the comments.

static get_dataset(fields=None)

Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.

Parameters

fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.

Returns

The loaded dataset.

Return type

CatacxCommentsDataset

static get_default_fields()

Method returns a dict of default Catacx comment fields. fields : author_name, author_id, id, likes_cnt, message

Returns

fields – dict containing all default Catacx fields

Return type

dict(str, Field)

podium.datasets.impl.catacx_dataset module

class podium.datasets.impl.catacx_dataset.CatacxDataset(dir_path, fields=None)

Bases: podium.datasets.hierarhical_dataset.HierarchicalDataset

Catacx dataset.

static get_dataset(fields=None)

Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.

Parameters

fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.

Returns

The loaded dataset.

Return type

CatacxDataset

static get_default_fields()

Method returns a dict of default Catacx fields.

Returns

fields – dict containing all default Catacx fields

Return type

dict(str, Field)

podium.datasets.impl.conllu_dataset module

Module contains the CoNLL-U dataset.

class podium.datasets.impl.conllu_dataset.CoNLLUDataset(file_path, fields=None)

Bases: podium.datasets.dataset.Dataset

A CoNLL-U dataset class. This class uses all default CoNLL-U fields.

static get_default_fields()

Method returns a dict of default CoNLL-U fields. fields : id, form, lemma, upos, xpos, feats, head, deprel, deps, misc

Returns

fields – Dict containing all default CoNLL-U fields.

Return type

dict(str, Field)

podium.datasets.impl.cornell_movie_dialogs_dataset module

Module contains Cornell Movie Dialogs datasets.

class podium.datasets.impl.cornell_movie_dialogs_dataset.CornellMovieDialogsConversationalDataset(data, fields=None)

Bases: podium.datasets.dataset.Dataset

Cornell Movie Dialogs Conversational dataset which contains sentences and replies from movies.

static get_default_fields()

Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

podium.datasets.impl.croatian_ner_dataset module

Module contains Croatian NER dataset.

class podium.datasets.impl.croatian_ner_dataset.CroatianNERDataset(tokenized_documents, fields)

Bases: podium.datasets.dataset.Dataset

Croatian NER dataset.

A single example in the dataset represents a single sentence in the input data.

classmethod get_dataset(tokenizer='split', tag_schema='IOB', fields=None, **kwargs)

Method downloads (if necessary) and loads the dataset.

Parameters
  • tokenizer (str | callable) – Word-level tokenizer used to tokenize the input text

  • tag_schema (str) –

    Tag schema used for constructing the token labels

    supported tag schemas:

    ’IOB’: the label of the beginning token of the entity is prefixed with ‘B-‘, the remaining tokens that belong to the same entity are prefixed with ‘I-‘. The tokens that don’t belong to any named entity are labeled ‘O’

  • fields (dict(str, Field)) – dictionary mapping field names to fields. If set to None, the default fields are used.

  • **kwargs

    SCPLargeResource.SCP_USER_KEY:

    User on the host machine. Not required if the user on the local machine matches the user on the host machine.

    SCPLargeResource.SCP_PRIVATE_KEY:

    Path to the ssh private key eligible to access the host machine. Not required on Unix if the private is in the default location.

    SCPLargeResource.SCP_PASS_KEY:

    Password for the ssh private key (optional). Can be omitted if the private key is not encrypted.

Returns

The loaded dataset.

Return type

CroatianNERDataset

static get_default_fields()

Method returns default Croatian NER dataset fields.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

podium.datasets.impl.eurovoc_dataset module

Module contains EuroVoc dataset.

class podium.datasets.impl.eurovoc_dataset.EuroVocDataset(eurovoc_labels, crovoc_labels, documents, mappings, fields=None)

Bases: podium.datasets.dataset.Dataset

EuroVoc dataset class that contains labeled documents and the label hierarchy.

get_all_ancestors(label_id)

Returns ids of all ancestors of the label with the given label id.

Parameters

label_id (int) – id of the label

Returns

list of label_ids of all ancestors of the given label or None if the label is not present in the dataset label hierarchies

Return type

list(int)

get_crovoc_label_hierarchy()

Returns CroVoc label hierarchy.

Returns

dict(int – dictionary that maps label id to label

Return type

Label)

static get_default_fields()

Method returns default EuroVoc fields: title, text, eurovoc and crovoc labels.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

get_direct_parents(label_id)

Returns ids of direct parents of the label with the given label id.

Parameters

label_id (int) – id of the label

Returns

list of label_ids of all direct parents of the given label or None if the label is not present in the dataset label hierarchies

Return type

list(int)

get_eurovoc_label_hierarchy()

Returns the EuroVoc label hierarchy.

Returns

dict(int – dictionary that maps label id to label

Return type

Label)

is_ancestor(label_id, example)

Checks if the given label_id is an ancestor of any labels of the example.

Parameters
  • label_id (int) – id of the label

  • example (Example) – example from dataset

Returns

True if label is ancestor to any of the example labels, False otherwise

Return type

boolean

podium.datasets.impl.eurovoc_dataset.remove_nonalpha_and_stopwords(raw, tokenized, stop_words)

Removes all non alphabetical characters and stop words from tokens.

Parameters
  • raw (string) – raw text

  • tokenized (list(str)) – tokenized text

Returns

Return type

tuple(str, list(str))

podium.datasets.impl.imdb_sentiment_dataset module

Module contains IMDB Large Movie Review Dataset Dataset webpage: http://ai.stanford.edu/~amaas/data/sentiment/

When using this dataset, please cite:

@InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142–150}, url = {http://www.aclweb.org/anthology/P11-1015} }

class podium.datasets.impl.imdb_sentiment_dataset.IMDB(dir_path, fields)

Bases: podium.datasets.dataset.Dataset

Simple Imdb dataset with only supervised data which uses non processed data.

NAME

dataset name

Type

str

URL

url to the imdb dataset

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TRAIN_DIR

name of the training directory

Type

str

TEST_DIR

name of the directory containing test examples

Type

str

POSITIVE_LABEL_DIR

name of the subdirectory containing examples with positive sentiment

Type

str

NEGATIVE_LABEL_DIR

name of the subdirectory containing examples with negative sentiment

Type

str

TEXT_FIELD_NAME

name of the field containing comment text

Type

str

LABEL_FIELD_NAME

name of the field containing label value

Type

str

POSITIVE_LABEL

positive sentiment label

Type

int

NEGATIVE_LABEL

negative sentiment label

Type

int

static get_dataset_splits(fields=None)

Method creates train and test dataset for Imdb dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.

Returns

(train_dataset, test_dataset) – tuple containing train dataset and test dataset

Return type

(Dataset, Dataset)

static get_default_fields()

Method returns default Imdb fields: text and label.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

podium.datasets.impl.pauza_dataset module

Module contains PauzaHR datasets.

class podium.datasets.impl.pauza_dataset.PauzaHRDataset(dir_path, fields)

Bases: podium.datasets.dataset.Dataset

Simple PauzaHR dataset class which uses original reviews.

URL

url to the PauzaHR dataset

Type

str

NAME

dataset name

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TRAIN_DIR

name of the training directory

Type

str

TEST_DIR

name of the directory containing test examples

Type

str

static get_default_fields()

Method returns default PauzaHR fields: rating, source and text.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

static get_train_test_dataset(fields=None)

Method creates train and test dataset for PauzaHR dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`.

Returns

(train_dataset, test_dataset) – tuple containing train dataset and test dataset

Return type

(Dataset, Dataset)

Module contents

Package contains concrete datasets

class podium.datasets.impl.IMDB(dir_path, fields)

Bases: podium.datasets.dataset.Dataset

Simple Imdb dataset with only supervised data which uses non processed data.

NAME

dataset name

Type

str

URL

url to the imdb dataset

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TRAIN_DIR

name of the training directory

Type

str

TEST_DIR

name of the directory containing test examples

Type

str

POSITIVE_LABEL_DIR

name of the subdirectory containing examples with positive sentiment

Type

str

NEGATIVE_LABEL_DIR

name of the subdirectory containing examples with negative sentiment

Type

str

TEXT_FIELD_NAME

name of the field containing comment text

Type

str

LABEL_FIELD_NAME

name of the field containing label value

Type

str

POSITIVE_LABEL

positive sentiment label

Type

int

NEGATIVE_LABEL

negative sentiment label

Type

int

static get_dataset_splits(fields=None)

Method creates train and test dataset for Imdb dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.

Returns

(train_dataset, test_dataset) – tuple containing train dataset and test dataset

Return type

(Dataset, Dataset)

static get_default_fields()

Method returns default Imdb fields: text and label.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

class podium.datasets.impl.CatacxDataset(dir_path, fields=None)

Bases: podium.datasets.hierarhical_dataset.HierarchicalDataset

Catacx dataset.

static get_dataset(fields=None)

Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.

Parameters

fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.

Returns

The loaded dataset.

Return type

CatacxDataset

static get_default_fields()

Method returns a dict of default Catacx fields.

Returns

fields – dict containing all default Catacx fields

Return type

dict(str, Field)

class podium.datasets.impl.CoNLLUDataset(file_path, fields=None)

Bases: podium.datasets.dataset.Dataset

A CoNLL-U dataset class. This class uses all default CoNLL-U fields.

static get_default_fields()

Method returns a dict of default CoNLL-U fields. fields : id, form, lemma, upos, xpos, feats, head, deprel, deps, misc

Returns

fields – Dict containing all default CoNLL-U fields.

Return type

dict(str, Field)

class podium.datasets.impl.CornellMovieDialogsConversationalDataset(data, fields=None)

Bases: podium.datasets.dataset.Dataset

Cornell Movie Dialogs Conversational dataset which contains sentences and replies from movies.

static get_default_fields()

Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

class podium.datasets.impl.EuroVocDataset(eurovoc_labels, crovoc_labels, documents, mappings, fields=None)

Bases: podium.datasets.dataset.Dataset

EuroVoc dataset class that contains labeled documents and the label hierarchy.

get_all_ancestors(label_id)

Returns ids of all ancestors of the label with the given label id.

Parameters

label_id (int) – id of the label

Returns

list of label_ids of all ancestors of the given label or None if the label is not present in the dataset label hierarchies

Return type

list(int)

get_crovoc_label_hierarchy()

Returns CroVoc label hierarchy.

Returns

dict(int – dictionary that maps label id to label

Return type

Label)

static get_default_fields()

Method returns default EuroVoc fields: title, text, eurovoc and crovoc labels.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

get_direct_parents(label_id)

Returns ids of direct parents of the label with the given label id.

Parameters

label_id (int) – id of the label

Returns

list of label_ids of all direct parents of the given label or None if the label is not present in the dataset label hierarchies

Return type

list(int)

get_eurovoc_label_hierarchy()

Returns the EuroVoc label hierarchy.

Returns

dict(int – dictionary that maps label id to label

Return type

Label)

is_ancestor(label_id, example)

Checks if the given label_id is an ancestor of any labels of the example.

Parameters
  • label_id (int) – id of the label

  • example (Example) – example from dataset

Returns

True if label is ancestor to any of the example labels, False otherwise

Return type

boolean

class podium.datasets.impl.PauzaHRDataset(dir_path, fields)

Bases: podium.datasets.dataset.Dataset

Simple PauzaHR dataset class which uses original reviews.

URL

url to the PauzaHR dataset

Type

str

NAME

dataset name

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TRAIN_DIR

name of the training directory

Type

str

TEST_DIR

name of the directory containing test examples

Type

str

static get_default_fields()

Method returns default PauzaHR fields: rating, source and text.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

static get_train_test_dataset(fields=None)

Method creates train and test dataset for PauzaHR dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`.

Returns

(train_dataset, test_dataset) – tuple containing train dataset and test dataset

Return type

(Dataset, Dataset)

class podium.datasets.impl.SNLIDataset(file_path, fields)

Bases: podium.datasets.impl.snli_dataset.SNLISimple

A SNLI Dataset class. Unlike SNLISimple, this class includes all the fields included in the SNLI dataset by default.

NAME

Name of the Dataset.

Type

str

URL

URL to the SNLI dataset.

Type

str

DATASET_DIR

Name of the directory in which the dataset files are stored.

Type

str

ARCHIVE_TYPE

Archive type, i.e. compression method used for archiving the downloaded dataset file.

Type

str

TRAIN_FILE_NAME

Name of the file in which the train dataset is stored.

Type

str

TEST_FILE_NAME

Name of the file in which the test dataset is stored.

Type

str

DEV_FILE_NAME

Name of the file in which the dev (validation) dataset is stored.

Type

str

ANNOTATOR_LABELS_FIELD_NAME

Name of the field containing annotator labels

Type

str

CAPTION_ID_FIELD_NAME

Name of the field containing caption ID

Type

str

GOLD_LABEL_FIELD_NAME

Name of the field containing gold label

Type

str

PAIR_ID_FIELD_NAME

Name of the field containing pair ID

Type

str

SENTENCE1_FIELD_NAME

Name of the field containing sentence1

Type

str

SENTENCE1_PARSE_FIELD_NAME

Name of the field containing sentence1 parse

Type

str

SENTENCE1_BINARY_PARSE_FIELD_NAME

Name of the field containing sentence1 binary parse

Type

str

SENTENCE2_FIELD_NAME

Name of the field containing sentence2

Type

str

SENTENCE2_PARSE_FIELD_NAME

Name of the field containing sentence2 parse

Type

str

SENTENCE2_BINARY_PARSE_FIELD_NAME

Name of the field containing sentence2 binary parse

Type

str

static get_default_fields()

Method returns all SNLI fields in the following order: annotator_labels, captionID, gold_label, pairID, sentence1, sentence1_parse, sentence1_binary_parse, sentence2, sentence2_parse, sentence2_binary_parse

Returns

fields – Dictionary mapping field names to respective Fields.

Return type

dict(str, Field)

Notes

This dataset includes both parses for every sentence,

static get_train_test_dev_dataset(fields=None)

Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.

Parameters

fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied, `get_default_fields` is used.

Returns

(train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.

Return type

(Dataset, Dataset, Dataset)

class podium.datasets.impl.SNLISimple(file_path, fields)

Bases: podium.datasets.dataset.Dataset

A Simple SNLI Dataset class. This class only uses three fields by default: gold_label, sentence1, sentence2.

NAME

Name of the Dataset.

Type

str

URL

URL to the SNLI dataset.

Type

str

DATASET_DIR

Name of the directory in which the dataset files are stored.

Type

str

ARCHIVE_TYPE

Archive type, i.e. compression method used for archiving the downloaded dataset file.

Type

str

TRAIN_FILE_NAME

Name of the file in which the train dataset is stored.

Type

str

TEST_FILE_NAME

Name of the file in which the test dataset is stored.

Type

str

DEV_FILE_NAME

Name of the file in which the dev (validation) dataset is stored.

Type

str

GOLD_LABEL_FIELD_NAME

Name of the field containing gold label

Type

str

SENTENCE1_FIELD_NAME

Name of the field containing sentence1

Type

str

SENTENCE2_FIELD_NAME

Name of the field containing sentence2

Type

str

static get_default_fields()

Method returns the three main SNLI fields in the following order: gold_label, sentence1, sentence2

Returns

fields – Dictionary mapping field names to respective Fields.

Return type

dict(str, Field)

static get_train_test_dev_dataset(fields=None)

Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.

Parameters

fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied, `get_default_fields` is used.

Returns

(train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.

Return type

(Dataset, Dataset, Dataset)

class podium.datasets.impl.SST(file_path, fields, fine_grained=False, subtrees=False)

Bases: podium.datasets.dataset.Dataset

The Stanford sentiment treebank dataset.

NAME

dataset name

Type

str

URL

url to the SST dataset

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TEXT_FIELD_NAME

name of the field containing comment text

Type

str

LABEL_FIELD_NAME

name of the field containing label value

Type

str

POSITIVE_LABEL

positive sentiment label

Type

int

NEGATIVE_LABEL

negative sentiment label

Type

int

static get_dataset_splits(fields=None, fine_grained=False, subtrees=False)

Method loads and creates dataset splits for the SST dataset.

Parameters
  • fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.

  • fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.

  • subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.

Returns

(train_dataset, valid_dataset, test_dataset) – tuple containing train, valid and test dataset

Return type

(Dataset, Dataset, Dataset)

static get_default_fields()

Method returns default Imdb fields: text and label.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)