podium.datasets.impl package¶
Submodules¶
podium.datasets.impl.catacx_comments_dataset module¶
Module contains the catacx dataset.
-
class
podium.datasets.impl.catacx_comments_dataset.
CatacxCommentsDataset
(dir_path, fields=None)¶ Bases:
podium.datasets.dataset.Dataset
Simple Catacx dataset. Contains only the comments.
-
static
get_dataset
(fields=None)¶ Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.
- Parameters
fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.
- Returns
The loaded dataset.
- Return type
CatacxCommentsDataset
-
static
get_default_fields
()¶ Method returns a dict of default Catacx comment fields. fields : author_name, author_id, id, likes_cnt, message
- Returns
fields – dict containing all default Catacx fields
- Return type
dict(str, Field)
-
static
podium.datasets.impl.catacx_dataset module¶
-
class
podium.datasets.impl.catacx_dataset.
CatacxDataset
(dir_path, fields=None)¶ Bases:
podium.datasets.hierarhical_dataset.HierarchicalDataset
Catacx dataset.
-
static
get_dataset
(fields=None)¶ Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.
- Parameters
fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.
- Returns
The loaded dataset.
- Return type
CatacxDataset
-
static
get_default_fields
()¶ Method returns a dict of default Catacx fields.
- Returns
fields – dict containing all default Catacx fields
- Return type
dict(str, Field)
-
static
podium.datasets.impl.conllu_dataset module¶
Module contains the CoNLL-U dataset.
-
class
podium.datasets.impl.conllu_dataset.
CoNLLUDataset
(file_path, fields=None)¶ Bases:
podium.datasets.dataset.Dataset
A CoNLL-U dataset class. This class uses all default CoNLL-U fields.
-
static
get_default_fields
()¶ Method returns a dict of default CoNLL-U fields. fields : id, form, lemma, upos, xpos, feats, head, deprel, deps, misc
- Returns
fields – Dict containing all default CoNLL-U fields.
- Return type
dict(str, Field)
-
static
podium.datasets.impl.cornell_movie_dialogs_dataset module¶
Module contains Cornell Movie Dialogs datasets.
-
class
podium.datasets.impl.cornell_movie_dialogs_dataset.
CornellMovieDialogsConversationalDataset
(data, fields=None)¶ Bases:
podium.datasets.dataset.Dataset
Cornell Movie Dialogs Conversational dataset which contains sentences and replies from movies.
-
static
get_default_fields
()¶ Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
static
podium.datasets.impl.croatian_ner_dataset module¶
Module contains Croatian NER dataset.
-
class
podium.datasets.impl.croatian_ner_dataset.
CroatianNERDataset
(tokenized_documents, fields)¶ Bases:
podium.datasets.dataset.Dataset
Croatian NER dataset.
A single example in the dataset represents a single sentence in the input data.
-
classmethod
get_dataset
(tokenizer='split', tag_schema='IOB', fields=None, **kwargs)¶ Method downloads (if necessary) and loads the dataset.
- Parameters
tokenizer (str | callable) – Word-level tokenizer used to tokenize the input text
tag_schema (str) –
Tag schema used for constructing the token labels
- supported tag schemas:
’IOB’: the label of the beginning token of the entity is prefixed with ‘B-‘, the remaining tokens that belong to the same entity are prefixed with ‘I-‘. The tokens that don’t belong to any named entity are labeled ‘O’
fields (dict(str, Field)) – dictionary mapping field names to fields. If set to None, the default fields are used.
**kwargs –
- SCPLargeResource.SCP_USER_KEY:
User on the host machine. Not required if the user on the local machine matches the user on the host machine.
- SCPLargeResource.SCP_PRIVATE_KEY:
Path to the ssh private key eligible to access the host machine. Not required on Unix if the private is in the default location.
- SCPLargeResource.SCP_PASS_KEY:
Password for the ssh private key (optional). Can be omitted if the private key is not encrypted.
- Returns
The loaded dataset.
- Return type
CroatianNERDataset
-
static
get_default_fields
()¶ Method returns default Croatian NER dataset fields.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
classmethod
podium.datasets.impl.eurovoc_dataset module¶
Module contains EuroVoc dataset.
-
class
podium.datasets.impl.eurovoc_dataset.
EuroVocDataset
(eurovoc_labels, crovoc_labels, documents, mappings, fields=None)¶ Bases:
podium.datasets.dataset.Dataset
EuroVoc dataset class that contains labeled documents and the label hierarchy.
-
get_all_ancestors
(label_id)¶ Returns ids of all ancestors of the label with the given label id.
- Parameters
label_id (int) – id of the label
- Returns
list of label_ids of all ancestors of the given label or None if the label is not present in the dataset label hierarchies
- Return type
list(int)
-
get_crovoc_label_hierarchy
()¶ Returns CroVoc label hierarchy.
- Returns
dict(int – dictionary that maps label id to label
- Return type
Label)
-
static
get_default_fields
()¶ Method returns default EuroVoc fields: title, text, eurovoc and crovoc labels.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
get_direct_parents
(label_id)¶ Returns ids of direct parents of the label with the given label id.
- Parameters
label_id (int) – id of the label
- Returns
list of label_ids of all direct parents of the given label or None if the label is not present in the dataset label hierarchies
- Return type
list(int)
-
get_eurovoc_label_hierarchy
()¶ Returns the EuroVoc label hierarchy.
- Returns
dict(int – dictionary that maps label id to label
- Return type
Label)
-
is_ancestor
(label_id, example)¶ Checks if the given label_id is an ancestor of any labels of the example.
- Parameters
label_id (int) – id of the label
example (Example) – example from dataset
- Returns
True if label is ancestor to any of the example labels, False otherwise
- Return type
boolean
-
-
podium.datasets.impl.eurovoc_dataset.
remove_nonalpha_and_stopwords
(raw, tokenized, stop_words)¶ Removes all non alphabetical characters and stop words from tokens.
- Parameters
raw (string) – raw text
tokenized (list(str)) – tokenized text
- Returns
- Return type
tuple(str, list(str))
podium.datasets.impl.imdb_sentiment_dataset module¶
Module contains IMDB Large Movie Review Dataset Dataset webpage: http://ai.stanford.edu/~amaas/data/sentiment/
- When using this dataset, please cite:
@InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142–150}, url = {http://www.aclweb.org/anthology/P11-1015} }
-
class
podium.datasets.impl.imdb_sentiment_dataset.
IMDB
(dir_path, fields)¶ Bases:
podium.datasets.dataset.Dataset
Simple Imdb dataset with only supervised data which uses non processed data.
-
NAME
¶ dataset name
- Type
str
-
URL
¶ url to the imdb dataset
- Type
str
-
DATASET_DIR
¶ name of the folder in the dataset containing train and test directories
- Type
str
-
ARCHIVE_TYPE
¶ string that defines archive type, used for unpacking dataset
- Type
str
-
TRAIN_DIR
¶ name of the training directory
- Type
str
-
TEST_DIR
¶ name of the directory containing test examples
- Type
str
-
POSITIVE_LABEL_DIR
¶ name of the subdirectory containing examples with positive sentiment
- Type
str
-
NEGATIVE_LABEL_DIR
¶ name of the subdirectory containing examples with negative sentiment
- Type
str
-
TEXT_FIELD_NAME
¶ name of the field containing comment text
- Type
str
-
LABEL_FIELD_NAME
¶ name of the field containing label value
- Type
str
-
POSITIVE_LABEL
¶ positive sentiment label
- Type
int
-
NEGATIVE_LABEL
¶ negative sentiment label
- Type
int
-
static
get_dataset_splits
(fields=None)¶ Method creates train and test dataset for Imdb dataset.
- Parameters
fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use
`get_default_fields`
. User should use default field names defined in class attributes.- Returns
(train_dataset, test_dataset) – tuple containing train dataset and test dataset
- Return type
(Dataset, Dataset)
-
static
get_default_fields
()¶ Method returns default Imdb fields: text and label.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
podium.datasets.impl.pauza_dataset module¶
Module contains PauzaHR datasets.
-
class
podium.datasets.impl.pauza_dataset.
PauzaHRDataset
(dir_path, fields)¶ Bases:
podium.datasets.dataset.Dataset
Simple PauzaHR dataset class which uses original reviews.
-
URL
¶ url to the PauzaHR dataset
- Type
str
-
NAME
¶ dataset name
- Type
str
-
DATASET_DIR
¶ name of the folder in the dataset containing train and test directories
- Type
str
-
ARCHIVE_TYPE
¶ string that defines archive type, used for unpacking dataset
- Type
str
-
TRAIN_DIR
¶ name of the training directory
- Type
str
-
TEST_DIR
¶ name of the directory containing test examples
- Type
str
-
static
get_default_fields
()¶ Method returns default PauzaHR fields: rating, source and text.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
static
get_train_test_dataset
(fields=None)¶ Method creates train and test dataset for PauzaHR dataset.
- Parameters
fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use
`get_default_fields`
.- Returns
(train_dataset, test_dataset) – tuple containing train dataset and test dataset
- Return type
(Dataset, Dataset)
-
Module contents¶
Package contains concrete datasets
-
class
podium.datasets.impl.
IMDB
(dir_path, fields)¶ Bases:
podium.datasets.dataset.Dataset
Simple Imdb dataset with only supervised data which uses non processed data.
-
NAME
¶ dataset name
- Type
str
-
URL
¶ url to the imdb dataset
- Type
str
-
DATASET_DIR
¶ name of the folder in the dataset containing train and test directories
- Type
str
-
ARCHIVE_TYPE
¶ string that defines archive type, used for unpacking dataset
- Type
str
-
TRAIN_DIR
¶ name of the training directory
- Type
str
-
TEST_DIR
¶ name of the directory containing test examples
- Type
str
-
POSITIVE_LABEL_DIR
¶ name of the subdirectory containing examples with positive sentiment
- Type
str
-
NEGATIVE_LABEL_DIR
¶ name of the subdirectory containing examples with negative sentiment
- Type
str
-
TEXT_FIELD_NAME
¶ name of the field containing comment text
- Type
str
-
LABEL_FIELD_NAME
¶ name of the field containing label value
- Type
str
-
POSITIVE_LABEL
¶ positive sentiment label
- Type
int
-
NEGATIVE_LABEL
¶ negative sentiment label
- Type
int
-
static
get_dataset_splits
(fields=None)¶ Method creates train and test dataset for Imdb dataset.
- Parameters
fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use
`get_default_fields`
. User should use default field names defined in class attributes.- Returns
(train_dataset, test_dataset) – tuple containing train dataset and test dataset
- Return type
(Dataset, Dataset)
-
static
get_default_fields
()¶ Method returns default Imdb fields: text and label.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
-
class
podium.datasets.impl.
CatacxDataset
(dir_path, fields=None)¶ Bases:
podium.datasets.hierarhical_dataset.HierarchicalDataset
Catacx dataset.
-
static
get_dataset
(fields=None)¶ Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.
- Parameters
fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.
- Returns
The loaded dataset.
- Return type
CatacxDataset
-
static
get_default_fields
()¶ Method returns a dict of default Catacx fields.
- Returns
fields – dict containing all default Catacx fields
- Return type
dict(str, Field)
-
static
-
class
podium.datasets.impl.
CoNLLUDataset
(file_path, fields=None)¶ Bases:
podium.datasets.dataset.Dataset
A CoNLL-U dataset class. This class uses all default CoNLL-U fields.
-
static
get_default_fields
()¶ Method returns a dict of default CoNLL-U fields. fields : id, form, lemma, upos, xpos, feats, head, deprel, deps, misc
- Returns
fields – Dict containing all default CoNLL-U fields.
- Return type
dict(str, Field)
-
static
-
class
podium.datasets.impl.
CornellMovieDialogsConversationalDataset
(data, fields=None)¶ Bases:
podium.datasets.dataset.Dataset
Cornell Movie Dialogs Conversational dataset which contains sentences and replies from movies.
-
static
get_default_fields
()¶ Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
static
-
class
podium.datasets.impl.
EuroVocDataset
(eurovoc_labels, crovoc_labels, documents, mappings, fields=None)¶ Bases:
podium.datasets.dataset.Dataset
EuroVoc dataset class that contains labeled documents and the label hierarchy.
-
get_all_ancestors
(label_id)¶ Returns ids of all ancestors of the label with the given label id.
- Parameters
label_id (int) – id of the label
- Returns
list of label_ids of all ancestors of the given label or None if the label is not present in the dataset label hierarchies
- Return type
list(int)
-
get_crovoc_label_hierarchy
()¶ Returns CroVoc label hierarchy.
- Returns
dict(int – dictionary that maps label id to label
- Return type
Label)
-
static
get_default_fields
()¶ Method returns default EuroVoc fields: title, text, eurovoc and crovoc labels.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
get_direct_parents
(label_id)¶ Returns ids of direct parents of the label with the given label id.
- Parameters
label_id (int) – id of the label
- Returns
list of label_ids of all direct parents of the given label or None if the label is not present in the dataset label hierarchies
- Return type
list(int)
-
get_eurovoc_label_hierarchy
()¶ Returns the EuroVoc label hierarchy.
- Returns
dict(int – dictionary that maps label id to label
- Return type
Label)
-
is_ancestor
(label_id, example)¶ Checks if the given label_id is an ancestor of any labels of the example.
- Parameters
label_id (int) – id of the label
example (Example) – example from dataset
- Returns
True if label is ancestor to any of the example labels, False otherwise
- Return type
boolean
-
-
class
podium.datasets.impl.
PauzaHRDataset
(dir_path, fields)¶ Bases:
podium.datasets.dataset.Dataset
Simple PauzaHR dataset class which uses original reviews.
-
URL
¶ url to the PauzaHR dataset
- Type
str
-
NAME
¶ dataset name
- Type
str
-
DATASET_DIR
¶ name of the folder in the dataset containing train and test directories
- Type
str
-
ARCHIVE_TYPE
¶ string that defines archive type, used for unpacking dataset
- Type
str
-
TRAIN_DIR
¶ name of the training directory
- Type
str
-
TEST_DIR
¶ name of the directory containing test examples
- Type
str
-
static
get_default_fields
()¶ Method returns default PauzaHR fields: rating, source and text.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-
static
get_train_test_dataset
(fields=None)¶ Method creates train and test dataset for PauzaHR dataset.
- Parameters
fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use
`get_default_fields`
.- Returns
(train_dataset, test_dataset) – tuple containing train dataset and test dataset
- Return type
(Dataset, Dataset)
-
-
class
podium.datasets.impl.
SNLIDataset
(file_path, fields)¶ Bases:
podium.datasets.impl.snli_dataset.SNLISimple
A SNLI Dataset class. Unlike SNLISimple, this class includes all the fields included in the SNLI dataset by default.
-
NAME
¶ Name of the Dataset.
- Type
str
-
URL
¶ URL to the SNLI dataset.
- Type
str
-
DATASET_DIR
¶ Name of the directory in which the dataset files are stored.
- Type
str
-
ARCHIVE_TYPE
¶ Archive type, i.e. compression method used for archiving the downloaded dataset file.
- Type
str
-
TRAIN_FILE_NAME
¶ Name of the file in which the train dataset is stored.
- Type
str
-
TEST_FILE_NAME
¶ Name of the file in which the test dataset is stored.
- Type
str
-
DEV_FILE_NAME
¶ Name of the file in which the dev (validation) dataset is stored.
- Type
str
-
ANNOTATOR_LABELS_FIELD_NAME
¶ Name of the field containing annotator labels
- Type
str
-
CAPTION_ID_FIELD_NAME
¶ Name of the field containing caption ID
- Type
str
-
GOLD_LABEL_FIELD_NAME
¶ Name of the field containing gold label
- Type
str
-
PAIR_ID_FIELD_NAME
¶ Name of the field containing pair ID
- Type
str
-
SENTENCE1_FIELD_NAME
¶ Name of the field containing sentence1
- Type
str
-
SENTENCE1_PARSE_FIELD_NAME
¶ Name of the field containing sentence1 parse
- Type
str
-
SENTENCE1_BINARY_PARSE_FIELD_NAME
¶ Name of the field containing sentence1 binary parse
- Type
str
-
SENTENCE2_FIELD_NAME
¶ Name of the field containing sentence2
- Type
str
-
SENTENCE2_PARSE_FIELD_NAME
¶ Name of the field containing sentence2 parse
- Type
str
-
SENTENCE2_BINARY_PARSE_FIELD_NAME
¶ Name of the field containing sentence2 binary parse
- Type
str
-
static
get_default_fields
()¶ Method returns all SNLI fields in the following order: annotator_labels, captionID, gold_label, pairID, sentence1, sentence1_parse, sentence1_binary_parse, sentence2, sentence2_parse, sentence2_binary_parse
- Returns
fields – Dictionary mapping field names to respective Fields.
- Return type
dict(str, Field)
Notes
This dataset includes both parses for every sentence,
-
static
get_train_test_dev_dataset
(fields=None)¶ Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.
- Parameters
fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied,
`get_default_fields`
is used.- Returns
(train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.
- Return type
(Dataset, Dataset, Dataset)
-
-
class
podium.datasets.impl.
SNLISimple
(file_path, fields)¶ Bases:
podium.datasets.dataset.Dataset
A Simple SNLI Dataset class. This class only uses three fields by default: gold_label, sentence1, sentence2.
-
NAME
¶ Name of the Dataset.
- Type
str
-
URL
¶ URL to the SNLI dataset.
- Type
str
-
DATASET_DIR
¶ Name of the directory in which the dataset files are stored.
- Type
str
-
ARCHIVE_TYPE
¶ Archive type, i.e. compression method used for archiving the downloaded dataset file.
- Type
str
-
TRAIN_FILE_NAME
¶ Name of the file in which the train dataset is stored.
- Type
str
-
TEST_FILE_NAME
¶ Name of the file in which the test dataset is stored.
- Type
str
-
DEV_FILE_NAME
¶ Name of the file in which the dev (validation) dataset is stored.
- Type
str
-
GOLD_LABEL_FIELD_NAME
¶ Name of the field containing gold label
- Type
str
-
SENTENCE1_FIELD_NAME
¶ Name of the field containing sentence1
- Type
str
-
SENTENCE2_FIELD_NAME
¶ Name of the field containing sentence2
- Type
str
-
static
get_default_fields
()¶ Method returns the three main SNLI fields in the following order: gold_label, sentence1, sentence2
- Returns
fields – Dictionary mapping field names to respective Fields.
- Return type
dict(str, Field)
-
static
get_train_test_dev_dataset
(fields=None)¶ Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.
- Parameters
fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied,
`get_default_fields`
is used.- Returns
(train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.
- Return type
(Dataset, Dataset, Dataset)
-
-
class
podium.datasets.impl.
SST
(file_path, fields, fine_grained=False, subtrees=False)¶ Bases:
podium.datasets.dataset.Dataset
The Stanford sentiment treebank dataset.
-
NAME
¶ dataset name
- Type
str
-
URL
¶ url to the SST dataset
- Type
str
-
DATASET_DIR
¶ name of the folder in the dataset containing train and test directories
- Type
str
-
ARCHIVE_TYPE
¶ string that defines archive type, used for unpacking dataset
- Type
str
-
TEXT_FIELD_NAME
¶ name of the field containing comment text
- Type
str
-
LABEL_FIELD_NAME
¶ name of the field containing label value
- Type
str
-
POSITIVE_LABEL
¶ positive sentiment label
- Type
int
-
NEGATIVE_LABEL
¶ negative sentiment label
- Type
int
-
static
get_dataset_splits
(fields=None, fine_grained=False, subtrees=False)¶ Method loads and creates dataset splits for the SST dataset.
- Parameters
fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use
`get_default_fields`
. User should use default field names defined in class attributes.fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.
subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.
- Returns
(train_dataset, valid_dataset, test_dataset) – tuple containing train, valid and test dataset
- Return type
(Dataset, Dataset, Dataset)
-
static
get_default_fields
()¶ Method returns default Imdb fields: text and label.
- Returns
fields – Dictionary mapping field name to field.
- Return type
dict(str, Field)
-