podium.models.impl.eurovoc_models package

Submodules

podium.models.impl.eurovoc_models.multilabel_svm module

Multilabel SVM classifier for EuroVoc dataset.

Example

model_path = … Path where the trained model will be stored

dataset_path = … Path where the dilled instance of EuroVoc dataset will be stored to and/or loaded from

LargeResource.BASE_RESOURCE_DIR = … Directory where the EuroVoc downloaded raw dataset is stored or where it should be downloaded

# this creates and dills the dataset and it should be done only once # the created instance of the dataset can be reused for training the model dill_dataset(dataset_path)

p_grid = {“C”: [1, 10, 100]} train_multilabel_svm(dataset_path=dataset_path, n_jobs=20, param_grid=p_grid, cut_off=2)

with open(model_path, “wb”) as output_file:

dill.dump(obj=clf, file=output_file)

class podium.models.impl.eurovoc_models.multilabel_svm.MultilabelSVM

Bases: podium.models.model.AbstractSupervisedModel

Multilabel SVM with hyperparameter optimization via grid search using K-fold cross-validation.

Multilabel SVM is implemented as a set of binary SVM classifiers, one for each class in dataset (one vs. rest).

fit(X, y, parameter_grid, n_splits=3, max_iter=10000, cutoff=1, scoring='f1', n_jobs=1)

Fits the model on given data.

For each class present in y (for each column of the y matrix), a separate SVM model is trained. If there are no positive training instances for some label (the entire column is filled with zeros), no model is trained. Upon calling the predict function, a zero vector is returned for that class. The indexes of the columns containing such labels are stored and can be retrieved using the get_indexes_of_missing_models method.

Parameters
  • X (np.array) – input data

  • y (np.array) – data labels, 2D array (number of examples, number of labels)

  • parameter_grid (dict or list(dict)) – Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings. For more information, refer to https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html The parameter_grid may contain any of the parameters used to train an instance of the LinearSVC model, most notably penalty parameter ‘C’ and regularization penalty ‘penalty’ that can be set to ‘l1’ or ‘l2’. For more information, please refer to: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

  • n_splits (int) – Number of splits for K-fold cross-validation

  • max_iter (int) – Maximum number of iterations for training a single SVM within the model.

  • cutoff (int >= 1) – If the number of positive training examples for a class is less than the cut-off, no model is trained for such class and the index of the label is added in the missing model indexes.

  • scoring (string, callable, list/tuple, dict or None) – Indicates what scoring function to use in order to determine the best hyperparameters via grid search. For more details, view https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

  • n_jobs (int) – Number of threads to be used.

Raises

ValueError – If cutoff is not a positive integer >= 1. If n_jobs is not a positive integer or -1. If n_jobs is not a positive integer >= 1. If max_iter is not a positive integer >= 1.

get_indexes_of_missing_models()

Returns the indexes of classes for which the models have not been trained due to the lack of positive training examples.

Returns

result – Indexes of models that were not trained.

Return type

list(int)

Raises

RuntimeError – If the model instance is not fitted.

predict(X)

Predict labels for given data.

If no model has been trained for some class (because the was not enough examples for this label in the train set), a zero column is returned. If one wishes to exclude such labels from the evaluation, their indexes can be retrieved through the get_indexes_of_missing_models method.

Parameters

X (np.array) – input data

Returns

result – Predictions of the model for the given examples.

Return type

2D np.array (number of examples, number of classes)

Raises

RuntimeError – If the model instance is not fitted.

reset(**kwargs)

Resets the model to its initial state so it can be re-trained.

Parameters

kwargs – Additional key-value parameters for model

podium.models.impl.eurovoc_models.multilabel_svm.get_label_matrix(Y)

Takes the target fields returned by the EuroVoc iterator and returns the EuroVoc label matrix.

Parameters

Y (dict) – Target returned by the EuroVoc dataset iterator.

Returns

np.array

Return type

matrix of labels for each example (number of examples, number of classes)

podium.models.impl.eurovoc_models.multilabel_svm.train_multilabel_svm(dataset_path, param_grid, cutoff, n_outer_splits=5, n_inner_splits=3, n_jobs=1, is_verbose=True, include_classes_with_no_train_examples=False, include_classes_with_no_test_examples=False)

Trains the multilabel SVM model on a given instance of dataset.

Parameters
  • dataset_path (str) – Path to the instance of EuroVoc dataset stored as a dill file.

  • param_grid (dict or list(dict)) – Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings. For more information, refer to https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

  • cutoff (int) – If the number of positive training examples for a class is less than the cut-off, no model is trained for such class and the index of the label is added in the missing model indexes.

  • n_outer_splits (int) – Number of splits in an outer loop of a nested cross validation.

  • n_inner_splits (int) – Number of splits in an inner loop of a nested cross validation.

  • n_jobs (int) – Number of threads to be used.

  • is_verbose (boolean) – If set to true, scores on test set are printed for each fold of the outer loop in the nested cross validation.

  • include_classes_with_no_train_examples (boolean) – If True, scores of the classes witn an unsufficient number of training examples (less than the specified cut-off) are included when calculating general scores. Note that this makes sense if cut-off=1 because that means classes with no train examples will be taken into consideration.

  • include_classes_with_no_test_examples (boolean) – If True, scores for classes with no positive instances in the test set are included in the general score.

Module contents

Package contains models used in eurovoc classification.

class podium.models.impl.eurovoc_models.MultilabelSVM

Bases: podium.models.model.AbstractSupervisedModel

Multilabel SVM with hyperparameter optimization via grid search using K-fold cross-validation.

Multilabel SVM is implemented as a set of binary SVM classifiers, one for each class in dataset (one vs. rest).

fit(X, y, parameter_grid, n_splits=3, max_iter=10000, cutoff=1, scoring='f1', n_jobs=1)

Fits the model on given data.

For each class present in y (for each column of the y matrix), a separate SVM model is trained. If there are no positive training instances for some label (the entire column is filled with zeros), no model is trained. Upon calling the predict function, a zero vector is returned for that class. The indexes of the columns containing such labels are stored and can be retrieved using the get_indexes_of_missing_models method.

Parameters
  • X (np.array) – input data

  • y (np.array) – data labels, 2D array (number of examples, number of labels)

  • parameter_grid (dict or list(dict)) – Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings. For more information, refer to https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html The parameter_grid may contain any of the parameters used to train an instance of the LinearSVC model, most notably penalty parameter ‘C’ and regularization penalty ‘penalty’ that can be set to ‘l1’ or ‘l2’. For more information, please refer to: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

  • n_splits (int) – Number of splits for K-fold cross-validation

  • max_iter (int) – Maximum number of iterations for training a single SVM within the model.

  • cutoff (int >= 1) – If the number of positive training examples for a class is less than the cut-off, no model is trained for such class and the index of the label is added in the missing model indexes.

  • scoring (string, callable, list/tuple, dict or None) – Indicates what scoring function to use in order to determine the best hyperparameters via grid search. For more details, view https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

  • n_jobs (int) – Number of threads to be used.

Raises

ValueError – If cutoff is not a positive integer >= 1. If n_jobs is not a positive integer or -1. If n_jobs is not a positive integer >= 1. If max_iter is not a positive integer >= 1.

get_indexes_of_missing_models()

Returns the indexes of classes for which the models have not been trained due to the lack of positive training examples.

Returns

result – Indexes of models that were not trained.

Return type

list(int)

Raises

RuntimeError – If the model instance is not fitted.

predict(X)

Predict labels for given data.

If no model has been trained for some class (because the was not enough examples for this label in the train set), a zero column is returned. If one wishes to exclude such labels from the evaluation, their indexes can be retrieved through the get_indexes_of_missing_models method.

Parameters

X (np.array) – input data

Returns

result – Predictions of the model for the given examples.

Return type

2D np.array (number of examples, number of classes)

Raises

RuntimeError – If the model instance is not fitted.

reset(**kwargs)

Resets the model to its initial state so it can be re-trained.

Parameters

kwargs – Additional key-value parameters for model