Skip to content

Latest commit

 

History

History
46 lines (29 loc) · 9.82 KB

modelling_notes_and_caveats.md

File metadata and controls

46 lines (29 loc) · 9.82 KB

Scikit-Multilearn Notes

  • In multi-label classification, more than one label/class can be assigned to a given object out of the available n_labels.
  • Limitation to work problem: A multi-label classifier outputs a set of assigned labels, either in a list of assigned labels or as a binary vector in which a 1 or 0 on i-th position indicates if an i-th label is assigned or not (separate columns of binary (0 or 1) indicators). There is no mention of percentage proportions of different labels being output from the model, which was discussed as important for the work problem.
  • There is a problem transformation approach in scikit-multilearn that converts multi-label problems to single-label (binary classifier) problems (single-class or multi-class, where the sample can only be one class in both these instances). For this method, the dataset is stored in sparse matrices for efficiency. Despite this, not all scikit-learn classifiers support matrix input and sparse representations. Due to this reason, every scikit-multilearn classifier that follows a problem transformation approach admits a “require_dense” parameter in the constructor. As the scikit-multilearn classifiers transform the multi-label problem to a set of single-label problems and solve them using scikit-learn base classifiers, the “require_dense”parameter allows the user to decide which format of the transformed input and output space are passed to the base classifier. -Some scikit-learn classifiers support the sparse representation of X (features) especially for textual data. To forward this representation to the scikit-learn classifier, pass require_dense =[False, None] to the scikit-multilearn classifier’s constructor.
  • Scikit-multilearn has 11 classifiers that provide a strong variety of classification scenarios through label partitioning and ensemble classification.

  • BRkNN classifiers train a k Nearest Neighbor per label and uses infer label assignment in one of the two variants. PROS: Takes some label relations into account while estimating single-label classifiers. Works well when the distance between the samples is a good predictor for label assignment (large disparity between label populations). CONS: This method trains a classifier per label and is less suitable for large label spaces. It requires parameter estimation. This method may not be suitable for our work problem, computationally speaking.

  • MLTSVN: The documentation for this method is unclear as there are grammatical errors and copy-paste of preceding sections here. Further investigation into this technique is required and is beyond the scope of this work.

  • [TO USE] MLkNN builds k-NearestNeighbors models to find the nearest examples to a test class, and uses Bayesian inference to select assigned labels. PROS: This method estimates one multi-class subclassifier. It works when the distance between samples is a good predictor for label assignment. CONS: It requires parameter estimation.

  • MLARAM is an ART classifier which uses clustering of learned prototypes into large clusters to improve performance. PROS: It is linear in the number of samples so it scales well. CONS: It requires parameter estimation. Historically, ART techniques have had generalization limits. This may not be a good model to use for our problem until these limitations are resolved.

  • BinaryRelevance is a classifier that transforms a multi-label classification problem with n labels into n single-label separate binary classification problems. PROS: This model estimates single-label classifiers and can generalize beyond available label combinations. CONS: It ignores label relations and is not suitable when there are large numbers of labels. Given these trade-offs, usage of this model may not suit the work problem.

  • ClassifierChain is a model that transforms multi-label problems to a multi-class problem, where each label combination is a separate class. PROS: This model estimates single-label classifiers and can generalize beyond available label combinations. It also takes label relations into account. CONS: This method is not suitable for a large number of labels. The model quality strongly depends on the label ordering in the chain. This model may therefore not suit the work problem at hand.

  • LabelPowerset is a model that transforms a multi-label problem to a multi-class problem, where each label combination is a separate class. PROS: The method estimates label dependencies, with only one classifier. It is often the best solution for subset accuracy if the training data contains all relevant label combinations. CONS: Of note is that this model requires all predictable label combinations to be present in the training data. As such, this method is very prone to underfitting with large label spaces.

  • RakeID is a model that randomly partitions label space and trains a Label Powerset classifier per partition with a base multi-class classifier. PROS: The method may use less classifiers than Binary Relevance and still generalize label relations while not underfitting like LabelPowerset. CONS: This method uses a random approach, which may not be very probable to drawing an optimal label space division. This method requires more research into its shortcomings and usage of this model is beyond the scope of this work.

  • RakeIO is a model that randomly draws label subspaces (possibly overlapping) and trains a Label Powerset classifier per partition with a base multi-class classifier. Here, labels are assigned based on voting. PROS: This approach may provide better results with overlapping models. CONS: This model takes a large number of classifiers to generate improvement and is not scalable. In addition, random subspaces may not be optimal. As such, using this model is beyond the scope of this work.

  • [TO USE] LabelSpacePartitioningClassifier is a method that uses clustering methods to divide the label space into subspaces, and trains a base classifier per partition with a base multi-class classifier. PROS: This method accommodates different problem types. It infers whether or not to divide into subproblems and decides when to use less classifiers than Binary Relevance. The method is scalable to datasets with large label numbers. It generalizes label relations well while not underfitting like LabelPowerset. In addition, it doesn’t require parameter estimation. CONS: This method requires label relationships present in the training data to be representable of the problem (so all possible labels need to be captured in the training data, including potential future labels!). In addition, partitioning may prevent certain label combinations from being correctly classified. This method appears to be a key approach to the work problem and is worth testing.

  • [TO USE] MajorityVotingClassifier is a method where clustering methods are used to divide the label space into subspaces (possibly overlapping) and train a base classifier per partition with a base multi-class classifier. Labels are assigned based on voting. PROS: The method accommodates different problem types. It infers whether or not to divide into subproblems and helps decide when to use less classifiers than Binary Relevance. It is scalable to data sets with a large numbers of labels and generalizes label relations well while not underfitting like LabelPowerset. Finally it doesn’t require parameter estimation. CONS: The only downside is that this method requires label relationships present in training data to be representable of the problem (so all present and future labels need to be captured!).

  • EmbeddingClassifier is a model that embeds the label space, trains a regressor (or many) for unseen samples to predict their embeddings, and a classifier to correct the regression error. PROS: This model improves discriminability and joint label probability distributions. It also provides good results with low-complexity linear embeddings and weak regressors/classifiers. CONS: The model requires parameter estimation and further work is needed on using rule-of-thumb estimations in research papers.

Estimating parameters:

Scikit-multilearn allows this by using scikit-learn’s model selection GridSearchCV API. In the simplest version, it can look for the best parameter of a scikit-multilearn’s classifier, and in the more complicated cases of problem transformation methods, it can estimate both the method’s hyper parameters and the base classifier’s parameters [this last bit may be useful for predict_proba implementation (see below)].

  • For the MajorityVotingClassifier selected for the work problem, the classifier requires further research into the best base classifier to use and also to see if the predict_proba function can be utilised to generate probability predictions of multiple labels. This last part is essential to the work problem.
  • Further research needs to be conducted into the LabelSpaceClustererBase that partitions the output space (e.g. FixedLabelSpaceClusterer).
  • A potential source of further research can be found here: http://scikit.ml/labelrelations.html.
  • For the network-based label space partition ensemble classification selected for the work problem, the classifier requires further research into the best base classifier to use and also to see if the predict_proba function can be utilised to generate probability predictions of multiple labels. This last part is essential to the work problem.
  • Further research needs to be conducted into the LabelSpaceClustererBase that partitions the output space (e.g. FixedLabelSpaceClusterer).