research.active_learning.ALSimulation

class research.active_learning.ALSimulation(classifier=None, generator=None, use_sample_weight=False, init_clusterer=None, init_strategy='random', selection_strategy='entropy', param_grid=None, cv=None, max_iter=None, n_initial=0.02, increment=0.02, save_classifiers=False, save_test_scores=True, auto_load=True, test_size=None, evaluation_metric='accuracy', random_state=None)[source]

Class to simulate Active Learning experiments.

This algorithm is an implementation of an Active Learning framework as presented in [1]. The initialization strategy is WIP.

Parameters
classifierclassifier object, default=None

Classifier to be used as Chooser and Predictor, or a pipeline containing both the generator and the classifier.

generatorgenerator estimator, default=None

Generator to be used for artificial data generation within Active Learning iterations.

use_sample_weightbool, default=False

Pass sample_weights as a fit parameter to the generator object. Used to generate artificial data around samples with high uncertainty. sample_weights is an array-like of shape (n_samples,) containing the probabilities (based on uncertainty) for selecting a sample as a center point.

init_clustererclusterer estimator, default=None

WIP

init_strategyWIP, default=’random’

WIP

selection_strategyfunction or {‘entropy’, ‘breaking_ties’, ‘random’}, default=’entropy’

Method used to quantify the chooser’s uncertainty level. All predefined functions are set up so that a higher value means higher uncertainty (higher likelihood of selection) and vice-versa. The uncertainty estimate is used to select the instances to be added to the labeled/training dataset. Selection strategies may be added or changed in the UNCERTAINTY_FUNCTIONS dictionary.

param_griddict or list of dictionaries

Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

cvint, cross-validation generator or an iterable, default=None

Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,

  • integer, to specify the number of folds in a (Stratified)KFold,

  • CV splitter.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

max_iterint, default=None

Maximum number of iterations allowed. If None, the experiment will run until 100% of the dataset is added to the training set.

n_initialint, default=.02

Number of observations to include in the initial training dataset. If n_initial < 1, then the corresponding percentage of the original dataset will be used as the initial training set.

incrementint, default=.02

Number of observations to be added to the training dataset at each iteration. If n_initial < 1, then the corresponding percentage of the original dataset will be used as the initial training set.

save_classifiersbool, default=False

Save classifiers fit at each iteration. These classifiers are stored in a list self.classifiers_.

save_test_scoresbool, default=True

If True, test scores are saved in the list self.test_scores_. Size of the test set is defined with the test_size parameter.

auto_loadbool, default=True

If True, the classifier with the best training score is saved in the method self.classifier_. It’s the classifier object used in the predict method.

test_sizefloat or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to 0.25.

evaluation_metricstring, default=’accuracy’

Metric used to calculate the test scores. See research.metrics for info on available performance metrics.

random_stateint, RandomState instance, default=None

Control the randomization of the algorithm.

  • If int, random_state is the seed used by the random number generator;

  • If RandomState instance, random_state is the random number generator;

  • If None, the random number generator is the RandomState instance used by np.random.

References

1

Fonseca, J., Douzas, G., Bacao, F. (2021). Increasing the Effectiveness of Active Learning: Introducing Artificial Data Generation in Active Learning for Land Use/Land Cover Classification. Remote Sensing, 13(13), 2619. https://doi.org/10.3390/rs13132619

Methods

fit(X, y)

Run an Active Learning procedure from training set (X, y).

get_params([deep])

Get parameters for this estimator.

load_best_classifier(X, y)

Loads the best classifier in the self.classifiers_ list.

predict(X)

Predict class or regression value for X.

score(X, y[, sample_weight])

Return the mean accuracy on the given test data and labels.

set_params(**params)

Set the parameters of this estimator.


fit(X, y)[source]

Run an Active Learning procedure from training set (X, y).

Parameters
X{array-like, sparse matrix} of shape (n_samples, n_features)

The training input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

The target values (class labels) as integers or strings.

Returns
selfALWrapper

Completed Active Learning procedure

get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

load_best_classifier(X, y)[source]

Loads the best classifier in the self.classifiers_ list.

The best classifier is used in the predict method according to the performance metric passed.

Parameters
X{array-like, sparse matrix} of shape (n_samples, n_features)

The test input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

The target values (class labels) as integers or strings.

Returns
selfALWrapper

Completed Active Learning procedure

predict(X)[source]

Predict class or regression value for X.

For a classification model, the predicted class for each sample in X is returned. For a regression model, the predicted value based on X is returned.

Parameters
X{array-like, sparse matrix} of shape (n_samples, n_features)

The test input samples.

Returns
yarray-like of shape (n_samples,) or (n_samples, n_outputs)

The predicted classes, or the predict values.

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

Mean accuracy of self.predict(X) wrt. y.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.