`mlresearch.active_learning`.AugmentationAL¶

class mlresearch.active_learning.AugmentationAL(classifier: BaseEstimator | ClassifierMixin = None, generator: BaseOverSampler = None, param_grid: dict = None, cv=None, acquisition_func=None, n_init: int | float = None, budget: int | float = None, max_iter: int = None, evaluation_metric=None, continue_training: bool = False, random_state: int = None)[source]¶

Active Learning with pipelined Data Augmentation. This method is implemented and analysed in a working paper.

Parameters:

classifierclassifier object, default=None

Classifier or pipeline to be trained in the iterative process. If None, defaults to sklearn’s RandomForestClassifier with default parameters and uses the random_state passed in the Active Learning model.

generatorgenerator estimator, default=None

Generator to be used for artificial data generation within Active Learning iterations.

param_griddict or list of dictionaries

Used to optimize the classifier and generator hyperparameters at each iteration via cross-validated grid-search. If None, parameter tuning is skipped. Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

cvint, cross-validation generator or an iterable, default=None

Determines the cross-validation splitting strategy. Used to optimize the classifier and generator hyperparameters at each iteration. Possible inputs for cv are:

None, to use the default 5-fold cross validation,
integer, to specify the number of folds in a (Stratified)KFold,
CV splitter.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

acquisition_funcfunction or {‘entropy’, ‘breaking_ties’, ‘random’}, default=None

Method used to quantify the prediction’s uncertainty level. All predefined functions are set up so that a higher value means higher uncertainty (higher likelihood of selection) and vice-versa. The uncertainty estimate is used to select the instances to be added to the labeled/training dataset. Acquisition functions may be added or changed in the UNCERTAINTY_FUNCTIONS dictionary. If None, defaults to “random”.

n_initint or float, default=None

Number of observations to include in the initial training dataset. If n_init < 1, then the corresponding percentage of the original dataset will be used as the initial training set. If None, defaults to 2% of the size of the original dataset.

budgetint or float, default=None

Number of observations to be added to the training dataset at each iteration. If budget < 1, then the corresponding percentage of the original dataset will be used as the initial training set. If None, defaults to 2% of the size of the original dataset.

max_iterint, default=None

Maximum number of iterations allowed. If None, the experiment will run until 100% of the dataset is added to the training set.

evaluation_metricstring, default=’accuracy’

Metric used to calculate the test scores. See mlresearch.metrics for info on available performance metrics.

continue_trainingbool, default=False

If False, fit a new classifier at each iteration. If True, the classifier fitted in the previous iteration is used for further training in subsequent iterations.

random_stateint, RandomState instance, default=None

Control the randomization of the algorithm.

If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.

Attributes:

acquisition_func_function: Method used to calculate the classification uncertainty at each iteration.
evaluation_metric_scorer: Metric used to estimate the performance of the AL classifier at each iteration.
classifier_estimator object: The classifier used in the iterative process. It is the classifier fitted in the last iteration.
metadata_dict: Contains the performance estimations, classifiers, labeled pool mask and original dataset.
n_init_int: Number of observations included in the initial training dataset.
budget_int: Number of observations to be added to the training set per iteration.
max_iter_int: Maximum number of iterations allowed.
labeled_pool_array-like of shape (n_samples,): Mask that filters the labeled observations from the original dataset.

fit(X, y, **kwargs)¶

Fit an Active Learning model from training set (X, y).

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): The training input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs): The target values (class labels) as integers or strings.

Returns:

selfActive Learning Classifier: Fitted Active Learning model.

get_metadata_routing()¶

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

initialization(X, y, initial_selection=None, **kwargs)¶

iteration(X, y, **kwargs)¶

predict(X)¶

Predict class or regression value for X.

For a classification model, the predicted class for each sample in X is returned. For a regression model, the predicted value based on X is returned.

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): The test input samples.

Returns:

yarray-like of shape (n_samples,) or (n_samples, n_outputs): The predicted classes, or the predict values.

score(X, y, sample_weight=None)¶

Return accuracy on provided data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:

Xarray-like of shape (n_samples, n_features): Test samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs): True labels for X.
sample_weightarray-like of shape (n_samples,), default=None: Sample weights.

Returns:

scorefloat: Mean accuracy of self.predict(X) w.r.t. y.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → AugmentationAL¶

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns:

selfobject: The updated object.

mlresearch.active_learning.AugmentationAL¶

`mlresearch.active_learning`.AugmentationAL¶