.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_introduction_active_learning.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_introduction_active_learning.py: Introduction to Active Learning =============================== This example explains what Active Learning (AL) is and how ml-research can be used to run AL simulations. We will focus exclusively on the ``StandardAL`` object, which is the typical/classical process used in AL. Let's start by setting up our environment. .. GENERATED FROM PYTHON SOURCE LINES 13-64 .. code-block:: Python import numpy as np from sklearn.datasets import make_classification from sklearn.pipeline import make_pipeline from sklearn.preprocessing import MinMaxScaler from sklearn.neural_network import MLPClassifier from sklearn.inspection import DecisionBoundaryDisplay from sklearn.metrics import f1_score from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt from mlresearch.active_learning import StandardAL # This will apply a nice formatting style to our plots from mlresearch.utils import set_matplotlib_style, feature_to_color set_matplotlib_style(font_size=16, use_latex=False) # Set up some environment variables RNG_SEED = 42 labels = np.array([-1, 0, 1]) color_labels = { label: color for label, color in zip(labels, feature_to_color(labels + 1, cmap="Accent")) } # Define some helper functions def plot_data(X, y, classifier=None, ax=None): if ax is None: fig, ax_ = plt.subplots() else: ax_ = ax if classifier is not None: DecisionBoundaryDisplay.from_estimator(classifier, X=X, alpha=0.2, ax=ax_) for label in labels: mask = y == label ax_.scatter( X[mask, 0], X[mask, 1], c=color_labels[label], label=label, alpha=0.5 ) if ax is None: plt.legend() plt.show() else: return ax_ .. GENERATED FROM PYTHON SOURCE LINES 65-68 We can now generate a simple mock dataset with 2 features and 2 target classes. Our goal is to produce a high-performing classifier that will be able to distinguish the 2 classes: .. GENERATED FROM PYTHON SOURCE LINES 68-74 .. code-block:: Python X, y = make_classification( n_samples=500, n_features=2, n_informative=2, n_redundant=0, random_state=RNG_SEED ) plot_data(X, y) .. image-sg:: /auto_examples/images/sphx_glr_plot_introduction_active_learning_001.png :alt: plot introduction active learning :srcset: /auto_examples/images/sphx_glr_plot_introduction_active_learning_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 75-92 What is Active Learning? ------------------------ Now that we have our problem set up, we can discuss what AL actually is. AL is most commonly used when we have a large pool of unlabeled data. In such cases, if we want to train a classifier, we will need to label/annotate some of these data points in order to produce a training dataset. However, randomly selecting data points to form this training dataset is very inefficient; the annotation process can be time consuming or expensive (and sometimes both!). AL allows this process to be much more efficient, since it attempts to retrieve the most informative data points to the learning process. **The goal of AL is to find the smallest possible data subset that will allow a classifier to achieve the best possible performance.** Let's apply this description to our dataset by assuming it is unlabeled at an initial state: .. GENERATED FROM PYTHON SOURCE LINES 92-97 .. code-block:: Python y_known = np.zeros(y.shape) - 1 plot_data(X, y_known) .. image-sg:: /auto_examples/images/sphx_glr_plot_introduction_active_learning_002.png :alt: plot introduction active learning :srcset: /auto_examples/images/sphx_glr_plot_introduction_active_learning_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 98-104 Here, the label `-1` means the label of a given point is unknown. Since there is no information this stage, we currently have no way to select points for annotation in an informed way. Let's begin by selecting two random points (one from each class) and use those to train a classifier. To do this, we can use the `StandardAL` class. .. GENERATED FROM PYTHON SOURCE LINES 104-122 .. code-block:: Python classifier = make_pipeline( MinMaxScaler(), MLPClassifier((20, 20), max_iter=3000, random_state=RNG_SEED) ) al = StandardAL( classifier=classifier, acquisition_func="breaking_ties", n_init=2, budget=4, random_state=RNG_SEED, ) al.initialization(X, y) y_known[al.labeled_pool_] = y[al.labeled_pool_] # At this point, we only collected 2 labeled points, no classifier has been trained yet plot_data(X, y_known) .. image-sg:: /auto_examples/images/sphx_glr_plot_introduction_active_learning_003.png :alt: plot introduction active learning :srcset: /auto_examples/images/sphx_glr_plot_introduction_active_learning_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 123-134 The acquisition function is defined to ensure the following annotation stages are done based on `breaking ties`, which will quantify the uncertainty of the classifier when predicting the labels for the unlabeled points. The unlabeled points with the highest uncertainty are the ones we expect to be the most valuable to annotate and include in the training dataset. This is the core concept of AL. Usually, the points selected for annotation are the ones closest to the decision boundary. This is because the classifier is most uncertain about these points, and they are therefore the most informative. Let's select 4 additional points for annotation: .. GENERATED FROM PYTHON SOURCE LINES 134-139 .. code-block:: Python al.iteration(X, y) y_known[al.labeled_pool_] = y[al.labeled_pool_] plot_data(X, y_known, al.classifier_) .. image-sg:: /auto_examples/images/sphx_glr_plot_introduction_active_learning_004.png :alt: plot introduction active learning :srcset: /auto_examples/images/sphx_glr_plot_introduction_active_learning_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 140-142 Repeat this process until the model reaches a satisfactory performance level. Let's see how the classifier evolves as we annotate more points: .. GENERATED FROM PYTHON SOURCE LINES 142-153 .. code-block:: Python fig, axes = plt.subplots(3, 3, figsize=(15, 10)) for i in range(18): al.iteration(X, y) y_known[al.labeled_pool_] = y[al.labeled_pool_] if i % 2 == 0: ax = axes.flatten()[i // 2] plot_data(X, y_known, al.classifier_, ax=ax) plt.show() .. image-sg:: /auto_examples/images/sphx_glr_plot_introduction_active_learning_005.png :alt: plot introduction active learning :srcset: /auto_examples/images/sphx_glr_plot_introduction_active_learning_005.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 154-157 As we can see, the classifier is able to learn the decision boundary more accurately as we annotate more points. At this point, our model would classify the remaining points as follows: .. GENERATED FROM PYTHON SOURCE LINES 157-162 .. code-block:: Python y_pred = al.classifier_.predict(X) print("F1 Score:", f1_score(y, y_pred, average="weighted")) plot_data(X, y_pred, al.classifier_) .. image-sg:: /auto_examples/images/sphx_glr_plot_introduction_active_learning_006.png :alt: plot introduction active learning :srcset: /auto_examples/images/sphx_glr_plot_introduction_active_learning_006.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none F1 Score: 0.9299599329146503 .. GENERATED FROM PYTHON SOURCE LINES 163-167 We can simplify the process of running AL experiments by using the `fit_predict` method, which will run the entire process for us. We will run an experiment with more points being labeled per iteration, until all points are labeled, while keeping track of the test score at each iteration: .. GENERATED FROM PYTHON SOURCE LINES 167-190 .. code-block:: Python X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=RNG_SEED ) classifier = make_pipeline( MinMaxScaler(), MLPClassifier((20, 20), max_iter=3000, random_state=RNG_SEED) ) al = StandardAL( classifier=classifier, acquisition_func="breaking_ties", n_init=10, budget=10, evaluation_metric="f1_weighted", random_state=RNG_SEED, ) al.fit(X, y, X_test=X_test, y_test=y_test) test_scores = [al.metadata_[i]["test_score"] for i in range(1, al.max_iter_)] plt.plot(test_scores) plt.show() .. image-sg:: /auto_examples/images/sphx_glr_plot_introduction_active_learning_007.png :alt: plot introduction active learning :srcset: /auto_examples/images/sphx_glr_plot_introduction_active_learning_007.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 191-197 As we can see, the performance of the classifier improves as we annotate more points, up to a certain point. After that, the performance stabilizes or even decreases, since the classifier is overfitting to the training data. We can visualize the decision boundaries of the classifier at different iterations to show this effect: .. GENERATED FROM PYTHON SOURCE LINES 197-208 .. code-block:: Python fig, axes = plt.subplots(1, 4, figsize=(15, 5)) for iter_, ax in zip([1, 12, 24, al.max_iter_], axes.flatten()): metadata = al.metadata_[iter_] y_known = np.zeros(y.shape) - 1 y_known[metadata["labeled_pool"]] = y[metadata["labeled_pool"]] plot_data(X, y_known, metadata["classifier"], ax=ax) ax.set_title(f"Iteration {iter_}") ax.legend() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_plot_introduction_active_learning_008.png :alt: Iteration 1, Iteration 12, Iteration 24, Iteration 49 :srcset: /auto_examples/images/sphx_glr_plot_introduction_active_learning_008.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 38.188 seconds) .. _sphx_glr_download_auto_examples_plot_introduction_active_learning.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_introduction_active_learning.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_introduction_active_learning.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_introduction_active_learning.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_