mlresearch.datasets.BinaryDatasets

class mlresearch.datasets.BinaryDatasets(names: str | list = 'all', data_home: str = None, download_if_missing: bool = True)[source]

Class to download, transform and save binary class datasets.


download(keep_index=False)

Download the datasets.

fetch_arcene()[source]

Download and transform the Arcene Data Set.

https://archive.ics.uci.edu/ml/datasets/Arcene

fetch_audit()[source]

Download and transform the Audit Data Set.

https://archive.ics.uci.edu/ml/datasets/Audit+Data

fetch_banknote_authentication()[source]

Download and transform the Banknote Authentication Data Set.

https://archive.ics.uci.edu/ml/datasets/banknote+authentication

fetch_breast_cancer()[source]

Download and transform the Breast Cancer Wisconsin Data Set.

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

fetch_ionosphere()[source]

Download and transform the Ionosphere Data Set.

https://archive.ics.uci.edu/ml/datasets/ionosphere

fetch_parkinsons()[source]

Download and transform the Parkinsons Data Set.

https://archive.ics.uci.edu/ml/datasets/parkinsons

fetch_spambase()[source]

Download and transform the Spambase Data Set.

https://archive.ics.uci.edu/ml/datasets/Spambase

imbalance_datasets(imbalance_ratio: float, random_state: int = None)

Appends imbalanced versions of datasets with predefined imbalance ratios to self.content_.

\[IR = \frac{|C_{maj}|}{|C_{min}|}\]
Parameters:
imbalance_ratiofloat

Final Imbalance Ratio expected in the datasets.

random_stateint, RandomState instance, default=None

Control the randomization of the algorithm.

  • If int, random_state is the seed used by the random number generator;

  • If RandomState instance, random_state is the random number generator;

  • If None, the random number generator is the RandomState instance used by np.random.

Returns:
selfDatasets
items()
keys()
save(path, db_name)

Save datasets.

summarize_datasets()

Create a summary of the downloaded datasets.

Returns:
datasets_summarypd.DataFrame

Dataframe with summary statistics of all datasets.

values()