fetch_datasets

imbens.datasets.fetch_datasets(*, data_home=None, filter_data=None, download_if_missing=True, random_state=None, shuffle=False, verbose=False)[source]

Load the benchmark datasets from Zenodo, downloading it if necessary.

Added in version 0.3.

Parameters:
data_homestr, default=None

Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

filter_datatuple of str/int, default=None

A tuple containing the ID or the name of the datasets to be returned. Refer to the above table to get the ID and name of the datasets.

download_if_missingbool, default=True

If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site.

random_stateint, RandomState instance or None, default=None

Random state for shuffling the dataset. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shufflebool, default=False

Whether to shuffle dataset.

verbosebool, default=False

Show information regarding the fetching.

Returns:
datasetsOrderedDict of Bunch object,

The ordered is defined by filter_data. Each Bunch object — referred as dataset — have the following attributes:

dataset.datandarray of shape (n_samples, n_features)
dataset.targetndarray of shape (n_samples,)
dataset.DESCRstr

Description of the each dataset.

Notes

This collection of datasets have been proposed in [1]. The characteristics of the available datasets are presented in the table below.

ID

Name

Repository & Target

Ratio

#S

#F

1

ecoli

UCI, target: imU

8.6:1

336

7

2

optical_digits

UCI, target: 8

9.1:1

5,620

64

3

satimage

UCI, target: 4

9.3:1

6,435

36

4

pen_digits

UCI, target: 5

9.4:1

10,992

16

5

abalone

UCI, target: 7

9.7:1

4,177

10

6

sick_euthyroid

UCI, target: sick euthyroid

9.8:1

3,163

42

7

spectrometer

UCI, target: >=44

11:1

531

93

8

car_eval_34

UCI, target: good, v good

12:1

1,728

21

9

isolet

UCI, target: A, B

12:1

7,797

617

10

us_crime

UCI, target: >0.65

12:1

1,994

100

11

yeast_ml8

LIBSVM, target: 8

13:1

2,417

103

12

scene

LIBSVM, target: >one label

13:1

2,407

294

13

libras_move

UCI, target: 1

14:1

360

90

14

thyroid_sick

UCI, target: sick

15:1

3,772

52

15

coil_2000

KDD, CoIL, target: minority

16:1

9,822

85

16

arrhythmia

UCI, target: 06

17:1

452

278

17

solar_flare_m0

UCI, target: M->0

19:1

1,389

32

18

oil

UCI, target: minority

22:1

937

49

19

car_eval_4

UCI, target: vgood

26:1

1,728

21

20

wine_quality

UCI, wine, target: <=4

26:1

4,898

11

21

letter_img

UCI, target: Z

26:1

20,000

16

22

yeast_me2

UCI, target: ME2

28:1

1,484

8

23

webpage

LIBSVM, w7a, target: minority

33:1

34,780

300

24

ozone_level

UCI, ozone, data

34:1

2,536

72

25

mammography

UCI, target: minority

42:1

11,183

6

26

protein_homo

KDD CUP 2004, minority

111:1

145,751

74

27

abalone_19

UCI, target: 19

130:1

4,177

10

References

[1]

Ding, Zejin, “Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics.” Dissertation, Georgia State University, (2011).