fetch_openml_datasets

imbens.datasets.fetch_openml_datasets(target_type='all', imalance_type='all', cat_preprocess='onehot', data_home=None)[source]

Fetches multiple datasets from OpenML [1] based on the specified criteria and preprocesses them.

Added in version 0.3.0.

Parameters:
target_type{‘all’, ‘binary’, ‘multiclass’}, optional (default=’all’)

The type of target variable:

  • all: No filter on target variable type.

  • binary: Datasets with binary target variables.

  • multiclass: Datasets with multiclass target variables.

imalance_type{‘all’, ‘low’, ‘medium’, ‘high’, ‘extreme’}, optional (default=’all’)

The imbalance type of the dataset:

  • all: No filter on imbalance level.

  • low: Datasets with low imbalance ratio (less than 5:1).

  • medium: Datasets with medium imbalance ratio (between 5:1 and 10:1).

  • high: Datasets with high imbalance ratio (between 10:1 and 50:1).

  • extreme: Datasets with extreme imbalance ratio (greater than 50:1).

cat_preprocess{‘drop’, ‘onehot’, ‘ordinal’}, optional (default=’onehot’)

The method for preprocessing categorical features:

  • drop: Drop categorical features.

  • onehot: One-hot encode categorical features.

  • ordinal: Ordinal encode categorical features.

data_homestr or Path or None, optional (default=None)

The directory where the datasets should be stored. If None, it will use the default user cache directory.

Returns:
datasetsOrderedDict of Bunch object,

The ordered is ranked by the imbalance ratio of the dataset. Each Bunch object — referred as dataset — have the following attributes:

dataset.datandarray of shape (n_samples, n_features)
dataset.targetndarray of shape (n_samples,)
dataset.openml_idint (OpenML dataset ID)
dataset.IRfloat (imbalance ratio of the dataset)

Notes

This function fetches datasets based on target and imbalance type filters, preprocesses them, and caches the datasets locally. The characteristics of the available datasets are presented in the table below.

ID

Name

OpenML ID

Ratio

#S

#F

1

bwin_amlb

45717

2.01:1

530

13

2

mozilla4

1046

2.04:1

15,545

5

3

mc2

1054

2.10:1

161

39

4

wholesale-customers

1511

2.10:1

440

8

5

vertebra-column

1524

2.10:1

310

6

6

law-school-admission-bianry

43904

2.11:1

20,800

14

7

bank32nh

833

2.22:1

8,192

32

8

elevators

846

2.24:1

16,599

18

9

cpu_small

735

2.31:1

8,192

12

10

Credit_Approval_Classification

46503

2.33:1

1,000

50

11

house_8L

843

2.38:1

22,784

8

12

house_16H

821

2.38:1

22,784

16

13

phoneme

1489

2.41:1

5,404

5

14

ilpd-numeric

41945

2.49:1

583

10

15

planning-relax

1490

2.50:1

182

12

16

MiniBooNE

41150

2.56:1

130,064

50

17

machine_cpu

733

2.73:1

209

6

18

telco-customer-churn

42178

2.77:1

7,043

39

19

haberman

43

2.78:1

306

3

20

vehicle

994

2.88:1

846

18

21

cpu

796

2.94:1

209

36

22

ada

41156

3.03:1

4,147

48

23

adult

45068

3.18:1

48,842

107

24

blood-transfusion-service-center

1464

3.20:1

748

4

25

default-of-credit-card-clients

42477

3.52:1

30,000

23

26

Customer_Churn_Classification

46362

3.74:1

175,028

24

27

SPECTF

1600

3.85:1

267

44

28

Medical-Appointment-No-Shows

43439

3.95:1

110,527

36

29

JapaneseVowels

976

5.17:1

9,961

14

30

ibm-employee-attrition

43893

5.20:1

1,470

53

31

first-order-theorem-proving

1475

5.26:1

6,118

51

32

user-knowledge

1508

5.38:1

403

5

33

online-shoppers-intention

45560

5.46:1

12,330

28

34

kc1

1067

5.47:1

2,109

21

35

thoracic-surgery

1506

5.71:1

470

16

36

UCI_churn

44232

5.90:1

3,333

18

37

arsenic-female-bladder

949

5.99:1

559

4

38

okcupid_stem

45067

6.83:1

26,677

117

39

ecoli

40671

7.15:1

327

7

40

pc4

1049

7.19:1

1,458

37

41

bank-marketing

1558

7.68:1

4,521

48

42

Diabetes-130-Hospitals_(Fairlearn)

43903

7.96:1

101,766

50

43

Otto-Group-Product-Classification-Challenge

45548

8.36:1

61,878

93

44

eucalyptus

43925

8.54:1

4,331

26

45

pendigits

1019

8.61:1

10,992

16

46

pc3

1050

8.77:1

1,563

37

47

page-blocks-bin

1021

8.77:1

5,473

10

48

optdigits

980

8.83:1

5,620

64

49

mfeat-karhunen

1020

9.00:1

2,000

64

50

mfeat-fourier

971

9.00:1

2,000

76

51

mfeat-zernike

995

9.00:1

2,000

47

52

Pulsar-Dataset-HTRU2

45558

9.92:1

17,898

8

53

vowel

1016

10.00:1

990

26

54

heart-h

1565

12.53:1

294

13

55

pc1

1068

13.40:1

1,109

21

56

seismic-bumps

45562

14.20:1

2,584

22

57

ozone-level-8hr

1487

14.84:1

2,534

72

58

microaggregation2

41671

15.02:1

20,000

20

59

Sick_numeric

41946

15.33:1

3,772

29

60

insurance_company

46281

15.76:1

9,822

85

61

wilt

40983

17.54:1

4,839

5

62

Click_prediction_small

1217

21.37:1

149,639

11

63

jannis

41168

22.83:1

83,733

54

64

letter

977

23.60:1

20,000

16

65

walking-activity

1509

24.14:1

149,332

4

66

helena

41169

36.08:1

65,196

27

67

mammography

310

42.01:1

11,183

6

68

dis

40713

64.03:1

3,772

29

69

Satellite

40900

67.00:1

5,100

36

70

Employee-Turnover-at-TECHCO

43551

68.74:1

34,452

9

71

page-blocks

30

175.46:1

5,473

10

72

allbp

40707

257.79:1

3,772

29

73

CreditCardFraudDetection

42397

577.88:1

284,807

30

References

[1]

Vanschoren, J., Van Rijn, J. N., Bischl, B., & Torgo, L. (2014). OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2), 49-60.