fetch_datasets
- imbens.datasets.fetch_datasets(*, data_home=None, filter_data=None, download_if_missing=True, random_state=None, shuffle=False, verbose=False)[source]
Load the benchmark datasets from Zenodo, downloading it if necessary.
Added in version 0.3.
- Parameters:
- data_homestr, default=None
Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.
- filter_datatuple of str/int, default=None
A tuple containing the ID or the name of the datasets to be returned. Refer to the above table to get the ID and name of the datasets.
- download_if_missingbool, default=True
If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site.
- random_stateint, RandomState instance or None, default=None
Random state for shuffling the dataset. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- shufflebool, default=False
Whether to shuffle dataset.
- verbosebool, default=False
Show information regarding the fetching.
- Returns:
- datasetsOrderedDict of Bunch object,
The ordered is defined by
filter_data
. Each Bunch object — referred as dataset — have the following attributes:- dataset.datandarray of shape (n_samples, n_features)
- dataset.targetndarray of shape (n_samples,)
- dataset.DESCRstr
Description of the each dataset.
Notes
This collection of datasets have been proposed in [1]. The characteristics of the available datasets are presented in the table below.
ID
Name
Repository & Target
Ratio
#S
#F
1
ecoli
UCI, target: imU
8.6:1
336
7
2
optical_digits
UCI, target: 8
9.1:1
5,620
64
3
satimage
UCI, target: 4
9.3:1
6,435
36
4
pen_digits
UCI, target: 5
9.4:1
10,992
16
5
abalone
UCI, target: 7
9.7:1
4,177
10
6
sick_euthyroid
UCI, target: sick euthyroid
9.8:1
3,163
42
7
spectrometer
UCI, target: >=44
11:1
531
93
8
car_eval_34
UCI, target: good, v good
12:1
1,728
21
9
isolet
UCI, target: A, B
12:1
7,797
617
10
us_crime
UCI, target: >0.65
12:1
1,994
100
11
yeast_ml8
LIBSVM, target: 8
13:1
2,417
103
12
scene
LIBSVM, target: >one label
13:1
2,407
294
13
libras_move
UCI, target: 1
14:1
360
90
14
thyroid_sick
UCI, target: sick
15:1
3,772
52
15
coil_2000
KDD, CoIL, target: minority
16:1
9,822
85
16
arrhythmia
UCI, target: 06
17:1
452
278
17
solar_flare_m0
UCI, target: M->0
19:1
1,389
32
18
oil
UCI, target: minority
22:1
937
49
19
car_eval_4
UCI, target: vgood
26:1
1,728
21
20
wine_quality
UCI, wine, target: <=4
26:1
4,898
11
21
letter_img
UCI, target: Z
26:1
20,000
16
22
yeast_me2
UCI, target: ME2
28:1
1,484
8
23
webpage
LIBSVM, w7a, target: minority
33:1
34,780
300
24
ozone_level
UCI, ozone, data
34:1
2,536
72
25
mammography
UCI, target: minority
42:1
11,183
6
26
protein_homo
KDD CUP 2004, minority
111:1
145,751
74
27
abalone_19
UCI, target: 19
130:1
4,177
10
References
[1]Ding, Zejin, “Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics.” Dissertation, Georgia State University, (2011).