Datasets

Dataset Helper

features_labels_from_data(X, y, train_size=None, test_size=None, n_features=None, *, use_pca=False, return_bunch=False)[source]

This script splits a dataset according to the required train size, test size and number of features

Parameters

X – raw data from dataset
y – labels from dataset
test_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
train_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
n_features – number of desired features
use_pca – whether to use PCA for dimensionality reduction or not default False
return_bunch – whether to return a sklearn.Bunch (similar to a dictionary) or not
Returns – Preprocessed dataset as available in sklearn

pca_reduce(X_train, X_test, n_components=2)[source]

Return type: Tuple[ndarray, ndarray]

label_to_class_name(predicted_labels, classes)[source]

Helper converts labels (numeric) to class name (string)

Parameters

predicted_labels (numpy.ndarray) – Nx1 array
classes (dict or list) – a mapping form label (numeric) to class name (str)

Return type

List[str]

Returns

list of predicted class names of each datum

Example

>>>  classes = ['sepal length (cm)',
>>>             'sepal width (cm)',
>>>             'petal length (cm)',
>>>             'petal width (cm)']
>>> predicted_labels = [0, 2, 1, 2, 0]
>>> print(label_to_class_name(predicted_labels, classes))

Breast Cancer

load_breast_cancer(train_size=None, test_size=None, n_features=None, *, use_pca=False, return_bunch=False)[source]

This script loads breast cancer dataset from sklearn and splits it according to the required train size, test size and number of features

Parameters

test_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
train_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
n_features – number of desired features
use_pca – whether to use PCA for dimensionality reduction or not default False
return_bunch – whether to return a Bunch (similar to a dictionary) or not
Returns – Breast Cancer dataset as available in sklearn

Iris

load_iris(train_size=None, test_size=None, n_features=None, *, use_pca=False, return_bunch=False)[source]

This script loads iris dataset from sklearn and splits it according to the required train size, test size and number of features

Parameters

test_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
train_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
n_features – number of desired features
use_pca – whether to use PCA for dimensionality reduction or not default False
return_bunch – whether to return a Bunch (similar to a dictionary) or not
Returns – Iris dataset as available in sklearn

Wine

load_wine(train_size=None, test_size=None, n_features=None, *, use_pca=False, return_bunch=False)[source]

This script loads wine dataset from sklearn and splits it according to the required train size, test size and number of features

Parameters

test_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
train_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
n_features – number of desired features
use_pca – whether to use PCA for dimensionality reduction or not default False
return_bunch – whether to return a Bunch (similar to a dictionary) or not
Returns – Wine dataset as available in sklearn