Datasets

Dataset Helper

features_labels_from_data(X, y, train_size=None, test_size=None, n_features=None, *, use_pca=False, return_bunch=False)[source]

This script splits a dataset according to the required train size, test size and number of features

Parameters
  • X – raw data from dataset

  • y – labels from dataset

  • test_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

  • train_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

  • n_features – number of desired features

  • use_pca – whether to use PCA for dimensionality reduction or not default False

  • return_bunch – whether to return a sklearn.Bunch (similar to a dictionary) or not

  • Returns – Preprocessed dataset as available in sklearn

pca_reduce(X_train, X_test, n_components=2)[source]
Return type

Tuple[ndarray, ndarray]

label_to_class_name(predicted_labels, classes)[source]

Helper converts labels (numeric) to class name (string)

Parameters
  • predicted_labels (numpy.ndarray) – Nx1 array

  • classes (dict or list) – a mapping form label (numeric) to class name (str)

Return type

List[str]

Returns

list of predicted class names of each datum

Example

>>>  classes = ['sepal length (cm)',
>>>             'sepal width (cm)',
>>>             'petal length (cm)',
>>>             'petal width (cm)']
>>> predicted_labels = [0, 2, 1, 2, 0]
>>> print(label_to_class_name(predicted_labels, classes))

Breast Cancer

load_breast_cancer(train_size=None, test_size=None, n_features=None, *, use_pca=False, return_bunch=False)[source]

This script loads breast cancer dataset from sklearn and splits it according to the required train size, test size and number of features

Parameters
  • test_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

  • train_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

  • n_features – number of desired features

  • use_pca – whether to use PCA for dimensionality reduction or not default False

  • return_bunch – whether to return a Bunch (similar to a dictionary) or not

  • Returns – Breast Cancer dataset as available in sklearn

Iris

load_iris(train_size=None, test_size=None, n_features=None, *, use_pca=False, return_bunch=False)[source]

This script loads iris dataset from sklearn and splits it according to the required train size, test size and number of features

Parameters
  • test_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

  • train_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

  • n_features – number of desired features

  • use_pca – whether to use PCA for dimensionality reduction or not default False

  • return_bunch – whether to return a Bunch (similar to a dictionary) or not

  • Returns – Iris dataset as available in sklearn

Wine

load_wine(train_size=None, test_size=None, n_features=None, *, use_pca=False, return_bunch=False)[source]

This script loads wine dataset from sklearn and splits it according to the required train size, test size and number of features

Parameters
  • test_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

  • train_size – float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

  • n_features – number of desired features

  • use_pca – whether to use PCA for dimensionality reduction or not default False

  • return_bunch – whether to return a Bunch (similar to a dictionary) or not

  • Returns – Wine dataset as available in sklearn