Early diabete prediction using Keras and Scikit-learn

In this article, we will demonstrate how to combine Keras, Tensorflow and Scikit to achieve binary classification using Deep Neural Network approach.

We will follow the standard process that should be adopted for any kind of machine learning projects :

  • Load, Explore, visualize a dataset
  • Data preprocessing
  • Develop the model
  • Create training and evaluation datasets
  • Execute the training
  • Serving
  • Load, Explore, visualize a dataset

The dataset that we will be using is for “Early Stage diabetes prediction” which is available for download on UCI Machine Learning repository with the description of each column.

We use the venerable Pandas to load the dataset and to create a dataframe.

import pandas as pd
CSV_COLUMNS = ['age','gender','polyuria','polydipsia','sudden_weight_loss', 'weakness', 'polyphagia', 'genital_thrush', 'visual_blurring','itching','irritability', 'delayed_healing', 'partial_paresis', 'muscle_stiffness', 'alopecia', 'obesity', 'class']
URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00529/diabetes_data_upload.csv"
#Load dataset
full_dataset = pd.read_csv(URL,
                           header = 0,
                           names = CSV_COLUMNS)
  • Data Preprocessing

Now, we need to prepare the features by coding them according to the vocabulary list. We don’t have to worry about the assigned codes as Pandas will take of it.

for colname in ['gender','polyuria','polydipsia','sudden_weight_loss', 'weakness', 'polyphagia', 'genital_thrush', 'visual_blurring','itching','irritability', 'delayed_healing', 'partial_paresis', 'muscle_stiffness', 'alopecia', 'obesity']:
    full_dataset[colname] = pd.Categorical(full_dataset[colname])
    full_dataset[colname] = full_dataset[colname].cat.codes
  • Create training and evaluations dataset

Now, we split the original dataset into features/label and encode the label using scikit-learn.

from sklearn import preprocessing
import numpy as np

X = full_dataset.iloc[:,0:(len(full_dataset.columns) - 1)]
X = np.asarray(X).astype(np.float32)
Y = full_dataset.iloc[:,len(full_dataset.columns) - 1]
encoder = preprocessing.LabelEncoder()
encoded_Y = encoder.transform(Y)
  • Develop the model

Next step is the build the model with 2 hidden layers [34, 16] with activation function ReLU and Sigmoid for the output. Forthe loss, BinaryCrossentropy is more appropriate for binary classification and we have put from_logits=True to allow more stability. Adam is chosen as optimizer with a small learning_rate=0.005 but higher epoch=400 during the training.

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Dense

def build_dnn_model():
    # create model
    model = tf.keras.models.Sequential()
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=True), optimizer=keras.optimizers.Adam(learning_rate=0.005), metrics=['accuracy'])
    return model
  • Execute the training

Time now to execute the training and perform a cross validation. We run the Keras model on Scikit Classifier and we use 10% of dataset for validation.

import tensorflow as tf
from sklearn import model_selection

estimator = tf.keras.wrappers.scikit_learn.KerasClassifier(build_fn=build_dnn_model, epochs=400, batch_size=10)
kfold = model_selection.StratifiedKFold(n_splits=10, shuffle=True)
history = model_selection.cross_val_score(estimator, X, encoded_Y, cv=kfold, verbose = 2)
print("Accuracy: %.2f%% (%.2f%%)" % (history.mean()*100, history.std()*100))

As result, we can an accuracy as high as 96.35%.

The last step, Serving, will be covered in another article.

The Jupyter notebook file is available on https://github.com/erasolon/machine_learning/blob/main/Diabete_DNN_Scikit_Keras.ipynb

Leave a comment

Your email address will not be published. Required fields are marked *