In this article, we will use again the same dataset that we have used in previous articles to do a an early diabete prediction but with its libsvm format which is made available on this link.

This time, we will use XGBoost which has shown many advantages like parallel processing, faster convergence, possibility of cross-validation after each epoch and better resiliancy to data scarcity on both classification or regression.

For the backend, we will start using cloud services which now become increasingly cost effective thanks to fierce competition between major players Amazon, Google, Microsoft. Here, I have chosen to use Amazon Sagemaker which is on dominant position at the moment I wrote this article. I assume you have already some basic knowledge on Amazon cloud services especially on S3 and Sagemaker. Otherwise, you can find easily some free tutorials on AWS website itself.

As always, you better have a good understanding on the mathematics behind XGBoost so that you can play with hyperparameters with confidence. There are several articles about it out there in the internet. You don’t have to buy expensive books or invest in something costly. I was looking at this youtube channel which explains it with a pleasant way.

I am not going to explain every line of the code that is available on github. Basically, I went through regular pre-processing process like uploading the dataset into S3 bucket, splitted it into training and testing sets.

The most important part is on the model parameters definition. I will use regular parameters for now but I will share in another article for their tuning. As objective function, I will use softmax as it is multi-classifier and for the evaluation matric, I will put merror as it is a classification.

After running the model, we have observed an accuracy of 78% which is quite bad compared to the result from algorithms that I have used in previous articles but as I have promised, hyperparameter tuning will be subject to another article.

Classification report and ROC curve, two important figures that must be looked at for classification, are shown below.

Fig 1 : Classification report
Fig 2 : ROC curve

One Reply to “Early diabete prediction using XGBoost on AWS”

Comments are closed.