H2O is an end-to-end ML platform that enables organizations to rapidly build world-class AI models and applications. It provides also open-source librairies for R, Python and Scala for custom modeling. With regards to parallel computing and orchestration, it can run on top of Spark, Hadoop and Kubernetes. Recently, they have launched a cloud-based solution for enterprise called H2O hybrid AI.
In this tutorial, you will discover how to build a classification model using H2O AutoML.
The dataset is taken from Kaggle and a copy is made available on this link. The task is to do a binary classification to predict the death from heart failure ( column DEATH_EVENT).
After completing this tutorial, you will know:
- How to install H2O AutoML
- How to perform data ingestion into AutoML
- How to build a model while doing feature engineering
- How to check the metrics
- How to make prediction from the generated model
Let’s enjoy the beauty of H2O AutoML!
Before anything, you need to download the binary from https://h2o-release.s3.amazonaws.com/h2o/rel-zipf/3/index.html , uncompress the file into a local folder then run the application with the command java -jar h2o.jar. After that, you can access the AutoML portal on the http://127.0.0.1:54321.
The order of the top menus is already self-explaining on the process that you need to follow.
First, you need to import the file into AutoML. That can be done under menu Data -> Import files, capture the full path of the the file and click on Import.
After importing the file, you have already a preview of the data but you still need to parse it to check the infered type for each variable and specify the column delimiter and the column header.
The next step if very important because it will define whether the task will be a classification or a regression. For classification, you have to change the target variable to type Enum.
After parsing, you will get the dataframe.
Now, you can split the dataframe into training/testing from the menu Data -> Split Frame, select the dataframe and capture the split ratio.
At this step, you have already machine learning ready dataframe and you can now move on to the modeling.
Feature engineering and Modeling
H2O encompasses the majority of ML/AI algorithms out there but with additional benefits: it allows to do feature engineering while defining the parameters for the model. This is very helpful because it guides you already on the type of feature engineering that you need to complete depending on the algorithm you have chosen.
For the case that we are trying to model, since the size of the dataset is about thousands, Machine learning approach would be more appropriate than deep learning which requires millions records for better accuracy. We will use Random forest which has proven to be very performant on small size dataset. Go to menu Model -> Distributed Random forest and specify the parameters of the model.
For feature engineering, I have checked balance_classes.
After running the model, H2O provides variety of metrics to assess the quality of the generated model.
- Loss: AutoML will plot the type of loss that you have chosen on the model definition for training and testing. You can see that the loss is better from number of trees = 30.
- ROC curve: You can check as well the AUC performance and show the confusion matrix at a given point.
- Variable importance: H2O provides as well the variable importance. You can see that the variable “time” is the most influential one followed by “ejection _fraction” while “Sex” got very small influence on the target.
- Confusion matrix : The overall confusion matrix for both training and validation metrics.
Once you are happy with the metrics of the model, you can proceed to prediction.
Go to menu Score -> Predict and select the model that we have built and the dataframe.
After the prediction task is completed, you will have the same metrics as you have on step 4.
You can donwload the whole flow on Github.