While many assume that a machine learning project is just about modeling, the ground-truth is that the foremost task in a machine learning pipeline is the data engineering covering from data collection, data cleaning & organizing to data mining. Below graph shows how much time data scientists usually spend on each part of the project.

Volume of time spent by data scientist (from book “Feature Engineering Made it Easy”)

Data engineering is the main pain point for data scientists. That situation raises the question as to which techniques can be used to handle this part in a versant way.

In this post, you will discover some feature engineering techniques frequently used in machine learning. This comes in addition to previous article which was related to regression modeling.

After going through this post, you will gain more knowledge on :

  • Exploratory Data Analysis (EDA)
  • Feature understanding
  • Feature improvement
  • Feature selection
  • Feature contruction
  • Feature importance assessment

The dataset that we will be using is obtained from UCI machine learning. The aim is to predict whether a person makes over 50K a year based on 14 features. So, it is a multivariate binary classification task. The number of instances is 48882, which is conveniant for a deep learning approach in addition to regular machine learning algorithms.

Let’s get started.

  1. Exploratory Data Analysis (EDA)

We have already explained this in the previous article so we won’t spend too much time on this. Basically, this step is about loading dataset into pandas dataframe, view the head of the dataset, check the columns and their assigned data types.

2. Feature understanding

Before you decide on anything for the dataset, you must understand the data that you are dealing with, its statistical characteristics and attributes.

There are 4 types of structured data and for each of them, there are certain mathematical operations and visualization that can be carried out on the dataset.

Type ExampleMathematical operationsVisualization
NominalJob TitleFrequencies
Mode
Bar
Pie
OrdinalRanksFrequencies
Mode
Median
Bar
Pie
Stem
IntervalSalary bandFrequencies
Mode
Median
Mean
Standard deviation
Bar
Pie
Stem
Box plot
Histogram
Ratio IncomeMean
Standard Deviation
Histogram
Box plot
Table 1 : 4 types of structured data

3. Feature improvement

At this step, you have already a good understanding of the data in front of you and you can now embark into improvement tasks. Here are the points that you will have to look at :

  • Missing values handling : You need to check for any missing values and deal with them whether by dropping or by replacing with mean, median or mode value depending on the type of data.
  • Feature scaling : There are 2 types of feature scaling : normalization and standardization. The purpose is to reduce the standard deviation for faster convergence when fitting the model. However, you need to choose which one should be used based on below cases.
    • Normalization, also known as Min-Max scaler, is to scale data to range values between 0 and 1. This is suitable for non-gaussian data but no outliers. It is available on scikit-learn library.
    • Standardization, also known as z-scaler, offers better standard deviation result than normalization when dealing with outliers data. You can use StandardScaler on scikit-learn.

4. Feature selection

Most of the time, raw data have too many features that, even a domain expert would sruggle to efficiently assess their relevance to the modeling. In fact, redundant information will slow down the fitting task. There are some mathematical operations that can help in identifying and removing those features.

For numerical variables, there are 2 tasks you need to perform:

  • Remove variables with constant or quasi-constant variables : This can be done with VarianceThreshold from scikit-learn.
  • Remove correlated variables : This is about looking at the covariance between variables. They can be identified with scickit-learn corr() function.

For categorical variables, you need to compute the chi-square p-values using chi2 function from scikit-learn. If it is less than 0.05, then there is a high correlation between both variables and you need to drop the dependant variable.

5. Feature construction

This is the last step before the modeling itself. There are 2 things you need to perform:

  • Split the dataset into train/test by making sure that the proportionality between the classes is preserved
  • Check for imbalance classification and use SMOTE to oversample the minority class

6. Feature assessment

After fiting the model, you need to proceed to feature assessment to understand which features have contributed the most in the given result. This is very useful as you will have to explain to non-technical people the outcome of the model.

There are 2 aspects of assessment that you need to check : SHAP (SHapley Additive exPlanations) and LIME ((SHapley Additive exPlanations)). While many people think both are the same, they are actually complementing each other. SHAP is to assess the importance of each feature in the overall accuracy of the model. LIME is to assess the importance of each feature on a single test.

The whole process explained above is implemented on this Jupyter notebook file.