No modeling type is able to provide a good result with a poor data quality. “Garbage in, garbage out“.

In this article, we will focus much more on data preprocessing techniques. Actually, in real world, we get raw datas which are collected around from various sources using different methods for multiple purposes. Those circumtances make them incomplete, inconsistent, inaccurate and even irrelevant to the purpose. This is why we need to clean, format, and organize raw data into dataset which is suitable to machine learning. All of these must be done towards only one thing : to meet the business requirement.

Steps in Data Preprocessing in Machine Learning

There are 6 steps in data preprocessing in machine learning :

  1. Business requirement

Prior to data preprocessing, it is required to understand the business requirement in order to be able to identify the relevance of the datas and their attributes according to use cases.

In this step, you need to seek answers to following questions:

  • What are the pain points or the vision of the company? What do they want to achieve?
  • How would the company benefit from solving the problem?
  • How would the end clients benefit from solving the problem?
  • Does it need to be a machine learning or a simple automation would suffice ?

Then, summarize and frame the requirements into formal statements like “Problem A has an impact on I and affects B so starting point S“.

2. Dataset overview

Now, you can start getting into the technical process by loading the dataset and overview the top rows with below steps:

  • Load the dataset into Pandas dataframe. This can be achieved with pandas.read_csv.
  • Verify the first rows with dataframe.head() and check the columns headers are correctly mapped

3. Exploratory data analysis (EDA)

Once the data is correctly loaded, you can proceed to EDA which is a necessary step before embarking into data transformation process. Basically, you have to look at below points:

  • Shape and size of the dataset. You can get those information with dataframe.shape
  • Information about the dataset like type of each column with dataframe.info()
  • Summary of the dataset on statistical values like count, standard, mean, min/max and percentile of the population for each column. You can use dataframe.describe(). This is mainly for continuous variables.
  • Distribution of categorical variables. Just like continuous variables, it is important to understand and visualize the distribution of the values. As there is no python function for this, you can adapt my function get_category_distribution() to your needs.

4. Data cleaning and preprocessing

At this step, you will handle missing and duplicate values.

Missing values can be detected with dataset.isnull().sum(). You have to fill those up with some value or decide to drop them out. For continuous values, you can take the mean or median value of the column. For categorical values, you can take the mode values (most frequent values).

As for duplicate values, you can just drop them out to reduce the size of the dataset and thus speed up the training later on. You can achieve this with dataframe.drop_duplicated().

You can inspire from the class Data_preprocessing_handling that is available on the Jupyter notebook file.

5. Outlier treatment and normalization

Continuous data is prone to outlier data issue. Outlier data can badly affect the quality of the result especially with high deviation. You can vizualize outliers using seaborn.boxplot() and deal with them properly in a mindful way. You can then decide whether to drop them out or replace with Interquantile Range (IQR).

It is also recommended, but not mandatory, to proceed to value normalization to check the range of values in the same column.

6. Feature selection techniques

This is the last step on the preprocessing before you go to modeling itself. You need to identify which columns are relevant to the training.

For continuous variables, you can Remove variables with constant or quasi-constant variance. Those can be identified with sklearn.VarianceThreshold().

For categorical values, you can run dataframe.corr() and plot the heatmap of the result with seaborn.heatmap() and manually identify correlated columns and remove them out. Alternatively, for a deeper statistical analysis, you can proceed to chi-square test .

The whole process above has been implemented on Jupyter notebook which is shared on github.