The data preprocessing phase is the most challenging and time-consuming part of data science, but it's also one of the most important parts. If you fail to clean and prepare the data, it could compromise the model."If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team."- Andrew NgWhen dealing with real-world data, Data Scientists will always need to apply some preprocessing techniques in order to make the data more usable. These techniques will facilitate its use in machine learning (ML) algorithms, reduce the complexity to prevent overfitting, and result in a better model.With that said, let's get into an overview of what data preprocessing is, why it's important, and learn the main techniques to use in this critical phase of data science. Here's everything we'll cover in this guide: What is data preprocessing? Why you should use data preprocessing Important techniques for the preprocessing phase The data preprocessing pipeline What is Data Preprocessing?After understanding the nuances of your dataset and the main issues in the data through the Exploratory Data Analysis, data preprocessing comes into play by preparing your dataset for use in the model. In an ideal world, your dataset would be perfect and without any problems. Unfortunately, real-world data will always present some issues that you'll need to address. Consider, for instance, the data you have in your company. Can you think of any inconsistencies such as typos, missing data, different scales, etc.? These examples often happen in the real world and need to be adjusted in order to make the data more useful and understandable. This process, where we clean and solve most of the issues in the data, is what we call the data preprocessing step.Why is Data Preprocessing Important?If you skip the data preprocessing step, it will affect your work later on when applying this dataset to a machine learning model. Most of the models can't handle missing values. Some of them are affected by outliers, high dimensionality and noisy data, and so by preprocessing the data, you'll make the dataset more complete and accurate. This phase is critical to make necessary adjustments in the data before feeding the dataset into your machine learning model.Important Data Preprocessing TechniquesNow that you know more about the data preprocessing phase and why it's important, let's look at the main techniques to apply in the data, making it more usable for our future work. The techniques that we'll explore are: Data CleaningDimensionality ReductionFeature EngineeringSampling Data Data TransformationImbalanced Data Technique 1: Data CleaningOne of the most important aspects of the data preprocessing phase is detecting and fixing bad and inaccurate observations from your dataset in order to improve its quality. This technique refers to identifying incomplete, inaccurate, duplicated, irrelevant or null values in the data. After identifying these issues, you will need to either modify or delete them. The strategy that you adopt depends on the problem domain and the goal of your project. Let's see some of the common issues we face when analyzing the data and how to handle them. Noisy DataUsually, noisy data refers to meaningless data in your dataset, incorrect records, or duplicated observations. For example, imagine there is a column in your database for 'age' that has negative values. In this case, the observation doesn't make sense, so you could delete it or set the value as null (we'll cover how to treat this value in the "Missing Data" section). Another case is when you need to remove unwanted or irrelevant data. For example, say you need to predict whether a woman is pregnant or not. You don't need the information about their hair color, marital status or height, as they are irrelevant for the model.An outlier can be considered noise, even though it might be a valid record, depending on the outlier. You'll need to determine if the outlier can be considered noise data and if you can delete it from your dataset or not.Solution: A common technique for noise data is the binning approach, where you first sort the values, then divide them into "bins" (buckets with the same size), and then apply a mean/median in each bin, smoothing it. If you want to learn more, here is a good article on dealing with noise data.Missing DataAnother common issue that we face in real-world data is the absence of data points. Most machine learning models can't handle missing values in the data, so you need to intervene and adjust the data to be properly used inside the model. There are different approaches you can take to handle it (usually called imputation):Solution 1: The simplest solution is to remove that observation. However, this is only recommended if:1) You have a large dataset and a few missing records, so removing them won't impact the distribution of your dataset. 2) Most of the attributes of that observation are null, so the observation itself is meaningless. Solution 2: Another solution is to use a global constant to fill that gap, like "NA" or 0, but only if it's difficult to predict the missing value. An alternative option is to use the mean or median of that attribute to fill the gap. Solution 3: Using the backward/forward fill method is another approach that can be applied, where you either take the previous or next value to fill the missing value. Solution 4: A more robust approach is the use of machine learning algorithms to fill these missing data points. For example: Using KNN, first find the k instances closer to the missing value instance, and then get the mean of that attribute related to the k-nearest neighbors (KNN).Using regression, for each missing attribute, learn a regressor that can predict this missing value based on the other attributes.It's not easy to choose a specific technique to fill the missing values in our dataset, and the approach you use strongly depends on the problem you are working on and the type of missing value you have. This topic goes beyond the scope of this article, but keep in mind that we can have three different types of missing values, and each has to be treated differently: Type 1: Missing Completely at Random (MCAR)Type 2: Missing at Random (MAR)Type 3: Missing Not at Random (MNAR)If you are familiar with Python, the sklearn library has helpful tools for this data preprocessing step, including the KNN Imputer I mentioned above.Structural ErrorsStructural errors usually refer to some typos and inconsistencies in the values of the data. For example, say that there is a marketplace and we sell shoes on our website. The data about the same product can be written in different ways by different sellers that sell the same shoes. Imagine that one of the attributes we have is the brand of the shoes, and aggregating the name of the brand for the same shoes we have: Nike, nike, NIKE. We need to fix this issue before giving this data to the model, otherwise, the model may treat them as different things. In this case, it's an easy fix: just transform all the words to lowercase. It may require more complex changes to fix inconsistencies and typos in other scenarios, though. This issue generally requires manual intervention rather than applying some automated techniques.Technique 2: Dimensionality Reduction The dimensionality reduction is concerned with reducing the number of input features in training data. The Curse of Dimensionality in Your DatasetWith a real-world dataset, there are usually tons of attributes, and if we don't reduce this number, it may affect the model's performance later when we feed it this dataset. Reducing the number of features while keeping as much variation in the dataset as possible will have a positive impact in many ways, such as: Requiring less computational resources Increasing the overall performance of the modelPreventing overfitting (when the model becomes too complex and the model memorizes the training data, instead of learning, so in the test data the performance decreases a lot)Avoiding multicollinearity (high correlation of one or more independent variables). Also, applying this technique will reduce the noise data. Let's dive into the main types of dimensionality reduction we can apply to our data to make it better for later use.Feature SelectionFeature selection refers to the process of selecting the most important variables (features) related to your prediction variable, in other words, selecting the attributes which contribute most to your model. Ere are some techniques for this approach that you can apply either automatically or manually:Correlation Between Features: This is the most common approach, which drops some features that have a high correlation with others. Statistical Tests: Another alternative is to use statistical tests to select the features, checking the relationship of each feature individually with the output variable. There are many examples in the scikit-learn library like SelectKBest, SelectPercentile, chi2, f_classif, f_regression. Recursive Feature Elimination (RFE): The Recursive Feature Elimination, also known as Backward Elimination, where the algorithm trains the model with all features in the dataset, calculating the performance of the model, and then drops one feature at a time, stopping when the performance improvement becomes negligible. Variance Threshold: Another feature selection method is the variance threshold, which detects features with high variability within the column, selecting those that got over the threshold. The premise of this approach is that features with low variability within themselves have little influence on the output variable. Also, some models automatically apply a feature selection during the training. The decision-tree-based models can provide information about the feature importance, giving you a score for each feature of your data. The higher the value, the more relevant it is for you