- When to use Machine Learning?
- The Machine Learning pipeline
- Define the problem
- Collect the data
- Come up with a ‘good’ performance metric
- Set an effective evaluation protocol
- Prepare the data
- Create a benchmark model
- Tune and optimize the model
Machine Learning and related fields have become the backbone of the technological sector of any business. Today, ML helps businesses in a plethora of different ways e.g. customer behavior prediction, understanding revenue opportunities, identification of the market trends, and decision making as a whole. Having said that, developing a capable and effective ML application requires you to go through several different stages including planning and all.
Objective clarification, data collection, and analysis, data preparation, model training, etc. are some of the steps that are usually followed for developing applications.
Machine Learning is a statistical subset of Artificial Intelligence – a field involving extraction and mining of patterns. Hence it plays a significant role in extracting useful information from historical data and decision making. It is concerned about finding and discovering any patterns that might there be in the data and create mathematical models to support those findings.
The mathematical models are like the very backbones of the field. Once we find a suitable model for our data, we can use it for inference. Take sales for example – we could use the model for predicting whether a user would like a similar product or not given the past decisions that the user has made.
When to use Machine Learning?
Machine Learning is undoubtedly a very powerful tool. It has the ability to come up with an educated guess of the future occurrence based on the historical data. But there are indeed certain drawbacks and shortcomings of ML as well which can prove to be a deal-breaker in many cases.
Sometimes, in fact, it’s better to rely on conventional non-learning techniques. Sometimes it’s better to utilize the domain knowledge and use hard-coded rules rather than learning them.
Machine Learning is used mainly in 2 scenarios. First – it is very hard (or sometimes even almost impossible) to hard-code the rules. The difficulty lies in identifying and implementing the rules as in most cases an array of rules go hand in hand.
This scenario includes tasks that cannot be done or generalized well enough using hard-coded rules. Second – the data scale is simply very high. You can define the rules manually for a handful of samples, but it’s financially infeasible and a time-consuming affair. So it’s better to come up with a mathematical model that can automate the whole task while giving a solid numerical background.
The Machine Learning pipeline
Here in this section, we are going to take a look at the overall procedure of the development of an ML model and application. The whole process can be divided into several smaller subparts which again often consist of even smaller sub-subparts. Usually, dedicated personnel is employed for each of the subparts, and their task is usually constrained to their section. Without further ado, let’s take a look at the different parts.
1. Define the problem
Defining the problem correctly plays a huge role. And often you would see people overlooking this step. You need to take into consideration many questions when trying to assess the need for your organization. What is the main objective? What is it that you are trying to predict? What are the features? Is it going to be offline learning or online learning (for offline learning, you have a static dataset whereas, for online learning, you are learning as the data comes in)? What kind of problem is it? Clustering? Binary classification? Multiclass classification?
So, as you can see, it involves a lot of discussions, going to and fro, changing the spec sheet, and so on. It is also worthy to remember that not every problem can possibly be solved effectively. A model is only as good as the training data. And it also makes the underlying assumption that future behavior is going to be similar to the past which is simply not true always.
2. Collect the data
The previous step is mostly a planning phase, data collection can be considered as the first real step towards developing a model. You ought to make sure that you have “quality” data as the success of the whole project is very much dependant on it. You also have to make sure that you have enough data for the training phase as most of the complex models are pretty ill-famous for being sample-inefficient.
3. Come up with a ‘good’ performance metric
Choosing a ‘good’ measure of goodness of the model is equally important, if not more. Countless performance metrics have been proposed over the years, some being fairly simple and intuitive, while some being relatively more complex. Some are also curated specifically for certain types of tasks. So it’s your task to use or come up with a metric that can be related to the type of problem that you are facing and aligned with the higher-level goals of your business.
4. Set an effective evaluation protocol
Now that you have prepared all the prerequisites, you now need to focus on how you are going to measure the progress. Many evaluation protocols are popular among data scientists. Among them keeping a validation set completely separate from the training and testing is very popular. Other evaluation methodologies include K-Fold validation and iterated K-Fold validation with shuffling.
5. Prepare the data
Data preparation in itself is a very big topic that involves many intermediate steps. You need to deal with the missing data. You need to think about feature scaling as models often perform better and converge quicker when the features are of the same scale. You also have to consider dropping or adding features (often called feature engineering) depending on their informativeness. Sometimes you would also find yourself tweaking the categorical features.
6. Create a benchmark model
The goal of this step is to develop a baseline model, to which you are going to compare any subsequent model. Creating a benchmark model is required for making the whole process measurable, comparable, and reproducible.
7. Tune and optimize the model
Now that we have a baseline to which we can compare our actual models, we move on to finding and developing better models. Finding a ‘good’ model can be a daunting task provided we have a bazillion different models and methodologies. Finding and discovering the right set of hyperparameters also have a huge impact on the performance of the models.
Conclusion
We have broken down the whole workflow of developing Machine Learning-based applications and studied them. It’s not necessary that every organization have to follow this strict set of rules to a tee. Organizations have different internal structures, and hence they might want to merge multiple phases into one or decompose some phases even further. It’s completely up to your organization as to how much emphasis you want to put on it.