Supervised Machine Learning: Types, Overfitting, and Practical Scenarios

Reading Time:

4 minutes

When a defined set of data is provided to the computer to analyse it and make decisions based on it, it is termed supervised learning. Human supervision is needed to train the model. The algorithm is a paradigm of machine learning that is provided with labelled data where it maps the input data to output labels based on the patterns appearing from it.

The provided data consists of examples related to a specific topic and its correct output. The algorithm then uses it to predict correct outcomes. 

Types of Supervised Learning:

Techniques for supervised learning can be classified as either regression or classification. While regression approaches forecast continuous responses, classification techniques forecast discrete responses. The explanation of these two methods that follows provides a real-world illustration of how supervised machine learning functions.

Classification Algorithm

A classification algorithm is employed when a finite set of labelled, or sample data is available. It aims to classify the output data into multiple classes and categories.

These algorithms are applied in email spam detection, image recognition, medical diagnostic tests, customer behaviour prediction etc. A probability score is generated by the classification algorithm to produce a yes or no answer to a query and categorize it. 

Let’s consider the employment of a classification algorithm in categorizing email as spam or not spam.

  1. Data collection takes place in the first step where a dataset of emails that are spam or not spam is obtained.
  2. The collected data is then preprocessed before using it on a model.
  3. The right model is selected such as a decision tree which is a type of classification algorithm.
  4. The dataset is then split into a sample and test dataset that is labelled as spam or otherwise, so that the model can obtain associations and patterns. 
  5. The trained model is then evaluated using different metrics and scores.
  6. Different unknown or experimental parameters are tested on the model to ensure its accuracy.
  7. After training and testing, the model is ready to classify the emails into spam or not. 

This is the procedure followed by the classification algorithm to separate spam emails from useful ones.

Regression Algorithm

A regression algorithm is used when the target variable is continuous, meaning it is continually updated and more suitable for numerical values. It predicts a real value and is mostly employed in engineering and finance facilities.

It shows the relationship between a dependent and an independent variable. Some applications of regression algorithms are the prediction of market trends, the likelihood of an accident on a road, etc. 

An application of regression analysis is Weather forecasting. By analyzing historical data and current situations, supervised learning plays a vital role in weather forecasting by making correct predictions.

  1. Systems gather a large amount of data like humidity, temperature, wind direction, wind speed, atmospheric pressure, and precipitation levels from various sources such as satellites, radar systems and weather stations.
  2. Labeled data is then provided which includes the historical patterns for the algorithm to look through. These historical patterns serve as the sample data which provides the corresponding weather conditions like cloudy, sunny etc.
  3. A paradigm of supervised learning is trained on historical data. 
  4. The model after training is evaluated using validation data to assess its accuracy. Different matrices are used for quantitative measures.
  5. After training and evaluation, the model is used to make predictions on the current weather using real-time data.
  6. These models are continuously updated with new sample data to ensure maximum accuracy in predictions.

These are the steps the weather forecasting systems take to help businesses, governments and individuals extract patterns in the weather and avoid potential risks with adverse conditions.

Overfitting

The sample data is updated regularly, and the machine learning model is confirmed repeatedly to avoid a problem called overfitting, which is the model only works on the patterns of training data, but not on new data. The model is too fit for the sample data and does not recognize new data. This happens when the model is too complex, is provided with a small set of training data with a lot of irrelevant information (noise), or is trained over the same data for too long. 

Considering the same two scenarios as before, if a spam detection model is trained on a dataset that contains mostly emails with certain keywords or phrases, it may learn to classify emails as spam based solely on the presence of those keywords, even if they are not indicative of spam in general and if a weather forecasting model is trained on historical weather data for a specific location and period, it may learn to fit the noise in the data rather than the actual patterns and trends.



Comments

Leave a Reply

Your email address will not be published. Required fields are marked *