Logistic Regression: A Powerful Tool for Classification

By —

Reading Time:

5 minutes

•

May 2, 2024

Logistic regression is a statistical method used for classification under supervised learning where a continuous input is provided, and it gives a binary outcome. This means that it gives one of the two possibilities which could be 1 or 0, Yes or no, true or false, etc. It predicts the probability of an input belonging to a certain class.

The decision boundaries are what the algorithm tries to find because these are what distinguish two different categories or classes. These decision boundaries may be simple or complex. In logistic regression, the decision boundaries are assumed to be linear.

The parameters for logistic regression are called weights which are mapped to a value between 0 and 1 with the help of the logistic function. The algorithm tunes the weights to classify the input values in the right category.

Logistic regression is the simplest model and is used as a go-to for classification purposes. The linearly separable classes make it a very efficient model and it gives highly accurate results. It is used in a wide range of fields like cybersecurity, email spam filtering, speech recognition, finance, etc.

Logistic Function (Sigmoid Function)

The logistic or sigmoid function is what processes the input and produces the binary outcome, so it is the core of logistic regression. It is given below:

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

Where $x$ is the linear combination of input values and $\sigma(x)$ outputs a value between 1 and 0.

The Formula for Logistic Regression

Consider a set of input data points as

$$X = (x_{1},x_{2}, …, x_{n})$$

The logistic regression will calculate the probability as follows:

$$P(Y=1 |X)$$

The following formula is used:

$$P(X) = \sigma(\beta_{0}+ \beta_{1}x_{1} +…+ \beta_{n}x_{n}) = \frac{1}{1-e^{(\beta_{0}+ \beta_{1}x_{1} +…+ \beta_{n}x_{n})}}$$

Here, $\beta_{0}$ is the intercept and the others are the coefficients for the input features.

Decision Boundary

As mentioned earlier, decision boundaries decide whether an input value belongs to a class or not. For this, a threshold is set. This threshold is commonly 0.5 and it decides the final classification.

Key Properties of Logistic Regression

The key properties of logistic regression include the following:

Logistic regression works through Bernoulli distribution which is a distribution that involves only two possible outcomes. It is a discrete probability method.
Prediction is based on maximum likelihood. This is a statistical technique often employed in machine learning to estimate parameters such that the likelihood increases of the input data belonging to a certain category.

Key Assumptions for Implementing Logistic Regression

The output or dependent variable is binary, which means it belongs to one of the two possibilities.
Linear relationship of independent variables to log odds.
A large sample size is required.
The input or independent variables should be independent of one another.
There should be no outliers in the data.

Types of Logistic Regression

Binary Logistic Regression

It is the most widely used and simplest type of logistic regression. This means that the output should belong to one of the two categories. For example, classifying an organism as a vertebrate or invertebrate (It can be one of the two).

Multinomial Logistic Regression

This means the output belongs to one of the three or more categories and there is no natural ordering among them. An example of this type can be the customer choosing a genre of book he or she is likely to read from multiple options.

Ordinal Logistic Regression

This is the type of logistic regression that requires the output to belong to one of the multiple categories and there is a natural ordering among them. For example, a business predicts whether a product’s ratings are medium, high, or low.

Real-life Scenario With Logistic Regression

Credit card fraudulent detection employs logistic regression to detect unusual transaction patterns to take action against them. Millions of transactions take place every day and handling them all manually is impossible even if there is a large number of employees tasked to do it. This is where an automatic system comes into the picture. How logistic regression helps with this is described below:

Data Collection

As for all models, a historical dataset is required for training. In this case, the data will include information about transactions like amount, location, and time and the individual using it like credit score, location, and spending habits. As in the case of supervised learning, the data will be labelled as either fraudulent or legitimate.

Data Processing

The training data may need processing. For example, time zones needed to be adjusted, or locations needed to be converted into cities.

Logistic Regression Model

The logistic model is the representation of the algorithm that is followed, and it takes the transaction attributes like transaction amount, location, etc. as input and outputs the probability of the transaction being fraudulent i.e. between 1 and 0. If input increases positively, the output approaches 1 which is equivalent to high probability. Conversely, if input increases negatively, the output approaches 0 which is equivalent to low probability. The logistic regression model learns the optimal values of weights and bias through training on historical data. These weights determine how much significance a feature has on the probability of a transaction being fraudulent or not. Thus, weights can tell which attributes are most indicative of unusual transaction behaviour.

Model Training

The model uses the labeled historical data to train the model which extracts the patterns and relationship between the features and the probability of fraud. The weight and bias are adjusted in such a way that the cost function gets minimized. The cost function is the difference between the model’s predictions and the actual labels in the training data.

Fraud Detection

The finally trained model can be used on new, real-time transaction data. The transactions that exceed a certain threshold (explained before in the decision boundary) get flagged for further investigation. This threshold can be adjusted based on the tradeoff between catching fraud and minimizing false positives.

Logistic regression is a powerful tool to combat fraud. By analyzing vast amounts of transaction data, it can help detect suspicious activity.

Written by

Hafsa Qureshi

I am a bioinformatics undergraduate interested in AI, machine learning, and large language models. I aim to contribute to the intersection of AI and bioinformatics, leveraging computational techniques to contribute to biological research and healthcare.