Insight

F1 score and ML - Definition, principles, calculations, limitations, overcome

How to predict and improve the performance of machine learning models

2023
.
08
.
02
by
Sangsun Moon
F1 score and ML - Definition, principles, calculations, limitations, overcome

What is F1 Score?

For a long time, model accuracy was the only metric used to evaluate and compare machine learning models. However, model accuracy calculates how many times a model made a correct prediction across the entire dataset, which has the limitation that the data associated with the number of times it was evaluated can only be said to be accurate if it has a similar distribution across the entire dataset.

Is F1-Score Really Better than Accuracy?
Is F1-Score Really Better than Accuracy?

F1 score is a performance metric calculated as the harmonic mean of precision and recall. It is primarily used in binary classification problems to predict the performance of a model and evaluate its results.

F1 score is another machine learning evaluation metric that assesses an AI model's predictive ability by drilling down into class-specific performance rather than overall performance like Accuracy. F1 score combines two competing metrics, a model's Precision and Recall, and has become popular in recent years.

In this article, we'll cover the structure and principles of F1 Score, as well as use cases and ways to improve it.

How F1 Score works and is calculated

Accuracy

Before we get into the specifics of F1 Score, let's first talk about Accuracy. Using a classification model as an example, Accuracy is an evaluation metric that represents the percentage of samples that a model correctly predicts out of the total sample. In other words, it represents how accurately the model predicts.

To understand these calculations, we first need to look at the confusion matrix. By confusion matrix, we mean the predictive performance of a model on a dataset. For a binary class dataset consisting of a "positive" class and a "negative" class, the confusion matrix has four essential components.

  • True Positive (TP): Number of samples correctly predicted as "positive"
  • False Positive (FP): Number of samples incorrectly predicted as "positive"
  • True Negatives (TN): Number of samples correctly predicted as "negative"
  • False Negatives (FN): Number of samples incorrectly predicted as "negative"

Accuracy = (number of samples where the prediction was correct) / (number of total samples)

accuracy

To illustrate this with a simple example, if a classification model predicts 100 samples and then correctly predicts 85 samples, it has an accuracy of 85%. However, the way Accuracy is calculated is not always appropriate because there is 'unbalanced data'.

Imbalanced data

Imbalanced data refers to a situation where the number of samples for each class in a classification problem is unbalanced. This is usually caused by a relatively large number of samples for one class and a small number of samples for the other class.

For example, for error detection, datasets typically have many normal samples and few abnormal samples. Or in medical diagnostics to determine rare diseases, The number of samples for rare diseases can be very small. This unbalanced data can cause problems when evaluating or training a model's performance. It can result in a model that primarily focuses on predicting many classes and fails to accurately predict fewer classes. It's important to note that lower prediction accuracy for fewer classes can skew the model's performance.

Therefore, the way Accuracy is calculated is When classes are unbalanced is not a good metric. If we use Accuracy as the evaluation metric for the example described above, the model will only select classes that have a high probability of being correct. This means that it is difficult to accurately evaluate the performance of the model using Accuracy alone, because the model may be performing very badly, but the accuracy is high, which may look like a good result to the user. Therefore, we will use F1 Score to solve the data imbalance problem.

How F1 Score works

Precision and Recall are the most common metrics used with unbalanced data, and are the basis for the F1 Score we're going to explore. The F1 Score is a harmonic average of these two, but they can also be used as individual metrics.

Precision

precision

Precision is one of the metrics for evaluating the performance of a classification model, and it can be used to determine the The percentage of samples predicted to be positive by the model that are actually positive.

Where True Positive (TP) means the number of samples that are actually positive out of those predicted by the model to be positive. 반면 False Positives (FP), which is the number of samples that the model incorrectly predicts as positive that are actually negative.

To illustrate with a simple example: if a classification model predicted 100 positive samples, 85 of which were actually positive, and the remaining 15 were negative, the TP would be 85 and the FP would be 15. In this case, Precision would be 85 / (85 + 15) = 0.85. Precision is the percentage of "positive" predictions that are actually positive, indicating how accurately the model predicts positives.

High precision means that the model minimizes false positives and finds many cases where what it predicts as positive is actually positive. Therefore, the higher the precision, the better the performance of the model.

Recall

recall

Recall is one of the metrics for evaluating the performance of a classification model, and The percentage of samples that are actually positive that the model correctly predicts as positive.

Where TP is the number of samples that the model predicted as positive that are actually positive. FN is the number of samples that the model incorrectly predicted as negative that are actually positive.

If a classification model correctly predicted 85 out of 100 true positive samples as positive and incorrectly predicted the remaining 15 as negative, the TP would be 85 and the FN would be 15. In this case, the recall would be 85 / (85 + 15) = 0.85.

Recall indicates how many of the samples that are actually positive the model predicts to be positive, so it indicates how many positive samples the model misses. A high recall is also known as the sensitivity of the model, as it means that the model is good at finding samples that are actually positive. Therefore, the higher the recall, the better the performance of the model.

F1 score calculation

f1 score

Precision represents the percentage of positive predictions that the model classifies as true positives, while Recall represents the percentage of true positives that the model predicts as positive. Because F1 score evaluates a model's predictive performance by considering both of these metrics, it has the advantage of considering the average of accurate predictions and recall at the same time.

The F1 Score is a harmonic average of these two metrics. The reason for using a harmonic average is that it is more influenced by the smaller of the two metrics. F1 Score has a value between 0.0 and 1.0, and you need both Precision and Recall values to get a high score. It's a much more concrete measure of model performance than Accuray, which was vulnerable to unbalanced data.

In conclusion, while other metrics are simple to compute, they may not capture the full predictive power of a binary classification model, so we recommend tendency. However, F1 score can evaluate a model's predictive performance with a little more accuracy by using a harmonic mean to average the two metrics.

Precision vs. Recall

Recall and Precision are metrics that evaluate the performance of a classification model, each measuring a different aspect of the model's predictive output. The difference between the two metrics is as follows

Definition.

  • Precision: The percentage of samples that the model predicts to be positive that are actually positive.
  • Recall: The percentage of samples that the model correctly predicts as positive among samples that are actually positive.

Perspectives

  • Recall rate focuses on how many true positives you find
  • Precision focuses on the percentage of samples that the model predicts are positive that are actually positive.

Leverage

  • Recall indicates how well the model finds samples that are actually positive, so focus on minimizing false negatives (when a sample is predicted to be negative when it is actually positive).
  • Precision indicates how many of the samples the model correctly classifies as positive are actually positive, so focus on minimizing false positives (when it predicts a positive when it is actually negative).

Precision-Recall Trade off

F1 score explained
How to measure an AI models performance - F1 score explained

Ideally, a good model only needs to properly categorize positive things, and only need to properly categorize positive things. But realistically, it's hard to do both, so we need to think about which one to focus on.

Because the two metrics value different things, they also have different use cases. 예를 들어, When it's important not to miss a real cancer patient when diagnosing cancer We focus on recall, but when classifying spam, we want to make sure that legitimate mail is We also focus on Precision to reduce the rate of misclassification as spam.

To summarize, recall focuses on making sure you don't miss any true positives. Precision, on the other hand, is concerned with ensuring that the model correctly classification. The two have a conflicting relationship in the evaluation of classification models, and you should choose the appropriate method based on the problem or goal you are trying to evaluate.

What is R-squared?

R-squared (coefficient of determination) is an evaluation metric used in regression models. It is used to measure how well the model explains the given data. R-squared is the ratio of the amount of variation explained by the model to the total amount of variation in the dependent variable, and has a value from 0 to 1. The closer the R-squared value is to 1, the better the model explains the data.

R-squared = 1 - (sum of squares of residuals / sum of total squares of dependent variable)
R-squared

The sum of the squared residuals is the sum of the squared differences (residuals) between the predicted and actual values. The sum of the total squared values of the dependent variable is the sum of the squared values of the dependent variable and how far they are from the mean. The R-squared value typically has a value between 0 and 1, but it can also be negative.

R-Squared in One Picture
R-Squared in One Picture

This means that the model does not explain the data well. R-squared is used to determine the goodness of fit of a model, and a high R-squared value can be interpreted to mean that the model explains the data well. However, because R-squared is independent of the number of independent variables, it does not provide information about the complexity of the model. Therefore, it is recommended that you consider other metrics or information, such as Accuracy, when comparing models.

F1 Score Use cases

harmonic mean
Normalized harmonic mean plot (x is precision, y is recall, vertical axis is F1 score, percentage points)

F1 score is the harmonic mean of precision and recall, which is one of the metrics to evaluate the performance of a binary classification model. F1 score has the advantage of balancing precision and recall, allowing it to simultaneously consider accurate predictions and the average of the lift rates.

  • Class imbalanced data:
    If the classes are unbalanced, that is, one class has a significantly larger number of samples than the other, it may not be appropriate to evaluate the model based on Accuracy alone. In this case, you can use F1 score to accurately measure the performance of the model.
  • Importance of true positives:
    Sometimes you need a model that is good at determining true positives, which is important in cancer diagnostics, for example, where missing a sample that is truly positive would be a big problem. In this case, F1 score is a metric that can evaluate the performance of a model by considering both recall and precision.
  • Categorize text or search for information:
    NTT recognition or evaluation of word segmentation, such as NLP 프로세스에서도 사용됩니다. In text categorization problems or information retrieval, it is often necessary to classify documents into two classes, such as positive and negative, or relevant and irrelevant. In these problems, F1 score can also be used to evaluate performance in terms of precision and recall.

How to improve your F1 Score

Analyzing machine learning model performance
Analyzing machine learning model performance

The F1 Score can be scored from 0 to 1. If the F1 Score is low, the Both precision and recall are low. Sometimes precision and recall, which make up the F1 Score, are each low. A low precision score indicates that the machine learning model is generating incorrect annotations, while a low recall indicates that the machine learning model failed to generate annotations that it should have. This can be caused by a number of factors, including the complexity of your type system, the adequacy of your training documentation, the skill of your human labelers, and other factors. Below are common ways to improve your F1 Score.

  • Training data tuning:
    Either adding more training data itself or adding more annotations to the data. And Ways to improve data processing quality, which can improve your F1 Score.
  • Balancing Sampling:
    In class-imbalanced data, you typically have more samples of other classes than of the class you want to predict. In these cases, you can balance the data to It is important to sample to train the model. Undersampling is a way to reduce the number of samples in the majority class by some amount, while oversampling is a way to increase the number of samples in the minority class.
  • Class Weights:
    To emphasize the importance of minority classes, you can give them greater weight in the model's training. This helps the model learn minority classes better, and can improve its prediction performance.
  • Prediction Threshold Adjustment:
    You can adjust the model's prediction threshold to balance precision and recall. For example, lowering the threshold can result in more positive classifications, increasing recall and decreasing precision. Increasing the threshold can increase precision and decrease recall.

Ensuring that commonly confused or frequently occurring types are identified with a low probability will increase your ability to recognize minority classes. In addition, if your F1 Score for a particular type is low, you should review the clarity of the annotation guidelines that apply to that type. You can increase your F1 Score by carefully reviewing and monitoring your training data, such as adding dictionaries for types that do not occur frequently in your training data.

F1 Score limitations and alternatives

Limits

Predictive Coding Performance and the Silly F1 Score
Predictive Coding Performance and the Silly F1 Score

As you can see, F1 Score is a popular metric for evaluating binary classification models, but it's not perfect.

  1. F1 Score does not provide information about the error distribution.
    F1 Score provides a single value that summarizes the model's performance with Precision and Recall. But does not provide information about the distribution of errors, which can be important in certain applications.
  2. F1 Score assumes equal importance of Precision and Recall.
    Gives equal weight to both Precision and Recall, and assumes they have equal importance. However, because the two metrics may have different importance in some applications, other metrics may be more useful than F1 Score, which takes a harmonic average.
  3. F1 Score is not optimized for multi-class classification.
    Because this metric is designed for binary classification models, it may not be applicable to multi-class classification problems. In this case, a metric like Precision or Micro/Macro F1 Score might be more appropriate.
  4. F1 Score may not be sensitive to certain patterns in your data.
    F1 Score is a generic metric that doesn't take into account specific patterns or characteristics of your data. In some cases, if you want to capture specific attributes of a problem, you can use You may need a more specialized metric.

So what's the alternative?

In conclusion, F1 score is useful in class imbalance problems and is a good metric to evaluate the tradeoff between recall and precision. However, given the limitations above, we recommend a more comprehensive evaluation of the model's performance. It is important to use it in conjunction with other evaluation metrics, or to choose the appropriate metric based on the nature of the problem. Below are some alternatives that can complement F1 Score.

ROC Curves, AUC

Understanding AUC - ROC Curve
Understanding AUC - ROC Curve

Receiver operating characteristic (ROC) curves and area under the curve (AUC) are evaluation metrics used to assess the performance of binary classification models. They can measure a model's ability to distinguish between positive and negative instances over a range of classification thresholds. Like F1 Score, they are particularly useful for imbalanced data. More information can be found below.

ROC curves

  • Curves plotting the model's precision (true positive rate) as a function of threshold and the model's specificity (false positive rate) at 1 on the x-axis and y-axis, respectively, for a binary classification model.
  • Visualize how well your model classifies positives and negatives
  • The closer you are to the upper left corner of the ROC curve, the better you perform.

AUC

  • Area under the ROC curve, meaning the area for the combination of recall and specificity at all points on the ROC curve.
  • The AUC value ranges from 0 to 1, with the best model having an AUC close to 1.
  • AUC provides the tendency of a positive sample to be predicted with a higher probability when compared to a randomly selected negative sample.
  • When comparing models by their AUC values, models with larger AUCs tend to perform better.

F-Beta Score

F-Beta Score — PyTorch-Metrics 1.0.1 documentation
F-Beta Score — PyTorch-Metrics 1.0.1 documentation

Like F1 Score, F-beta score is a metric that evaluates the performance of a binary classification model by calculating the harmonic mean of precision and recall to give a final score. However, you'll notice that it's calculated a little differently than F1 Score.

F-beta score = (1 + beta^2) * (Precision * Recall) / ((beta^2 * Precision) + Recall)
F-beta score

Here, beta is a parameter that controls how much the model weighs precision over recall: a larger beta value gives more weight to recall, and a smaller value gives more weight to precision. Typically, beta values are set to 1, 0.5, and 2, with a beta value of 1 equaling an F1 score.

The F-beta score is used to evaluate the predictive performance of the model by balancing precision and recall. For example, if recall (correctly finding true positives) is more important to you, you might set the beta value to 2 to give greater weight to recall. Conversely, if precision is more important to you, you might set a beta value of 0.5 to give greater weight to precision.

Therefore, the F-beta score allows you to manage the predictive performance of your model by balancing precision and recall. You can calculate the F-beta score by choosing an appropriate beta value based on the characteristics of your model and the problem you want to evaluate.

Geometric Mean

Geometric Mean
Geometric Mean

Geometric Mean (G-mean) is one of the evaluation metrics for assessing the performance of a binary classification model. It is calculated as the geometric mean of Recall and Specificity, and helps to accurately evaluate the performance of a model in class imbalance problems.

G-mean = √(Recall * Specificity)

Recall is the percentage of true positives that the model correctly predicts as positive, and specificity is the percentage of true negatives that the model correctly predicts as negative. G-mean is the geometric mean of these two metrics and represents the average prediction performance of the model for positive and negative classes.

G-mean is useful for evaluating a model's performance on class-imbalanced data. It compensates for the fact that Accuracy, a common evaluation metric, can be skewed in unbalanced data, and helps measure how well a model performs in predicting both positive and negative classes evenly.

As such, G-mean can be used as a metric to simultaneously evaluate the predictive power and reliability of a model: the closer the value is to 1, the better the performance of the model. However, G-mean can also have limitations depending on the specific problem and data, so it's important to consider it in conjunction with other evaluation metrics and choose one that fits the nature of the problem.

Conclusion: Training data accuracy can improve F1 Score and ensure model completeness.

Because F1 Score assigns equal weight to both Precision and Recall, it can mask important differences in classification model evaluation metrics. When using F1 Score, it's important to explore the goals and requirements of the challenge and select appropriate metrics, while considering the specific problem and context. 즉, Given the specific situation and context, other metrics may be more effective than F1 Score. To comprehensively assess the performance of your model, you need to evaluate other metrics and factors together.

While it depends on your situation, we consider a model to be good when it has an F1 Score of 0.7 or higher. A high score is not a silver bullet, but a model with a high F1 Score is definitely a good overall model. In order to achieve a high F1 Score, you need to Quality of pre-trained data is important, and to improve performance, It's important to specifically review and monitor your training data. This is why DataHunt has put so much research and effort into achieving 99% training data accuracy, so that the final model is more complete.

Reference.

  1. What is F1 score?
  2. Understanding Machine Learning Classification Metrics (Accuracy, Precision, Recall, F1 Score) | by WANGGEON LEE | Medium
  3. Evaluate machine learning classification models (precision, recall, f1-score, etc.)
  4. Is F1-Score Really Better than Accuracy? | by Samuele Mazzanti | Towards Data Science
  5. A Look at Precision, Recall, and F1-Score | by Teemu Kanstrén | Toward Data Science

Talk to Expert