Skip to content

News & Insights

Graphic of a circular ring

F is for F1 Score

Share

F is for F1 Score

From A to I to Z: Jaid’s Guide to Artificial Intelligence

An F1 score — also known as F-score or F-measure — measures a machine learning model’s accuracy. It’s made up of two key elements:

  • Precision: the number of times the machine learning model makes a correct prediction.
  • Recall: the number of predictions in a dataset which the machine learning model identifies correctly.

Imagine you wanted to train an AI to identify which patients in a dataset have type 2 diabetes. The dataset contains 3,000 patients, of which 1,500 have type 2 diabetes and 1,500 are healthy. 

The AI identifies 1,500 patients as being sick.

1,000 of these predictions are correct. The patients do have type 2 diabetes. These are called true positives.

However, the other 500 patients are actually healthy, so the AI’s predictions were incorrect. These are called false positives.

The AI also predicts that 500 patients in the dataset are healthy, when in fact they’re sick. These are called false negatives.

Precision is worked out by adding true positives (in this case 1,000) and false positives (in this case 500) and dividing the number of true positives by the answer (1,500).

This means the model’s precision is 0.67.

To work out recall, on the other hand, you add true positives and false negatives and divide the number of true positives by the answer.

In our example, there’s an identical number of false positives and false negatives. So, here again, the recall score is 1,000 divided by 1,500 — that is 0.67.

The F1 score is the harmonic mean of precision and recall, and it’s worked out using the following formula:

F1 = 2 x (Precision x Recall) / (Precision + Recall)

So, using our example, the machine learning model’s F1 score would be 2 x (0.67 x 0.67) / (0.67 + 0.67) which works out as  0.67.

The highest possible F1 score is 1, while the worst possible score is 0.

Some facts:

The harmonic mean is an average calculation that gives more weight to low values and less weight to high values. In the context of F1 scores, the harmonic mean gives more weight to precision and recall scores that are close to 0. This makes it a more sensitive measure of the machine learning model’s performance.

While F1 scores are generally a good measure of accuracy, they can be misleading when you’re dealing with datasets that have very few possible true positives, for example because you’re training AI to identify a rare disease, or fraudulent transactions.

Imagine a dataset of 3,000 people where 2,950 are healthy and only 50 are sick.

Here, the model might achieve a high F1 score by correctly predicting that the majority of the people in the dataset are healthy, but not do well at all when it comes to what actually matters — correctly predicting who is sick.

For this reason, scientists generally use more than one metric when comparing models and evaluating their performance.

Alongside precision, recall, and F1 scores, the most common performance metrics used to compare machine learning models are:

  • Accuracy – This measures the proportion of the model’s predictions that are correct. So if a model has made 1000 predictions over its lifetime and 800 of them were accurate, its accuracy score would be 0.8.
  • ROC-AUC – This measures how good the model is at distinguishing between positives and negatives by comparing the number of true positives with the number of false positives.
  • Log loss – This measures the likelihood that a machine learning model’s prediction will be close to the truth.
  • AUC-PR – This compares precision and recall.

Want to know more?

This video is a deep dive into F1 scoring, and includes easy-to-understand explanations of key terminology, with examples.

Slightly more advanced, this Machine Learning Mastery tutorial includes code snippets you can use to create your own F1, precision, and recall scoring models.

Jaid’s perspective

F1 scores are useful because they can tell you how accurate a machine learning model is and, so, how well it can train AI to perform a specific function. From a customer service perspective, the higher a machine learning model’s F1 score, the more reliably it is able to answer queries to a customer’s satisfaction. That said, it’s important to keep in mind that F1 scores don’t always fairly represent the model’s capabilities, which is why you should evaluate performance using several metrics.

Learn how Jaid’s AI-powered platform can help streamline your customer queries AND keep customers satisfied. Contact us today!