What Is a Confusion Matrix? (Plus How To Calculate One)
Updated October 22, 2023
Data analysts and engineers perform various assessments in data science when working with machine learning problems. They often solve problems concerning data classification. A confusion matrix is a valuable tool for measuring the factors affecting the accuracy and precision of a classification model or classifier.
In this article, we explore what a confusion matrix is, examine why it's important in data analysis and machine learning, provide steps for how you can calculate a confusion matrix for a two-class classification problem and provide an example to guide you.
What is a confusion matrix?
A confusion matrix is a chart or table that summarizes the performance of a classification model or algorithm for machine learning processes. Confusion matrices help with predictive analysis and can be effective tools for evaluating what functions a machine learning system performs correctly and incorrectly.
When creating a confusion matrix, include both predictive and actual values you test in the system. Each row corresponds to each predicted class and each column corresponds to the actual class. Depending on the number of outputs you arrive at for each input, the confusion matrix can calculate either multiple-class or two-class classification problems.
Why a confusion matrix is important
Data scientists who develop machine learning systems rely on confusion matrices to solve classification problems containing two or more classes. The matrix organizes input and output data in a way that allows analysts and programmers to visualize the accuracy, recall and precision of the machine learning algorithms they apply to system designs.
In a two-class, or binary, classification problem, the confusion matrix is crucial for determining two outcomes. The outcomes can be positive or negative, where these variables represent numerical values in a machine learning system. When computing binary classification problems, you can use confusion matrices to find:
Accuracy rate: This is the percentage of times a classifier is correct.
Misclassification rate: This is the percentage of times a classifier is incorrect.
True positive rate: This figure represents the percentage of times a classifier correctly predicts desired outcomes.
True negative rate: This refers to how often a classifier correctly predicts undesirable outcomes.
False positive rate: This is a type I error representing how often a classifier is incorrect when predicting desirable outcomes.
False negative rate: This is a type II error representing the percentage of times a classifier incorrectly predicts undesirable outcomes.
Precision rate: This is the rate at which the desirable predictions turn out to be correct.
Related: Learn About Being a Data Scientist
How to calculate a confusion matrix for binary classification
The following steps outline the basic process for calculating confusion matrices for two-class classification problems:
1. Construct your table
Before entering data, you need a table to develop the confusion matrix. Create a table with two rows and two columns, with an additional row and column for labeling your chart. The left side of the matrix represents the actual outputs and the right represents the predicted outputs.
Related: 5 Jobs in Machine Learning
2. Enter the predicted positive and negative values
In the predictive row and column, list the values you estimate for both positive and negative outcomes. For example, you can predict the number of pass-fail exam scores from a data set containing 120 samples.
This means you can have two outputs, either "pass" or "fail." If you predict 100 passing scores and 20 failing scores, you enter these values as the outputs under the columns for your predictive "pass" and "fail" values.
3. Enter the actual positive and negative values
After analyzing your predictive values to determine whether they're correct, you can enter the actual outputs in your matrix. The actual outputs become the "true" and "false" values in the table. Your "true positive" and "false negative" values represent the actual positive outputs. The "false positive" and "true negative" values represent the actual negative outcomes.
In the example of a pass-fail exam, the passing scores represent the positive outcomes, while the failing scores represent the negative outcomes. If the actual number of passing scores is 110 and failing scores is 10, these values become your true positive and negative values in the matrix. Your false positive and negative values would both be 10 because you incorrectly predict 10 more failing scores and 10 fewer passing scores.
4. Determine the accuracy rate
Using the completed matrix, you can determine what the accuracy rate is when predicting desirable outcomes. This metric measures how often you predict outcomes correctly. This can be useful for understanding error rates and identifying where modifications in data systems are necessary.
To find the accuracy rate, add the true positive and negative values together and divide the result by the total number of values in your data set. With the example test scores, correctly predicting 100 passing scores and 10 failing scores gives you a sum of 110 accurate predictions out of 120 total scores, resulting in a 92% accuracy rate.
5. Calculate the misclassification rate
The misclassification rate shows how often your confusion matrix is incorrect in predicting the actual positive and negative outputs. Find this value by adding the false positive and negative values together and dividing this sum by the total number of values in your data set. For example, using the previous example of pass-fail exam scores, assume you incorrectly predict 10 passing scores and 10 failing scores.
The false-positive and false-negative outputs would both be 10 in your matrix. Combining these values results in 20, which you divide by the total of 120 test scores. This results in a misclassification rate of 0.166, or about 17%, which means you only predict an outcome incorrectly 17% of the time.
6. Find the true positive rate
The true positive rate of a data set is the recall value, which represents how often a system output is positive when you predict a positive outcome. To find the recall rate, divide the number of positive outcomes you predict correctly by the number of actual positive outcomes you get when performing your analysis.
For example, assume you correctly predict 100 passing scores. This is the true positive value because you correctly predict 100 of the actual 110 passing scores. Divide this true positive value by the 110 passing scores to get a recall rate of 0.91 or 91%.
7. Determine the true negative rate
The true negative rate of your matrix is the specificity rate, which shows how often your classifier correctly predicts a negative outcome. To determine this rate, divide the total number of negative outcomes you predict correctly by the number of actual negative outcomes you get in your analysis.
Using the previous example exam scores, assume you predict 10 failing scores correctly out of 20 predictions. This gives you a true negative or specificity rate of 50%.
Example of a confusion matrix calculation
Below is an example of a confusion matrix calculation:
Example of a confusion matrix calculation
Environmental scientists want to solve a two-class classification problem for predicting whether a population contains a specific genetic variant. They can use a confusion matrix to determine how many ways automated processes might confuse the machine learning classification model they're analyzing. Assuming the scientists use 500 samples for their data analysis, a table is constructed for their predictive and actual values before calculating the confusion matrix.
Predicted without the variant
Predicted with the variant
Actual number without the variant
Actual number with the variant
Total predicted value
Total predictive value
After creating the matrix, the scientists analyze their sample data. Assume the scientists predict that 350 test samples contain the genetic variant and 150 samples don't. If they determine the actual number of samples containing the variant is 305, the actual number of samples without the variant is 195. These values become the "true" values in the matrix and the scientists enter the data in the table:
Predicted without the variant
Predicted with the variant
Actual number without the variant = 195
True negative = 45
False positive = 150
Actual number with the variant = 305
False negative = 105
True positive = 200
Using the data from the confusion matrix, the scientists can then compute the true positive and negative rates, the accuracy rate and the misclassification rate of their classification model:
Recall rate = (True positive value) / (Actual positive value) = (200) / (305) = 0.66 = 66%
Specificity rate = (True negative value) / (Actual negative value) = (45) / (195) = 0.23 = 23%
Accuracy rate = (True positive value + True negative value) / (Total number of samples) = (200 + 45) / (500) = (245) / (500) = 0.49 = 49%
Misclassification (error) rate = (False positive value + False negative value) / (Total number of samples) = (150 + 105) / (500) = (255) / (500) = 0.51 = 51%
Evaluating this data can help scientists determine how to change or improve the classification algorithm to increase the accuracy rate of predicting genetic variations in an ecosystem's population.
Explore more articles
- How To Calculate Covariance in 6 Steps (With Examples)
- How To Calculate Percentages (With Formula and Examples)
- How To Write a Great Career Goals Essay
- Tagline vs. Slogan: What's the Difference and Why Are They Important?
- What Is a Domain Type? (Definition, 5 Types and Examples)
- 10 Qualities of a Good Employee (With Examples)
- What Is a Psychometrist? (And Requirements To Become One)
- Dealing With Difficult Patients: 7 Steps To Take (Plus Tips)
- How To Change Your Location on Google Chrome in 7 Steps
- How To Write a Salary Verification Letter (With Example)
- What Is Utility Software? Definition, Types and Benefits
- How To Become a Truck Dispatcher