When I train business professionals looking to level-up their skills, data quality is a topic that always comes up. They bring it up around the time they start learning the different methods used to clean data for analysis. The more they learn, the more they begin to understand the art of analytics. You see my friend, even the best statistical analysis can’t solve the problems created by poor data quality. This article dives into evaluating the implications of poor data quality for your analysis.
Characteristics of Good Data Quality
In a previous article, I go into some depth about the characteristics of good data quality. To summarize, high quality data is:
- Accurate: the information contained in the data set is correct.
- Consistent: fields in the data set are captured with the same information and formats across all observations.
- Complete: data fields do not contain missing values where there should be none.
- Unique: data tables capture fields unique to those tables, and there are no duplicate records.
- Comprehensive: the fields in the data cover all topics you want to analyze.
- Timely: the data is recent enough to useful for your analytic purposes.
How Data Quality Impacts Validity & Reliability of Metrics
Organizations pay a lot of attention to creating and maintaining high quality data. From an analytic perspective, high quality data improves two very important aspects of your metrics: validity and reliability. Validity and reliability refer to the accuracy and consistency of metrics to measure what they are supposed to measure.
- Validity: the accuracy of a metric with respect to the concept it is supposed to measure.
- Reliability: the consistency with which repeated measurements of a metric reflect the correct value of the concept being measured.
To illustrate these concepts, consider an automatic blood pressure monitor at your doctor’s office. When the doctor takes multiple readings from the same people within a few minutes, the readings are typically the same. The equipment technician, however, has not calibrated in the last two months. As a result, the monitor produces results that are five (5) points higher than a person’s true blood pressure. Because the monitor can reproduce results across repeated measurements of patients, the data are reliable. In contrast, because the results are too high, the data are not a valid metric of patient blood pressure.
Importantly, for a metric to be valid we typically want it to be reliable as well. If a metric is not a consistent measure of a concept, then it probably isn’t an accurate measure either.
Validity and reliability are important concepts to understand when we discuss the impact of poor data quality on analyses. If you are working with poor quality data, then your analyses will likely suffer poor reliability, poor validity, or both. Let’s dive into the specifics.
Poor Accuracy
If your data are not accurate, then they are not valid; that is the very definition of validity. Your data could still be reliable, however, as in my previous description of the blood pressure monitor. A metric might be consistent, but that does not guarantee accuracy.
If your data are inaccurate, then your analytic results will be inaccurate as well. Researchers use the term biased to refer to metrics or results that are not accurate. Let’s go back to my blood pressure monitor example. Because the monitor always registers five (5) points higher than the true blood pressure, the data are biased upward. If we calculated the percentage of patients with high blood pressure, our result would also be biased upward.
Poor Consistency
If your data are not consistent, then they are not reliable. As with the relationship between accuracy and validity, consistency is the very definition of reliability. And as I mentioned previously, unreliable data won’t be valid data.
In my blood pressure monitor example, I originally said the monitor produced consistent readings for a patient. But what if that wasn’t true?
Imagine that our blood pressure monitor produced results that were sometime biased upward, and sometimes biased downward. Imagine further that the pattern of high and low measurements seemed to be unpredictable.
Would you trust that the data could provide you with accurate results? Probably not. Worse yet, because the data inconsistency is unpredictable, we can’t tell whether the direction of bias in the analytic results.
Poor Completeness
If your data contain missing values, then they are incomplete. When data are missing, you should be concerned about the accuracy of your metrics. For example, if the missing data is random across observations, then metrics calculated from the known data should be accurate. In contrast, if the pattern of missing values is not random, then it’s likely that calculated metrics will be biased.
Metrics of individual or family income are classic examples where missing data can bias results. Many people choose not to report their income on surveys. This is especially true if their income is either very high or very low. When data are missing in this type of pattern, then your metric of income is likely biased. Both the average and the variability of the income metric can be incorrect from this type of missing data pattern.
I need to emphasize here that a metric might be valid for observations where the values are known. In these cases, the metric itself might be valid. Missing data, however, creates a concern around whether your data are representative of the population you want to study. In the income example, if the missing values are on the low and high end of the income spectrum, then the data may only be representative of the middle-income group.
Poor Uniqueness
If you have duplicate records in a data set, then the data are not unique. You might also find that your data sets have multiple sources for the same information. If these different sources conflict with each other, then you have a more serious type of uniqueness problem. The impact of non-unique data on your analysis happens through its effect on the validity and reliability of your metrics.
If your data has duplicate records, then your metrics are most likely going to be invalid. The averages calculated from data containing duplicates are likely to be biased by the values of the duplicate records. Duplicate records also impact the validity of your metrics by biasing the amount of variability between observations.
If your data has multiple sources of the same information, then you need to confirm whether the values for duplicate fields agree with one another across observations. For example, if you have multiple customer address field in your database, do the addresses agree across the multiple fields? If not, then some of those fields must have incorrect information. Using incorrect fields in your analysis destroys accuracy and will result in invalid metrics and biased results.
Poor Comprehensiveness
If you are missing metrics that you would like to use in your analysis, then your data are not comprehensive. For example, you may believe that there are differences between men and women in specific types of products they purchase. If your database does not include a field for customer gender, then the data is not comprehensive for that analysis.
Data with poor comprehensiveness will produce results with a different kind of bias than having inaccurate metrics. When important fields are missing from your analysis, the results will suffer from omitted variable bias.
In the case of our gender differences in products purchased, a missing gender field will result in omitted variable bias. You could only identify the purchasing patterns for the typical customer without being able to assess the gender differences.
Poor Timeliness
When data are not timely, then your analysis will be generated using old data. The results will always leave one important question open: has anything changed about these relationships since the data was collected? If you can reasonably assume that all relationships remain the same, then working with older data may still be useful. In contrast, changes in the environment or processes producing the data may notably alter relationships in the data.
When timeliness is poor, we develop questions about the validity of the metrics and results. Specifically, are the old data still representative of the current situation? While older data may be reliable with respect to consistency, the question is whether they represent the current situation accurately. Some processes and industries experience changes very slowly. We are less concerned about using older data when changes are slow, as opposed to more rapid changes.
For example, your customer addresses are not likely to change rapidly in a database. Because addresses are generally stable over time, you would be less concerned about using data that was a year old. In contrast, economic factors can change rapidly. If you wanted to forecast sales for next year, you would want to use the most recent data possible.
Conclusion
Bloggers have no shortage of online discussion about the importance and characteristics of high-quality data quality. Most authors, however, don’t provide discussion about what can happen if your data quality is poor. In this article, I unpack that discussion, and provide a more comprehensive view of potential impacts of poor data quality.
The challenges of poor data quality impact the validity and reliability of your metrics. By developing and maintaining high data quality, you can avoid these issues. When they are present, however, you are better equipped to adjust your analyses.