5 Proven Steps to Improve Data Quality in Your Organization

The quality of data is a constraint on every analytic project. As the saying goes – Garbage In, Garbage Out. As part of the planning process for an analysis, you should understand the quality of the available data. In this post, I’ll discuss the characteristics of good data quality, and how your organization can start improving it. By the end, my friend, you will have five (5) proven steps to improve data quality in your organization.

Characteristics of Good Data Quality

Data experts generally agree on the characteristics of good data quality regardless of the topics covered by the data. You might find that some sources have slight differences in terminology, but the concepts covered are typically the same regardless of specific terms. Understanding the characteristics of high quality data is also essential to having good data literacy. You can read more about that in my post on Essential Elements to Target for Your Data Literacy. High quality data have the following characteristics.

Accurate

High quality data includes correct information. If your database captures customer emails, phone numbers, and addresses, then you expect that the information contained in these fields is correct when you use it.

Consistent

High quality data are reliable, capturing the same information in the same manner across different observations in the data set. If your database captures customer telephone numbers, then all of the entries should have the same format, such as (555)555-5555. You should not see entries such as 5555555555, or 555 555 5555. As another example, if your data collects addresses, you expect to see entries such as 10 Main St, and not have some entries a Ten Main Street. Additionally, where the same field appears in different data tables across a database, the values will be the same if the data are consistent.

Timely

If your organization captures data an they are available for use in a timely manner, the data are high quality. The definition of timely varies, however, across industries. In some industries, timeliness might mean the data are available for analysis within 24 hours of capture. In other industries, timeliness might mean the data are available within a week or a month of capture. The most important aspect of timely data is that the data are available in time for the organization to make relevant data-driven decisions.

Comprehensive

High quality data contains fields for all the variables that you want to capture in your analytics project. A data set that contains a limited number of fields on a specific topic may not have all the fields you need for your project. This data set is therefore not comprehensive with respect to your analytic project. You may need to combine multiple data sets together to create a comprehensive data set for your project.

Unique

High quality data does not contain duplicate observations where it should not. This means that in a single data table, you do not have duplicate entries. Additionally, this means that across tables in a database you are only capturing specific pieces of data in one place, rather than capturing data in multiple places where the information could conflict.

Complete

High quality data does not contain missing values. Importantly, this does not mean that there are no missing values. You might expect some data fields to have missing values. If you include optional fields on data capture forms, such as a customer middle name or birthdate, you should expect that some users will choose not to provide that information. Your data may still be complete with respect to the required data fields from the form. Completeness, therefore, is partially a function of what you want to do with the data.

When is Data Quality Too Poor?

One question I hear regularly is, “At what point should we consider the data low quality, problematic, or unusable?” This is a great question, and one for which there is no right answer. The point at which your data become unusable for an analysis depends on what you are trying to accomplish. You will need to consider each dimension of data quality with respect to your data and your analytic goal. Then ask yourself this question: How much error are you willing to accept in your results due to poor data quality?

If your analysis does not need to be highly accurate, then you can probably accept a certain level of inaccuracy or missingness in your data. I typically begin to question the usefulness of my data when I get to 10 percent or more missing data values or inaccurate information.

If your analysis needs to be timely, you will need to determine whether the vintage of your data is recent enough, or if it is too old to be useful. You will also need to clean your data if it contains duplicate entries or inconsistencies in how the data are entered.

Finally, if your data are not comprehensive with respect to the fields you want for your analysis, you will need to consider how important those fields might be. Are they core concepts in the project? If so, perhaps you should consider waiting until you have more data. If not, perhaps you can proceed with some caution when interpreting the results.

Developing High Quality Data Systems

Your organization’s data quality is a function of two key factors. First, how well do you capture the data initially? Second, what are your organizational processes and procedures for validating and storing data that impact their quality? Here are five (5) proven steps that you can take to promote and maintain good data quality in your organization.

Standardized Data Capture

Develop standardized data systems to capture the data coming into our organization. Where possible, make sure that data are collected using standardized forms where all fields have a predefined structure and format. Create validation rules to double check entries and alert users when entries do not conform with the expected format of the field. Finally, where field entries logically fall into one of a list of specific values, use lookup tables and drop-down menus to ensure users can only select a valid value.

Data Review and Reconciliation

Create a process for regular data review and reconciliation. Much like a physical inventory tracking system, think of this as your data inventory tracking system. Verify the amount of data on hand, and whether the growth of your data since the last review aligns with expectations. Confirm the integrity of the data by assessing whether there are any duplicate entries, whether entries in each database table can be identified in other database tables, and whether the fields of each table align with expected values and formats. Where possible, develop procedures to remedy any issues found during the review and then perform a root-cause analysis to identify and correct the source of the problem.

Removing Bottlenecks

Identify any bottlenecks to the timeliness of your data processing systems. Are there processing steps that analysts perform manually? Are there processing programs that do not run automatically? In an ideal scenario, your data system would work on autopilot, allowing for quick and seamless intake and validation of data. Most organizations are far from the ideal, however. Performing a process analysis to list out each step in the data processing chain, whether it has been automated, how long it takes to run, and any downstream dependencies will allow you to see the bottlenecks an prioritize finding solutions to streamline your systems.

Assigning Data Owners

The more complex your data systems become, the more challenging it becomes for a single person or group to manage the data quality infrastructure. This is especially true when your organization has many functional divisions, such as sales, marketing, research & development, operations, customer support, HR, IT, etc., each with their own systems for collecting and managing the data they need. As your organizational complexity grows, assigning a single data owner to each data source or database can help ensure the data within that group are in good shape. The data owners hold responsibility for working within their team, as well as across teams, to ensure that data quality and integrity are maintained across the organization.

Data Quality Committee

With data owners assigned across teams, you want to prevent data silos from forming and becoming disconnected from the rest of the organization. To combat this, convene a committee of data owners on a periodic basis as part of the data review and reconciliation process to confirm that all systems are integrated and working properly, and address any concerns as a group.

The data quality committee can include all data owners as well as leadership, analysts, and other stakeholders to provide voices across all of the data touch points in the organization. The data quality committee can set the strategic direction for your data platform, identify steps to achieve the strategic vision, root out challenges and address them, and ensure smooth operations across the organization.

Conclusion

Some of the most important assets in your organization are the data used across your business functions to track performance and make decisions. The quality of your data system can make all the difference between you making timely, efficient, data-driven decisions, or having to rely on gut-feelings and guesswork.

Fortunately, the key dimensions of high data quality are easy for you to assess. Similarly, there are five (5) proven steps you can take on your team and across your organization to improve areas where your data quality is lacking. The approaches in this article will help get you moving down the path. For additional information about improving your analytic governance model, see my post on the 7 Mistake Your Organization is Making with Its Analytic Governance Model, and How to Address Them.