As data is collected daily through the heavy use of computers and the internet, we continue to be plagued by the problem of dirty data. Data that is error prone or inconsistent leads to bad outcomes, such as misguided decision making (e.g. inaccurate contact information leads to increased costs from having to re-send mail or from lost opportunities.) As data analysts continue to spend large amounts of time preparing data prior to analysis or reporting, it becomes critical to clean your data.
Data cleaning, aka data cleansing, is the process of analyzing your database records and identifying data errors, inconsistencies, and duplication. After screening the data, it’s corrected through data transformations using database tools or scripting. The keys to efficient data cleaning is to be organized with your processes and to understand your data systems.
Prepare Resources and Data Analysis
The first step to efficient data cleaning is to create a well-organized plan and gather resources for analyzing your data. Focus on coordinating key stakeholders of the data, from those that prepare it to those that digest it. It’s best to be thorough when involving others to help analyze the data sources to be cleaned. All too often, issues occur when data cleaning or transformations are done in a closed environment where groups of people are unaware of changes occurring to their data.
With a coordinated group, discover how your data “looks” by inventorying the data sources and their tables. Be sure to include their records, purposes, and characteristics. Analyze the data thoroughly so that there is an understanding of the underlying structures of the sources and how well they fit for their intended purposes. This will be helpful when you move on to screening the dirty data.
Determine Data Quality to Identify Errors
To identify data errors, it’s important to define clean data as it relates to 1) the data source you’re working with, and 2) the overall data scheme at your institution. Use the work done in the preparation stage to follow table definitions and see how it fits in the overall database design. Check if records are formatted for their intended use and if they make sense. For example, when looking at a table of addresses, it’s a problem to have various formats for ZIP code or street names that don’t follow a standard. Expect some instances where you won’t know the correct data format until confirming with the data owners.
Once you’ve defined what clean data is, create a set of guidelines. These guidelines should be agreed upon by all stakeholders. This will make it easier to analyze and identify data that does not fit in the defined criteria. Data that does not follow the guidelines are to be noted for correction. The guidelines will also be useful for future data cleanings and for configuring your systems around data error prevention.
Data transformations, such as ETL processes, and data cleaning tools are used to correct the errors identified in the previous step. Technical staff are required to create these scripts and tools. Generally, data sources are moved into warehouses or cloned so that live data is not affected during this process.
The types of data correction can be obvious or can require time to sort out. It can involve anything from incorrect date formats, record duplication, typos from data entries, or misrepresentation of data. To avoid correcting the same error more than once, document all corrections and look for patterns. Database scripts can be recycled for other tasks, and measures can be taken to reduce bad data entry.
If you’re considering data cleaning tools, keep in mind that it can be costly and may not save much time – as assistance is still required to determine data quality and screening. If any are needed, weigh the pros and cons, and choose ones that you’re more comfortable with and make sense for your goals.
It’s important to only correct data when the data is complete and when definitions have been set. There’s no point in cleaning data when you know you’ll have to correct it again later.
Verification of Data Changes
Inspect and validate the changes made to the data sources with the stakeholders in the process. Using the work done previously, you should have an idea of the data quality of the updated records and their accuracy. Use this stage in the process to also think about enriching the data that has been cleaned. You can supplement the cleaned data with additional data for enhanced reporting, or there may be better ways to represent that data moving forward.
The verification process either finishes with changes being confirmed or it repeats itself to audit the data again. Repeated cycles may be needed when certain corrections were not completed in the time allotted. Shorter, more focused cycles of data cleaning can result in quicker corrections to the data but must occur more often. As more cycles occur, the processes should become more streamlined and efficient.
Remember, the result should be clean data that is consistent, uniform, and accurate.
Through data cleaning, you can see how well your data systems handle data entry errors from applications or other sources. The process can be used to improve these systems by identifying and correcting the causes of the errors.
One of the biggest causes of data errors are the users themselves. So, an improvement may be to update data entry applications to prevent users from entering certain data formats, to run error checking, or to enforce better integrity constraints. While it’s impossible to handle every situation, well-designed systems can at least ensure the data cleaning process is executed more efficiently by making it easier to identify data errors. (NOTE: Although your systems may be designed for data error prevention, users may still find ways to go around them!)
For those serious about data governance, data cleaning is a regular and important process to ensure better data quality. This philosophy originates at the top of your institution and requires commitment from those that utilize data at every level. Use your resources to improve the data entry processes and data application integration. With a well-organized plan for data cleanliness, you will be well on your way to high-quality data.
Like this blog?
Check out other blogs in this series and sign up for the webinar*:
*The June 27 webinar will expand on the content of these blogs.
Latest posts by Jonathan Kim (see all)
- Data Cleaning: A Technical Perspective - 05/17/2018
- The 5 Most Commonly Found Metrics for Student Success - 02/16/2017
- Restructuring the Report Tree - 06/15/2016