In its simplest terms, data profiling means learning the basics about the data that you are working with and discerning information from that data. Profiling data will help you to identify any data quality issues so that you can correct them. It will also help inform what data cleansing needs to occur in order to improve the data for reporting and analysis.
The data analyst should always start by profiling any data set that they plan to work with while performing their role.
Steps of Data Profiling
Profiling data involves forming an understanding of the data by learning about aspects such as the following:
- The source of the data
- Keys of the data
- Relationships within the data
- Record counts
Basic steps to follow for Profiling Data
-
Identify and document the source of the data and its integrity. For example, was this data collected in a system with a defined process, or was it manually keyed from a paper document?
-
Identify the fieldnames and data types, and determine whether they are appropriate for what you are reporting. For example, are the date fields set as dates, or are they text? Depending on what you discover, you may need to convert the data types for some fields.
-
Determine the main fields identified for reports and what their column profile entails. For example, if you are reporting on products, what products are represented in the data?
-
Check whether a key field (primary, natural, or foreign) is represented where expected. For example, if working with product sale data, do all the sales have an associated product code?
-
Recognize the total of all the data in the data set. For example, if the total of every record combined is $500,000, and $1 million suddenly displays upon reporting, is there an obvious issue here?
Data Profiling Tools and Techniques
Data analysts can use either manual techniques or advanced software to profile data. However, when working with large amounts of data, the use of basic manual techniques can be a challenge. Most of the popular data tools (such as Power Query in Excel, PowerBI, and Tableau) contain built-in tools to help you profile data. In Power Query, the process is as simple as checking the column profile checkbox. In Tableau, profiling data requires just a right click and the selection of the describe option.
Column profiling gives us basic insight into that column of data. For data types like numbers, you can get insight on the max, min, and averages of that column. For text data types, you can obtain distinct or unique counts of the values in that column.
Conclusion
We have learned that why is it important for the data analyst to profile data. What are some elements of data that are assessed when profiling and now we can name a few popular tools that can be used to profile data.
Read More : How do data analysts visualize their data?