Power BI offers a number of data profiling and data exploration features that help developers understand the data that they are modeling by automatically sampling the data and providing high level detail about it. This high-level analysis is a great way to identify null values, duplicates, and understand the distribution of values across a dataset.
While not strictly necessary, we consider it a best practice to run an analysis on your data prior to building a dashboard. Null and duplicate values can at times cause incorrect calculations or at worst, cause calculation errors. In small datasets you may be able to visually explore the data, but in large datasets it wouldn’t be practical.
Luckily Power BI gives users a set of features to investigate their data.
Let’s dive in!
Table of Contents
What is Data Profiling?
Data profiling is the process of analyzing your data at a high level. Rather than reviewing each individual row in a dataset which is either an inefficient use of time or impossible with a large enough dataset, we allow the analysis software to summarize the data and provide statistical information.
The process is crucial in understanding the quality, structure, and distribution of data. In the context of Power BI, it’s a quick and efficient way to avoid potential calculation errors and detect problems with data prior to finalizing a data model and building a report that may display inaccurate information.
The following example is a screenshot of the Power BI Column Quality for a column of values related to unit sales. You can quickly see the range and distribution of values.
This is just one of the features available.
Why is Data Profiling Important in Power BI
When using the data profiling features in Power BI, they not only help you avoid errors for future calculations or data analysis but they also help you understand the data. It’s fairly common for a business analyst or a dashboard developer to receive an assignment to work with a dataset that they are unfamiliar with. At times, you may even find that the process owner themselves don’t fully understand a dataset.
By utilizing the data profiling features available in Power BI, it helps you make an informed data model and ask some questions up front before going too far into dashboard development.
Questions to ask a process owner:
- What is the relevant range of sales?
- Do Null values make sense?
- Does it make sense to have X number of rows?
- Should specific values be unique?
Of course these will all vary depending on the specific use case of your data, but by asking informed questions you can determine if a data feed is incorrect or if source data is inaccurate before taking the time to build out a report based on bad data.
Power BI Data Profiling Features Explained
The data profiling features of Power BI are all available in Power Query, the data modeling and transformation side of Power BI. This is the area that data is imported, cleaned up, and processed prior to becoming available in the report view to add to visuals and create your report.
Data profiling and analysis features are available in Power Query, in the Data Preview section of the View Tab.
There are three major features, Column Quality, Column Profile, and Column Distribution and two minor features, Monospaced, and Show Whitespace.
Let’s look at each one, but first we need to enable profiling of the full dataset instead of the first 1,000 rows.
Enable Profiling of a Full Dataset in Power BI
The default setting in Power Query is to only sample the first 1,000 rows of data when exploring Column Quality, To enable profiling of a full dataset, click on the text, “Column profiling based on top 1,000 rows” at the bottom of the Power Query Window. It turns into a menu and select profiling based on entire dataset.
It is not immediately clear that this is a clickable option, but you will see the text in the bottom left corner of Power Query.
After enabling full profiling on the entire dataset statistics will be provided on all available data. The only drawback to this method is that profiling will take significantly longer when working with large data sets.
The Column Quality feature in Power BI provides a very small, high-level overview of each column. It’s primary purpose is identifying Errors and Empty values in a dataset for each column. Once enabled it adds a preview of the data analysis at the top of each column.
You can move your mouse over the column quality information to view the count of values that make up the percentage of distribution.
An added feature is that you can also click the three dots on the bottom of the analysis, or right click on any of the boxes at the top of each column to quickly remove or replace errors and keep duplicate values if you want to explore them further.
As a top level view for data cleanliness Column Quality can be a great starting point, and is useful to keep turned on during the data modeling process.
Column Distribution in Power Query can either replace or compliment the Column Quality view. In the example below we have added Column Distribution to Column Quality. The top of each column in Power Query will display the percentage of each value, then also show how many distinct and unique values are available.
By being able to view how many distinct and unique values are available in a dataset you may be able to identify potential errors. For example, if there are more than 12 unique values in a column that displays the month of a year, or if a single value repeats many different times in time-series data where each date should be unique.
Column Profile provides detailed statistics of data in each column. After turning it on, you can click on a column to highlight it and Power BI will analyze the values and provide detailed statistics. It includes metrics like min/max values, average, and distinct count.
This feature is particularly useful for financial analysis, where understanding metrics like average transaction value or the range of prices can provide insights into business performance and act as a gut check to ensure that the data you are working with is reasonable.
Column Profile can take up a decent amount of screen space. You can adjust the panel size by moving your mouse over the edge of the panel and drag it up, down to make it smaller or larger. Because it’s so useful to see as much information as possible on a single screen, this is one of the reasons we recommend using Power BI with a high resolution ultra-wide monitor.
Additional Options Available
Power BI also offers additional options that we don’t necessarily consider data profiling but can be very useful when exploring data cleanliness that you should be aware of. They are also available in the Power Query View section under Data Preview.
- Monospaced: This feature displays data in a monospaced font, which forces each character and letter to be the same width. For example, io and ii typically take up different widths making it difficult to eyeball that you have the same number of characters in a column.
- Show Whitespace: Enabling this option reveals any whitespace characters in the data. A common use case is identifying leading or trailing white space when a merge looks like it should have worked, but left values remaining.
We don’t use these two features super often, but are useful to know that they are there when you need them. Luckily, Power BI makes it easy to toggle all of these features on and off as needed.
Using the data profiling and quality features in Power BI are a quick and efficient way to gain a high level understanding of the data available. It provides a meta data analysis, or data about your data that would otherwise be difficult to get. This is especially true with larger datasets.
All of the data exploration features in Power BI are available in the Power Query View. As an alternative, we have at times also built out a hidden tab in the Power BI Report view and add calculations to a Matrix visual, such as average value, row count, min/max as a way to check that data pulling into the report is as expected without having to launch Power Query or edit the semantic layer..