Data duplication can cause serious problems with a Power BI data model and can lead to bad information being presented to end users. We’ll look at how to identify and remove duplicates in Power BI, along with tips and tricks to detect and avoid them in the future.
This article will use the following example dataset that has two sets of duplicated rows.
Power BI provides a number of different ways to remove duplicates.
We’ll start with the easiest fastest way to detect and remove duplicates, and then look at some best practices to find them in the future.
Table of Contents
While this specific dataset is small, identification of duplicates is equally important when working in large datasets that are not as easy to manually review.
Why You Should Pay Attention Duplicates in Power BI
Note, that we are not wholesale saying that you should remove duplicates from your dataset. Each dataset is unique and there can be legitimate reasons to keep duplicate values in your dataset.
There are a number of issues that can arise when there are duplicate values in your dataset. A few of them are as follows.
- Primary Key Errors – Power BI can give you an error when creating relationships between tables with non-unique primary keys between multiple tables.
- Incorrect Totals and Calculations – It’s possible to accidentally include duplicate values in a dataset causing sums, totals, averages and other calculations to be mis-represented causing inaccurate information to make it to the report viewer.
- Excessive Data Size – Duplication of data can cause file sizes to grow needlessly. While Power BI is highly data efficient, it’s possible to slow the performance of it especially when working with large calculations.
- Slow Calculations – Some calculations in Power BI require fully sampling all rows of a dataset. The more rows in the dataset the slower Power BI will perform the calculations creating a bad user experience.
The first step to fixing duplicate values is to identify them.
How to Quickly Identify Duplicate Rows in Power BI
Power BI provides users with a number of Data Preview tools. These allow you to view the metadata, also known as data about your data. They are descriptive values that are meant to give users a quick overview of what their data looks like without having to manually review it.
To view the meta data, first launch the Power Query Editor. The fastest way is to right click on a dataset and click on “Edit Query”
Once you’re in the Power Query Editor in Power BI, navigate to:
View > Column Distribution and Column Profile
This will place a small chart at the top of each column of your dataset. The bars tell you how many times a value appears within a column.
At the bottom of your data preview, Power BI will provide you with the total count of rows along with how many distinct values there are. If the number of distinct values is below the count of total rows it means that there are duplicates within your dataset that should be explored further.
There are also a number of additional data preview options to view data samples. We recommend testing them out and seeing which ones you find most valuable.
You can also mouse over value distributions to see the exact count for that specific value.
How to Remove Duplicate Values in Power BI
The fastest way to remove duplicates in Power BI is to right click on a column header in Power Query and select Remove Duplicates. Power Query will add a step to remove any duplicate values for that specific column. A drawback with this method is that the values are no longer visible in the dataset going forward, and Power BI does not give you a summary of the rows that were removed.
A second option for removing duplicate rows is by navigating to Home > Remove Rows in the Power Query ribbon of Power BI.
It works the same as right clicking on a column header. There are some additional options to remove blanks, errors, as well.
Both of the methods mentioned will remove duplicate rows automatically.
However, there are times that you want to view the duplicate rows before removing them. We’ll look at how to keep duplicate rows next.
How to View Duplicate Rows in Power BI Using Power Query
Duplicate rows can be viewed as part of a Power Query data model by following these steps.
Step 1.) Launch the Power Query Editor
Step 2.) Right Click on a Query and Select Reference
A reference query in Power BI creates a link to another query or table within your data model. It means that you can apply additional steps without impacting the original dataset.
Another option would be to duplicate your dataset, but doing so will fork the two queries. Duplicating a table can also cause an increase in file size of the .pbix file and may not be practical in all cases.
Step 3.) On the new reference table, go to Home > Keep Rows > Keep Duplicates
The results are a second data table that you could re-name and keep in your data model as a reference.
Once you remove the duplicates from your primary data source, this table will no longer show the duplicates if it is a reference table. A reference table starts where the previous query ends.
Using Group By to Consolidate Duplicate Rows
When you want to aggregate rows instead of filtering them, use the Power BI Group By function to consolidate duplicate rows. From the Power Query editor, go to Home > Group By
The following screen will appear with a Basic and an Advanced mode. In our example below, we will tell Power BI to keep all of our columns as groups within our dataset.
Then we add an aggregation method to Sum our Sales Column.
Power BI will go through all of the rows in a dataset, and remove the duplicate rows while consolidating the values into a single row.
By using the Power BI Group By function, you can consolidate rows. Ultimately this will take duplicates out of your dataset, but also ensures that no data is lost in the process.
Creating a Check Tab in a Power BI Report to Prevent Duplication Errors
Creating a hidden check tab in a Power BI report is another way that you can identify and prevent duplication errors. It can be easier to use than building checks directly into Power Query and has additional benefits of being able to show multiple data checks on a single page of a reporting dashboard.
In the example below, we created a Matrix visualization that consists of Item ID and the Count of Item ID on our sales table.
This setup will show how many times a value appears in a dataset. We sorted our visual to show rows that have a value count from high to low. You could also setup a filter on the visual to only show counts > 1 to identify duplicates.
Note: Once you’ve added the duplicate row check visual in Power BI, you can right click on the tab and select Hide Page. This will make it visible to anybody that opens the .pbix file such as a Power BI developer and hides it from the end user.
Try adding multiple checks onto a tab to ensure that the data is clean. We commonly use this practice when building out financial statements and ensure that all accounts are mapped and that the trial balance will zero out once it’s being summed.
Preventing Duplication of Incoming Data
The best way to handle duplicate data in Power BI is to ensure that it isn’t duplicated in the first place. Everyone’s situation will vary, but here are some considerations to keep in mind.
- Import Data Closes to the Source System – A common cause of duplicate data is using a dataset that has already been modified. An example is an Excel file that comes from a user who makes manual adjustments and calculations. Doing so increases the risk of importing bad information that was caused by user error. By importing data that is directly exported from a system you can mitigate these risks.
- Use Unique Keys – Applying unique keys to data helps in identifying and preventing duplicate entries. If a dataset requires a unique identifier, enforcing this constraint ensures that duplicates cannot be inserted in the source system. Many SQL table structures will have primary keys.
- Automate Duplicate Checks – Even if your dataset is clean at the time that a report is published, it does not mean that the incoming data won’t change over time. Setup a Report Checks tab or table to automatically e-mail you once a month or week with the data to ensure that new errors have not occurred.
By following these best practices you will be able to avoid duplicate data from entering into your Power BI data model or at the very least identify when incoming data changes so that it can be addressed quickly.
You can quickly identify duplicates in Power BI by using some of the sampling functions available within Power Query. Power BI will look at the metadata of each column and provide a graph of how often values occur within the dataset.
Removing duplicates in Power BI can be as simple as right clicking on a column name and choosing Remove Duplicates. Developers can also create separate reference tables in Power Query to see which records have been duplicated.
Outside of Power Query, users can also build out a Check tab and create a matrix visual that counts the number of times that a value appears. Sort of filter the count of row values, and either hide the tab or schedule it to automatically e-mail you to ensure that new errors do not occur over time.