How to Find and Remove Duplicates in Power BI

Data duplication can cause serious problems with a Power BI data model and can lead to bad information being presented to end users. We’ll look at how to identify and remove duplicates in Power BI, along with tips and tricks to detect and avoid them in the future.

This article will use the following example dataset that has two sets of duplicated rows.

Screenshot of duplicate rows and values in Power BI

Power BI provides a number of different ways to remove duplicates.

We’ll start with the easiest fastest way to detect and remove duplicates, and then look at some best practices to find them in the future.

While this specific dataset is small, identification of duplicates is equally important when working in large datasets that are not as easy to manually review.

Why You Should Pay Attention Duplicates in Power BI

Note, that we are not wholesale saying that you should remove duplicates from your dataset. Each dataset is unique and there can be legitimate reasons to keep duplicate values in your dataset.

There are a number of issues that can arise when there are duplicate values in your dataset. A few of them are as follows.

  • Primary Key Errors – Power BI can give you an error when creating relationships between tables with non-unique primary keys between multiple tables.
  • Incorrect Totals and Calculations – It’s possible to accidentally include duplicate values in a dataset causing sums, totals, averages and other calculations to be mis-represented causing inaccurate information to make it to the report viewer.
  • Excessive Data Size – Duplication of data can cause file sizes to grow needlessly. While Power BI is highly data efficient, it’s possible to slow the performance of it especially when working with large calculations.
  • Slow Calculations – Some calculations in Power BI require fully sampling all rows of a dataset. The more rows in the dataset the slower Power BI will perform the calculations creating a bad user experience.

The first step to fixing duplicate values is to identify them.

How to Quickly Identify Duplicate Rows in Power BI

Power BI provides users with a number of Data Preview tools. These allow you to view the metadata, also known as data about your data. They are descriptive values that are meant to give users a quick overview of what their data looks like without having to manually review it.

To view the meta data, first launch the Power Query Editor. The fastest way is to right click on a dataset and click on “Edit Query”

Right click a data table to launch the Power BI Power Query Editor

Once you’re in the Power Query Editor in Power BI, navigate to:

View > Column Distribution and Column Profile

This will place a small chart at the top of each column of your dataset. The bars tell you how many times a value appears within a column.

At the bottom of your data preview, Power BI will provide you with the total count of rows along with how many distinct values there are. If the number of distinct values is below the count of total rows it means that there are duplicates within your dataset that should be explored further.

Enable Power BI Column Distribution and Column Profile to see how many duplicate values are in a column

There are also a number of additional data preview options to view data samples. We recommend testing them out and seeing which ones you find most valuable.

You can also mouse over value distributions to see the exact count for that specific value.

How to Remove Duplicate Values in Power BI

The fastest way to remove duplicates in Power BI is to right click on a column header in Power Query and select Remove Duplicates. Power Query will add a step to remove any duplicate values for that specific column. A drawback with this method is that the values are no longer visible in the dataset going forward, and Power BI does not give you a summary of the rows that were removed.

Right click a column name in Power BI and choose Remove Duplicates to add a step to remove duplicate values in a column

A second option for removing duplicate rows is by navigating to Home > Remove Rows in the Power Query ribbon of Power BI.

It works the same as right clicking on a column header. There are some additional options to remove blanks, errors, as well.

Power BI developers can also use the remove duplicates option under Remove rows on the Power BI ribbon

Both of the methods mentioned will remove duplicate rows automatically.

However, there are times that you want to view the duplicate rows before removing them. We’ll look at how to keep duplicate rows next.

How to View Duplicate Rows in Power BI Using Power Query

Duplicate rows can be viewed as part of a Power Query data model by following these steps.

Step 1.) Launch the Power Query Editor

Step 2.) Right Click on a Query and Select Reference

Create a reference table in Power BI Power Query to view duplicate rows

A reference query in Power BI creates a link to another query or table within your data model. It means that you can apply additional steps without impacting the original dataset.

Another option would be to duplicate your dataset, but doing so will fork the two queries. Duplicating a table can also cause an increase in file size of the .pbix file and may not be practical in all cases.

Step 3.) On the new reference table, go to Home > Keep Rows > Keep Duplicates

Use Power BI Keep Duplicates option to filter on only duplicated values.  This will let you preview duplicates before removing them for a dataset

The results are a second data table that you could re-name and keep in your data model as a reference.

Screenshot of a reference column created in Power Query

Once you remove the duplicates from your primary data source, this table will no longer show the duplicates if it is a reference table. A reference table starts where the previous query ends.

Using Group By to Consolidate Duplicate Rows

When you want to aggregate rows instead of filtering them, use the Power BI Group By function to consolidate duplicate rows. From the Power Query editor, go to Home > Group By

Screenshot of the Power BI Group By button in Power Query

The following screen will appear with a Basic and an Advanced mode. In our example below, we will tell Power BI to keep all of our columns as groups within our dataset.

Then we add an aggregation method to Sum our Sales Column.

Screenshot of Power BI Advanced Group BY Settings used to add duplicate rows together

Power BI will go through all of the rows in a dataset, and remove the duplicate rows while consolidating the values into a single row.

Final result of grouped columns in Power BI that add duplicate rows together instead of filtering them from a dataset.

By using the Power BI Group By function, you can consolidate rows. Ultimately this will take duplicates out of your dataset, but also ensures that no data is lost in the process.

Creating a Check Tab in a Power BI Report to Prevent Duplication Errors

Creating a hidden check tab in a Power BI report is another way that you can identify and prevent duplication errors. It can be easier to use than building checks directly into Power Query and has additional benefits of being able to show multiple data checks on a single page of a reporting dashboard.

In the example below, we created a Matrix visualization that consists of Item ID and the Count of Item ID on our sales table.

This setup will show how many times a value appears in a dataset. We sorted our visual to show rows that have a value count from high to low. You could also setup a filter on the visual to only show counts > 1 to identify duplicates.

Use a Matrix Visualization in Power BI with a column name and count of column values to identify duplicate rows in a report tab

Note: Once you’ve added the duplicate row check visual in Power BI, you can right click on the tab and select Hide Page. This will make it visible to anybody that opens the .pbix file such as a Power BI developer and hides it from the end user.

Hide a reporting check tab to provide insights of reporting errors without launching Power Query or automate it with an e-mail alert.

Try adding multiple checks onto a tab to ensure that the data is clean. We commonly use this practice when building out financial statements and ensure that all accounts are mapped and that the trial balance will zero out once it’s being summed.

Preventing Duplication of Incoming Data

The best way to handle duplicate data in Power BI is to ensure that it isn’t duplicated in the first place. Everyone’s situation will vary, but here are some considerations to keep in mind.

  • Import Data Closes to the Source System – A common cause of duplicate data is using a dataset that has already been modified. An example is an Excel file that comes from a user who makes manual adjustments and calculations. Doing so increases the risk of importing bad information that was caused by user error. By importing data that is directly exported from a system you can mitigate these risks.
  • Use Unique Keys – Applying unique keys to data helps in identifying and preventing duplicate entries. If a dataset requires a unique identifier, enforcing this constraint ensures that duplicates cannot be inserted in the source system. Many SQL table structures will have primary keys.
  • Automate Duplicate Checks – Even if your dataset is clean at the time that a report is published, it does not mean that the incoming data won’t change over time. Setup a Report Checks tab or table to automatically e-mail you once a month or week with the data to ensure that new errors have not occurred.

By following these best practices you will be able to avoid duplicate data from entering into your Power BI data model or at the very least identify when incoming data changes so that it can be addressed quickly.

Conclusion

You can quickly identify duplicates in Power BI by using some of the sampling functions available within Power Query. Power BI will look at the metadata of each column and provide a graph of how often values occur within the dataset.

Removing duplicates in Power BI can be as simple as right clicking on a column name and choosing Remove Duplicates. Developers can also create separate reference tables in Power Query to see which records have been duplicated.

Outside of Power Query, users can also build out a Check tab and create a matrix visual that counts the number of times that a value appears. Sort of filter the count of row values, and either hide the tab or schedule it to automatically e-mail you to ensure that new errors do not occur over time.

Scroll to Top