How to Remove Duplicates in Excel Power Query

Power Query offers multiple ways to remove duplicates from datasets ranging from using the remove duplicate function, applying custom filters, and handling duplicate values by aggregating them by grouping them. The method used largely depends on the style of data that you are working with.

Removing duplicates in Power Query can be as quick as using the remove duplicates button, duplicates can be identified in data profiling mode or you can group and filter for more control over the process.

Power Query is a great tool for removing duplicates, it’s much more efficient that using other methods in the standard Excel worksheet interface due to its ability to work with larger datasets, data profiling capabilities, and ease of use to perform operations that would normally take several steps in Excel.

Let’s jump in!

Why Remove Duplicates in Excel Power Query

Removing duplicates in Excel datasets is an important step in preparing data for analysis. Without removing them you run the risk of providing incorrect analysis, or if you’re using Power Pivot and merging your data with multiple tables you run the risk of duplicate entire rows of data making your resulting dataset misleading at best and wildly incorrect at worst.

There are also a number of benefits when using Power Query over standard Excel, such as the ability to work with larger datasets, more advanced data transformation features, and the ability to automate data refreshes.

Some additional benefits include:

  • Ensure Data Accuracy – End analysis can only be as good as the beginning data.
  • Reduce Processing Times – Removing duplicates will reduce the total row count and reduce the speed that Power Query takes to process additional data transformation steps.
  • Streamline Data Analysis – Power Query can be setup for automatic data refreshes, removing duplicates protects you from changes in base data from one period to the next.

While it can be easy to agree that removing duplicates is an important step in data analysis, it can be more difficult recognizing when columns of data contain duplicate values to begin with.

Luckily, Excel’s Power Query feature gives you several ways to sample data and identify them.

How to Identify Duplicates in Excel Power Query

To identify duplicates in Power Query, launch the Power Query Editor, go to the View section of the Power Query Ribbon and enable Column Distribution and Column Profile options under the Data Preview section. The Data Preview features provide a count of total records along with the distinct count of values that help identify duplicates.

The below screenshot highlights where the options are to enable, along with column statistics that change based on the selected column.

Screenshot of data profiling in Excel Power Query using the Data Preview options for Column Distribution and Column Profile

Note: By default, Power Query only analyzes the first 100 rows. If you click on “Colum Profiling Based on top 100 Rows” at the bottom of the Power Query window you can tell Power Query to sample data based on the entire dataset. It will take longer to process but provide a more robust data profile.

Before you begin transforming and analyzing data, it is a best practice to review the structure of the data. Having a high level understanding of whether the data includes duplicates, errors, null values, blanks and ensuring that the range of data makes sense for the dataset can all provide clues as to whether or not the data came in correctly and how it needs to be prepared for analysis.

After reviewing your dataset, you can uncheck the Data Preview options to provide more screen space to work with your dataset.

How to Remove Duplicates in Excel Power Query

To quickly remove duplicates using Excel Power Query, launch the Power Query editor, select a column or multi-select several columns then right click on a column header and select Remove Duplicates. Duplicates will be removed based on the selected fields. Alternatively, you can use the Group by function or utilize custom filters.

Here are the different methods in more detail.

Option 1.) The “Remove Duplicates” Button

The fastest way to remove duplicates is to right click on a column name in the Power Query Editor and from the drop down menu, select Remove Duplicates.

Screenshot of the Remove Duplicates option in Power Query for Excel

Tip: You can select multiple columns by holding down CONTROL + Left Click, or SHIFT + Left Click to select a range of columns. Then you can right click any of the selected column names and Remove Duplicates. Power Query will use all columns as criteria for removing duplicates.

To keep track of how many rows are being removed, Power Query shows a row count at the bottom left of the Power Query Editor. Toggle between applied steps to see the row count before and after removing duplicates.

Option 2.) Custom Filtering

When you specifically know that your dataset includes duplicated values, such as the screenshot below where we can see the Value Distribution using the Column Profile option, you can click the triangle at the top of a column to filter the value out of the data.

How to filter out duplicates in Power Query for Excel with data profile view enabled.

This method provides you with more control over removing specific duplicated values. The data profiling method lets you view the record count, and then selectively remove values. In Power Query removing and filtering values are similar as the data that is loaded back into your Excel workbook will be the final data after all Power Query applied steps.

Option 3.) Power Query Group By Feature

The Group By feature of Power Query can be used to identify duplicate values, as well as to remove them by aggregating the duplicated values. This method differs from the Remove Duplicates option because the data will be consolidated instead of being dropped from your dataset.

To use the Group By feature of Excel Power Query, you can either select Group by from the Home tab of the Power Query Ribbon, or you can right click on a column name and select Group By.

Screenshot of the Group By button in Excel Power Query

After launching the Group By function, the grouping editor will pop up. You can either work with Group by in Basic mode for consolidating data based on a single column or set of values, or you can toggle to Advanced mode which lets you use multiple columns and aggregate multiple columns.

In the example below we are telling Power Query to calculate the Total by Transaction ID taking the Sum of the Quantity field.

Screenshot of the Advanced Group By functions in Power Query Editor for Excel

The result of the grouping is as follows. The Quantity field is aggregated using a SUM function and a unique list of Transaction ID’s are left over.

Example of data that has been grouped showing the resulting grouped and aggregated values

Another way to think of grouping is that it is similar to adding rows and values to an Excel Pivot Table. The rows on the left side can be set with a specific grouping order, and the values on the right side are aggregated at those different levels.

Power Query will not allow you to add subtotals, but it will prep the data to limit your dataset and aggregate the rows based on selected, effectively removing duplicates while maintaining the total value of your columns.

Conclusion

Removing duplicates in Excel Power Query ensures the accuracy and integrity of your analysis. Compared to the standard Excel interface, Power Query performs these operations much faster and has advanced capabilities such as built in data profiling features to identify duplicates, and even a remove duplicates button that you can access by right clicking on a column title.

For more advanced duplicate removal, you can choose to filter out specific values, effectively removing them from your dataset when you press Close & Load to exit Power Query or you can use the Group By function to aggregate data ensuring that duplicates are removed without deleting necessary counts or totals.

Scroll to Top