How to Remove or Group Duplicate Values in Alteryx

Duplicate values in Alteryx can be a real pain and are easy to remove using the Alteryx Unique tool. However, what can be more difficult is determining which duplicate you want to keep and sometimes determining why your workflow is duplicating rows in the first place.

Duplicate values in Alteryx are often removed with a unique tool or consolidated with the summarize tool

We’ll explain how to remove duplicates from your data set along with some tips and tricks to help you identify duplicates to make sure that the data coming into a workflow is in line with the data that’s going out of a workflow. We’ll also go over how to avoid duplication of values in the first place.

Removing Duplicates with the Alteryx Unique Tool

To remove duplicate values from an Alteryx workflow, go to the Preparation Toolbar and drag and drop the Unique Tool into your workflow. Once connected to incoming data, there is an output for Unique Values (U) and Duplicate Values (D). The tool selects the first unique record from the top down when duplicates exist.

The Unique tool is located on the far right of the Preparation Toolbar of Alteryx. Tools are ordered alphabetically from left to right.

Screenshot of the Unique tool in the Alteryx Toolbar

Once selected, you can configure the Unique tool to tell it which columns to take into account when determining duplicate values. When multiple columns are selected, the combination of values across all selections must be unique. Alternatively, you can select a single column and the Unique tool will split the data into Unique and Duplicate outputs.

Configuring a Unique tool to remove duplicates based on a single column in Alteryx

When using the Unique tool, the first value that appears from the top of the dataset will be included in the output.

Using a Unique Tool to Select the Most Recent Record

If your data contains multiple records of data across different dates, you can use the Unique Tool combined with a Sort Tool to select the most recent record. Simply sort Descending by Date with a Sort Tool. Then use a Unique tool to filter on an applicable category. The Unique values will be for the most recent date.

The example below shows the setup of first sorting data using the Sort tool. Then attaching it to a Unique tool to select the first record of the selected column.

Using a sort tool combined with a Unique tool to select the first record or the most recent date

After it’s sorted, you can apply a Unique Tool to tell Alteryx which records to keep going out of the Unique (U) output. The remaining records will fall off as duplicate records (D Output)

Example output of a sort and unique tool combined to remove specific duplicate values

This setup is fairly common, Another way to achieve a similar result is by using a Summarize Tool which can be set to Max or First values as an aggregation method.

Consolidate Duplicates with the Summarize Tool

To remove duplicate values without losing data, you can use a Summarize tool to aggregate them. The Summarize tool groups data based on category and can aggregate values by summing, averaging, selecting the first record etc. Once grouped all values will be unique unless a sub-category exists at a lower level.

Here’s how to set it up in more detail.

Connect a Summarize tool from the Transform section of the Alteryx Toolbar to your incoming data stream.

The Summarize tool in Alteryx from the Transform Toolbar

The Summarize tool is configured by selecting columns at the top of the configuration panel on the left, and clicking the “Add” button half-way down the screen. The fields that are added at the bottom become the aggregation hierarchy. You can add one or many different levels of grouping and Alteryx will automatically create a hierarchy similar to adding multiple rows to an Excel Pivot Table.

The Summarize tool in Alteryx being configured to remove and consolidate duplicates

When you run the workflow the results will be the columns you have added with the column at the top being the highest tier of grouping. Values can be assigned different actions or methods of aggregation by using the dropdown menu next to each field name after it’s added to the section at the bottom.

Setting aggregation methods in an Alteryx Summarize tool

This will generate an output of unique categories but the values will be aggregated instead of being dropped off like would happen when using a Unique tool.

Identify Duplicates with the Count Records Tool

A best practice when working in Alteryx is to keep track of your record count, or the number of rows that exist within a data set. This concept is especially important when working with complex data transformations to ensure that records are not unintentionally dropped off or duplicated.

The number of rows or records is shown in the workflow results panel at the bottom of a page when clicking on the output of a tool.

You can also add a Count Records Tool from the Transformation section of the Alteryx Toolbar to publish the number of records as part of a check figure tab or calculation.

How to identify the number of records in a dataset in Alteryx using the workflow results pane

The Count Records Tool will only show a single value, the number of rows in an incoming data set which limits its usability, it’s often used as an offshoot of a larger workflow to act as a check figure.

Using a Count Records tool in Alteryx to output the number of records available at a point in a workflow

An alternative to the Count Records tool is using a regular Summarize tool set to count the number of values in incoming data. The Summarize tool provides more flexibility and the option to Sum or total incoming data as well as outgoing data for more advanced check figures.

Avoiding Duplication of Records

The most common reasons for records to duplicate in Alteryx are bad Join tool setup. Pay attention to the record count of each data set going into a Join tool and the number of records leaving the Join (J) output of the Join tool. Records will duplicate where there is a one to many or many to many relationship.

In the example below, the values of each table are not unique which causes duplication. It becomes apparent when there are 736 rows of data going into the left and right side of a join, but the output is a total of 135,424 records.

Example of records being duplicated coming out of an Alteryx Join tool

To avoid duplication you may want to examine the incoming data of both sides, or add a Unique or Summarize tool prior to the join to ensure that the data going into the Join is unique and will not be duplicated.

The append tool is also likely to duplicate records when there are non-unique values being appended to an existing table. Be cautious of the record count using either of these tools to avoid potential calculation errors further down stream.

Conclusion

Duplicate values can quickly be removed from an Alteryx workflow using the Unique tool available in the Preparation section of the Alteryx Toolbar. The Unique tool accepts one input and outputs both Unique and Duplicate records based on the evaluation criteria selected.

The drawback with the Unique Tool is that it drops values that you may otherwise need. To remove duplicate values by aggregating and grouping data use a Unique Tool from the Summarize section of the Alteryx Toolbar. Summarize works more similarly to an Excel Pivot Table or SQL Group By function.

Related Articles

Scroll to Top