Part of Microsoft’s revamp of the Power BI workspace interface was integrating it with a number of other capabilities that were historically only available within Microsoft Azure. One of Microsoft’s strategies over the last several years has been to offer the same products on their Azure cloud service targeted at professional developers with wholesale prices, then re-branding them with a nicer interface and selling them to less technical business users at retail prices.
A couple examples of this are Azure Logic Apps and Power Automate, Azure Cognitive Services and Microsoft AI Builder. The products are almost identical in functionality, but the ones targeted at business users are generally more expensive and easier to use.
This brings us to the Data Factory section of the New Microsoft Fabric Experience if you’ve ever looked around the Fabric Trial and what’s now available in the same location as your Power BI workspace.
We’ll briefly explain what each of these components are and explain when to use them.
What is Microsoft Data Factory?
Microsoft Azure Data Factory is a cloud-based data integration service that allows individuals and organizations to create, schedule, and manage data pipelines to moving and transforming data at scale. They’re basically workflows that get scheduled and tell the cloud service when and what data to begin copying from one location to another.
What are Power BI Dataflows?
You can think of Power BI Dataflows as Power Query in the cloud. You can connect to data, transform it, clean it, and prep it for use elsewhere. We wrote a guide to Power BI dataflows Gen2 that highlights the difference with Gen1 dataflows and the features that are currently available.
What are Data Pipelines?
Data pipelines are a series of processes and workflows that facilitate movement, orchestration, and transformation of data from one or more sources to destinations. They can be scheduled to automate the flow of data and ensure that it’s accessible and usable for analysis, machine learning, or reporting.
Some of the key stages of data pipelines include:
- Data Ingestion – Bringing data into a pipeline.
- Data Transformation – Modifying the data, cleaning it, prepping it, can be performed with Dataflows.
- Data Orchestration – Sequencing data pipelines. For example, one data source may need to finish updating before starting a second stage of data ingestion.
- Data Output – Placing the new or updated data into its final location.
- Data Monitoring – Notifications if automated data pipelines fail, or summary information about your data such as number of null rows, number of copied records, and information about data quality.
As you can see, the use case of a data pipelines is similar but different from a dataflow. While dataflows are designed mostly for transforming and preparing data, data pipelines are a professional tool to move data between systems.
Simple Example of a Data Factory Data Pipeline in Microsoft Fabric
The following screenshot is a very simple example of a data pipeline setup in Microsoft Fabric. Dataflows gives you a nice user interface and a number of templates that make common tasks for moving data around easier. For our first step of our pipeline we’ll bring in a dataflow.
The dataflow becomes a block on a canvas where you can add additional steps, not unlike Alteryx or Tableau Prep. You’ll notice on the right side of the block there is a check mark, arrow and an X allowing you to add different workflows if the step is successful, completed, or fails.
In comparison, if a data source refresh fails in Power BI, you can set it up to send you an e-mail, it won’t retry until the next scheduled refresh time. With data pipelines you could ask Data Factory to automatically try it agian.
The next part of our example that uses a Dataflow as the starting point, will write it to a data lake based on a schedule. In reality, you would often have more steps than just these two and would likely build in logic in case things didn’t go as planned.
There’s a series of steps within the dataflow to prep and clean data. Then connected to the completed output of the step, we add another step to output and copy the data to a data lake.
The final part of it, is using the scheduling feature which gives you the ability to control how often the process runs.
Another benefit of this method compared to working with Power BI data sources is that you have a lot more flexibility of where data comes from, where it goes, and when it’s scheduled.
As a professional grade data integration tool, most of the data sources you will be interacting with are databases, data warehouses, and data lakes. The product definitely skews towards promoting Microsoft Azure Technologies, but you can also work with other tools like SFTP and Snowflake.
Who Should Use Microsoft Fabric Data Pipelines?
Most casual users of Power BI and business intelligence analysts will never setup data pipelines. They’re usually setup by a professional data engineering team at large corporations that have a number of large data sources that are being pulled in and consolidated from various sources, and then consolidated to a central data lake or data warehouse that business users would connect to as a starting point of a Power BI dashboard.
Professional Data Engineers who already work in the Microsoft Azure ecosystem will likely already be familiar with the Data Factory interface. Microsoft even publishes a training on making the transition from Azure to Fabric. Their decision to include it as part of the Microsoft Fabric experience means that you won’t have to log into the Azure Portal and does help keep every part of a data stream in one place from ingestion to end reporting.
Microsoft Fabric Data Pipelines vs Power BI Refresh Costs
If you’re a frequent user of Power BI and creating lots of dashboards with scheduled refreshes, you’re probably aware that there is a limit of 8 refreshes a day for a Power BI Pro license that costs $10 USD per user per month, and that limit bumps up to 48 scheduled refreshes with Power BI Premium workspace licensing. You’ll likely also know that you can use Power Automate to dynamically refresh Power BI datasets within those limits.
Data pipelines get rid of those limitations and have the benefit of centralizing refresh schedules for datasets that can be used across multiple parts of Power BI. You would have to switch a data source from scheduled import mode to a direct connection to a central data source and this would be fairly large project to take on if you’ve already published a number of reports.
However, the pricing for Dataflows is fairly complicated. It moves away from a per user per month license and becomes a usage based model. This could either benefit you or be less advantageous depending on the number of refreshes and the amount of data that’s being moved around.
Microsoft publishes a guide to help you understand Microsoft Fabric Capacity pricing that are worth a read.
Even though Microsoft has combined Data Factory and Data Pipelines into the same interface as Microsoft Fabric which also encompasses Power BI, they’re really completely different products designed for different audiences. Data Factory is a professional grade data integration tool that moves data in bulk from one location to another while sometimes transforming it in-between.
Data engineers can take advantage of the Power Query style interface of Dataflows, or Power BI developers can use them to create central data sets to use across multiple reports. On the other hand, data pipelines are meant more for copying, deleting, or refreshing data that is moving between a bulk storage source like a data warehouse to a data lake.
Power BI can connect to data sources that are at the output stage of a pipeline and leverage direct connections to many of them for a more expansive data transformation process, but it will be more complex flipping between several software solutions to get to the same result as importing data directly into Power BI Desktop keeping Data Factory features outside the scope of many smaller Power BI deployments.