1. Introduction
In this blog, you will learn how to use mapping data flows in Azure Data Factory to transform data with a graphical interface and no code. Mapping data flows are a powerful feature that allow you to design and execute data transformation logic using a drag-and-drop interface. You can use mapping data flows to perform various tasks such as cleansing, filtering, aggregating, joining, and reshaping data.
By the end of this blog, you will be able to:
- Create a mapping data flow in Azure Data Factory
- Configure the source and sink of the data flow
- Add and edit transformations using the expression builder and the debug mode
- Publish and execute the data flow
To follow along with this blog, you will need:
- An Azure subscription
- An Azure Data Factory resource
- An Azure Storage account
- A basic understanding of data transformation concepts
Are you ready to transform your data with mapping data flows? Let’s get started!
2. Prerequisites
Before you can create a mapping data flow in Azure Data Factory, you need to have some prerequisites in place. These include:
- An Azure subscription. If you don’t have one, you can create a free account here.
- An Azure Data Factory resource. This is where you will create and manage your data flows. You can follow this quickstart to create one.
- An Azure Storage account. This is where you will store your source and sink data. You can use any type of storage, such as blob, file, or ADLS Gen2. You can follow this tutorial to create one.
- A basic understanding of data transformation concepts. You should be familiar with terms such as source, sink, transformation, schema, and expression. You can learn more about them here.
Once you have these prerequisites ready, you can proceed to the next section, where you will learn how to create a mapping data flow in Azure Data Factory.
3. Creating a Mapping Data Flow
To create a mapping data flow in Azure Data Factory, you need to follow these steps:
- Open your Azure Data Factory resource in the Azure portal and click on the Author tab.
- In the left pane, expand the Data Flows folder and click on the + button to create a new data flow.
- In the pop-up window, select Mapping Data Flow and give your data flow a name. For this tutorial, we will name it MappingDataFlowDemo.
- Click on OK to create the data flow. You will see a blank canvas with a toolbar on the top and a properties panel on the right.
- On the toolbar, you can access various features such as expression builder, debug mode, settings, and script. You can also zoom in and out, fit the canvas to the screen, and undo and redo your actions.
- On the properties panel, you can configure the properties of the data flow and the individual components. You can also view the lineage, schema, and statistics of the data flow.
You have now created a mapping data flow in Azure Data Factory. In the next section, you will learn how to configure the source and sink of the data flow.
4. Configuring the Source and Sink
The source and sink are the essential components of a mapping data flow. The source defines where the data comes from, and the sink defines where the data goes to. You can use various types of sources and sinks, such as Azure Blob Storage, Azure SQL Database, Azure Synapse Analytics, and more. You can also use multiple sources and sinks in the same data flow.
In this section, you will learn how to configure the source and sink of your mapping data flow. You will use Azure Blob Storage as both the source and the sink. You will use a sample CSV file as the source data and write the transformed data to a new CSV file in the sink. You will also learn how to create linked services and datasets to connect to your storage account.
To configure the source and sink of your mapping data flow, you need to follow these steps:
- Create a linked service to connect to your Azure Storage account. A linked service is a connection string that defines the connection information for a data store. You can create a linked service by clicking on the Manage tab in your Azure Data Factory resource and then clicking on the Linked services option. You can then click on the New button and select Azure Blob Storage as the type of the linked service. You can then provide the name, subscription, resource group, and storage account name of your Azure Storage account. You can also test the connection and create the linked service.
- Create two datasets to reference your source and sink files. A dataset is a named view of the data that you want to use in your data flow. You can create a dataset by clicking on the Author tab in your Azure Data Factory resource and then expanding the Datasets folder. You can then click on the + button and select Azure Blob Storage as the type of the dataset. You can then provide the name, linked service, file path, and format of your source and sink files. For this tutorial, we will use a sample CSV file called customers.csv as the source file and a new CSV file called customers_transformed.csv as the sink file. You can download the sample CSV file from here and upload it to your Azure Storage account.
- Add the source and sink components to your mapping data flow. You can do this by dragging and dropping the Source and Sink icons from the left pane to the canvas. You can then select the datasets that you created in the previous step as the source and sink datasets. You can also configure the properties of the source and sink components, such as the schema, partitioning, and settings.
You have now configured the source and sink of your mapping data flow. In the next section, you will learn how to add and edit transformations to transform your data.
5. Adding and Editing Transformations
Transformations are the core of a mapping data flow. They allow you to manipulate and transform your data in various ways, such as filtering, aggregating, joining, pivoting, and more. You can add and edit transformations using the graphical interface of the data flow canvas. You can also use the expression builder to write complex expressions and the debug mode to preview and validate your data.
In this section, you will learn how to add and edit transformations to your mapping data flow. You will perform some common data transformation tasks, such as:
- Filtering the data based on a condition
- Adding a derived column with a calculated value
- Aggregating the data by grouping and summarizing
- Joining the data with another dataset
- Sorting the data by a column
To add and edit transformations to your mapping data flow, you need to follow these steps:
- Add a Filter transformation to your data flow. You can do this by dragging and dropping the Filter icon from the left pane to the canvas. You can then connect the output of the source component to the input of the filter component. You can also configure the properties of the filter component, such as the name, description, and condition. For this tutorial, we will filter the data by the Country column and keep only the rows where the country is USA.
- Add a Derived Column transformation to your data flow. You can do this by dragging and dropping the Derived Column icon from the left pane to the canvas. You can then connect the output of the filter component to the input of the derived column component. You can also configure the properties of the derived column component, such as the name, description, and columns. For this tutorial, we will add a derived column called FullName that concatenates the FirstName and LastName columns with a space in between.
- Add an Aggregate transformation to your data flow. You can do this by dragging and dropping the Aggregate icon from the left pane to the canvas. You can then connect the output of the derived column component to the input of the aggregate component. You can also configure the properties of the aggregate component, such as the name, description, group by, and aggregates. For this tutorial, we will group the data by the State column and calculate the average of the Age column and the count of the FullName column.
- Add a Join transformation to your data flow. You can do this by dragging and dropping the Join icon from the left pane to the canvas. You can then connect the output of the aggregate component to the left input of the join component. You can also add another source component to the data flow and connect it to the right input of the join component. You can then configure the properties of the join component, such as the name, description, join type, and join condition. For this tutorial, we will use another CSV file called states.csv as the right source, which contains the state names and abbreviations. You can download the sample CSV file from here and upload it to your Azure Storage account. You can also create a dataset and a source component for this file as you did for the customers.csv file. You can then join the data by the State column from the left source and the Abbreviation column from the right source. You can also select the Inner join type to keep only the matching rows.
- Add a Sort transformation to your data flow. You can do this by dragging and dropping the Sort icon from the left pane to the canvas. You can then connect the output of the join component to the input of the sort component. You can also configure the properties of the sort component, such as the name, description, and sort order. For this tutorial, we will sort the data by the Average_Age column in descending order.
You have now added and edited transformations to your mapping data flow. In the next section, you will learn how to use the expression builder and the debug mode to enhance your data transformation logic.
5.1. Using the Expression Builder
The expression builder is a feature of the mapping data flow that allows you to write complex expressions to manipulate and transform your data. You can use the expression builder to create derived columns, filter conditions, join conditions, aggregate functions, and more. You can also use various operators, functions, variables, and parameters in your expressions.
In this section, you will learn how to use the expression builder to enhance your data transformation logic. You will use some examples of expressions that you can apply to your data flow. You will also learn some tips and tricks to write effective expressions.
To use the expression builder in your mapping data flow, you need to follow these steps:
- Select the component that you want to edit, such as a derived column, a filter, or a join.
- Click on the Expression Builder icon on the toolbar or on the properties panel.
- Write your expression in the expression editor. You can use the IntelliSense feature to autocomplete your syntax and the Function Reference panel to browse and insert functions. You can also use the Parameters and Variables tabs to access and insert parameters and variables.
- Click on the Validate button to check your expression for errors and warnings. You can also click on the Format button to format your expression for readability.
- Click on the OK button to save your expression and close the expression builder.
Here are some examples of expressions that you can use in your data flow:
- To create a derived column that converts the Age column to an integer, you can use the expression:
toInt(Age)
- To filter the data by the Gender column and keep only the rows where the gender is Female, you can use the expression:
Gender == 'Female'
- To join the data by the State column from the left source and the Abbreviation column from the right source, you can use the expression:
left.State == right.Abbreviation
- To aggregate the data by the State column and calculate the sum of the Age column and the count of the FullName column, you can use the expressions:
sum(Age)
and
count(FullName)
- To sort the data by the Sum_Age column in descending order, you can use the expression:
Sum_Age desc
Here are some tips and tricks to write effective expressions:
- Use parentheses to group your expressions and control the order of operations.
- Use comments to document your expressions and explain your logic. You can use the // symbol to start a comment.
- Use the coalesce function to handle null values and replace them with a default value.
- Use the case function to create conditional expressions and return different values based on different conditions.
- Use the currentTimestamp function to get the current date and time in UTC format.
You have now learned how to use the expression builder in your mapping data flow. In the next section, you will learn how to use the debug mode to preview and validate your data.
5.2. Using the Debug Mode
The debug mode is a feature of the mapping data flow that allows you to preview and validate your data at any point of the data flow. You can use the debug mode to test your data transformation logic, troubleshoot errors, and optimize performance. You can also compare the input and output data of each component and view the data profile and statistics.
In this section, you will learn how to use the debug mode in your mapping data flow. You will use some examples of how to debug your data flow and check the results. You will also learn some tips and tricks to use the debug mode effectively.
To use the debug mode in your mapping data flow, you need to follow these steps:
- Click on the Debug button on the toolbar to start the debug session. You will see a pop-up window that shows the status of the debug session. You can also see the debug settings, such as the debug name, the time limit, and the data flow parameters.
- Wait for the debug session to complete. You will see a green check mark on each component that indicates that the component has successfully executed. You can also see the execution time and the data count of each component.
- Select the component that you want to preview or validate. You can then click on the Data Preview tab on the properties panel to see the input and output data of the component. You can also click on the Data Flow Monitor tab to see the data profile and statistics of the component.
- Repeat the previous step for each component that you want to debug. You can also compare the data of different components and see how the data changes as it flows through the data flow.
- Click on the Stop Debug button on the toolbar to stop the debug session. You can also click on the Debug History button to see the history of your previous debug sessions and their results.
Here are some examples of how to debug your data flow and check the results:
- To preview the data after the filter transformation, you can select the filter component and click on the Data Preview tab. You can then see the input and output data of the filter component. You can also see that the output data only contains the rows where the country is USA.
- To validate the expression of the derived column transformation, you can select the derived column component and click on the Data Preview tab. You can then see the input and output data of the derived column component. You can also see that the output data contains a new column called FullName that has the concatenated values of the FirstName and LastName columns.
- To check the performance of the join transformation, you can select the join component and click on the Data Flow Monitor tab. You can then see the data profile and statistics of the join component. You can also see the execution time, the data skew, the partition distribution, and the memory usage of the join component.
Here are some tips and tricks to use the debug mode effectively:
- Use the Debug Settings button to customize your debug session, such as changing the debug name, setting the time limit, and modifying the data flow parameters.
- Use the Sampling option to limit the amount of data that you want to debug. You can use sampling to reduce the execution time and the cost of the debug session.
- Use the Inspect button to see the details of each component, such as the name, description, expression, and schema.
- Use the Refresh button to refresh the data preview and the data flow monitor tabs.
- Use the Export button to export the data preview and the data flow monitor tabs as CSV or JSON files.
You have now learned how to use the debug mode in your mapping data flow. In the next section, you will learn how to publish and execute your data flow.
6. Publishing and Executing the Data Flow
After you have created and debugged your mapping data flow in Azure Data Factory, you need to publish and execute it to run it on your data. You can publish your data flow to save it to your data factory and make it available for execution. You can also execute your data flow either manually or as part of a pipeline.
In this section, you will learn how to publish and execute your mapping data flow in Azure Data Factory. You will use some examples of how to run your data flow and check the results. You will also learn some tips and tricks to optimize your data flow execution.
To publish and execute your mapping data flow in Azure Data Factory, you need to follow these steps:
- Click on the Publish All button on the toolbar to publish your data flow. You will see a pop-up window that shows the status of the publishing process. You can also see the changes that you have made to your data flow and the data factory.
- Wait for the publishing process to complete. You will see a message that says Publish succeeded. You can also see the published data flow under the Data Flows folder in the left pane.
- Click on the Add Trigger button on the toolbar to execute your data flow. You will see a pop-up window that shows the options to trigger your data flow. You can either select Trigger Now to run your data flow immediately or New/Edit to create or edit a pipeline that includes your data flow.
- Select the option that suits your needs. If you select Trigger Now, you will see another pop-up window that shows the settings for your data flow execution. You can specify the name, the data flow parameters, and the debug settings for your execution. You can then click on the OK button to start the execution. If you select New/Edit, you will see another pop-up window that shows the pipeline canvas. You can drag and drop the Execute Data Flow activity from the left pane to the canvas. You can then configure the properties of the activity, such as the name, the data flow, the data flow parameters, and the settings. You can then click on the Debug button to debug the pipeline or the Add Trigger button to schedule the pipeline.
- Wait for the execution to complete. You can monitor the status of your execution by clicking on the Monitor tab in your Azure Data Factory resource and then clicking on the Data Flow Runs option. You can also see the details of each execution, such as the start time, the end time, the duration, and the outcome.
Here are some examples of how to publish and execute your data flow and check the results:
- To publish your data flow and run it immediately, you can click on the Publish All button and then the Add Trigger button. You can then select Trigger Now and click on the OK button. You can then monitor the status of your execution and see the results in your sink file.
- To publish your data flow and schedule it as part of a pipeline, you can click on the Publish All button and then the Add Trigger button. You can then select New/Edit and create a pipeline that includes your data flow. You can then click on the Add Trigger button and select New/Edit again. You can then create a trigger that specifies the schedule and the recurrence of your pipeline. You can then monitor the status of your pipeline runs and see the results in your sink file.
Here are some tips and tricks to optimize your data flow execution:
- Use the Settings tab on the properties panel to configure the settings of your data flow, such as the time to live, the auto-scaling, and the concurrency.
- Use the Optimize button on the toolbar to optimize your data flow performance and cost. You can see the recommendations and the best practices for your data flow and apply them accordingly.
- Use the Script button on the toolbar to view and edit the script of your data flow. You can see the JSON representation of your data flow and modify it as needed.
- Use the Lineage tab on the properties panel to view the lineage of your data flow. You can see the dependencies and the impact of each component and dataset in your data flow.
- Use the Annotations option to add comments and notes to your data flow and its components. You can use annotations to document your data flow and make it easier to understand and maintain.
You have now learned how to publish and execute your mapping data flow in Azure Data Factory. In the next and final section, you will learn how to conclude your blog and provide some additional resources for your readers.
7. Conclusion
In this blog, you have learned how to use mapping data flows in Azure Data Factory to transform data with a graphical interface and no code. You have seen how to create and configure a mapping data flow, how to add and edit transformations using the expression builder and the debug mode, and how to publish and execute your data flow. You have also learned some tips and tricks to optimize your data flow performance and cost.
Mapping data flows are a powerful and flexible way to transform your data in Azure Data Factory. You can use them to perform various data transformation tasks, such as cleansing, filtering, aggregating, joining, and reshaping data. You can also use them to create complex expressions and preview and validate your data. You can also integrate them with pipelines and triggers to automate and schedule your data flow execution.
We hope you have enjoyed this blog and found it useful. If you want to learn more about mapping data flows and Azure Data Factory, you can check out these additional resources:
- Mapping Data Flow Overview
- Mapping Data Flow Expression Functions
- Mapping Data Flow Best Practices
- Mapping Data Flow Performance
- Mapping Data Flow Scenarios
Thank you for reading this blog. We would love to hear your feedback and suggestions. Please leave a comment below or contact us at azuredatafactory@microsoft.com.