Azure Data Factory: Creating and Configuring Data Pipelines

This blog teaches you how to create and configure data pipelines using Azure Data Factory UI or code. You will learn about data pipelines, Azure Data Factory, and its features.

1. Introduction

Data pipelines are essential for moving and transforming data from various sources to various destinations. They enable you to automate the data processing and analysis tasks that are crucial for your business or project.

In this tutorial, you will learn how to create and configure data pipelines using Azure Data Factory, a cloud-based data integration service that allows you to orchestrate and automate data movement and transformation. You will also learn about the different ways of creating data pipelines, such as using the Azure Data Factory UI, the Azure Data Factory SDK, or the Azure Resource Manager template. You will also learn how to configure and monitor your data pipelines using the Azure portal or the Azure Monitor service.

By the end of this tutorial, you will have a solid understanding of how to use Azure Data Factory to create and configure data pipelines for your data integration needs.

Before you start, you will need the following:

  • An Azure subscription. If you don’t have one, you can create a free account here.
  • An Azure storage account. You will use this to store the data that you will move and transform using Azure Data Factory. You can create one using the Azure portal or the Azure CLI. You can follow this guide to create one.
  • A basic knowledge of data pipelines and Azure Data Factory. You can read this overview to learn more about them.

Ready to create and configure your data pipelines? Let’s get started!

2. What is a Data Pipeline?

A data pipeline is a set of steps that move and transform data from one or more sources to one or more destinations. A data pipeline can perform various tasks, such as copying, filtering, aggregating, joining, cleansing, or enriching data. A data pipeline can also handle different types of data, such as structured, semi-structured, or unstructured data.

Data pipelines are useful for many scenarios, such as:

  • Data integration: You can use data pipelines to integrate data from different sources, such as databases, files, web services, or APIs. For example, you can use a data pipeline to copy data from an on-premises SQL Server database to an Azure SQL Database.
  • Data analysis: You can use data pipelines to prepare data for analysis, such as data warehousing, business intelligence, or machine learning. For example, you can use a data pipeline to transform data from various sources into a common format and load it into an Azure Synapse Analytics data warehouse.
  • Data processing: You can use data pipelines to process data in real-time or batch mode, such as streaming, event-driven, or scheduled processing. For example, you can use a data pipeline to ingest data from an IoT device, apply some logic, and send it to an Azure Event Hub.

How do you create and configure a data pipeline? There are many tools and services that can help you with that, but one of the most powerful and flexible ones is Azure Data Factory. Let’s see what Azure Data Factory is and how it works.

3. Azure Data Factory Overview

Azure Data Factory is a cloud-based data integration service that allows you to create and manage data pipelines. A data pipeline in Azure Data Factory consists of two main components: a data source and a data sink. A data source is where your data comes from, such as a file, a database, a web service, or an API. A data sink is where your data goes to, such as a file, a database, a data warehouse, or a data lake. You can also add optional components to your data pipeline, such as activities, datasets, linked services, and triggers. Activities are the tasks that your data pipeline performs, such as copying, transforming, or processing data. Datasets are the data structures that define the input and output of your activities. Linked services are the connections that enable your data pipeline to access your data sources and sinks. Triggers are the events that initiate or schedule your data pipeline execution.

Azure Data Factory provides you with three different ways of creating and configuring your data pipelines: using the Azure Data Factory UI, using the Azure Data Factory SDK, or using the Azure Resource Manager template. Each of these methods has its own advantages and disadvantages, depending on your preferences and requirements. In the next sections, you will learn how to use each of these methods to create and configure your data pipelines.

But before you do that, you need to create an Azure Data Factory instance. An Azure Data Factory instance is a logical container that holds your data pipelines and other related resources. You can create an Azure Data Factory instance using the Azure portal, the Azure CLI, or the Azure PowerShell. You can follow this guide to create one.

Once you have created your Azure Data Factory instance, you are ready to create and configure your data pipelines. Let’s start with the Azure Data Factory UI method.

4. Creating a Data Pipeline using Azure Data Factory UI

The Azure Data Factory UI is a graphical user interface that allows you to create and configure your data pipelines using drag-and-drop components. You can access the Azure Data Factory UI from the Azure portal, by selecting your Azure Data Factory instance and clicking on the Author & Monitor button.

The Azure Data Factory UI consists of four main sections: the Home page, the Author page, the Monitor page, and the Manage page. The Home page provides you with an overview of your Azure Data Factory instance and some quick links to common tasks. The Author page is where you can create and edit your data pipelines, datasets, linked services, and triggers. The Monitor page is where you can view and manage the status and performance of your data pipeline runs. The Manage page is where you can configure and manage your Azure Data Factory resources, such as integration runtimes, connections, and access policies.

In this section, you will learn how to use the Author page to create a simple data pipeline that copies data from a CSV file in your Azure storage account to a table in your Azure SQL Database. You will also learn how to configure the data pipeline properties, such as the name, description, parameters, and annotations. You will also learn how to test and debug your data pipeline using the Debug mode and the Output window.

To create a data pipeline using the Azure Data Factory UI, follow these steps:

  1. On the Author page, click on the + (plus) button and select Pipeline from the drop-down menu. This will create a new pipeline with a default name and an empty canvas.
  2. On the Properties window, enter a name and a description for your data pipeline. For example, you can name it CopyCSVtoSQL and describe it as “A data pipeline that copies data from a CSV file to a SQL table”. You can also add parameters and annotations to your data pipeline, if needed.
  3. On the Activities panel, expand the Move & Transform category and drag and drop the Copy Data activity to the canvas. This will add a copy data activity to your data pipeline with a default name and settings.
  4. On the Properties window, enter a name and a description for your copy data activity. For example, you can name it CopyCSVtoSQLActivity and describe it as “A copy data activity that copies data from a CSV file to a SQL table”. You can also configure the settings of your copy data activity, such as the fault tolerance, the parallelism, and the performance.
  5. On the Source tab, click on the + (plus) button next to the Source dataset field. This will open a new window where you can create a new dataset or select an existing one. A dataset is a data structure that defines the input and output of your copy data activity.
  6. On the New Dataset window, select the Azure Blob Storage option and click Continue. This will create a new dataset that connects to your Azure storage account where your CSV file is stored.
  7. On the Set Properties window, enter a name and a description for your dataset. For example, you can name it CSVFileDataset and describe it as “A dataset that connects to a CSV file in Azure storage”. You can also configure the properties of your dataset, such as the file path, the file format, and the compression type.
  8. On the Linked service field, click on the + (plus) button. This will open a new window where you can create a new linked service or select an existing one. A linked service is a connection that enables your data pipeline to access your data source or sink.
  9. On the New Linked Service window, select the Azure Blob Storage option and click Continue. This will create a new linked service that connects to your Azure storage account where your CSV file is stored.
  10. On the Set Properties window, enter a name and a description for your linked service. For example, you can name it AzureStorageLinkedService and describe it as “A linked service that connects to Azure storage”. You can also configure the properties of your linked service, such as the authentication method, the account name, and the account key.
  11. Click on the Test connection button to verify that your linked service can connect to your Azure storage account. If the test is successful, click on the Create button to create your linked service and close the window.
  12. On the Set Properties window of your dataset, click on the Preview data button to preview the data in your CSV file. If the data looks correct, click on the OK button to create your dataset and close the window.
  13. On the Source tab of your copy data activity, select the CSVFileDataset from the Source dataset drop-down menu. This will set your dataset as the source of your copy data activity.
  14. On the Sink tab, click on the + (plus) button next to the Sink dataset field. This will open a new window where you can create a new dataset or select an existing one.
  15. On the New Dataset window, select the Azure SQL Database option and click Continue. This will create a new dataset that connects to your Azure SQL Database where you want to copy the data to.
  16. On the Set Properties window, enter a name and a description for your dataset. For example, you can name it SQLTableDataset and describe it as “A dataset that connects to a SQL table in Azure SQL Database”. You can also configure the properties of your dataset, such as the table name, the schema name, and the column mapping.
  17. On the Linked service field, click on the + (plus) button. This will open a new window where you can create a new linked service or select an existing one.
  18. On the New Linked Service window, select the Azure SQL Database option and click Continue. This will create a new linked service that connects to your Azure SQL Database where you want to copy the data to.
  19. On the Set Properties window, enter a name and a description for your linked service. For example, you can name it AzureSQLLinkedService and describe it as “A linked service that connects to Azure SQL Database”. You can also configure the properties of your linked service, such as the authentication method, the server name, the database name, and the user name and password.
  20. Click on the Test connection button to verify that your linked service can connect to your Azure SQL Database. If the test is successful, click on the Create button to create your linked service and close the window.
  21. On the Set Properties window of your dataset, click on the OK button to create your dataset and close the window.
  22. On the Sink tab of your copy data activity, select the SQLTableDataset from the Sink dataset drop-down menu. This will set your dataset as the sink of your copy data activity.
  23. On the Mapping tab, click on the Import schemas button to import the schemas from your source and sink datasets. This will automatically map the columns from your CSV file to the columns in your SQL table. You can also manually edit the mapping, if needed.
  24. On the Settings tab, you can configure the settings of your copy data activity, such as the write behavior, the batch size, and the staging settings.
  25. On the Debug mode, click on the Debug button to test and debug your data pipeline. This will run your data pipeline and show the output and the status of your copy data activity on the Output window. You can also view the details of your data pipeline run on the Monitor page.
  26. If your data pipeline runs successfully, you can publish your data pipeline by clicking on the Publish all button on the toolbar. This will save your data pipeline and make it ready for execution.

Congratulations! You have created and configured a data pipeline using the Azure Data Factory UI. You can now use this data pipeline to copy data from your CSV file to your SQL table whenever you want. You can also create more data pipelines using the Azure Data Factory UI, or try the other methods of creating data pipelines, such as using the Azure Data Factory SDK or the Azure Resource Manager template.

5. Creating a Data Pipeline using Azure Data Factory SDK

The Azure Data Factory SDK is a software development kit that allows you to create and configure your data pipelines using code. You can use the Azure Data Factory SDK with various programming languages, such as Python, C#, Java, or PowerShell. You can also use the Azure Data Factory SDK with various development tools, such as Visual Studio, Visual Studio Code, or Azure Cloud Shell.

The Azure Data Factory SDK provides you with classes and methods that correspond to the components of your data pipeline, such as activities, datasets, linked services, and triggers. You can use these classes and methods to define and manipulate your data pipeline objects using code. You can also use the Azure Data Factory SDK to perform various operations on your data pipeline, such as creating, updating, deleting, running, or monitoring.

In this section, you will learn how to use the Azure Data Factory SDK with Python to create and configure the same data pipeline that you created using the Azure Data Factory UI in the previous section. You will also learn how to use the Azure Data Factory SDK to run and monitor your data pipeline using code.

To create a data pipeline using the Azure Data Factory SDK with Python, follow these steps:

  1. Install the Azure Data Factory SDK for Python using the pip command:
    pip install azure-mgmt-datafactory
  2. Import the Azure Data Factory SDK modules and other dependencies in your Python script:
    from azure.identity import ClientSecretCredential
    from azure.mgmt.datafactory import DataFactoryManagementClient
    from azure.mgmt.datafactory.models import *
  3. Authenticate to your Azure subscription using the ClientSecretCredential class. You will need to provide your subscription ID, tenant ID, client ID, and client secret. You can obtain these values from the Azure portal or the Azure CLI. For example:
    subscription_id = "your-subscription-id"
    tenant_id = "your-tenant-id"
    client_id = "your-client-id"
    client_secret = "your-client-secret"
    
    credential = ClientSecretCredential(tenant_id, client_id, client_secret)
  4. Create an instance of the DataFactoryManagementClient class using your subscription ID and credential. This will allow you to access and manage your Azure Data Factory resources using code. For example:
    adf_client = DataFactoryManagementClient(credential, subscription_id)
  5. Create an instance of the Factory class using the name and location of your Azure Data Factory instance. This will represent your Azure Data Factory instance as a Python object. For example:
    adf_name = "your-adf-name"
    adf_location = "your-adf-location"
    
    adf_resource = Factory(location=adf_location)
  6. Create or update your Azure Data Factory instance using the create_or_update method of the adf_client object. This will create or update your Azure Data Factory instance in your Azure subscription using the adf_resource object. For example:
    adf_client.factories.create_or_update(adf_name, adf_resource)
  7. Create an instance of the LinkedServiceResource class using the name and properties of your Azure storage linked service. This will represent your Azure storage linked service as a Python object. You will need to provide the account name and key of your Azure storage account. For example:
    storage_name = "your-storage-name"
    storage_key = "your-storage-key"
    
    storage_ls_name = "AzureStorageLinkedService"
    storage_ls_properties = AzureStorageLinkedService(connection_string=f"DefaultEndpointsProtocol=https;AccountName={storage_name};AccountKey={storage_key}")
    
    storage_ls_resource = LinkedServiceResource(properties=storage_ls_properties)
  8. Create or update your Azure storage linked service using the create_or_update method of the adf_client object. This will create or update your Azure storage linked service in your Azure Data Factory instance using the storage_ls_resource object. For example:
    adf_client.linked_services.create_or_update(adf_name, storage_ls_name, storage_ls_resource)
  9. Create an instance of the LinkedServiceResource class using the name and properties of your Azure SQL Database linked service. This will represent your Azure SQL Database linked service as a Python object. You will need to provide the server name, database name, user name, and password of your Azure SQL Database. For example:
    sql_server_name = "your-sql-server-name"
    sql_database_name = "your-sql-database-name"
    sql_user_name = "your-sql-user-name"
    sql_password = "your-sql-password"
    
    sql_ls_name = "AzureSQLLinkedService"
    sql_ls_properties = AzureSqlDatabaseLinkedService(connection_string=f"Data Source=tcp:{sql_server_name}.database.windows.net,1433;Initial Catalog={sql_database_name};User ID={sql_user_name};Password={sql_password};Integrated Security=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30")
    
    sql_ls_resource = LinkedServiceResource(properties=sql_ls_properties)
  10. Create or update your Azure SQL Database linked service using the create_or_update method of the adf_client object. This will create or update your Azure SQL Database linked service in your Azure Data Factory instance using the sql_ls_resource object. For example:
    adf_client.linked_services.create_or_update(adf_name, sql_ls_name, sql_ls_resource)
  11. Create an instance of the DatasetResource class using the name and properties of your CSV file dataset. This will represent your CSV file dataset as a Python object. You will need to provide the file path, the file format, and the linked service reference of your CSV file dataset. For example:
    csv_file_path = "your-csv-file-path"
    csv_file_format = DelimitedTextDataset(delimiter=",", first_row_as_header=True)
    csv_file_ls = LinkedServiceReference(reference_name=storage_ls_name)
    
    csv_file_ds_name = "CSVFileDataset"
    csv_file_ds_properties = AzureBlobDataset(folder_path=csv_file_path, file_format=csv_file_format, linked_service_name=csv_file_ls)
    
    csv_file_ds_resource = DatasetResource(properties=csv_file_ds_properties)
  12. Create or update your CSV file dataset using the create_or_update method of the adf_client object. This will create or update your CSV file dataset in your Azure Data Factory instance using the csv_file_ds_resource object. For example:
    adf_client.datasets.create_or_update(adf_name, csv_file_ds_name, csv_file_ds_resource)
  13. Create an instance of the DatasetResource class using the name and properties of your SQL table dataset. This will represent your SQL table dataset as a Python object. You will need to provide the table name, the schema name, and the linked service reference of your SQL table dataset. For example:
    sql_table_name = "your-sql-table-name"
    sql_schema_name = "your-sql-schema-name"
    sql_table_ls = LinkedServiceReference(reference_name=sql_ls_name)
    
    sql_table_ds_name = "SQLTableDataset"
    sql_table_ds_properties = AzureSqlTableDataset(table_name=sql_table_name, schema=sql_schema_name, linked_service_name=sql_table_ls)
    
    sql_table_ds_resource = DatasetResource(properties=sql_table_ds_properties)
  14. Create or update your SQL table dataset using the create_or_update method of the adf_client object. This will create or update your SQL table dataset in your Azure Data Factory instance using the sql_table_ds_resource object. For example:
    adf_client.datasets.create_or_update(adf_name, sql_table_ds_name, sql_table_ds_resource)
  15. Create an instance of the PipelineResource class using the name and properties of your data pipeline. This will represent your data pipeline as a Python object. You will need to provide the activities, the parameters, and the annotations of your data pipeline. For example:
    pipeline_name = "CopyCSVtoSQL"
    pipeline_description = "A data pipeline that copies data from a CSV file to a SQL table"
    pipeline_parameters = {}
    pipeline_annotations = []
    
    pipeline_properties = PipelineResource(activities=[], parameters=pipeline_parameters, annotations=pipeline_annotations, description=pipeline_description)
  16. Create an instance of the CopyActivity class using the name and properties of your copy data activity. This will represent your copy data activity as a Python object. You will need to provide the source, the sink, the inputs, the outputs, and the settings of your copy data activity. For example:
    copy_activity_name = "CopyCSVtoSQLActivity"
    copy_activity_description = "A copy data activity that copies data from a CSV file to a SQL table"
    copy_activity_source = TabularSource()
    copy_activity_sink = SqlSink(write_behavior="Insert")
    copy_activity_input = DatasetReference(reference_name=csv_file_ds_name)
    copy_activity_output = DatasetReference(reference_name=sql_table_ds_name)
    copy_activity_settings = CopyBehaviorSettings(data_integration_units=0)
    
    copy_activity_properties = CopyActivity(name=copy_activity_name, description=copy_activity_description, source=copy_activity_source, sink=copy_activity_sink, inputs=[copy_activity_input], outputs=[copy_activity_output], settings=copy_activity_settings)
  17. Add the copy data activity to the activities list of your data pipeline object. For example:
    pipeline_properties.activities.append(copy_activity_properties)
  18. Create or update your data pipeline using the create_or_update method of the adf_client object

    6. Creating a Data Pipeline using Azure Resource Manager Template

    The Azure Resource Manager template is a JSON file that allows you to create and configure your data pipelines using declarative syntax. You can use the Azure Resource Manager template to define the resources, properties, dependencies, and parameters of your data pipeline. You can also use the Azure Resource Manager template to deploy your data pipeline to your Azure subscription using the Azure portal, the Azure CLI, or the Azure PowerShell.

    The Azure Resource Manager template provides you with a consistent and reusable way of creating and configuring your data pipelines. You can use the Azure Resource Manager template to create multiple data pipelines with similar or different settings. You can also use the Azure Resource Manager template to share your data pipelines with others or to store them in a version control system.

    In this section, you will learn how to use the Azure Resource Manager template to create and configure the same data pipeline that you created using the Azure Data Factory UI and the Azure Data Factory SDK in the previous sections. You will also learn how to deploy and monitor your data pipeline using the Azure Resource Manager template.

    To create a data pipeline using the Azure Resource Manager template, follow these steps:

    1. Create a JSON file that contains the Azure Resource Manager template for your data pipeline. You can use any text editor or IDE to create and edit your JSON file. You can also use the Export ARM Template option in the Azure Data Factory UI to generate a JSON file with the Azure Resource Manager template for your existing data pipeline.
    2. In your JSON file, define the schema, contentVersion, parameters, variables, resources, and outputs of your Azure Resource Manager template. The schema and contentVersion are the metadata of your template. The parameters are the values that you can pass to your template when you deploy it. The variables are the values that you can use within your template. The resources are the Azure Data Factory resources that you want to create or update, such as the data factory, the linked services, the datasets, the pipelines, and the triggers. The outputs are the values that you want to return after the deployment.
    3. For each resource, define the type, apiVersion, name, properties, and dependsOn of the resource. The type is the resource type, such as Microsoft.DataFactory/factories, Microsoft.DataFactory/factories/linkedServices, Microsoft.DataFactory/factories/datasets, Microsoft.DataFactory/factories/pipelines, or Microsoft.DataFactory/factories/triggers. The apiVersion is the API version of the resource type, such as 2018-06-01. The name is the name of the resource, such as your-adf-name, your-adf-name/your-ls-name, your-adf-name/your-ds-name, your-adf-name/your-pipeline-name, or your-adf-name/your-trigger-name. The properties are the properties of the resource, such as the location, the connection string, the folder path, the activities, or the schedule. The dependsOn are the resources that the resource depends on, such as the data factory, the linked service, or the dataset.
    4. Save your JSON file with a name and a location of your choice. For example, you can name it CopyCSVtoSQL.json and save it in your local folder.
    5. Deploy your data pipeline using the Azure Resource Manager template. You can use the Azure portal, the Azure CLI, or the Azure PowerShell to deploy your template. You will need to provide your subscription ID, your resource group name, your template file path, and your parameter values. For example, using the Azure CLI, you can run the following command:
      az deployment group create --subscription your-subscription-id --resource-group your-resource-group-name --template-file CopyCSVtoSQL.json --parameters your-parameter-values
    6. Monitor your data pipeline using the Azure portal or the Azure Monitor service. You can view and manage the status and performance of your data pipeline runs. You can also set up alerts and notifications for your data pipeline events.

    Congratulations! You have created and configured a data pipeline using the Azure Resource Manager template. You can now use this data pipeline to copy data from your CSV file to your SQL table whenever you want. You can also create more data pipelines using the Azure Resource Manager template, or try the other methods of creating data pipelines, such as using the Azure Data Factory UI or the Azure Data Factory SDK.

    7. Configuring and Monitoring a Data Pipeline

    Once you have created your data pipeline, you need to configure it to run according to your requirements. You also need to monitor its performance and status to ensure that it is working as expected. In this section, you will learn how to configure and monitor your data pipeline using Azure Data Factory.

    To configure your data pipeline, you need to specify the following parameters:

    • Trigger: A trigger defines when and how often your data pipeline runs. You can use different types of triggers, such as schedule, tumbling window, event-based, or manual triggers. For example, you can use a schedule trigger to run your data pipeline every day at a specific time.
    • Linked service: A linked service defines the connection information to your data source or destination. You can use different types of linked services, such as Azure Storage, Azure SQL Database, Azure Synapse Analytics, or Azure Data Lake Storage. For example, you can use an Azure Storage linked service to connect to your Azure Blob Storage account.
    • Dataset: A dataset defines the structure and format of your data. You can use different types of datasets, such as delimited text, JSON, Parquet, or Avro. For example, you can use a delimited text dataset to represent your CSV file.
    • Activity: An activity defines the action that your data pipeline performs on your data. You can use different types of activities, such as copy, lookup, filter, join, or execute. For example, you can use a copy activity to copy data from one location to another.

    You can configure your data pipeline using the Azure Data Factory UI, the Azure Data Factory SDK, or the Azure Resource Manager template. You can also use parameters, expressions, and variables to make your data pipeline dynamic and reusable.

    To monitor your data pipeline, you can use the following tools and services:

    • Azure Data Factory UI: You can use the Azure Data Factory UI to view the status, duration, and output of your data pipeline runs. You can also view the details of each activity, such as the input, output, and error messages. You can also troubleshoot and debug your data pipeline using the debug mode or the data flow debug session.
    • Azure Monitor: You can use Azure Monitor to collect and analyze the metrics and logs of your data pipeline. You can also create alerts and notifications to notify you of any issues or failures. You can also use Azure Monitor to create dashboards and reports to visualize your data pipeline performance and health.

    By configuring and monitoring your data pipeline, you can ensure that it meets your data integration needs and expectations. You can also optimize and improve your data pipeline over time by analyzing its performance and feedback.

    8. Conclusion

    In this tutorial, you learned how to create and configure data pipelines using Azure Data Factory, a cloud-based data integration service that allows you to orchestrate and automate data movement and transformation. You also learned about the different ways of creating data pipelines, such as using the Azure Data Factory UI, the Azure Data Factory SDK, or the Azure Resource Manager template. You also learned how to configure and monitor your data pipelines using the Azure portal or the Azure Monitor service.

    By following this tutorial, you gained a solid understanding of how to use Azure Data Factory to create and configure data pipelines for your data integration needs. You also learned how to use the key features and components of Azure Data Factory, such as triggers, linked services, datasets, activities, parameters, expressions, and variables. You also learned how to use the different types of triggers, linked services, datasets, and activities to handle different types of data and scenarios.

    Azure Data Factory is a powerful and flexible tool that can help you with various data integration challenges and scenarios. You can use it to integrate data from different sources, prepare data for analysis, process data in real-time or batch mode, and more. You can also use it to create dynamic and reusable data pipelines that can scale and adapt to your changing data needs.

    We hope you enjoyed this tutorial and found it useful. If you want to learn more about Azure Data Factory, you can check out the following resources:

    Thank you for reading this tutorial and happy data engineering!

Leave a Reply

Your email address will not be published. Required fields are marked *