Creating a data flow to join CSV files for an ETL pipeline in Azure Data Factory.

Azure Data Factory Data Flow

As seen in my post Building a data pipeline with Azure Data Factory, one of the main components of the ETL pipeline is to transform and join the data sources into a master dataset.


Setup

The goal is to transform and join CSV files from a Storage Blob into a SQL Database.

See Creating Azure Resources to create many of the following:

  • Create a resource group
  • Create a security group
  • Create a Storage Blob account
    • Create an input container
      • Load CSV files to container
    • Create an output container
  • Create an Azure SQL Server
    • Design the schema for the joined CSV file
    • Create a database
    • Create a table with the schema
  • Create a Data Factory
    • Create a Linked Service to Blob storage
    • Create a Dataset to point to the Blob storage
    • Create a Linked Service to Azure SQL Database
    • Create a Dataset to point to the SQL Database
    • Create a Pipeline
    • Create a Data Flow (details below)
      • Add the Data Flow to the Pipeline canvas
  • Save All (if using GitHub repo) and Publish

Save All and Publish

  • Creating the Data Factory, you should have created a GitHub repo
  • Use Save All to commit to the repo
  • When you reached any step at any time where you know everything runs without issues, click Publish

Create a Data Flow

  • In the Data Factory
  • On the left menu, Author/Edit interface
  • Under Factory Resources then Data Flow
  • Create a New data flow
  • Name it JoinInputCSVs
  • On the graph canvas section, an icon shows Add source
  • Save all to commit to the repository

Data Flow Workflow

  1. Create a source for every CSV file
  2. Add a transformation for each source
  3. Joined the transformed output to the dataset created above OutputDataflowCSV

Optimization

Instead of creating a source for every CSV file. An optimization strategy could be to group CSV files with a common format in a Blob container (this isn’t covered in this blog post)

  • In DataFlow create a parameterized component that gets a list of CSV names from this container
  • Then create another component that loops through each of these files.

Using ‘Data flow debug’

This will allow you to preview the data as you transform it. However, it creates a Spark cluster for the period you specify (when enabling Debug).

Enable Data flow debug

  • Click on Data flow debug
  • Stop debug mode when you are done transforming the data.
  • You can change the number of rows to preview.