Azure Data Factory Data Flow
Creating a data flow to join CSV files for an ETL pipeline in Azure Data Factory.
As seen in my post Building a data pipeline with Azure Data Factory, one of the main components of the ETL pipeline is to transform and join the data sources into a master dataset.
Setup
The goal is to transform and join CSV files from a Storage Blob into a SQL Database.
See Creating Azure Resources to create many of the following:
- Create a resource group
- Create a security group
- Create a Storage Blob account
- Create an input container
- Load CSV files to container
- Create an output container
- Create an input container
- Create an Azure SQL Server
- Design the schema for the joined CSV file
- Create a database
- Create a table with the schema
- Create a Data Factory
- Create a Linked Service to Blob storage
- Create a Dataset to point to the Blob storage
- Create a Linked Service to Azure SQL Database
- Create a Dataset to point to the SQL Database
- Create a Pipeline
- Create a Data Flow (details below)
- Add the Data Flow to the Pipeline canvas
- Save All (if using GitHub repo) and Publish
Save All and Publish
- Creating the Data Factory, you should have created a GitHub repo
- Use
Save All
to commit to the repo - When you reached any step at any time where you know everything runs without issues, click
Publish
Create a Data Flow
- In the Data Factory
- On the left menu, Author/Edit interface
- Under
Factory Resources
thenData Flow
- Create a
New data flow
- Name it
JoinInputCSVs
- On the graph canvas section, an icon shows
Add source
Save all
to commit to the repository
Data Flow Workflow
- Create a source for every CSV file
- Add a transformation for each source
- Joined the transformed output to the dataset created above
OutputDataflowCSV
Optimization
Instead of creating a source for every CSV file. An optimization strategy could be to group CSV files with a common format in a Blob container (this isn’t covered in this blog post)
- In DataFlow create a parameterized component that gets a list of CSV names from this container
- Then create another component that loops through each of these files.
Using ‘Data flow debug’
This will allow you to preview the data as you transform it. However, it creates a Spark cluster for the period you specify (when enabling Debug
).
Enable Data flow debug
- Click on
Data flow debug
- Stop debug mode when you are done transforming the data.
- You can change the number of rows to preview.