Building a pipeline with Azure Data Factory.

Azure Data Factory Pipeline

ETL process

  • Collect data sources
  • Clean/join them with Dataflow and PowerQuery
  • Create a master dataset
  • Move the dataset to SQL Server

Data Sources

Underrepresented and minority-owned companies found through and the DOT website, also known as DBE or Disadvantaged Business Enteprise. There is a DBE website for every US state.

The US Department of Transportation has a website with all US states and their corresponding DBE websites here. However, most of those links didn’t work.

Using the google query unified certification program, leads you to most states DBE websites. Many of them use the same platform that lets you export the data.

This is a sample of one of the files with a sea of rows:

Data Sources and Seas of Rows

Create an Azure Data factory

Follow my tutorial Create an Azure Data Factory

Review the CSV files to build a schema

I did an initial processing of the data exports. They were in Excel and CSV with a lot of extra header/footer rows. A few didn’t export correctly, and decided not to include them, as it would take me hours trying to clean them.

Many of them had these columns, while others had different column names, and more or fewer columns.

"Company Name","DBA Name","Owner First","Owner Last","Physical Address","City","State","Zip","Mailing Address","City","State","Zip","Phone","Fax","Email","Website","Agency","Certification Type","Capability","County","Work Districts/Regions","SBE Certification"

Schema columns

I decided to use these columns. The insights could be to find the number of DBE companies by city or state. Or place them on a US map dashboard.

CompanyName, DBAName, Address, City, State, Zip, Website, Agency, CertificationType, Capability, County

Upload the data to the storage container

I created a container named dbeapp with two directories:

  • inputCSV: I uploaded 12 CSV files
  • outputJoinedCSV: I will use this after joining all files.

Follow my tutorial Create an Azure Blob Storage and Container

Connect to the SQL Server Database

See Create an Azure SQL Database.

After the SQL Server and Database were created.

  • Go to the database
  • Query editor

Create a table to hold the data:

CREATE TABLE Directory (
    CompanyID int PRIMARY KEY IDENTITY(1,1),
    CompanyName nvarchar(100) NOT NULL,
    DBAName nvarchar(100),
    Address nvarchar(100),
    City nvarchar(50),
    State nvarchar(20),
    Zip nvarchar(10),
    Website nvarchar(50),