Building a pipeline with Azure Data Factory.
- Collect data sources
- Clean/join them with Dataflow and PowerQuery
- Create a master dataset
- Move the dataset to SQL Server
Underrepresented and minority-owned companies found through Data.gov and the DOT website, also known as
Disadvantaged Business Enteprise. There is a DBE website for every US state.
The US Department of Transportation has a website with all US states and their corresponding DBE websites here. However, most of those links didn’t work.
Using the google query
unified certification program, leads you to most states DBE websites. Many of them use the same platform that lets you export the data.
This is a sample of one of the files with a sea of rows:
Create an Azure Data factory
Follow my tutorial Create an Azure Data Factory
Review the CSV files to build a schema
I did an initial processing of the data exports. They were in Excel and CSV with a lot of extra header/footer rows. A few didn’t export correctly, and decided not to include them, as it would take me hours trying to clean them.
Many of them had these columns, while others had different column names, and more or fewer columns.
"Company Name","DBA Name","Owner First","Owner Last","Physical Address","City","State","Zip","Mailing Address","City","State","Zip","Phone","Fax","Email","Website","Agency","Certification Type","Capability","County","Work Districts/Regions","SBE Certification"
I decided to use these columns. The insights could be to find the number of DBE companies by city or state. Or place them on a US map dashboard.
CompanyName, DBAName, Address, City, State, Zip, Website, Agency, CertificationType, Capability, County
Upload the data to the storage container
I created a container named
dbeapp with two directories:
inputCSV: I uploaded
outputJoinedCSV: I will use this after joining all files.
Follow my tutorial Create an Azure Blob Storage and Container
Connect to the SQL Server Database
After the SQL Server and Database were created.
- Go to the database
- Query editor
Create a table to hold the data:
CREATE TABLE Directory ( CompanyID int PRIMARY KEY IDENTITY(1,1), CompanyName nvarchar(100) NOT NULL, DBAName nvarchar(100), Address nvarchar(100), City nvarchar(50), State nvarchar(20), Zip nvarchar(10), Website nvarchar(50),