As we wrote about earlier this month, the Texas Railroad Commission (RRC) released a treasure trove of data freely available to the public on their site. It was like Christmas in the Mi4 office. After we sang some carols and drank some hot chocolate, we realized that there was so much data. We didn’t know where to start.
Christmas in September
As my colleague @Talal wrote last week, we decided to get Lat/Long coordinates for every Texas well. In his post, he explained, there are many use cases for this data, so it seemed like an excellent place to start.
In this post I will go over my contribution to the exercise: creating a serverless function to process data in blob storage. It is the step in the Azure Data Factory pipeline highlighted below:
What about Blob?
Before we get started, I realize the terms “blob”, “blob container” and “blob storage” did not appear in Talal’s earlier post. A pretty close, but not perfect, analogy for the concept of blobs is:
- Blob = File
- Blob Container = Folder/Directory
- Blob Storage = A place files are stored
The previous post mentioned moving files from the RRC’s FTP site to a “folder in Azure Data Lake Storage.” Azure Data Lake Storage is a specific type of blob storage, and from an Azure Functions perspective, there is no difference.
With that out of the way, a word on Azure Functions:
Ok, a few more words on Azure Functions, and specifically the Azure Function to convert the DBF files from the Texas Railroad Commission to CSV files that can be loaded into a database:
According to Microsoft, Azure Functions are “a serverless compute service that enables you to run code on-demand without having to explicitly provision or manage infrastructure.” This means that we were able to convert the DBF files in Blob Storage without adding a server, VM, or even a container in Azure or anywhere else. All of the infrastructure needed for the function to work is baked into Azure.
For this exercise, we implemented a “blob-triggered” Azure Function. Each time a new DBF file/blob was added to the Azure Data Lake folder by Azure Data Factory, the Azure Function we developed would execute automatically. That part did not even require any code! The “blob trigger” Azure Function template already has blob input binding.
The next thing I added to the function was a blob output binding. In our Azure Function, the output binding specifies the blob container and filename of the resulting CSV file from the DBF triggered blob input binding . While I did have to write this code, adding a blob storage output binding to an Azure Function is a very common practice and very well documented.
Not having to reinvent the wheel
The code I did have to implement that was not-so-common in the Azure Functions universe, was the C# code that converted the DBF to CSV. Luckily, these conversions have been done before for traditional local files, so I was able to adapt previous approaches to work with blobs.
After I successfully tested the Azure Function locally in Visual Studio against blob storage containers in Azure, I deployed the function to Azure and everything has been running smoothly ever since.
The Azure ecosystem enabled us to take data from a public FTP site, perform conversions and logical operations on the data, load it into a database in the cloud, and create data visualizations on the data in under a week.