Building an ETL pipeline from device to cloud (part 6)
In this almost final part, we will build the last component that will ingest the transformed data into Azure Data Explorer and create exports to our data lake.
In these series of blogs I will be covering building an ETL data pipeline using Python. The data is coming from field devices that are installed on mobile equipment. Each blog focusses on a specific part of the pipeline to move the data from the field device to the cloud.
- Part 1. Building a simple, secure web server
- Part 2. Scaling up the web server with FastAPI
- Part 3. Creating the ELT module as a background worker
- Part 4. Moving data to the Azure Blob storage
- Part 5. Creating a function app to send data to the vendor
- Part 6. Ingesting data into Azure Data Explorer (you are here)
- BONUS. Improving our data pipeline
Our big data platform
We are now nearing the end of our data pipeline. We started off by getting raw JSON data from field devices installed on mining trucks, transforming the data, and uploading it to the cloud. In the previous part, we sent data to our external end-user, our vendor. For this final part in the series, we need to make data available to our internal end-users.
For this task, we decided to use Azure Data Explorer (ADX) and as per the website, it is a “fast and highly scalable data exploration service”. It is optimised for large data volumes, and you can ingest data from several sources, e.g., IoT Hub and Function Apps (via the API). You can also create external tables which point to files in storage accounts and SQL databases, as well as link to BI tools.
Utilising the power of the ADX cluster, we can compute things like 10-minute averages and summarise the data automatically using materialised views. With the continuous export functions, we can export data in several formats to the required locations. As we plan to add more data sources in future and the power and flexibility available in ADX, this is a good big data platform for our project.
Azure Data Explorer Training
Microsoft offers free access to three Azure Data Explorer courses to get up to speed on this fast, fully managed data analytics service. Azure Data Explorer (pluralsight.com) (not an affiliate link)
Ingesting our transformed data
The first step is to create our ADX cluster. For the compute specification we will use the Dev/test workload and is more than adequate for our current project. As the platform grows, we can change our compute SKUs and scale with multiple instances.
Once the cluster is created, we create our dev database for our site. The use the default retention period of 365 days for use in ADX as our data is stored in our storage account if historic data is needed later. The default cache period is also good as most of the analysis would be on recent data. This may vary depending on your requirements.
ADX uses the Kusto Query Language (KQL) and a cheat sheet is available for those familiar with SQL queries. For our project, we interact with the cluster using the built in Web UI which provides a quick way to start ingesting data using the Ingest from blob option. Below are some of the important screen shots.
From our compressed CSV files, ADX will try and define the schema for us. Always check data types, especially when working with schemeless files such as CSV.
The ingestion will happen in the background and after a few minutes we can start querying our data. Ingestion will take some time depending on the amount of data and this can be monitoring through the various tools in Azure. Once ingestion has completed, we see we have 52 field devices sending data and about 38.6 million data points already from end February to end April.
Automate data ingestion
We want to ensure that our data is automatically ingested into ADX when we upload data from our web server into our storage account. To achieve this, we add a data connection to ADX using Event Grid for Blob Storage. We link to our storage account and choose the event as
Blob created. In the ingestion properties, we select our ADX database and our table we created above.
Now as soon as we upload a file to our storage account, Event Grid will pick it up and ingest our new data into our ADX database. We can now leverage the full power of ADX and make data available to our internal end-users. By default, the maximum batching ingestion values are 5 minutes, 1 000 items, or a total size of 1 GB.
Another benefit here is that our end-users access data from the cloud, without interfering or adding additional load on the on-prem networks and source systems. This ensures that on-prem systems do what they need do and there is only one data pipeline going out. No more multiple connections to these systems and networks.
Creating a data model
Our job is not done just yet, we need to add context to our transformed data for our end-users. We have the tag names, timestamps, and values in our raw data table, but we also have meta data relating to the site, equipment, and tags (such as engineering units, limits, etc).
raw_oil_data table we can create a
factOilData table and parse the
TagName to create additional keys by splitting it on the
".". With the additional keys we can create two new dimension tables which we manually populate and load, called
meta_system_tag. Remember that ADX is append-only, thus we need to recreate the dimension tables to update them.
With our data model created, we can setup exports to our data lake for our end-users. The first step is to create the three external tables pointing to the data lake storage account. During the creation we define the data format (parquet in our case), the partitions and path as well as the storage account.
The last step is to export the data. As our dimension tables don’t change often, we manually export them when needed. Our data on the other hand changes all the time and for this we setup a continuous export in one-hour intervals. In the export, this is where we add the parse operation to create our additional keys that are used in the dimension tables.
We set up our big data platform, Azure Data Explorer, and created ingestion events to continually ingest our transformed data. Next, we created our data model and exported our dimension tables. As the last step we created a continuous export of our raw data to the data lake for our end-users.
Yes, there will be a bonus post! I am creating a blog post that will look at improvements in our pipeline to make it more robust and maintainable.