Building an ETL pipeline from device to cloud (part 5)
In this part, we will discuss data management and sharing, and create a triggered function app in Azure to send raw JSON data to the vendor.
Series breakdown
In these series of blogs I will be covering building an ETL data pipeline using Python. The data is coming from field devices that are installed on mobile equipment. Each blog focusses on a specific part of the pipeline to move the data from the field device to the cloud.
- Part 1. Building a simple, secure web server
- Part 2. Scaling up the web server with FastAPI
- Part 3. Creating the ELT module as a background worker
- Part 4. Moving data to the Azure Blob storage
- Part 5. Creating a function app to send data to the vendor (you are here)
- Part 6. Ingesting data into Azure Data Explorer
- BONUS. Improving our data pipeline
Managing and sharing data in our blob storage account
With our data now in the cloud, in Azure Blob Storage, we need to make the data available for our end-users. In this part, we will be creating an Azure function app to automatically send data to our external end-user, the device vendor. In part six, we will make the data available to our internal end-users.
Data is the new oil or gold or something like that. Most of us understand the vital role data plays, both in our personal lives as well as our business lives. With the advances in artificial intelligence, we can infer many things from data – rightly or wrongly.
In our project we are not dealing with personal data, however, data related to several mining trucks. The question is: what can be inferred from the data that may have an impact on the business? For this reason, we need to ensure we manage all our data appropriately.
In the first three parts we securely extracted the device data from the OT network layer (all internal) and moved it to a central blob storage account in the cloud. So, we have a one-way upload to the cloud. Our cloud resources need to be secured as well, and as mentioned in part four, we need to ensure that we implement network security and use proper authentication methods.
The storage account we used may contain many other data sources as well, so we need to separate data clearly. With that, we also need to use appropriate SAS keys (managed on a container level) or Service Principals which have access to only the required data needed. We uploaded our data to specific blob containers with raw
and processed
directories. Depending on how we plan to share data with end-users and the any potential enhancements we implement in the ETL process, we may need to split the raw
and processed
directories into separate containers.
As part of the agreement with the device vendor, we will provide only the raw device data to them, i.e. the data in the JSON files. No other plant data is included, hence the need for the clear separation as discussed above. How we provide the data would be depended on what are the methods are available. In all cases we would want to be in control of how this data is provided, be in push and/or pull.
For our project, we will push the raw device data automatically once it is uploaded to the storage account using an Azure Function App. The function app will post the JSON data to the vendor’s secured API. In this manner, we control the data being sent.
Creating the function app
The first step is creating our function app and setting it up. There are several options for creating a function app in terms of the runtime stack and hosting. We use Python 3.8 as our runtime stack and publish our code directly.
For hosting we use a Linux OS and App Service Plan as we will have some other apps running on it as well. The Consumption plan provides a serverless and event-driven scaling for the lowest minimal cost. During the creation we can also link it to our storage account.
Coding the function app
Now that the function app resource is ready, we can start coding. In the Azure Portal there are instructions for creating this function in VS Code (as per screenshot below).
We want to create a trigger in the function app so that data will be send automatically to the vendor as soon as a blob is created in the raw
directory. In VS Code, when creating the project, we select Azure Blob Storage trigger
.
Once the project is created, we can define the exact details of the blob trigger, which is defined in the function.json
file. We change the scriptFile
to main.py
. For the bindings, the name we use in our method and the direction is set as in
. The path is the container, directory, and file extension that we used in part four to upload the data. Note that connection is a key to connect to our storage account, stored in local.settings.json
for development.
{
"scriptFile": "main.py",
"bindings": [{
"name": "blobin",
"type": "blobTrigger",
"direction": "in",
"path": "device-data/raw/{name}.json",
"connection": "sah3xagn_STORAGE"
}]
}
Next, we create a main.py
file in the same directory as our function.json
file. Our main
method receives our blobin
as an input stream. We need to read that input stream to be able to create a JSON object. For that we use the BytesIO
in the standard io
library.
To send the data via a POST request, we need to install the requests
library. We create a try/except block that creates the POST request to the vendor’s provided URL and include the raw JSON file in the request. We log the actions and response if we need to investigate this in future.
After a few local tests to ensure everything is working as it should, we deploy the app to Azure. Before we deploy the app, we must update our requirements.txt
. This is essential to ensure the installed libraries we use locally are installed during deployment. We also need to check that our storage account connection string is part of the application settings, i.e. the connection
key in our function.json
file above.
Triggering the function app
The function app is triggered automatically once a new raw JSON file is uploaded. So, every time our web server process task completes, the app will be triggered, and data sent to the vendor. There may be several instances of the function app needed to handle the requests, depending on the frequency of data received from the field devices. The vendor system should also be able to cope with the number of requests.
Another benefit of the blob trigger is it will also trigger the function app if we upload data manually to our storage account. In the case we have failed uploads from our web server, we just need to manually upload the JSON files and it will automatically send the data to the vendor.
One consideration is that once we move to a production environment and upload existing JSON files to a new storage account, we will trigger the function app again. It will then resend all the historic data to the vendor. There should be a process in place to communicate these changes and handle potential duplicate data.
Conclusion
We discussed the importance of proper data management and always being in control of data sharing. We provided the appropriate data automatically to our vendor via an Azure Function App.
Next up
In the next part we will be ingesting our processed data into our big data platform, Azure Data Explorer.
0 Comments