Building an ETL pipeline from device to cloud (part 1)

Published by Coenraad Pretorius on

In this first part, we will be building a simple web server using the standard Python library. This is for initial testing, and we will secure the web server to use HTTPS.

Image adapted from pch.vector & Freepik

Series breakdown

In these series of blogs I will be covering building an ETL data pipeline using Python. The data is coming from field devices, installed on mobile equipment. Each blog will focus on a specific part of the data pipeline. Note that I will add links as soon as the other blogs are posted.

Introduction

The requirement is to securely make data available in the cloud for data analysis and machine learning. The data needs to be sent to the vendor. The data is collected from several field devices that are installed on mobile equipment, e.g., large mining trucks. The device, acting as a gateway on each piece of equipment, has several sensors connected to it. It measures parameters such as oil temperature, density, and viscosity.

The device has two methods to extract the data: using Modbus or using a JSON post request. Modbus is a widely used industrial protocol and fits into existing industrial data stacks. However, it can be tricky to configure. The challenges are to know which registers to read, what the data types are, scaling for decimal values and the real-time nature that does not offer buffer capacity if there are connection issues.

The JSON post request is a software engineering solution with all the required data correctly structured in the JSON body. The files are buffered on the device which ensures data is not lost. The drawbacks are that there can be a larger delay between readings, depending on the setup, and JSON files don’t easily fit into an industrial data stack.

Challenge

With existing knowledge on Modbus and it’s fit into the industrial data stack, which was the go-to protocol to use. The data engineering pipeline was built the tradition way, although the cloud portion is fairly new. Data was read via Modbus, loaded into an on-site historian with the appropriate tags and then streamed to Azure. Once in Azure, another data pipeline needed to be built to get the data to the vendor and that needed to be modified on their side as it is no longer in the JSON format they expected.

The main concern with this solution was there was poor connectivity via Modbus, and this resulted in large data gaps. And due to Modbus being real-time, data could not be backfilled. This led to investigating the use of the JSON post request as a backup to backfill or even a potential alternative. That is what we will be focussing on in this series.

The first challenge is that the traditional industrial data stack does no offer all the required tools for this purpose, especially because JSON files can differ from device to device. We need to ensure we can securely get data from the device, process it, and send it to the cloud. Industrial systems are locked down for security and allow only certain software and secure ports to be opened. Due to limited internet connectivity, there is some manual setup required.

Create a basic server for testing

The approach I choose was to use Python as standard throughout the solution as it is quick and easy to get started, make changes, and requires minimal overhead. I downloaded Python 3.8, copied it over to our server and installed it. I requested the required ports to be opened, using HTTP/80 for initial testing, and then moving the HTTPS/8443. HTTPS is normally over 443, but other services were using that port on the server.

Now, we can start building our server, with the exception that I can only use the Python standard library, due to the firewall restricting access. Luckily, this was straightforward using SimleHTTPRequestHandler. An HTTP server is created listening on port 80. It will receive several requests, but I only catered for the post request with the data as the JSON body. The server accepts the request, send HTTP 200 status code and saved the files locally.

It was critical to save the file locally as the device deletes the buffered JSON files once it receives the HTTP 200 status code. If the HTTP 200 status code was not sent, the device will try and upload the data once again. However, this can cause duplicates on the local server if the JSON object is saved locally and the correct response it not sent.

View this gist on GitHub

After some trial and error, testing with Postman and checking logs in the device I had success. Once issue that needed to be corrected was to ensure the devices can connect to the local NTP server to ensure correct timestamps on the readings.

Create a secured web server with minor improvements

The next step is to switch to a secure channel using HTTPS. For this, we need an SSL key and certificate. The easiest way to get started is to use openssl and Windows Subsystem for Linux. One of our admins assisted me with the config file and commands to use:

# file: san.cnf
[ req ]
default_bits       = 2048
distinguished_name = req_distinguished_name
req_extensions     = req_ext
prompt = no
[ req_distinguished_name ]
countryName                = Country HERE <<<<<<<<
stateOrProvinceName        = State/Province HERE <<<<<<<<
localityName               = City HERE <<<<<<<<
organizationName           = OrgName HERE <<<<<<<<
commonName                 = FQDN HERE <<<<<<<<
[ req_ext ]
subjectAltName = @alt_names
[alt_names]
DNS.1   = FQDN HERE <<<<<<<<
IP.1    = IP ADDRESS HERE <<<<<<<<<
# bash command
$ openssl req -new -newkey rsa:2048 -nodes -keyout key.pem -out cert.csr -config san.cnf

After generating the certificate request and getting back the actual certificate, I can now enable secure communications. We need to have both the key and the certificate files. Some conversion may be required using openssl depending on the format of the certificate you receive. I also added a logging file and captured the client IP address in the JSON data file name.

View this gist on GitHub

Conclusion

In this blog I introduced you to the challenges of getting data reliably to the cloud from several field devices. We started to build an alternative secure data pipeline to enable backfilling of the data, using the Python standard library.

Next up

In the next part, we will scale up our solution to allow for multiple devices to send data and prepare for our ETL tasks.


1 Comment

Zohaib Yolo · June 30, 2022 at 21:48

Excellent Stuff! Glad you took the time for this

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *