Building an ETL pipeline from device to cloud (part 2)

Published by Coenraad Pretorius on

In this part, we will rebuild the secure server from part one using FastAPI. This allows for multiple workers and running background tasks that we will be using in part three.

Image adapted from pch.vector & Freepik

Series breakdown

In these series of blogs I will be covering building an ETL data pipeline using Python. The data is coming from field devices that are installed on mobile equipment. Each blog focusses on a specific part of the pipeline to move the data from the field device to the cloud.

Scaling up data acquisition

In part one, we created a basic web server using the Python standard library and secured it with SSL. This works well in principle, but we need to scale this server to handle multiple posts from about 80 devices.

Our basic server runs synchronous operations, meaning that is handles requests one at a time and can only move on once the previous task have been completed. We would want a server that can run asynchronous tasks, meaning that it can handle multiple request simultaneously and does not need to wait for another task to finish. After some research, FastAPI appeared to be one of solutions that would work for scaling up our project. It can run asynchronously, supports background tasks and as the name suggests, it is fast.

As we are scaling up, we also need to apply best practises and thus, we need to use virtual environments. Virtual environments allow us to separate our project libraries and dependencies from the globally installed Python environment. It gives us a way to make the code more portable and reproduceable on another system.

There are a few choices out there and I have been using Pipenv for many years. I find it quick and easy to use. All the virtual environments are created in the users folder, i.e. C:\users\<username>\.virtualenvs, on Windows.

Getting our packages on the web server

We used the standard Python libraries due to security restrictions in installing packages from PyPi. There are two ways around this. We can request proxy rules to be updated to allow downloading packages or we can download the required packages and install them offline.

Using a proxy

We need to install pipenv and need to tell the global Python interpreter to use our proxy server. Below is a PowerShell script that I found that you can run. Your IT admins should be able to provide you with the details.

# replace proxy_server and port with your server details

$name, $value = (python -c 'print(''http_proxy=http://<proxy_server>:<port>'')') -split '=', 2
Set-Item "Env:$Name" $value
$env:http_proxy

$name, $value = (python -c 'print(''https_proxy=http://<proxy_server>:<port>'')') -split '=', 2
Set-Item "Env:$Name" $value
$env:https_proxy

Now we can install pipenv and create our virtual environment. Next, we need to tell our new virtual environment to use the proxy. This is achieved by creating a .env file, with details for our proxy server. Note, that after defining the proxy server details, you would need to reload the virtual environment.

# .env
# replace proxy_server and port with your server details

HTTPS_PROXY=http://<proxy_server>:<port>
HTTP_PROXY=http://<proxy_server>:<port>

Using offline packages

This method is a bit more challenging. We need to create a virtual environment on PC that has access to the internet. In this virtual environment we need to download all the required packages that we need.
We need to copy the packages folder over to our web server and install the packages. A great guide can be found from Prateek Rungta on Github.

Setting up FastAPI

With our virtual environment setup and packages installed, we can finally get to creating our FastAPI web server. The setup is similar to what we did in part one and we can re-use the SSL key and certificate for this web server.

The first difference is the use of asynchronous methods for GET and POST requests that starts with async. In the current implementation we are not really utilising the full benefits of asynchronous methods, but it will become important later.

View this gist on GitHub

For the POST request, we needed to include the exact path that the field devices posts the request to, i.e. @app.post("/upload/data"). Our first server was more a catch all approach. The server runs using the Uvicorn web server and we are running with four workers. This means we have four workers that can handle the post requests.

I noted that this server generated exponentially more data files initially. Upon investigation on the field device logs, I saw the data was sent to the web server but the web server did not send back a HTTP 200 response. The field device sees this as a failed send and will try to resend the file again in a few minutes, thus resulting in many duplicate files. The solution was to include status_code=200 in the return request.

Conclusion

In part one, we created a proof-of-concept web server. In this part we scaled up the solution for a production environment to handle multiple post requests and discussed how to get Python packages on our web server.

Next up

In the next post, we will be using the background tasks in FastAPI for our ETL process.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *