Extract dark data from SQL Server

Published by Coenraad Pretorius on

Many systems exist today that were simply not optimised for time-series data analysis and machine learning. People may not even know the data is being collected regularly as part of day-to-day operations. This is often referred to as dark data. As the hunger for data increases, data engineers are tasked to make this dark data available.

Adaped from repository vector created by fullvector.

What are the requirements?

It is essential to understand the requirements clearly of what data the team needs, at what frequency, what sampling interval, data quality checks, etc. For example, the normal (and lazy) answer is everything, everywhere, all at once (great movie btw). For me, this normally indicates that the team does not understand the problem they are trying to solve clearly enough.

Extracting all the data can put additional load on source systems, which may cause them to fall over. This is because many older systems were not designed to provide real-time data to external systems. In addition, there are also connectivity and bandwidth considerations, especially for remote sites.

In most cases, the task is to build a scalable data pipeline that would ingest data as it is required and move that data to the cloud. Further, monitoring the impact on the source system and bandwidth may be critical, as well as designing for intermitted connectivity.

Understanding the source system

Initial investigation is needed to understand the source system: where is the system located, how do you get access to it, what data is captured and at what frequency, and what data privacy steps are needed.

In our example case, we are looking at a source system that is located on-premises, capturing process data anywhere from 5 seconds to an hour. Data is stored in several SQL databases, one database for each week and in two separate tables, one for analogue and one for discrete data.

For short-term trending of a few tags, this works well. However, for long-term trends of a few hundred tags, the system slows down as it must run queries across multiple databases and tables, mostly using views. For example, it has in the past overloaded the source system which causes it to become unresponsive.

Design principle

Extracting data every time from a constrained, on-prem source system by several different teams cannot be allowed. The design principle is to build one data pipeline that extracts data from the source system and to never read the same data twice from the source. This data is then moved to a cloud platform where it can be consumed multiple times. This is much more efficient and secure at the end of the day.

Within the cloud environment, we have better processing capabilities, and we have better control of the data from a security point of view.

Extracting data within constrained environment

To minimise the impact on the source system and bandwidth of the site, we restore the weekly database backups to another non-critical SQL Server on-prem. From there, we use the SQL Server Command Line Utility, sqlcmd, to extract CSV data from the two tables directly. Finally, the CSV files are converted into parquet files for uploading to Azure.

On our non-critical server, we first install the SQL Server Command Line Utility. You may also need to install Python if it is not installed already. The first portion of the Python script reads the metadata table. It contains the TagID, TagName as well as other valuable information. The query is a normal SQL query where you can add some logic. Like in this case handling commas in the Description column.

We execute sqlcmd with some basic parameters such as the server, database, username and password. Now we add arguments to the command:

  • -Q where we pass our SQL query.
  • -s to add a column separator as ,.
  • -h -1 to remove our header row. That is why we added the print statement in the query.
  • -W to remove trailing spaces.
  • -o to output to our file.
View this gist on GitHub

Secondly, we get a list of all the historic databases that are available on the server. Note, you may have to add additional permissions to your SQL Server account for this step.

View this gist on GitHub

Now that we have a list of databases, we iterate through each one, extracting CSV data for the analogue and discrete tables. CSV files are uncompressed and take up lots of space on the hard drive. Importantly, we convert these two table CSV files to parquet format and then delete the CSV files. This ensures that we do not run out of disk space. However, make sure you have enough free space 🙂

View this gist on GitHub

The last step is to now upload the parquet files to Azure and process them further in the cloud.

While this was done on a secondary, non-critical server, I have also tested it on the live server. It appears to have minimal impact on the server judging from CPU and memory load. However, best to avoid using this approach.

Conclusion

In this post we discussed, extracting dark data from a constrained on-prem SQL Server and moving that data to the cloud. Most importantly, we covered an import design principle, of one connection to the source system to move data to the cloud where is can then be controlled and consumed multiple times.

Updated code and full repo available on GitHub.

Next step

We have established a batching data pipeline in this post and one that is useful for backfilling historic or missing data. The next post we will look at streaming data to the cloud as it becomes available.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *