fbpx

Today’s organization aims to be data-driven, that is, it makes decisions based on data. To achieve this, companies typically implement a Data Governance (DG) program to integrate all internal and external data into a single database, ensuring accuracy, timely, completion, and consistency.

A DG program is commonly structured with 5 elements: 1) engagement, 2) strategy, 3) architecture and design, 4) implementation, and 5) operations and changes. Engagement refers to clearly establishing what the DG will serve and the commitment that the stakeholders will have to the DG program. The strategy creates a plan and requirements to achieve the objectives of the organization, aligned with DG. Architecture-and-design means the description and design of the organization’s new capabilities to support the data initiative. Implementation is the deployment and sustainability of the program. And the operation and changes is the execution and refinements of DG [1].

Data engineering is a concept transversal to architecture, design, and implementation, within a process called Extraction, Transformation and Load (ETL), whose purpose is the integration of data from various sources [2]. In cloud architecture, the ETL process involves storage called a data lake, where a copy of the source data is stored (extraction). The transformation then explores, cleans, and converts this data to ensure data quality in a suitable format. And the load takes the transformed data into a single and structured data warehouse (this is not a data lake).

The ETL process is made possible through a cloud service (e.g. Azure Synapse), or Python code with and database system executed in a virtual machine. When the integrated data comes from hundreds of federated sources, each having a large volume of data, the storage, and processing resource are surely large. In particular, The IMEXHS company has hundreds of clients on its cloud servers. If you have 200 clients and each one stores 500 MB of data, the data lake should be able to store 100 TB at least. Under these conditions, the cost of an ETL service could be significant. One strategy for a cost-effective ETL might be automating queries applied to each client’s database. In this way, the data lake is completed only with pre-filtered data (usable data), which after transformation, will be loaded into the final database. This would help significantly in minimizing storage costs. To optimize processing time, one could partition the large tables and run the transformation on each of them. As such, the load is run sequentially, concatenated with the partitions. Creating a cost-effective ETL involves different optimizations, involving new ways of extracting and organizing the large dataset.

References:

[1] Ladley Jhon, Data Governance: how to design, deploy, and sustain an effective data governance program. Elsevier Inc, 2th edition2020, pp. 64-80.

[2] Crickard Paul, Data Engineering with Python. Packt Publishing, 1st edition, 2022, pp. 4-13.

This image has an empty alt attribute; its file name is Leonardo-Perez-776x1024.png

Leonardo Pérez
Head of BI/BA and big data at IMEXHS
leonardo.perez@imexhs.com

0 Comments