Evolution of Data Pipelines
In the past, when data had to be updated, operators manually entered it into a data table. This would lead to manual user entry errors and time lag. Since this was majorly done in batches, mostly as a daily job, there was substantial lead time from the time the event occurred to the time it was reported. Decision makers had to live with this time lag and often make decisions on stale data.
Fast forward into the present and now we have real-time updates and insights which are common place requirements. Building data pipelines essentially was with the intent to move data from one layer (transactional or event sources) to data warehouses or lakes where insights where derived.
The question is with these advancements in requirements to support real-time insights, and other quality requirements, are we efficient by using traditional architectures or popularly used ETL approaches. Let’s find out!
Current state of Data Pipeline Architectures and Challenges
Data pipelines is important to any Product Digitization program. Later half of this decade we witnessed immense focus on Digital architecture and technologies being adopted. Adoption of microservices and containerization is only seeing a strong growth trajectory establishes this fact. We also see tech advancements being applied but limited to traditional “OLTP” or core service/business logic.
However, the story is bit different, when one inspects the patterns involved in Data pipelines or “OLAP” side of things. We observe limited adaptation to tech evolution seen in core services space. Most common data pipelines are built using either traditional ETL, or ELTL architectures. These are popular industry de-facto approaches. Though these do solve the larger problem at hand i.e. deriving actionable insights, but it also comes with certain limitations. Let’s explore some of these challenges:
Siloed Teams: The ETL process requires expertise or skills in data extraction or migration. This could mean the technical team is layered or structured to deal with technical nuances of the process. E.g.: An ETL engineer is many a times oblivious to insights being derived and how it is consumed by end users.
Limited Manifestation: The implementation team is now trying to fit any use-case that is desired in to the set structure or pattern. Though this is always not a problem or a wrong thing to do, there are times this can be more in-efficient. E.g.: How does one extract from an unstructured source and deal with modelling the intermediate persistence schema?
Latency: Time taken to process extract, transform and load the data many a times does introduce lags. This lag could be attributed to the fact that data is processed in batches, or the necessary intermediate load steps to persist interim results. In few business scenario, this is not acceptable. E.g.: Data streams emanating from an IoT service is stored and batch processed at a later scheduled time. Thereby, introducing a lag from data generation to updated insights on dashboards.
Future state of Data Pipeline Architecture and Key considerations
As we see advancements in general software architecture like Microservice, Service Mesh, and so on, there is a need for similar modernization. One key approach emerging is distributing the data pipeline for the domains instead of centralized data pipeline contributing to build multiple such objects resulting in Data Mesh. Data Mesh aims to address these challenges by adopting a different approach:
- Team or pods that are aligned on functional feature delivery
- Treat Data as Product (discoverable, self-contained and secure)
- Polyglot storage and communication facilitate via Mesh
Initial read on Data Mesh can be found here.
Data Mesh can be implemented in various ways. One effective pattern is to use Event driven approach and Event storming to form Data Products. A Domain can comprise of one or more Data Products. This would also mean that data can be redundant and persisted in one or more stores. This is referred to as Polyglot storage. Finally, these data products are consumed via the Mesh APIs designed along the lines of each domain requirement.
Other architectural styles include Data Lake, Data Hub and Data Virtualization. A brief comparison on these can be found here.
Some other considerations that one should evaluate:
- Facilitate easy data access any time use standard interfaces like SQL. Tech like Snowflake, DBT, Materialize enable such real-time joins which not only enables BI, but also helps in low level plumbing of the pipeline
- Design Data Pipelines to be robust and fault tolerant, E.g. checkpoint intermediate results where required for further analysis
- Leverage distributed loosely-couple processing units, scalable to use polyglot technologies e.g. Spark job or Python models
- Use Data Virtualization to mitigate bottlenecks, E.g. shorten lead time for data availability
- Use of DataOps effectively to track and evaluate your Data pipeline performance
Conclusion
Finally, I would like to conclude with a disclaimer. This article is not to discard current architectures associated to ETL. In fact, for certain use cases like batch jobs, ETL is still a very good option to adopt. The intent here is more of a realization one would need to have based on the varied requirements and explore further architectures which could suit well for the need. In this article, we looked at few such architectures like Data Mesh and associated areas one needs to consider.
Feel free to drop your comments, feedback, queries on this article, I will try and answer each of those at my earliest convenience.