Data Processing → Stream Processing

Rethink Analytical Platforms by exposing (true) Data APIs

Data Processing (a.k.a ETL) has been in existence since the initial IT days for the downstream consumption, end-of-the-day reporting, reconciliation, housekeeping etc. In the last decade, the amount of data being processed, technology being used, data producers and the data consumers have evolved so much so, the magnitude of ask in the 'needs' has been matched with the what technology has to provide. With that said, reengineering of old legacy systems should not just be looked from a technology change, but also the non-functional flow of data. Lets go over the details around what it means to rethink the ever lasting components in Data Management (Data Intake, Data processing and Data Consumption) given the technology landscape we are gifted with.

"In the last decade, the amount of data being processed, technology being used, data producers and the data consumers have evolved so much so, the reengineering of old legacy systems should not just be looked from a technology change, but also the non-functional flow of data."

Need of the hour

By and large, there is a clear demarkation between operational and Analytical systems. Operational systems are Tier:0 (RPO: 0 minutes, RTO: < 30 mins with hot-standby in some scenarios) end user facing systems wherein Analytical systems are reporting/batch processing system with Tier:1+ (RPO: ~30 minutes, RTO: ~72 hrs with active-passive in some scenarios). As the end user requirements evolved to a point where more and more information and actions needs to be done quicker, the operational systems are stepped up to meet the processing needs. To state some use cases; the intelligence in the queuing systems, workflow logics, fraud models executed in the transaction processing systems are some of them. On the other end, the analytical systems are also to higher tiering standards needing to provide input to the operational systems reducing the turn around time. The scenarios include micro-batch processing, stream processing etc.

"As the end user requirements evolved to a point where more and more information and actions needs to be done quicker; the operational systems are stepped up to meet the processing needs and the analytical systems are held to a higher tiering standard"

Evolution of real-time processing platforms

Meanwhile, to address the async processing and decoupling the processors from storage; the queuing platforms has evolved within the scope of operational systems. ESB platforms were providing not just the message communication framework but the processes attached to the consumers/producers performing minor transformations of data, defining a standards around the data payload being communicated, rules for message channeling etc. In parallel, with the enterprise adoption of Big data systems, the redundancy in data storage and high availability/scale at low cost are becoming a norm; thought they aren’t generally accepted to cater to the needs of operational systems in real-time.

Data Producers

The source systems (a.k.a. System Of Record-SOR, System Of Origination-SOO) are traditionally meant to create a data dump from SOR and ship to downstream for processing and distribution. As the need to processes and compute information real-time (from across various SOR) was ever growing; stepping up the compute of the SOR/Operational systems isn't an option. This triggered the SOR to emit changes as they occur in their systems. Even the legacy mainframe system (where some of the core customer/account data resides till date) are remediated (with a plug-in from from the software vendor) to emit data payload as a change occurs in their system.

"As the need to processes and compute information real-time (from across various SOR) was ever growing; stepping up the compute of the SOR/Operational systems isn't an option."

Data Processing

Mainly with the popularity of Big Data/open source frameworks like Spark streaming, Storm, Flink etc., the stream processing and decision once reserved for products with high price tag is made available for all. Beyond the technical design of processing the data over streams, this made the platform to process data as they arrive, roll-up data over a window period and perform complex transformation and process chaining. This not only addressed the fact that the data is processed as they came; it enables the event chaining resulting in handling a whole a lot of ESB type workloads at enormous scale. As a byproduct this also accelerated the overall processing time compared to accumulated batch processing. This kept the load on the analytical systems low as they need less compute to process the data spread over a period of time.

"Stream processing not only addressed the fact that the data is processed as they came; it enables the event chaining and accelerated the overall processing time compared to accumulated batch processing."

Managed Services (Cloud)

Besides the technology, cost and agility, in this context, what cloud brought to the table is the ‘Managed Services’. Less and less do the folks who landed their career on Cloud directly appreciate the plethora of benefit the managed services brings in. To quote one, even now, organization running on-premise spend a great amount of effort/resources in managing their Disaster Recovery (DR) platforms. While DR is a necessity for any IT platform these days, recovering from the batch processing failure is designed traditionally in ‘Kill-and-Fill’ mode; as no effort is made to restore from the point of processing failure; rather re-run the entire load to recover from the last temp file/table. This critically limits the stream-processing from the face of it as certain computation like rolling-window need the data up until the last message. And hence, COTS products are architected to store/checkpoint the state with very high accuracy which makes some tool essential for monetary transaction workloads. This is made possible from the ability of processors in cloud to auto-checkpoint and recover from the failure and also the ability of the queuing system to replay the lost events in the processing systems. In nutshell, what it means to a developer is to largely focus on the functional coding and leave all other 'rainy-day non-functional' scenarios to the managed services to tackle.

"In nutshell, what managed services in cloud means to a developer is to largely focus on the functional coding and leave all other 'rainy-day non-functional' scenarios to the managed services to tackle."

Data Consumers

None of the above discussed capability makes sense unless the end consumer can harness the benefit of the data being received, processed and made available 'Real-time'. The enablement of the data to the user happens via the Data API. Traditionally, these are the same message protocols which transforms and provides data to various consumers. The way it is conceived in the data world is that, the incoming stream data is processed and the hot data is stored in queue, warm data is stored in Data stores and cold data in storage/warehouses (Data Federation). Data API is a pre-defined, finite, data payload, made available to the user either as pull/push. A PUSH Data API is an event stream that have consumer attached to it, listening for a message. A PULL Data API is an endpoint that provides data to the user upon request. The intricacies of engineering the data retrieval (from queue vs data store vs storage), security (authorization, fine-grained data security), throttling (ensure that no one system/person drains the system throughput) and monitoring (health of the system, recovering a lost processor, visibility in the processes) is managed internally and to and end user it is as simple as an endpoint call. Often organizations knowingly/unknowingly brands all of their traditional ETL extracts as Data APIs. Apart from the fact that they are NOT Data APIs; it sets a wrong precedence across the organization by masking what benefit truly a Data APIs can bring in.

"Data API is a pre-defined, finite, data payload, made available to the user either as pull/push. Often organizations knowingly/unknowingly brands all of their traditional ETL extracts as Data APIs. Beside the fact that they are NOT Data APIs; it sets a wrong precedence across the organization by masking what benefit truly a Data APIs can bring in."

In Summary

Moving from traditional to the Newer data platforms are getting prominent due to forced regulatory needs. Its worth for organizations to think about NOT just reengineering the data platforms but to rethink the entire non-functional data flow to see what true business benefit it can bring. It isn’t easy to get sponsor's attention for funding data platform as many still think Data platform to be cost-centers. But, when articulated correctly, build incrementally and associating a definite business outcome; funding and building these platforms will stay beneficial and relevant for years to come.