Making Sense of Unbounded Data & Real-Time Processing Systems
Unbounded data refers to continuous, never-ending data streams with no beginning or end
Unbounded data refers to continuous, never-ending data streams with no beginning or end. They are made available over time. Anyone who wishes to act upon them can do without downloading them first.
As Martin Kleppmann stated in his famous book, unbounded data will never “complete” in any meaningful way.
"In reality, a lot of data is unbounded because it arrives gradually over time: your users produced data yesterday and today, and they will continue to produce more data tomorrow. Unless you go out of business, this process never ends, and so the dataset is never “complete” in any meaningful way."
— Martin Kleppmann, Designing Data-Intensive Applications
Processing unbounded data requires an entirely different approach than its counterpart, batch processing. This article summarises the value of unbounded data and how you can build systems to harness the power of real-time data.
Bounded and unbounded data
Unlike its “bounded” counterpart — batch data, unbounded data has no boundaries defined in terms of time. The data may have been arriving from the past, continuing today, and expected to arrive in the future.
Events, messages, and streams
Streaming data, real-time data, and event streams are some similar terms for unbounded data. A stream of events consists of a sequence of immutable records that carry information about state changes that happened in a system. These records are called "events", rather than "messages" as they merely transmit what just happened.
Conversely, "messages" are a way of transferring the control or telling someone what to do next. I’ve written about this difference sometime back.
A continuous flow of related events makes an event stream. Each event in a stream has a timestamp. Event sources continuously generate streams of events and transmit them over a network. Hardware sensors, servers, mobile devices, applications, web browsers, and microservices are examples of event sources.
Benefits of real-time data and practical examples
Today’s organizations are abandoning batch processing systems and moving towards real-time processing systems. Their main goal is to gain business insights on time and act upon them without delays.
Batch processing is better in terms of providing the most accurate analysis of data. But it takes a long time to deliver and requires careful planning and complex architecture. In a world full of competition, businesses tend to trade accuracy for quick but reliable information. Real-time processing systems starting to shine there.
Following are some practical examples that demonstrate the power of real-time event processing.
- Real-time customer 360 view implementations
- Recommendation engines
- Fraud/anomaly detection
- Predictive maintenance using IoT
- Streaming ETL systems
Making sense of streaming data
Moving data alone doesn’t make sense to any organization. It has to be ingested and processed to gain more value out of it.
The best analogy I can provide is a small stream ( a real stream where water flows). If you consider the stream in isolation, it is not that useful except for supporting the life forms that depend on it. But when multiple streams are combined, they carry enough power. When stored in a reservoir, they can even generate electricity to power an entire village.
Likewise, event streams flowing into an organization have to undergo several stages before they become useful. Usually, an architecture that processes streams employ two layers to facilitate that; the ingestion layer and the processing layer.
Event Ingestion And Processing Is Challenging
The ingestion layer needs to support fast, inexpensive reads and writes of large streams of data. The ordering of incoming data and strong consistency is a concern as well. The processing layer is responsible for consuming data from the ingestion layer, running computations on that data, and then notifying the storage layer to delete data no longer needed. Also, the processing should be done on time, allowing organizations to act on data quickly.
An organization planning to build a real-time data processing system should plan for scalability, data durability, and fault tolerance in both the ingestion and processing layers.
A Reference Architecture For Stream Processing
A real-time processing architecture should have the following logical components to address the challenges mentioned above.
1. The Event Sources
Event sources can be almost any source of data: system or weblog data, social network data, financial trading information, geospatial data, mobile app data, or telemetry from connected IoT devices. They can be structured or unstructured in format.
2. The Ingestion System
The ingestion system captures and stores real-time messages until they are processed. A distributed file system can be a good fit here. But a messaging system would do a better job as it acts as a buffer for the messages and supports reliable delivery of messages to consumers.
The stored messages can be processed by many consumers multiple times. Hence, the ingestion system must support the message replay and non-destructive message consumption. That goes beyond the capabilities of traditional messaging systems and opens the door for a distributed log.
Distributed and append-only logs provide the foundation for ingestion systems architecture. Such a system appends received messages into a distributed log while preserving their order of arrival. This log is partitioned across multiple machines to support scalable message consumption and fault-tolerance.
The following figure illustrates the structure of a Kafka topic.
The architecture and behavior of an ingestion system go beyond the scope of this post. Hence, I’ll save it for a separate article.
Technology Choices: Apache Kafka, AWS Kinesis Streams, Azure Event Hubs, and IoT Hubs.
3. The Stream Processing System
After ingestion, the messages go through one or more stream processors that can route the data, or perform analytics and other processing.
The stream processor can run perpetual queries against an unbounded stream of data. These queries consume streams of data from the ingestion system, analyze them in real-time to detect anomalies, recognize patterns over rolling time windows, or trigger alerts when a specific condition occurs in the stream.
After processing, the stream processor writes the result into event sinks such as storage, databases, or directly to real-time dashboards.
Technology choices: Apache Flink, Apache Storm, Apache Spark Streaming, Apache Samza, Kafka Streams, Siddhi, AWS Kinesis Analytics, and Azure Stream Analytics.
4. The Cold Storage
In some cases, the stream processor writes the ingested message into cold storage for archiving or batch analysis. This data serves as the source of truth for the organization and doesn’t expect frequent reads. Later, batch analytic systems can process them to produce new reports and views required by the organization.
Cold storage systems can be in the form of object storage or more sophisticated Data Lakes.
Technology choices: Amazon S3, Azure Blob Containers, Azure Data Lake Store, and HDFS
5. The Analytical Datastore
Processed real-time data is transformed into a structured format and stored in a relational database to enable queried by analytical tools. This data store is often called the serving layer and feeds multiple applications that require analytics.
The serving layer can be a relational database, a NoSQL database, or distributed over a file system. Data warehouses are a common choice among many organizations to be used as such a store.
However, the serving layer requires strong support for random reads with low latency. Sometimes, the batch processing systems use this as their final destination. Hence, the serving layer should also support the random writes as the loading of processed data can introduce delays.
Technology choices: Amazon Redshift, Azure Synapse Analytics, Apache Hive, Apache Spark, Apache HBase
6. Monitoring and Notification
The stream processor can trigger alerts on different systems when it detects anomalies and some conditions. Publishing an event into a pub/sub topic, triggering an event-driven workflow, or invoking a serverless function are few examples.
The ultimate goal is to notify the relevant party that has the authority to act upon the alert.
Technology choices: Pub/sub messaging systems and AWS Lambda
7. Reporting and Visualization
Reporting applications and dashboards use processed data in the analytical store for historical reporting and visualizations. Additionally, the stream processor can directly update real-time reports and dashboards if a low-latency answer is needed.
Technology choices: Microsoft Power BI, Amazon Quicksight, Kibana, Reactive web, and mobile applications
8. Machine Learning
Real-time data can be passed through machine learning systems to train and score ML models in real-time.
Real-time processing can be beneficial to organizations in terms of getting actionable insights on time. A real-time processing architecture should cater to high volumes of streaming data, scale-out processing, and fault-tolerance.
But real-time processing can fall short when it comes to accuracy. Often, real-time systems trade accuracy for low latency. For a better level of accuracy, both real-time and batch processing need to be combined.