How does Spark choose the join algorithm to use at runtime?
February 12, 2021How to migrate an Amazon S3 bucket from one region to another?
February 17, 2021Kafka
Distributed, durable and reliable message broker which can handle high volume of real time messages coming from realtime producers.
Storage for real time streaming data
Kafka has evolved quite a bit in the recent years with the addition of Kafka Streams which does provide stream computation abilities.
Kafka connect offers plug and play connection to many real-time sources.
From the architecture standpoint, Kafka cluster is made up of broker nodes and uses zookeeper for coordination style tasks.
Storm
Scalable, fault-tolerant, real-time analytic system.
Computation on real time streaming data
In Storm, a spout is a source of real-time streams and bolt does some computation on the stream. Set of spouts and streams are connected together forming a Storm topology which is capable of performing complex real-time computation.
From the architecture standpoint, Storm cluster is made up of supervisor nodes and use zookeeper for coordination style tasks.
Using Kafa and Storm together
Below high level architecture is very common in real world real-time stream processing applications.
Real-time stream producer => Kafka => Storm => NoSQL or Files
Real-time stream producer will produce streaming records which will be fed to Kafka where the real-time messages are stored and even enhanced with few computations or joining with other streams.
Storm will then pick up the messages in Kafka for more custom and elaborate computations by passing the data through Storm topologies
Processed data can be sent to a NoSQL database or can be persisted in files.