- What is the retention policy for Kafka records in a Kafka cluster?
- What are the core APIs provided in Kafka platform?
- Compare: RabbitMQ vs Apache Kafka
- Justify the offset in writer information integration tool?
- What is the difference between Apache Kafka and Apache Storm?
- What do you know about a partition key?
- Explain the role of Streams API?
- What is a way to balance masses in writer once one server fails?
- Within the producer, when will a “queue fullness” situation come into play?
- Explain the term “Log Anatomy”.
- What is multi-tenancy?
- What do you mean by Stream Processing in Kafka?
- If the replica stays out of the ISR for a very long time, then what does it tell us?
- do you know how to improve the throughput of the remote consumer?
- When do you call the cleanup method?
- Why do you think the replications to be dangerous in Kafka?
- State Disadvantages of Apache Kafka.
- How to balance loads in Kafka when one server fails?
- How to start a Kafka server?
- What ensures load balancing of the server in Kafka?
- What roles do Replicas and the ISR play?
- What is the way to send large messages with Kafka?
- How is Kafka used as a stream processing?
- What are the benefits of using Kafka than other messaging services like JMS, RabbitMQ doesn’t provide?
- Where does the meta information about Topics stored in a Kafka Cluster?
- Describe scalability in the context of Apache Kafka.
- What is the main difference between Kafka and Flume?
- Would it be possible to use Kafka without the zookeeper?
- Is message duplication necessary or unnecessary in Apache Kafka?
- What are Kafka Topics?
- Describe high-throughput in the context of Apache Kafka.
- Explain the functionality of the Connector API in Kafka?
- What is the real-world use case of Kafka, which makes different from other messaging framework?
- What square measure the most options of writer that build it appropriate for information integration and processing in real-time?
- Explain what geo-replication is within Apache Kafka.
- Explain the term “Topic Replication Factor”.
- What are the three main system tools within Apache Kafka?
- What is the maximum message size that can be handled and received by Apache Kafka?
- What does it indicate if replica stays out of ISR for a long time?
- What is multi-tenancy?
- Within the producer can you explain when will you experience QueueFullException occur?
- What are the key components of Kafka?
- When does the queue full exception emerge inside the manufacturer?
- In the Producer, when does QueueFullException occur?
- When not to use Apache Kafka?
- What is the role of the ZooKeeper in Kafka?
- Explain the role of the offset.
- Describe durability in the context of Apache Kafka.
- Describe low latency in the context of Apache Kafka.
- Explain the role of the Kafka Producer API.
- Is apache Kafka is a distributed streaming platform? if yes, what you can do with it?
- Is replication critical or simply a waste of time in Kafka?
- Which components are used for stream flow of data?
- How are Kafka Topic partitions distributed in a Kafka cluster?
- Describe fault-tolerance in the context of Apache Kafka.
- Elaborate the architecture of Kafka.
- What are the key benefits of using storm for real time processing?
- How is Kafka used as a storage system?
- What is Broker and how Kafka utilize broker for communication?
- What Is ZeroMQ?
- How do you send messages to a Kafka topic using Kafka command line client?
- How are the messages consumed by a consumer in Kafka?
- Explain how you can reduce churn in ISR? When does broker leave the ISR?
- What happens if the preferred replica is not in the ISR?
- How can you justify the writer architecture?
- What’s a client cluster in Kafka?
- What is the replica? What does it do?
- Explain the concept of Leader and Follower.
- How you can get exactly once messaging from Kafka during data production?
- Why is Kafka preferred over traditional message transfer techniques?
- You have tested that a Kafka cluster with five nodes is able to handle ten million messages per minute. Your input is likely to increase to twenty five million messages per minute. How many more nodes should be added to the cluster?
- Which of the following is guaranteed by Kafka?
- When messages passes from producer to broker to consumer, the data modification is minimized by using:
- Which is the configuration file for setting up ZooKeeper properties in Kafka?
- Which of the following best describes the relationship between ZooKeeper and partial failures?
- The znodes that continue to exist even after the creator of the znode dies are called:
- Why is replication necessary in Kafka? Because it ensures that...
- A Kafka topic is setup with a replication factor of 5. Out of these, 2 nodes in the cluster have failed. Business users are concerned that they may lose messages. What do you tell them?
- How many brokers will be marked as leaders for a partition?
- Which server should be started before starting Kafka server?
- Kafka maintains feeds of messages in categories called
Kafka cluster retains all data records using a configurable retention period. The data records are retained even if they have been consumed by the consumers. For example, if the retention period is set as one week, then the data records are stored for one week after their creation before they are deleted. So consumers can access this data for one week after its creation.
Kafka provides the following core APIs:
- Producer API - An application uses the Kafka producer API to publish a stream of records to one or more Kafka topics.
- Consumer API - An application uses the Kafka consumer API to subscribe to one or more Kafka topics and consume streams of records.
- Streams API - An application uses the Kafka Streams API to consume input streams from one or more Kafka topics, process and transform the input data, and produce output streams to one or more Kafka topics.
- Connect API - An application uses the Kafka connect API to create producers and consumers that connect Kafka topics to existing applications or data systems.
One of the Apache Kafka’s alternative is RabbitMQ. So, let’s compare both:
Apache Kafka– Kafka is distributed, durable and highly available, here the data is shared as well as replicated.
RabbitMQ– There are no such features in RabbitMQ.
ii. Performance rate:
Apache Kafka– To the tune of 100,000 messages/second.
RabbitMQ- In case of RabbitMQ, the performance rate is around 20,000 messages/second.
Messages square measure keep in partitions and assigneda distinctive ID to every of them for fast and straightforward access. That distinctive range is known as because the offset that’s accountable to spot every of the messages within the partition.
- Apache Kafka: It is a distributed and robust messaging system that can handle huge amount of data and allows passage of messages from one end-point to another.
- Apache Storm: It is a real time message processing system, and you can edit or manipulate data in real time. Apache storm pulls the data from Kafka and applies some required manipulation.
A partition key is used to point to the aimed division of communication in Kafka producer. Usually, a hash-oriented divider concludes the division ID with the input, and also people use modified divisions.
An API which permits an application to act as a stream processor, and also consuming an input stream from one or more topics and producing an output stream to one or more output topics, moreover, transforming the input streams to output streams effectively, is what we call Streams API.
Every partition in writer has one main server that plays the role of a pacesetter and one or additional non-connected servers that square measure named because the followers. Here, the leading server sets the permission and remainder of the servers simply follow him consequently. In case, leading server fails then followers take the responsibility of the most server.
Queue fullness occurs when there are not enough Followers servers currently added on for load balancing.
We view log as the partitions. Basically, a data source writes messages to the log. One of the advantages is, at any time one or more consumers read from the log they select.
This is the most asked Kafka Interview Questions in an interview. Kafka can be deployed easily as a multi-tenant solution. The configuration for different topics on which data is to be produced or consumed this feature is enabled. With all this, it also provides operational support for different quotas.
The type of processing of data continuously, real-time, concurrently, and in a record-by-record fashion is what we call Kafka Stream processing.
If the replica stays out of the ISR for a very long time, or replica is not in synch with the ISR then it means that the follower server is not able to grasp data as fast the leader is doing. So basically the follower is not able to come up with the leader activities.
Well, it is interesting and advance concept in Kafka. If the consumer is located in the distant location then you need to optimize the socket buffer size to tune the overall throughput of a remote consumer.
The cleanup method is called when a Bolt is being shutdown and should cleanup any resources that were opened. There’s no guarantee that this method will be called on the cluster: For instance, if the machine the task is running on blows up, there’s no way to invoke the method. The cleanup method is intended when you run topologies in local mode (where a Storm cluster is simulated in process), and you want to be able to run and kill many topologies without suffering any resource leaks.
Duplication assures that the issued messages available are absorbed in the case of any appliance mistake, plan fault, or recurrent software promotions.
Limitations of Kafka are:
- No Complete Set of Monitoring Tools.
- Issues with Message Tweaking.
- Not support wildcard topic selection.
- Lack of Pace.
Every partition in Kafka has one main server that plays the role of a leader and one or more non-connected servers that are named as the followers. Here, the leading server sets the permission and rest of the servers just follow him accordingly. In case, leading server fails then followers take the responsibility of the main server.
Given that Kafka exercises Zookeeper, we have to start the Zookeeper’s server. One can use the convince script packaged with Kafka to get a crude but effective single node Zookeeper instance> bin/zookeeper-server-start.shconfig/zookeeper.properties. Now the Kafka server can start> bin/Kafka-server-start.shconfig/server.properties.
As the main role of the Leader is to perform the task of all read and write requests for the partition, whereas Followers passively replicate the leader. Hence, at the time of Leader failing, one of the Followers takeover the role of the Leader. Basically, this entire process ensures load balancing of the servers.
Basically, a list of nodes that replicate the log is Replicas. Especially, for a particular partition. However, they are irrespective of whether they play the role of the Leader. In addition, ISR refers to In-Sync Replicas. On defining ISR, it is a set of message replicas that are synced to the leaders.
In order to send larges messages using Kafka, you must adjust a few properties. By making these changes you will not face any exceptions and will be able to send all messages successfully. Below are the properties which require a few changes:
- At the Consumer end – fetch.message.max.bytes
- At the Broker, end to create replica– replica.fetch.max.bytes
- At the Broker, the end to create a message – message.max.bytes
- At the Broker end for every topic – max.message.bytes
Kafka can be used to consume continuous streams of live data from input Kafka topics, perform processing on this live data, and then output the continuous stream of processed data to output Kafka topics. For performing complex transformations on the live data, Kafka provides a fully integrated Streams API.
Now a days kafka is a key messaging framework, not because of its features even for reliable transmission of messages from sender to receiver, however, below are the key points which should consider:
- Reliability − Kafka provides a reliable delivery from publisher to a subscriber with zero message loss..
- Scalability −Kafka achieve this ability by using clustering along with the zookeeper coordination server
- Durability −By using distributed log, the messages can persist on disk.
- Performance − Kafka provides high throughput and low latency across the publish and subscribe application.
Considering the above features Kafka is one of the best options to use in Bigdata Technologies to handle the large volume of messages for a smooth delivery.
Zookeeper stores the information about Topics. The information it stores is : number of partitions in a Topic; which node is the master of which partition, which node has the replica of the partition, etc.
Apache Kafka has the ability to be scaled out without causing any semblance of downtime by tacking on nodes.
Even though both are used for real-time processing, Kafka is scalable and ensures message durability.
No, it is not possible to use Kafka without the zookeeper. The user will not able to connect directly to the Kafka server in the absence of zookeeper. For some reason, if zookeeper is down then the individual will not able to access any of the client requests.
Duplicating or replicating messages in Apache Kafka is actually a great practice. It ensures that all messages will never be lost, even if the main or producer server suffers a failure.
Kafka Topics are categories or feeds to which data streams or data records are published to. Kafka producers publish data records to the Kafka topics and Kafka consumers consume the data records from the Kafka topics.
There is no need for substantially large hardware in Apache Kafka. This is because Apache Kafka is capable of taking on very high-velocity and very high-volume data. It can also take care of message throughput of thousands of messages per second. In summary, Apache Kafka is very fast and efficient.
The Connector API is responsible where it allows the application to stay connected and keeping a track of all the changes that happen within the system. For this to happen, we will be using reusable producers and consumers which stays connected to the Kafka topics.
There is plethora of use case, where Kafka fit into the real work application, however I listed below are the real work use case which is frequently using.
- Metrics: Use for monitoring operation data, which can use for analysis or doing statistical operation on gather the data from distributed system
- Log Aggregation solution: can be used across an organization to collect logs from multiple services, which consume by consumer services to perform the analytical operation.
- Stream Processing: Kafka’s strong durability is also very useful in the context of stream processing.
- Asynchronous communication: In microservices, keeping this huge system synchronous is not desirable, because it can render the entire application unresponsive. Also, it can defeat the whole purpose of dividing into microservices in the first place. Hence, having Kafka at that time makes the whole data flow easier. Because it is distributed, highly fault-tolerant and it has constant monitoring of broker nodes through services like Zookeeper. So, it makes it efficient to work.
- Chat bots: Chat bots is one of the popular use cases when we require reliable messaging services for a smooth delivery.
- Multi-tenant solution: Multi-tenancy is enabled by configuring which topics can produce or consume data. There are also operations support for quotas
Above are the use cases where predominately require a Kafka framework, apart from that there are other cases which depends upon the requirement and design.
Some of the foremost lightness options of writer that build it well-liked worldwide includes – information partitioning, quantifiability, low-latency, high throughputs etc. These options square measure the rationale why writer had become the foremost appropriate selection for information integration and processing within the period of time.
For the Apache Kafka cluster, Apache Kafka MirrorMaker allows for geo-replication. Through this, messages are duplicated across various data centers or cloud regions. Geo-replication can be used in active or passive scenarios for the purpose of backup and recovery. It is also used to get data closer to users and support data locality needs.
It is very important to factor in topic replication while designing a Kafka system. Hence, if in any case, broker goes down its topics’ replicas from another broker can solve the crisis.
The three main system tools in Apache Kafka include Apache Kafka Migration Tool, Consumer Offset Checker, and Mirror Maker. Apache Kafka Migration Tool is used to move a broker from a specific version to another version. Consumer Offset Checker is used to show topics, partitions, and owners within a specific set of topics or consumer group. Mirror maker is used to mirror an Apache Kafka cluster to another Apache Kafka cluster.
The maximum message size that Apache Kafka can receive and process is approximately one million bytes, or one megabyte.
If a replica remains out of ISR for an extended time, it indicates that the follower is unable to fetch data as fast as data accumulated at the leader.
Apache Kafka can definitely be used as a multi-tenant product. Through configuring what topics can create or consume data, multi-tenancy is enabled and provides operational support for meeting quotas.
Well, if the producer is sending more messages to the broker and if it cannot handle this in the flow of the messages then we will experience QueueFullException. The producers don't have any limitation so it doesn't know when to stop the overflow of the messages. So to overcome this problem one should add multiple brokers so that the flow of the messages can be handled perfectly and we won't fall into this exception again.
Kafka consists of the following key components:
- Kafka Cluster - Kafka cluster contains one or more Kafka brokers (servers) and balances the load across these brokers.
- Kafka Broker - Kafka broker contains one or more Kafka topics. Kafka brokers are stateless and can handle TBs of messages and, thousands of reads and writes without impacting performance.
- Kafka Topics - Kafka topics are categories or feeds to which streams of messages are published to. Every topic has an associated log on disk where the message streams are stored.
- Kafka Partitions - A Kafka topic can be split into multiple partitions. Kafka partitions enable the scaling of topics to multiple servers. Kafka partitions also enable parallel consumption of messages from a topic.
- Kafka Offsets - Messages in Kafka partitions are assigned sequential id number called the offset. The offset identifies each record location within the partition. Messages can be retrieved from a partition based on its offset.
- Kafka Producers - Kafka producers are client applications or programs that post messages to a Kafka topic.
- Kafka Consumers - Kafka consumers are client applications or programs that read messages from a Kafka topic.
Queue Full Exception naturally happens when the manufacturer tries to propel communications at a speed which Broker can’t grip. Consumers need to insert sufficient brokers to collectively grip the amplified load since the Producer doesn’t block.
Whenever the Kafka Producer attempts to send messages at a pace that the Broker cannot handle at that time QueueFullException typically occurs. However, to collaboratively handle the increased load, users will need to add enough brokers, since the Producer doesn’t block.
- Kafka doesn't number the messages. It has a notion of â€œoffsetâ€ inside the log which identifies the messages.
- Consumers consume the data from topics but Kafka does not keep track of the message consumption. Kafka does not know which consumer consumed which message from the topic. The consumer or consumer group has to keep a track of the consumption.
- There are no random reads from Kafka. Consumer has to mention the offset for the topic and Kafka starts serving the messages in order from the given offset.
- Kafka does not offer the ability to delete. The message stays via logs in Kafka till it expires (until the retention time defined).
Apache Kafka is a distributed system is built to use Zookeeper. Although, Zookeeper’s main role here is to build coordination between different nodes in a cluster. However, we also use Zookeeper to recover from previously committed offset if any node fails because it works as periodically commit offset.
There is a sequential ID number given to the messages in the partitions what we call, an offset. So, to identify each message in the partition uniquely, we use these offsets.
Messages are essentially immortal because Apache Kafka duplicates its messages.
Apache Kafka is able to take on all these messages with very low latency, usually in the range of milliseconds.
The role of Kafka’s Producer API is to wrap the two producers – kafka.producer.SyncProducer and the kafka.producer.async.AsyncProducer. The goal is to expose all the producer functionality through a single API to the client.
Yes, Apache Kafka is a streaming platform. A streaming platform contains the vital three capabilities, they are as follows: - It will help you to push records easily - It will help you store a lot of records without giving any storage problems - It will help you to process the records as they come in
Replicating messages could be a smart follow in writer that assure that messages can ne’er lose though the most server fails.
- Bolt:- Bolts represent the processing logic unit in Storm. One can utilize bolts to do any kind of processing such as filtering, aggregating, joining, interacting with data stores, talking to external systems etc. Bolts can also emit tuples (data messages) for the subsequent bolts to process. Additionally, bolts are responsible to acknowledge the processing of tuples after they are done processing.
- Spout:- Spouts represent the source of data in Storm. You can write spouts to read data from data sources such as database, distributed file systems, messaging frameworks etc. Spouts can broadly be classified into following –
- Reliable:- These spouts have the capability to replay the tuples (a unit of data in data stream). This helps applications achieve ‘at least once message processing’ semantic as in case of failures, tuples can be replayed and processed again. Spouts for fetching the data from messaging frameworks are generally reliable as these frameworks provide the mechanism to replay the messages.
- Unreliable:- These spouts don’t have the capability to replay the tuples. Once a tuple is emitted, it cannot be replayed irrespective of whether it was processed successfully or not. This type of spouts follow ‘at most once message processing’ semantic.
- Tuple:- The tuple is the main data structure in Storm. A tuple is a named list of values, where each value can be any type. Tuples are dynamically typed — the types of the fields do not need to be declared. Tuples have helper methods like getInteger and getString to get field values without having to cast the result. Storm needs to know how to serialize all the values in a tuple. By default, Storm knows how to serialize the primitive types, strings, and byte arrays. If you want to use another type, you’ll need to implement and register a serializer for that type.
Partitions of the Kafka Topic logs are distributed over multiple servers in the Kafka cluster. Each partition is replicated across a configurable number of servers for fault tolerance.
Every partition has one server that acts as the 'leader' and zero or more servers that act as 'followers'. The leader handles the reads and writes to a partition, and the followers passively replicate the data from the leader.
If the leader fails, then one of the followers automatically take the role as the 'leader'.
Probably one of the biggest benefits of Apache Kafka that make the platform so attractive to tech companies is its ability to keep data safe in the event of a total system failure, major update, or component malfunction. This is known as fault-tolerance. Apache Kafka is fault-tolerant because it replicates every message within the system to store in case of malfunction.
In Kafka, a cluster contains multiple brokers since it is a distributed system. Topic in the system will get divided into multiple partitions, and each broker stores one or more of those partitions so that multiple producers and consumers can publish and retrieve messages at the same time.
- Easy to operate: Operating storm is quiet easy
- Real fast: It can process 100 messages per second per node
- Fault Tolerant: It detects the fault automatically and re-starts the functional attributes
- Reliable: It guarantees that each unit of data will be executed at least once or exactly once
- Scalable: It runs across a cluster of machine.
Kafka has the following data storage capabilities which makes it a good distributed data storage system:
- Replication - Data written to Kafka topics are by design partitioned and replicated across servers for fault-tolerance.
- Guaranteed - Kafka sends acknowledgment to Kafka producers after data is fully replicated across all the servers, hence guaranteeing that the data is persisted to the servers.
- Scalability - The way Kafka uses disk structures enables them to scale well. Kafka performs the same irrespective of the size of the persistent data on the server.
- Flexible reads - Kafka enables different consumers to read from different positions on the Kafka topics, hence making Kafka a high-performance, low-latency distributed file system.
- Broker are the system which is responsible to maintaining the publish data.
- Each broker may have one or more than one partition.
- Kafka contain multiple broker to main the load balancer.
- Kafka broker are stateless
- eg: Let’s say there are N partition in a topic and there is N broker, then each broker has 1 partition.
ZeroMQ is “a library which extends the standard socket interfaces with features traditionally provided by specialized messaging middleware products”. Storm relies on ZeroMQ primarily for task-to-task communication in running Storm topologies.
Kafka comes with a command line client and a producer script kafka-console-producer.sh that can be used to take messages from standard input on console and post them as messages to a Kafka queue.
By making use of send file API transfer of messages is done in Kafka. Using this file the transfer of bytes takes place from the socket to disk through the kernel space-saving copies and the calls between kernel user and back to the kernel.
ISR is a set of message replicas that are completely synced up with the leaders, in other word ISR has all messages that are committed. ISR should always include all replicas until there is a real failure. A replica will be dropped out of ISR if it deviates from the leader.
If the preferred replica is not in the ISR, the controller will fail to move leadership to the preferred replica.
Kafka product relies on a distributed style wherever one cluster has multiple brokers/servers related to it. The ‘Topic’ is going to be divided into lots of partitions to store the messages and there’s one client cluster to fetch the messages from brokers.
A client cluster is formed of one or additional shoppers that along take the various topics and fetch information from the brokers.
A replica can be defined as a list of essential nodes that are responsible to log for a particular partition, and it doesn't matter whether they actually play the role of a leader or not.
Every partition in Kafka has one server which plays the role of a Leader, and none or more servers that act as Followers. The Leader performs the task of all read and write requests for the partition, while the role of the Followers is to passively replicate the leader. In the event of the Leader failing, one of the Followers will take on the role of the Leader. This ensures load balancing of the server.
During data, production to get exactly once messaging from Kafka you have to follow two things avoiding duplicates during data consumption and avoiding duplication during data production. Here are the two ways to get exactly one semantics while data production: - Avail a single writer per partition, every time you get a network error checks the last message in that partition to see if your last write succeeded - In the message include a primary key (UUID or something) and de-duplicate on the consumer
Kafka product is more scalable, faster, robust and distributed by design.
- A: 15
- B: 13
- C: 8
- D: 5
Answer: C Explanation: Since Kafka is horizontally scalable, handling 25 million messages per minute will need 13 machines or 8 more machines.
- A: A consumer instance gets the messages in the same order as they are produced.
- B: A consumer instance is guaranteed to get all the messages produced.
- C: No two consumer instances will get the same message
- D: All consumer instances will get all the messages
- A: Message compression
- B: Message sets
- C: Binary message format
- D: Partitions
Answer: C Explanation: Binary message format ensures that consistent format is used by all three processes
- A: zookeeper.xml
- B: zookeeper.properties
- C: zk.yaml
- D: kafka.zk.properties
- A: ZooKeeper eliminates partial failures
- B: ZooKeeper causes partial failures
- C: ZooKeeper detects partial failures
- D: ZooKeeper provides a mechanism for handling partial failures
Answer: D Explanation: ZooKeeper only provides a mechanism to handle partial failures
- A: ephemeral nodes
- B: persistent nodes
- C: sequential nodes
- D: pure nodes
Answer: B Explanation: Unlike ephemeral nodes, persistent znodes continue to exist unless explicitly deleted
- A: A published message will not be lost
- B: A published message will not be saved
- C: A published message will not be deleted
- D: A published message will not be sent
- A: They need to stop sending messages till you bring up the 2 servers
- B: They need to stop sending messages till you bring up at least one server
- C: They can continue to send messages as there is fault tolerance of 4 server failures.
- D: They can continue to send messages as you are keeping a tape back up of all the messages
Answer: C Explanation: Fault tolerance is n - 1, so they don't have to worry about losing messages
- A: Zero
- B: One
- C: Five
- D: All running brokers
- A: ZooKeeper server
- B: Kafka Producer
- C: Kafka Consumer
- D: Kafka Topic
- A. Topics
- B. Chunks
- C. domains
- D. messages
Want to test this skill? Check out Adaface assessments
Hadoop Online Test
Apache Hadoop YARN Test
We evaluated several of their competitors and found Adaface to be the most compelling. Great default library of questions that are designed to test for fit rather than memorization of algorithms.