Kafka¶

91 cards — 🟢 21 easy | 🟡 42 medium | 🔴 21 hard

🟢 Easy (21)¶

1. What is Apache Kafka and what problems does it solve?

Show answer

[kafka.apache.org](https://kafka.apache.org): "Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications."

In other words, Kafka is a sort of distributed log where you can store events, read them and distribute them to different services and do it in high-scale and real-time.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

2. What are Kafka Streams and its use cases?

Show answer

Kafka Streams is a lightweight, stream processing library in Kafka that allows developers to build applications that process and analyze real-time data streams. It provides a high-level DSL (Domain-Specific Language) for writing stream processing applications directly against Kafka topics. Use cases for Kafka Streams include real-time analytics, fraud detection, monitoring, and ETL (Extract, Transform, Load) operations.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

3. What are the challenges and best practices for upgrading Kafka versions in a production environment?

Show answer

Upgrading Kafka versions in a production environment poses challenges that need careful consideration. Best practices include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

4. What is Apache Kafka?

Show answer

Apache Kafka is a distributed, scalable, and fault-tolerant streaming platform designed to handle real-time data feeds. Developed by the Apache Software Foundation, Kafka is widely used for building real-time data pipelines and streaming applications. It provides a publish-subscribe messaging system, high-throughput, fault tolerance, and durability, making it suitable for various use cases such as log aggregation, event sourcing, and data integration.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

5. What is the role of the Kafka Log Cleaner?

Show answer

The Kafka Log Cleaner is a background process responsible for managing disk space and maintaining optimal storage efficiency in Kafka brokers. Key aspects of the Kafka Log Cleaner include:
* Log Segments: Over time, log segments in Kafka can accumulate obsolete and deleted records, consuming unnecessary disk space.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

6. What is the role of Kafka's transactional producer API, and how does it differ from the non-transactional API?

Show answer

Kafka's transactional producer API provides Exactly-Once Semantics for producing messages. Key aspects of the transactional producer API and its differences from the non-transactional API include:

Remember: acks: 0=fire-and-forget, 1=leader, all=ISR. "0=YOLO, 1=Leader, all=Safe."

Gotcha: acks=all adds latency. Pair with retries + enable.idempotence for reliability.

7. What is the role of the Kafka Metrics API?

Show answer

The Kafka Metrics API provides a comprehensive set of metrics and monitoring capabilities to track the performance and health of a Kafka cluster. Key aspects of the Kafka Metrics API include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

8. What is a "Producer" in regards to Kafka?

Show answer

An application that publishes data to the Kafka cluster.

Remember: acks: 0=fire-and-forget, 1=leader, all=ISR. "0=YOLO, 1=Leader, all=Safe."

Gotcha: acks=all adds latency. Pair with retries + enable.idempotence for reliability.

9. Define a producer in Kafka.

Show answer

A producer in Kafka is a component or application responsible for publishing messages to Kafka topics. Producers create and send messages to specific topics, making the messages available for consumption by one or more consumers. Producers are typically designed to be highly scalable and fault-tolerant, ensuring reliable and efficient delivery of messages to Kafka brokers.

Remember: acks: 0=fire-and-forget, 1=leader, all=ISR. "0=YOLO, 1=Leader, all=Safe."

Gotcha: acks=all adds latency. Pair with retries + enable.idempotence for reliability.

10. What is the role of interceptors in Kafka producers and consumers?

Show answer

Interceptors in Kafka allow developers to intercept and modify records before they are sent by producers or received by consumers. Key aspects of interceptors include:

Remember: acks: 0=fire-and-forget, 1=leader, all=ISR. "0=YOLO, 1=Leader, all=Safe."

Gotcha: acks=all adds latency. Pair with retries + enable.idempotence for reliability.

11. What are the potential issues and solutions when dealing with out-of-order messages in Kafka?

Show answer

Dealing with out-of-order messages in Kafka is essential for maintaining data consistency. Potential issues and solutions include:
Causes of Out-of-Order Messages:
* Network Delays: Variability in network latencies can result in messages arriving out of order.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

12. What are the key considerations for Kafka deployment in a cloud environment?

Show answer

Deploying Kafka in a cloud environment involves several key considerations:
* Resource Scaling: Cloud platforms allow for dynamic scaling of resources, enabling Kafka clusters to adapt to varying workloads. Consider using auto-scaling features to adjust the number of broker instances based on demand.

Remember: Same key → same partition (hash). Guarantees per-key ordering.

Gotcha: Changing partition count redistributes keys → ordering breaks.

13. What is a Kafka record or message?

Show answer

In Kafka, a record or message is the basic unit of data that is produced and consumed. A record typically consists of two components:
* Key: An optional field that can be used for partitioning and indexing. The key is used to determine the partition to which the message will be sent.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

14. What is the purpose of Kafka Zookeeper?

Show answer

Kafka relies on Apache ZooKeeper for distributed coordination and management of its cluster. The main purposes of Kafka ZooKeeper include:
* Cluster Coordination: ZooKeeper helps Kafka brokers coordinate and elect a leader for each partition, facilitating fault tolerance and load balancing.

Fun fact: KRaft replaces ZooKeeper from Kafka 3.3+. One less system to manage.

Remember: ZK managed brokers, configs, elections. KRaft moves all into Kafka itself.

15. What is the purpose of the Kafka Controller?

Show answer

The Kafka Controller is a crucial component within the Kafka cluster responsible for managing partitions, leaders, and replicas. Its main purposes include:
Partition Leader Election: The Controller ensures the election of a leader for each partition. The leader is responsible for handling read and write operations for that partition.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

16. What is a Kafka transaction and when is it used?

Show answer

A Kafka transaction is a mechanism that allows producers to send messages to multiple partitions within a transactional context. Kafka transactions ensure atomicity, consistency, and isolation of message writes across partitions. Producers can either commit or abort a transaction, ensuring that messages are either all successfully written to partitions or none at all.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

17. What is the role of Kafka Streams DSL?

Show answer

Kafka Streams DSL (Domain-Specific Language) is a high-level API provided by Kafka Streams for building stream processing applications. It allows developers to define complex data processing operations using a fluent and expressive API. Key aspects of the Kafka Streams DSL include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

18. What is the role of the Kafka AdminClient API, and how is it used?

Show answer

The Kafka AdminClient API is a Java client that provides administrative functionality to interact with and manage Kafka clusters programmatically. Its role includes:
* Cluster Metadata Retrieval: The AdminClient allows users to retrieve metadata about the Kafka cluster, such as broker information, topic details, and partition assignments.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

19. What is the purpose of Kafka’s Exactly-Once Semantics and how is it implemented?

Show answer

Kafka's Exactly-Once Semantics ensures that messages are processed and delivered exactly once, without duplicates or message loss. This is achieved through the following mechanisms:

Under the hood: Requires enable.idempotence=true + transactional API. Adds latency.

20. What is a consumer group in Kafka?

Show answer

A consumer group in Kafka is a logical grouping of consumers that work together to consume messages from one or more topics. Each consumer group has one or more consumers, and each message within a topic partition is consumed by only one consumer within the group. Consumer groups enable parallel processing of messages, as different partitions can be consumed concurrently by different consumers. This architecture supports scalable and fault-tolerant message consumption.

Remember: Consumer group = team sharing work. One partition→one consumer. Max parallel = partitions.

Gotcha: More consumers than partitions = idle consumers.

21. What is a consumer in Kafka?

Show answer

A consumer in Kafka is a component or application responsible for subscribing to and consuming messages from Kafka topics. Consumers process the messages produced by the producers. Kafka supports both parallel and distributed consumption, allowing multiple consumers to work together to process messages from a shared topic. Consumers can be part of a consumer group, providing scalability and fault tolerance.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

🟡 Medium (42)¶

1. Discuss the scenarios where Kafka is a better choice than traditional messaging systems.

Show answer

Kafka is a preferred choice in several scenarios compared to traditional messaging systems:
* Scalability: Kafka's distributed architecture allows for horizontal scaling, accommodating large volumes of data and high message throughput.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

2. Explain the role of Kafka in a distributed system.

Show answer

In a distributed system, Kafka serves as a distributed messaging system that enables communication and data exchange between different components or services. Its key roles include:
* Data Streaming: Kafka facilitates the streaming of real-time data between distributed components, allowing seamless communication in a decoupled manner.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

3. Discuss the impact of changing the Kafka replication factor.

Show answer

Changing the Kafka replication factor has several impacts on the Kafka cluster:
* Fault Tolerance: Increasing the replication factor improves fault tolerance. Each partition has multiple replicas, and if a broker fails, one of the replicas can be promoted to leader, ensuring continuity of data availability.

Remember: RF=3 survives 2 broker failures. `min.insync.replicas=2` protects writes.

Gotcha: min.insync.replicas + acks=all = data safety but can reduce availability.

4. Describe the purpose of a Kafka broker.

Show answer

A Kafka broker is a server instance within the Kafka cluster that stores and manages the distribution of messages. The primary purposes of a Kafka broker include:
* Message Storage: Brokers store the messages published by producers to topics. Messages are stored in partitions within the broker.

Remember: Broker = Kafka server. Cluster = multiple brokers. Each stores partition replicas.

Under the hood: Each partition has one leader. Reads/writes go to leader (pre-KIP-392).

5. Explain the role of the Apache Kafka Consumer API.

Show answer

The Apache Kafka Consumer API is a set of classes and methods that allow developers to create and configure Kafka consumers for subscribing to and processing messages from Kafka topics. Key aspects of the Consumer API include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

6. Discuss Kafka’s support for multi-datacenter replication.

Show answer

Kafka supports multi-datacenter replication to enhance fault tolerance and ensure data availability across geographically distributed locations. Key aspects include:
* Replica Placement: Kafka allows the placement of replicas across multiple data centers. Each partition can have replicas distributed across different geographical regions.

Remember: RF=3 survives 2 broker failures. `min.insync.replicas=2` protects writes.

Gotcha: min.insync.replicas + acks=all = data safety but can reduce availability.

7. Discuss the considerations for selecting the appropriate storage infrastructure for Kafka.

Show answer

Choosing the right storage infrastructure for Kafka involves several considerations:
* Disk Speed and Type: Opt for high-speed disks, such as SSDs, to ensure optimal disk I/O performance. The choice of disk type impacts the overall throughput and latency of Kafka.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

8. What is a topic in Kafka?

Show answer

In Kafka, a topic is a logical channel or category to which messages are published by producers and from which messages are consumed by consumers. Topics serve as the primary means of organizing and categorizing data within the Kafka cluster. Producers publish messages to specific topics, and consumers subscribe to topics to receive and process the messages. Topics enable the decoupling of data producers and consumers, allowing for flexible and scalable data processing.

Remember: Topic = named feed. "TV channel." Publishers send, subscribers receive.

Under the hood: Topics split into partitions. Each partition = ordered, immutable log.

9. Discuss the role of Kafka MirrorMaker in data replication across clusters.

Show answer

Kafka MirrorMaker is a tool designed for replicating data between Kafka clusters. Its role includes:

Remember: RF=3 survives 2 broker failures. `min.insync.replicas=2` protects writes.

Gotcha: min.insync.replicas + acks=all = data safety but can reduce availability.

10. Explain how Kafka handles message deduplication.

Show answer

Kafka addresses message deduplication through a combination of producer and broker mechanisms:
* Producer Idempotence: Kafka introduced the concept of idempotent producers. When a producer is configured as idempotent, it ensures that messages are sent exactly once. This helps prevent duplicates caused by retries during transient failures.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

11. Discuss the role of Apache Avro in Kafka.

Show answer

Apache Avro is a binary serialization format used in Kafka for efficient and compact data serialization. Key aspects of Avro in Kafka include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

12. Explain Kafka's protocol for inter-broker communication.

Show answer

Kafka uses a binary protocol for inter-broker communication. Key aspects include:
* Message Format: Inter-broker communication involves the exchange of messages between Kafka brokers. Messages are sent in a binary format for efficiency.

Remember: Broker = Kafka server. Cluster = multiple brokers. Each stores partition replicas.

Under the hood: Each partition has one leader. Reads/writes go to leader (pre-KIP-392).

13. Discuss the considerations for achieving low-latency in Kafka.

Show answer

Achieving low-latency in Kafka involves careful consideration of several factors:
* Partition and Replica Placement: Distribute partitions and their replicas across brokers and network locations to minimize data transfer latency.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

14. Discuss Kafka's support for end-to-end security using SSL/TLS.

Show answer

Kafka provides robust support for end-to-end security through SSL/TLS. Key aspects include:
* Encryption: SSL/TLS ensures that data transferred between producers, brokers, and consumers is encrypted, preventing unauthorized access to sensitive information.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

15. Explain the use of Kafka quotas and rate limiting.

Show answer

Kafka quotas and rate limiting are mechanisms to control and manage resource usage within a Kafka cluster. Key aspects include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

16. Discuss the impact of message size on Kafka performance.

Show answer

The impact of message size on Kafka performance is a crucial consideration, and it affects various aspects of Kafka's operation:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

17. Explain the role of Kafka ACLs (Access Control Lists).

Show answer

Kafka ACLs (Access Control Lists) play a crucial role in securing Kafka clusters by defining fine-grained access permissions for users and applications. Key aspects of Kafka ACLs include:
* Topic-Level Permissions: ACLs can be set at the topic level, specifying which users or groups have read or write access to particular topics.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

18. Explain the process of upgrading a Kafka cluster.

Show answer

The process of upgrading a Kafka cluster involves the following steps:
* Backup: Before upgrading, ensure a comprehensive backup of the Kafka data and configurations. This provides a safety net in case of unforeseen issues during the upgrade.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

19. Discuss the use of Kafka Connect transforms and the available transformation types.

Show answer

Kafka Connect transforms are operations applied to data during the ETL (Extract, Transform, Load) process. They allow modification, filtering, or enrichment of data as it flows through Kafka Connect. Key transformation types include:

Remember: Kafka Connect = pre-built source/sink connectors. DB↔Kafka without coding.

20. How does Kafka handle dynamic partition assignment in consumer groups?

Show answer

Kafka handles dynamic partition assignment in consumer groups through the following process:
* Group Coordinator: Each consumer group has a designated group coordinator responsible for managing group membership and partition assignments.

Remember: Consumer group = team sharing work. One partition→one consumer. Max parallel = partitions.

Gotcha: More consumers than partitions = idle consumers.

21. Explain the considerations for securing a Kafka cluster.

Show answer

Securing a Kafka cluster involves implementing measures to protect data, ensure authentication and authorization, and prevent unauthorized access. Key considerations include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

22. Explain the impact of broker properties like min.insync.replicas on Kafka's reliability.

Show answer

The min.insync.replicas broker property in Kafka determines the minimum number of in-sync replicas (ISRs) required to acknowledge a write operation as successful. Its impact on Kafka's reliability includes:

Remember: Broker = Kafka server. Cluster = multiple brokers. Each stores partition replicas.

Under the hood: Each partition has one leader. Reads/writes go to leader (pre-KIP-392).

23. Explain the mechanics of Kafka rebalancing.

Show answer

Kafka rebalancing is a process that occurs when the membership of consumer group instances changes. It involves redistributing the partitions among the consumers to ensure a balanced workload. The mechanics of Kafka rebalancing include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

24. Discuss the significance of the ISR (In-Sync Replicas) list.

Show answer

The ISR (In-Sync Replicas) list is a subset of replicas for a partition that are considered in sync with the leader. The significance of the ISR list includes:
* Fault Tolerance: The ISR list ensures fault tolerance by only promoting replicas within the ISR list to leaders in case of a leader failure.

Remember: RF=3 survives 2 broker failures. `min.insync.replicas=2` protects writes.

Gotcha: min.insync.replicas + acks=all = data safety but can reduce availability.

25. How does Kafka handle data compaction in detail?

Show answer

Kafka handles data compaction through a feature known as log compaction. Here's a detailed explanation:
* Log Segments: Kafka maintains data in log segments, each representing a sequential and immutable portion of a partition's commit log.

Remember: delete=TTL-based removal. compact=keep latest per key. "delete=TTL, compact=upsert."

26. Discuss the considerations for choosing the appropriate Kafka storage format (log, compacted log, etc.).

Show answer

Choosing the appropriate Kafka storage format involves considering factors such as use case, data retention, and access patterns. Common storage formats include:
Log Format (Append-Only):
* Suitable for scenarios where the entire history of events is critical.
* Well-suited for event sourcing architectures.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

27. Discuss the concept of Kafka log appenders.

Show answer

Kafka log appenders are components responsible for appending log entries to Kafka logs in an efficient and reliable manner. Key aspects of Kafka log appenders include:
* Batching: Log appenders often batch multiple log entries into a single write operation to improve write efficiency and reduce disk I/O.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

28. Explain the role of the Apache Kafka Producer API.

Show answer

The Apache Kafka Producer API is a set of classes and methods that enable developers to create and configure Kafka producers for publishing messages to Kafka topics. The key aspects of the Producer API include:

Remember: acks: 0=fire-and-forget, 1=leader, all=ISR. "0=YOLO, 1=Leader, all=Safe."

Gotcha: acks=all adds latency. Pair with retries + enable.idempotence for reliability.

29. Explain the role of log segments in Kafka storage.

Show answer

In Kafka, the commit log is divided into log segments, each representing a sequential and immutable portion of a partition's log. Key aspects of log segments include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

30. What is the significance of the offset in Kafka?

Show answer

The offset is a unique identifier assigned to each message within a partition. It represents the position of a message in the partition's log. Consumers use offsets to keep track of the messages they have already consumed. Kafka ensures that each message has a unique offset within a partition, enabling consumers to resume processing from a specific point in the log. Offsets are stored in Kafka topics, providing a reliable way to maintain the state of message consumption.

Remember: Offset = partition position. "Bookmark." Consumers track where they left off.

Gotcha: Without committing offsets, consumers replay messages after crash.

31. What is the role of the Kafka Connect API?

Show answer

The Kafka Connect API is used for building and running connectors that integrate Kafka with external data sources or sinks. Connectors facilitate the movement of data in and out of Kafka, allowing seamless integration with databases, file systems, messaging systems, and other data storage or processing systems. Kafka Connect simplifies the development and deployment of data pipelines, enabling the transfer of data between Kafka topics and external systems in a scalable and fault-tolerant manner.

Remember: Kafka Connect = pre-built source/sink connectors. DB↔Kafka without coding.

32. Explain the considerations for scaling a Kafka cluster horizontally.

Show answer

Scaling a Kafka cluster horizontally involves adding more broker instances to distribute the workload and increase capacity. Considerations for horizontal scaling include:
* Broker Addition: New broker instances can be added to the Kafka cluster to increase the overall capacity for handling more producers, consumers, and partitions.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

33. How does Kafka support multi-tenancy?

Show answer

Kafka supports multi-tenancy, allowing multiple independent applications or business units (tenants) to share a single Kafka cluster. Key aspects of multi-tenancy in Kafka include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

34. Explain the role of the Kafka Raft metadata mode.

Show answer

Kafka Raft metadata mode is an enhancement to Kafka's metadata storage system, replacing the traditional Zookeeper-based metadata storage. The Raft consensus algorithm is used to achieve distributed consensus among broker nodes, providing better reliability and simplicity compared to Zookeeper. Key aspects of Kafka Raft metadata mode include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

35. Explain the process of Kafka producer acknowledgment.

Show answer

Kafka producer acknowledgment refers to the confirmation received by a producer after successfully publishing a message to a Kafka broker. Producers can configure the level of acknowledgment they require using the acks parameter:

Remember: acks: 0=fire-and-forget, 1=leader, all=ISR. "0=YOLO, 1=Leader, all=Safe."

Gotcha: acks=all adds latency. Pair with retries + enable.idempotence for reliability.

36. Discuss the impact of increasing the number of partitions on consumer parallelism.

Show answer

Increasing the number of partitions in Kafka has a direct impact on consumer parallelism. Key considerations include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

37. Describe the impact of increasing the number of partitions in Kafka.

Show answer

Increasing the number of partitions in Kafka has several impacts:
* Increased Parallelism: More partitions allow for more parallelism in data processing. Multiple consumers can concurrently consume messages from different partitions, providing improved throughput.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

38. What Kafka is used for?

Show answer

- Real-time e-commerce
- Banking
- Health Care
- Automotive (traffic alerts, hazard alerts, ...)
- Real-time Fraud Detection

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

39. How does Kafka ensure data durability?

Show answer

Kafka ensures data durability through various mechanisms:
* Replication: Kafka replicates partitions across multiple brokers. This means that even if one or more brokers fail, data remains available from the replicas.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

40. How does Kafka handle data compression, and what are the available compression codecs?

Show answer

Kafka handles data compression to optimize storage and network transfer. Producers can compress messages before sending them to brokers, and consumers can decompress received messages. Available compression codecs in Kafka include:
Gzip: Offers a good balance between compression ratio and speed.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

41. What's in a Kafka cluster?

Show answer

- Broker: a server with kafka process running on it. Such server has local storage. In a single Kafka clusters there are usually multiple brokers.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

42. Explain the difference between a queue and a topic in Kafka.

Show answer

In Kafka, the terms "queue" and "topic" are often used interchangeably with "topic" being the more commonly used term. However, in the context of traditional messaging systems, the key differences are:

Remember: Topic = named feed. "TV channel." Publishers send, subscribers receive.

Under the hood: Topics split into partitions. Each partition = ordered, immutable log.

🔴 Hard (21)¶

1. How does Kafka handle message ordering within a partition?

Show answer

Kafka ensures strict ordering of messages within a partition. Each partition maintains a sequential log of messages, and each message is assigned a unique offset. Producers sequentially append messages to the end of the log, and consumers read messages in the order of their offsets. The ordering is guaranteed within a partition, but across partitions, there is no guaranteed global order.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

2. Discuss Kafka’s support for different message delivery semantics.

Show answer

Kafka supports different message delivery semantics to cater to various application requirements. The main delivery semantics include:

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

3. Discuss the challenges and solutions for ensuring exactly-once semantics in Kafka.

Show answer

Ensuring exactly-once semantics in Kafka is challenging but achievable. Challenges include:
* Producer Idempotence: Producers can be configured to send messages idempotently, ensuring that duplicate messages do not affect the overall result.

Under the hood: Requires enable.idempotence=true + transactional API. Adds latency.

4. How does Kafka ensure fault tolerance?

Show answer

Kafka ensures fault tolerance through partition replication. Each partition has multiple replicas distributed across different brokers. If a broker fails, one of the replicas can be promoted to serve as the new leader, ensuring uninterrupted data availability. This replication strategy, combined with ZooKeeper for broker coordination and leader election, makes Kafka resilient to individual broker failures and contributes to the overall fault tolerance of the system.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

5. Explain the role of the Kafka Schema Registry.

Show answer

The Kafka Schema Registry is a centralized service that manages the schemas for messages produced and consumed in a Kafka environment. It ensures that producers and consumers agree on the structure of the data by enforcing schema compatibility. This is crucial in evolving systems where data formats may change over time. The Schema Registry supports various serialization formats like Avro, JSON, and Protobuf.

Remember: Schema Registry ensures format agreement. Avro/Protobuf/JSON Schema. Port 8081.

6. How does Kafka handle message retention?

Show answer

Message retention in Kafka is managed through configurable retention policies. Kafka allows users to define retention based on time or size constraints. Messages that exceed the specified retention period or size are eligible for deletion. This feature ensures that Kafka does not indefinitely store all messages, helping manage storage costs and preventing the system from becoming overloaded with outdated data. Retention policies can be set at both the topic and broker levels.

Remember: delete=TTL-based removal. compact=keep latest per key. "delete=TTL, compact=upsert."

7. Explain the concept of replication in Kafka.

Show answer

Replication in Kafka involves creating redundant copies (replicas) of each partition across multiple brokers. This provides fault tolerance, ensuring that data remains available even if some brokers fail. Replicas include a leader and follower(s). The leader handles read and write operations, while followers replicate the data. If the leader fails, one of the followers is promoted to be the new leader.

Remember: RF=3 survives 2 broker failures. `min.insync.replicas=2` protects writes.

Gotcha: min.insync.replicas + acks=all = data safety but can reduce availability.

8. Discuss the importance of log compaction in Kafka.

Show answer

Log compaction is an important feature in Kafka that helps retain the latest value for each key in a log, while older values are periodically compacted and removed. This is particularly useful in scenarios where it is essential to maintain the latest state of each record, such as maintaining the current state of a database. Log compaction ensures that even if there are multiple writes for the same key, only the latest value is retained, reducing storage overhead and improving query efficiency.

9. How does Kafka handle backpressure?

Show answer

Kafka handles backpressure through its flow control mechanism. Consumers can control the rate at which they consume messages by adjusting parameters like max.poll.records and fetch.min.bytes. Producers, on the other hand, can use settings such as acks and linger.ms to control the rate at which they send messages. If a consumer is overwhelmed, it can reduce the frequency of poll requests or process messages more quickly.

Remember: 0=fastest(lossy), 1=leader(balanced), all=safest(slowest). Choose by importance.

10. Explain Kafka's architecture in terms of leader and follower replicas.

Show answer

Kafka's architecture involves leader and follower replicas for each partition. Key points include:
* Partition Replication: Each Kafka topic is divided into partitions, and each partition has multiple replicas. Replication provides fault tolerance and high availability.

Remember: RF=3 survives 2 broker failures. `min.insync.replicas=2` protects writes.

Gotcha: min.insync.replicas + acks=all = data safety but can reduce availability.

11. Discuss the internal architecture of a Kafka broker.

Show answer

The internal architecture of a Kafka broker includes several components:
* Log Segment: The fundamental storage unit containing committed messages. Log segments are immutable and represent a portion of a partition's commit log.
* Log Manager: Manages the creation, deletion, and rolling of log segments. It also handles log indexing and compaction.

Remember: Broker = Kafka server. Cluster = multiple brokers. Each stores partition replicas.

Under the hood: Each partition has one leader. Reads/writes go to leader (pre-KIP-392).

12. Discuss the role of Kafka Connect converters.

Show answer

Kafka Connect converters are components responsible for translating data between Kafka Connect and external systems. They handle the serialization and deserialization of data, allowing seamless integration between Kafka topics and various data storage systems. Converters are crucial for ensuring that data can be efficiently and accurately transferred between Kafka and external systems with different data formats.

Remember: Kafka Connect = pre-built source/sink connectors. DB↔Kafka without coding.

13. Explain the role of the Kafka commit log.

Show answer

The Kafka commit log is the fundamental data structure that underlies the storage of messages in Kafka. It is a distributed, fault-tolerant, and durable log that records all messages published to Kafka topics. The commit log ensures the ordering, persistence, and fault tolerance of messages. Each partition has its own commit log, and messages are written sequentially to the log. Consumers read from the log, ensuring a consistent and ordered view of the data.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

14. How does Kafka address the challenge of maintaining order across multiple partitions?

Show answer

Maintaining order across multiple partitions in Kafka is addressed through the following mechanisms:
* Partition Ordering: Within each partition, Kafka maintains the order of records as they are produced. Consumers can rely on the order of records within a partition.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

15. How is data stored in Kafka?

Show answer

Data in Kafka is stored in the form of logs. Each topic is divided into partitions, and each partition is a linear, ordered sequence of messages. Messages within a partition are assigned a unique offset that represents their position in the partition. Kafka ensures durability by persisting messages to disk, making the data resilient to node failures.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

16. Discuss the publish-subscribe model in Kafka.

Show answer

The publish-subscribe model in Kafka involves producers publishing messages to topics, and consumers subscribing to those topics to receive and process the messages. Multiple consumers can subscribe to the same topic, forming consumer groups. Each message is broadcast to all consumers within a group, allowing for parallel and distributed processing. This model enables decoupling between producers and consumers, supporting real-time data streaming and event-driven architectures.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

17. Discuss the role of Kafka in event sourcing architectures.

Show answer

Kafka plays a crucial role in event sourcing architectures by serving as a distributed, fault-tolerant event log. In event sourcing:
* Event Log: Kafka acts as the central event log where all changes to the state of an application are captured as immutable events. These events represent state transitions and serve as a reliable source of truth.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

18. How can you monitor and optimize Kafka cluster performance?

Show answer

Monitoring and optimizing Kafka cluster performance involve several key practices:
* Metrics Monitoring: Regularly monitor Kafka metrics related to broker health, disk usage, network throughput, and consumer lag. Utilize tools like JMX, Prometheus, or Confluent Control Center for real-time and historical metrics.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

19. Discuss the use of partitions in Kafka.

Show answer

Partitions are fundamental units of parallelism and scalability in Kafka. They allow Kafka to horizontally scale by distributing the data across multiple brokers. Each partition is an ordered, immutable sequence of messages, and topics are divided into partitions. Partitions enable parallel processing, as multiple consumers can simultaneously consume different partitions of a topic.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.

20. Explain how Kafka handles the scenario of broker failures.

Show answer

Kafka is designed to handle broker failures seamlessly, ensuring high availability and fault tolerance. Key mechanisms include:

Remember: Broker = Kafka server. Cluster = multiple brokers. Each stores partition replicas.

Under the hood: Each partition has one leader. Reads/writes go to leader (pre-KIP-392).

21. Explain the scenarios where partitioning becomes a critical factor in Kafka.

Show answer

Partitioning is a critical factor in Kafka and becomes essential in various scenarios:
* Scalability: Partitioning allows Kafka to scale horizontally by distributing data across multiple partitions. Each partition can be processed independently, enabling Kafka to handle a high volume of data and support a large number of producers and consumers.

Remember: Kafka = distributed commit log. Topics→partitions→consumers. Sequential I/O + zero-copy.

Under the hood: Kafka persists everything to disk. High throughput from sequential writes.