VP Developer Relations
Software Practice Lead
In Apache Kafka®, partitions are the key to scalability and distributed processing. Kafka is designed as a distributed system, meaning it runs across multiple machines but appears as a single, unified service. If a topic were stored entirely on one machine, it would be limited by that machine's storage and processing power, capping its scalability. Partitioning solves this by splitting a topic's log into multiple, smaller logs called partitions. Each partition is stored separately across different nodes, allowing Kafka to handle far larger amounts of data.
This partitioning introduces some changes to message ordering. Within a single partition, message order is strictly maintained—messages are read in the exact sequence they were written. However, global ordering across all partitions is not guaranteed. Messages with the same key are always written to the same partition, ensuring order for that key. This is managed through a hashing function: Kafka hashes the key, applies modulo with the number of partitions, and assigns it accordingly. For example, all events from a thermostat with ID 42 would consistently go to the same partition, preserving their order.
If a message has no key, Kafka distributes it using a round-robin method, cycling through the partitions evenly. While this balances the load, it also means that messages from the same thermostat may end up in different partitions, losing any guarantee of order. This trade-off is important to understand when designing your data streams.
Kafka's ability to support millions of partitions across a cluster, thanks to KRaft, makes it incredibly scalable. More partitions allow for greater concurrency and parallelism, optimizing throughput and resilience. Understanding how Kafka distributes messages to partitions is crucial for designing efficient, high-throughput applications that preserve order where it matters.
In short, partitions are what enable Kafka to scale massively while maintaining efficient, reliable, and ordered message processing.
This is the introduction to Apache Kafka on partitions. Now, you probably know Kafka is a distributed system, so it's designed to operate across a number of computers and look on the outside world like it's just one big computer, right? Well, given that, if a topic were constrained to live entirely on one machine, just one node in that cluster, it would be a bit of a bummer. It would put a radical limit on the ability of Kafka to scale because then a topic could never get any bigger than the biggest node you could deploy and the storage on that node, and you know, that's not so great. So partitioning is how Kafka handles that.
Partitioning takes that single topic log that we talked about in the last module and breaks it up into multiple logs. This topic is partitioned, split into pieces. This is gonna have some effects on the concept of message ordering. We'll see how this works out, but in short, we don't have global ordering, at least not a strict guarantee of global order within a partitioned topic. We only have that within each partition. I'll show you an animation of that in just a little bit. Of course, you're not limited to just three partitions in a topic. You can have hundreds, you can have thousands. If you want, topics can get very, very large. Kafka in the current version, open source, Apache Kafka, supports round numbers, about 2 million partitions in releases. That's around Kafka 4.0 at the time of this recording. That's how scalable Kafka can be.
Of course, once we break a topic up into these partitions, now we need a way of deciding which messages get written to which partition, right? You've got somebody writing messages to this topic, and now that topic is split into pieces, I need to know which partition a given message goes to. Now, if a message has no key, and remember, we think of messages as key-value pairs in the simplest form, that key is null, it's empty, it's just not there, then the messages you write will be distributed round-robin among the partitions. Each message you write just gets assigned to the next partition, load is distributed evenly, and it's a wonderful world.
Well, mostly wonderful. Notice those messages from thermostat 42. You see, they all went to different partitions. They're not in order. They're kind of in order. Roughly speaking, stuff earlier on in the partition is gonna be older than stuff later on in the partition, but I don't have a guarantee of order. When I read things out of this partition, I don't know that I'm gonna get those in the same order they went into it.
If you want ordering by something like sensor ID, in this case, there's a way to do that. If the sensor ID is the message key, then what happens is that key gets run through a hash function, mod the number of partitions, and that just tells us the partition number to write to. So in this message for that key, that's partition two. So that message gets written to partition two. Likewise, the next one, 42, gets written to the same partition. You see, we have strict ordering for any messages of the same partition key. Now, different keys, anything goes, right? They could be assigned to whatever partition happens to win the hash function lottery, but we get an ordering guarantee on messages of a common key, and that can be very, very useful in some use cases.
To sum it up, Kafka partitions are what make Kafka scalable, efficient, and reliable. By dividing topics into partitions, Kafka can distribute workloads across a cluster and handle massive amounts of data, massive amounts of IO, pub-sub activity happening, and preserve message order when you need it. Understanding how partitions work, especially how they're assigned and balanced, helps you get the most out of Kafka.
We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.