Tuesday, March 21, 2023

Cassandra: evaluate table size without reading the data

🕥 3 min.

Introduction


While developing the Cassandra source connector for Flink I needed a way to ensure that the data I was reading fitted in memory. For that I had to evaluate how big the source table was in order to know how to divide it. But, of course, it had to be done without reading the data itself. Here is how it was achieved.

Cassandra size estimates statistics


Cassandra partitioning is based on tokens arranged into a ring. Cassandra cluster provides statistical information about tables sizes in a system table called system.size_estimates. It provides per-table information on what we will call in this article token ranges: number of partitions taken by the table, mean partition size, start and end tokens. These elements can be used to get a rough estimation of the table size.

To get these information, we need to issue this request: 

SELECT range_start, range_end, partitions_count, mean_partition_size FROM system.size_estimates WHERE keyspace_name = ? AND table_name = ?

We will receive the token ranges that the table occupies. To get the size of the table, we need to sum them that way:

table_size_on_this_node = sum (mean_partition_size * partition_count) 

You see in the formula above that the calculated size is only for one Cassandra node as the system table is the one of the node. We need to extrapolate to the whole cluster to avoid requesting all the nodes of the cluster. For that we will calculate the ring  fraction this node represents in the cluster. The ring fraction is a percentage obtained like this: 

ring_fraction = sum (token_ranges_size) / ring_size

ring_size is a constant depending on the configured Cassandra cluster partitioner.

token_ranges_size = sum(distance(range_start, range_end))

There can be overlap between tokens so the distance method needs to be a little more complex:

private BigInteger distance(BigInteger token1, BigInteger token2) {
// token2 > token1
if (token2.compareTo(token1) > 0) {
return token2.subtract(token1);
} else {
return token2.subtract(token1).add(partitioner.ringSize);
}
}
So, now that we have our ring fraction, we can extrapolate the node table size to get the total table size: 

table_size = table_size_on_this_node / ring_fraction

And here we are !

Updating the statistics


These size estimates statistics are updated when the Cassandra cluster flushes its Memtables (tables in memory) into SSTabes (tables on disc). Especially, it updates the partitions information. The flush of the table can be done through the nodetool command : 

nodetool flush keyspace table

    In integration tests, we often want to read the test data just after writing it. In that case, the cluster has not done the flush yet so the size estimates are not updated. It is worth mentioning that the official Cassandra docker image contains nodetool binary and that the flush can be done from within the container using testContainers with the code below:

cassandraContainer.execInContainer("nodetool", "flush", keyspace, table);
In that case, a local JMX call is issued and local JMX is enabled by default on the Cassandra cluster.

Conclusion


I guess this article is mostly useful for Cassandra connector authors or DevOps people. I hope you enjoyed reading.