How to find the total space occupied by a cassandra keyspace?
I use nodetool status <keyspace>
. The load column value is roughly the same as the value I get using df -h
(my cassandra installations are on different partitions than the system.
What is Compaction?
SStables are immutable -- once a memtable is flushed to disk, it remains unchanced until it is deleted (expired) or compacted. Compaction is the process of combining sstables together. This is important when your workload is update heavy and you may have several instances of a CQL row stored in your SSTables (see sstables per read in nodetool cfhistograms
). When you go to read that row, you may have to scan across multiple sstables to find the latest version of the data (in c* last write wins). When we compact, we may take up additional space on disk (especially size tiered compaction which may take up to--this is a theoretical maximum--50% of your data size when compacting) so it is important to keep free disk space. However, compaction will not take data away from your keyspace directory. This is not where your data is.
Then where did my data go?
You're right in your suspicion that data that has not yet been flushed to disk must be sitting in memtables. This data will make it to disk as soon as your commitlog fills up (default 1gb in 2.0 or 8gb in 2.1) or as soon as your memtables get too big -- memtable_total_space_in_mb.
If you want to see your data in sstables, you can flush it manually:
nodetool flush
and your memtables will be dropped into your KS directory in the form of SSTables. Or just be patient and wait until you hit either the commitlog or memtable thresholds.
But aren't cassandra writes durable?
Yes, your memtable data is also stored in the commitlog. If your machine looses power, etc, the data that has been written is still persisted to disk and the commit-log data will get replayed on startup!