Could Cassandra be optimized to store time-series data? This is a question that has become a common topic for discussion. I recently came across a great read that compares Cassandra performance to another dedicated time-series database, TimescaleDB.
TimescaleDB is an open-source SQL database that has been engineered from PostgreSQL. While this solution will still require constant database management (see previous articles comparing SQL time-series databases to Canary), it does highlight some serious considerations before going down a Cassandra path. On a side note, please recognize this article was written by a team member of TimescaleDB and is based on their internal testing only.
Excerpts from the original article:
How a 5 node TimescaleDB cluster outperforms 30 Cassandra nodes, with higher inserts, up to 5800x faster queries, 10% the cost, a more flexible data model, and of course, full SQL.
Cassandra does not work well with large batch inserts. In fact, batching as a performance optimization is explicitly discouraged due to bottlenecks on the coordinator node if the transaction hits many partitions. Cassandra’s default maximum batch size setting is very small at 5KB.
Even a 30 node Cassandra cluster performs nearly 27% slower for inserts... we now see that Cassandra needs well over 10x (probably closer to 15x) the resources to achieve similar write rates.
Bringing multiple rollups (across both time and device) into the mix starts to make both databases sweat, but TimescaleDB has a huge advantage over Cassandra, especially when it comes to rolling up multiple metrics.
In conclusion, the TimescaleDB team found:
Cassandra’s turnkey write scalability comes at a steep cost. For all but the simplest rollup queries, our benchmarks show TimescaleDB with a large advantage, with average query times anywhere from 10 to 5,873 times faster for common time-series queries. While Cassandra’s clustered wide rows provide good performance for querying data for a single key, it quickly degrades for complex queries involving multiple rollups across many rows.
Additionally, while Cassandra makes it easy to add nodes to increase write throughput, it turns out you often just don’t need to do that for TimescaleDB. With 10–15x the write throughput of Cassandra, a single TimescaleDB node with a couple of replicas for high availability is more than adequate for dealing with workloads that would require a 30+ node fleet of Cassandra instances to handle.
However, Cassandra’s scaling model does offer nearly limitless storage since adding more storage capacity is as simple as adding another node to the cluster.
What are your thoughts? Have you been able to successfully use Cassandra for time-series data storage with strong insert and query performance?
If you would like to learn more about the Canary data historian and how we handle both the writing and reading of time-series data, including how you can perform SQL queries on that data, please let us know!
An additional read that you might find interesting that profiles the performance of a Canary data historian or a single node that could handle a million inserts per second without a performance decline.