Testing Cassandra
Contents
My goal with this testing was to use cassandra-stress
to first write data to the cluster with QUORUM consistency while simultaneously deleting one of the pods in each datacenter, and see that Kubernetes restarted the deleted pods and that the cluster came back to health. Afterwards, I would re-run cassandra-stress
in read mode to show that no data was lost.
I first created a keyspace using NetworkTopologyStrategy
:
$ kubectl exec -it cassandra-0 --namespace=azure-1 -- /bin/bash root@cassandra-0:/# cqlsh Connected to Cassandra at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.11.4 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh> CREATE KEYSPACE "keyspace1" WITH replication = { 'class': 'NetworkTopologyStrategy', 'AWS-1': 3, 'AZURE-1': 3 }; cqlsh> exit
Then I ran cassandra-stress
against the cluster, first writing 100K keys:
root@cassandra-0:/# cassandra-stress write n=100000 cl=quorum -mode native cql3 ******************** Stress Settings ******************** Command: Type: write Count: 100,000 No Warmup: false Consistency Level: QUORUM Target Uncertainty: not applicable Key Size (bytes): 10 Counter Increment Distibution: add=fixed(1) ... Connected to cluster: Cassandra, max pending requests per connection 128, max connections per host 8 Datatacenter: AWS-1; Host: /10.9.1.22; Rack: Rack1 Datatacenter: AZURE-1; Host: localhost/127.0.0.1; Rack: Rack1 Datatacenter: AZURE-1; Host: /10.19.1.30; Rack: Rack1 Datatacenter: AWS-1; Host: /10.9.1.141; Rack: Rack1 Datatacenter: AZURE-1; Host: /10.19.1.24; Rack: Rack1 Datatacenter: AWS-1; Host: /10.9.2.213; Rack: Rack1 Created keyspaces. Sleeping 1s for propagation. Sleeping 2s... Warming up WRITE with 25000 iterations... Failed to connect over JMX; not collecting these stats Running WRITE with 200 threads for 100000 iteration Failed to connect over JMX; not collecting these stats type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb total, 387, 387, 387, 387, 128.8, 126.5, 215.6, 275.5, 407.1, 407.1, 1.0, 0.00000, 0, 0, 0, 0, 0, 0 total, 1487, 1100, 1100, 1100, 182.9, 159.9, 334.0, 446.2, 494.4, 498.9, 2.0, 0.38295, 0, 0, 0, 0, 0, 0 total, 3316, 1829, 1829, 1829, 118.1, 119.4, 189.9, 255.1, 300.4, 300.7, 3.0, 0.26751, 0, 0, 0, 0, 0, 0 ... Results: Op rate : 1,855 op/s [WRITE: 1,855 op/s] Partition rate : 1,855 pk/s [WRITE: 1,855 pk/s] Row rate : 1,855 row/s [WRITE: 1,855 row/s] Latency mean : 106.4 ms [WRITE: 106.4 ms] Latency median : 91.0 ms [WRITE: 91.0 ms] Latency 95th percentile : 233.2 ms [WRITE: 233.2 ms] Latency 99th percentile : 386.9 ms [WRITE: 386.9 ms] Latency 99.9th percentile : 619.2 ms [WRITE: 619.2 ms] Latency max : 855.6 ms [WRITE: 855.6 ms] Total partitions : 100,000 [WRITE: 100,000] Total errors : 0 [WRITE: 0] Total GC count : 0 Total GC memory : 0.000 KiB Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:00:53 END
During this test, I opened another terminal and deleted one of the pods from each datacenter:
$ kubectl config use-context kube-aws Switched to context "kube-aws". $ kubectl delete pod cassandra-1 --namespace=aws-1 pod "cassandra-1" deleted $ kubectl config use-context kube-azure Switched to context "kube-azure". $ kubectl delete pod cassandra-1 --namespace=azure-1 pod "cassandra-1" deleted
Kubernetes restarted the deleted pods automatically. One was up and running during the test, and the other took a couple minutes to come up after the test completed. nodetool status
showed that the new pods seamlessly took the place of the deleted ones.
I then ran a read test of 100K keys which completed with no errors, showing that no data was lost. During this test I also killed one of the nodes and it came back up and rejoined the cluster during the test.
root@cassandra-0:/# cassandra-stress read n=100000 cl=quorum -mode native cql3 ******************** Stress Settings ******************** Command: Type: read Count: 100,000 No Warmup: false Consistency Level: QUORUM Target Uncertainty: not applicable Key Size (bytes): 10 Counter Increment Distibution: add=fixed(1) ... id, type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb 4 threadCount, READ, 100000, 690, 690, 690, 5.8, 4.5, 12.8, 24.4, 55.9, 152.4, 144.9, 0.02011, 0, 0, 0, 0, 0, 0 4 threadCount, total, 100000, 690, 690, 690, 5.8, 4.5, 12.8, 24.4, 55.9, 152.4, 144.9, 0.02011, 0, 0, 0, 0, 0, 0 8 threadCount, READ, 100000, 1196, 1196, 1196, 6.6, 5.6, 13.1, 26.3, 55.3, 156.1, 83.6, 0.01719, 0, 0, 0, 0, 0, 0 8 threadCount, total, 100000, 1196, 1196, 1196, 6.6, 5.6, 13.1, 26.3, 55.3, 156.1, 83.6, 0.01719, 0, 0, 0, 0, 0, 0 16 threadCount, READ, 100000, 1632, 1632, 1632, 9.7, 8.3, 19.1, 35.5, 70.6, 173.7, 61.3, 0.01601, 0, 0, 0, 0, 0, 0 16 threadCount, total, 100000, 1632, 1632, 1632, 9.7, 8.3, 19.1, 35.5, 70.6, 173.7, 61.3, 0.01601, 0, 0, 0, 0, 0, 0 24 threadCount, READ, 100000, 1951, 1951, 1951, 12.2, 10.5, 23.5, 46.5, 88.6, 165.2, 51.3, 0.01996, 0, 0, 0, 0, 0, 0 24 threadCount, total, 100000, 1951, 1951, 1951, 12.2, 10.5, 23.5, 46.5, 88.6, 165.2, 51.3, 0.01996, 0, 0, 0, 0, 0, 0 36 threadCount, READ, 100000, 1898, 1898, 1898, 18.8, 15.3, 40.5, 71.7, 153.9, 300.7, 52.7, 0.03235, 0, 0, 0, 0, 0, 0 36 threadCount, total, 100000, 1898, 1898, 1898, 18.8, 15.3, 40.5, 71.7, 153.9, 300.7, 52.7, 0.03235, 0, 0, 0, 0, 0, 0 54 threadCount, READ, 100000, 1779, 1779, 1779, 29.9, 25.0, 62.3, 121.8, 192.4, 258.1, 56.2, 0.02635, 0, 0, 0, 0, 0, 0 54 threadCount, total, 100000, 1779, 1779, 1779, 29.9, 25.0, 62.3, 121.8, 192.4, 258.1, 56.2, 0.02635, 0, 0, 0, 0, 0, 0 81 threadCount, READ, 100000, 2539, 2539, 2539, 31.6, 27.5, 62.8, 104.2, 170.1, 239.3, 39.4, 0.01962, 0, 0, 0, 0, 0, 0 81 threadCount, total, 100000, 2539, 2539, 2539, 31.6, 27.5, 62.8, 104.2, 170.1, 239.3, 39.4, 0.01962, 0, 0, 0, 0, 0, 0 121 threadCount, READ, 100000, 3035, 3035, 3035, 39.2, 34.6, 75.2, 131.8, 231.9, 359.1, 32.9, 0.02576, 0, 0, 0, 0, 0, 0 121 threadCount, total, 100000, 3035, 3035, 3035, 39.2, 34.6, 75.2, 131.8, 231.9, 359.1, 32.9, 0.02576, 0, 0, 0, 0, 0, 0 181 threadCount, READ, 100000, 3307, 3307, 3307, 54.6, 48.3, 106.4, 188.0, 266.9, 359.1, 30.2, 0.02941, 0, 0, 0, 0, 0, 0 181 threadCount, total, 100000, 3307, 3307, 3307, 54.6, 48.3, 106.4, 188.0, 266.9, 359.1, 30.2, 0.02941, 0, 0, 0, 0, 0, 0 271 threadCount, READ, 100000, 3688, 3688, 3688, 72.2, 65.8, 140.2, 206.6, 299.4, 391.9, 27.1, 0.08923, 0, 0, 0, 0, 0, 0 271 threadCount, total, 100000, 3688, 3688, 3688, 72.2, 65.8, 140.2, 206.6, 299.4, 391.9, 27.1, 0.08923, 0, 0, 0, 0, 0, 0 406 threadCount, READ, 100000, 3690, 3690, 3690, 108.6, 98.6, 218.8, 338.2, 498.3, 676.9, 27.1, 0.09736, 0, 0, 0, 0, 0, 0 406 threadCount, total, 100000, 3690, 3690, 3690, 108.6, 98.6, 218.8, 338.2, 498.3, 676.9, 27.1, 0.09736, 0, 0, 0, 0, 0, 0 609 threadCount, READ, 100000, 4339, 4339, 4339, 137.7, 127.7, 265.0, 370.9, 515.1, 608.7, 23.0, 0.05923, 0, 0, 0, 0, 0, 0 609 threadCount, total, 100000, 4339, 4339, 4339, 137.7, 127.7, 265.0, 370.9, 515.1, 608.7, 23.0, 0.05923, 0, 0, 0, 0, 0, 0 913 threadCount, READ, 100000, 4269, 4269, 4269, 209.3, 188.1, 431.8, 577.2, 751.8, 1044.9, 23.4, 0.50210, 0, 0, 0, 0, 0, 0 913 threadCount, total, 100000, 4269, 4269, 4269, 209.3, 188.1, 431.8, 577.2, 751.8, 1044.9, 23.4, 0.50210, 0, 0, 0, 0, 0, 0 END
While this is not any kind of definitive test of production readiness, it shows that the Cassandra cluster does function normally as multiple StatefulSets
across independent Kubernetes clusters and cloud providers, and can recover from node failures.
It would be great to hear if anyone else has deployed Cassandra across cloud providers in this or any other way, and (as I am relatively new to Cassandra) if there are any important considerations I have missed.