Friday, August 29, 2014

How to estimate Cassandra database processing Capacity

Estimate Cassandra database processing Capacity.

- Use cases
  - Select group_id,ts_upd from my_table where pk_col = 'xxxxxxxxxxxxx';
  - Select group_id,ts_upd from my_table where index1_col = 1234;
  - Select group_id,ts_upd from my_table where index2_col = 1234;
  - Select group_id,ts_upd from my_table where index3_col = 1234;
All queries return 1 or 0 rows.
  - 80% time only return group_id,
  - update ?
- Create a data model
  - pk_col VARCHAR(20),
  - index1_col VARCHAR(30),
  - index2_col VARCHAR(30),
  - index3_col VARCHAR(30),
  - group_id NUMBER(10),
  - ts_upd : TIMESTAMP : 8 bytes,
  - record size: 128 bytes,
- Replication, High Available
  - Data distribution and replication
    - Strategy 1: one data center, 3 nodes, replication_factor = 3. Write Consistency Levels = 2
    - Strategy 2: two data centers, 3 nodes on each data center,
  - Murmur3Partitioner
  - Round((read 2 copy, or write 3 data copy) / 3 node) = 1.  The redundant work is distributed to 3 nodes.
- Estimate Casandra processing power with current price-performance sweet spot hardware.
  - variable read/write % criteria, 100:0, 90:10, 0:100
    - Transaction volume
    - Response time
  - Memory: 64GB : insure data is always in cache.
  - CPU: 8-core CPU processors
  - SSD: can provide P99.999 under 5 *milliseconds* regardless RAM usage.
    - SATA spinning disks: Hard drives will give wide ranges of latency, easily up to 5 *seconds* in the P99.999% range
- Basic operation time,
  - average read latency < 0.16 ms,  or 6250 reads/sec
  - average write latency < 0.025 ms, or 40,000 writes/sec
  - max latency < 5ms, 99.999%
- hypothesis / presumption
  - 1/4 queries on each index.
  - turn off key cache and row cache
  - Distributed index and MV data model, more code to maintain,
  - Sizing overhead
    - Column size = 15 + size of name(10) + size of value  :  use short column name,
    - row overhead = 23
    - primary key index size = 32 + average_key_size
- index options
  "Cassandra internal: http://www.wentnet.com/blog/?p=77"
  - Primary Key
    - Logical reads = 1,
  - Secondary index
    - (index column, primary key column), size of value: 50.
    - Logical reads = O(n) + 1 = 3 + 1 = 4;  n is number of nodes
    - Logical writes = 1 + 1 = 2;
    - 100% read : 6250 / (1 + 3 * 4) / 4 = 120 queries / second
    - 100% write : 40000 / (1 + 3 * 2) = 5714 rows / second
    - 90% read, 10% write:
      - 120 * 90% = 108 queries / second
      - 5714 * 10% = 571 rows / second
    - Storage Size: 60M * ((15+10)*3 + 128 + 23 + (32+20) + ((15+10)*2 + 50 + 23 + (32+30))*3) / 3 = 16.7GB
  - Distributed index.
    - (index column, primary key column), size of value: 50.
    - Logical reads = 1 + 1 = 2;
    - Logical writes = 1 + 3 = 4;
    - 100% read : 6250 / (1 + 3 * 2) / 4 = 223 queries/second
    - 100% write : 40000 / (1 + 3 * 4) = 3077 rows/second
    - 90% read, 10% write:
      - 223 * 90% = 201 queries/second
      - 3077 * 10% = 308 rows/second
    - Storage Size: 60M * ((15+10)*3 + 128 + 23 + (32+20) + ((15+10)*2 + 50 + 23 + (32+30))*3) / 3 = 16.7GB
  - Materialized View.
    - ((index column, primary key column, group_id, ts_upd), size of value: 68.
    - Logical reads = 1
    - Logical writes = 1 + 3 = 4;
    - Row size = (140 + 32) * 4 = 688
    - 100% read : 6250 / (1 + 3 * 1) / 4 = 391 queries/second
    - 100% write : 40000 / (1 + 3 * 4) = 3077 rows/second
    - 90% read, 10% write:
      - 391 * 90% = 352 queries/second
      - 3077 * 10% = 308 rows/second
    - Storage Size: 60M * ((15+10)*3 + 128 + 23 + (32+20) + ((15+10)*2 + 68 + 23 + (32+30))*3) / 3 = 17.7GB

- Oracle database processing power
  - Max to 2000 queries per second, update 20 million rows a day.
  - query latency:
    - 99% < 0.01 second
    - 99.99% < 0.2 second

- --
- Reference
  - http://planetcassandra.org/nosql-performance-benchmarks/#EndPoint
  - http://www.datastax.com/dev/blog/datastax-sandbox-2-0-now-available
  - http://www.stackdriver.com/cassandra-aws-gce-rackspace/
  "http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningHardware_c.html"


Thanks,
Charlie
木匠 | Database Architect Developer

Tuesday, August 26, 2014

Estimate Casandra processing power with current price-performance sweet spot hardware

Hi Cassandra database experts,

Can you help us estimate the Cassandra processing power based on below hardware and cluster configuration ?
They are current price-performance sweet spot hardware.

Data Model:  Only one table

create table_1
(
 key_1 varchar(30),
 key_2 varchar(30),
 key_3 varchar(30),
 key_3 varchar(30),
 col_1 varchar(30),
 col_2 varchar(30),
 col_3 varchar(500),
 primary key (key_1)
);

Besides primary key index, there are 3 indexes, are respective on column key_2, key_3 and Key_4.

There are 60 million rows.
Average row length 500 bytes.

Memory: 16GB to 64GB
CPU: 8-core CPU processors
Disk:
-          SSD (solid state drives) : Size ?
-          SATA spinning disks : Size ?

  • Data Model 1: One base table with 3 indexes
  • Data Model 2: One base table and 3 Materialized View tables.
  • Data distribution and replication
    • Strategy 1: one data center, 3 nodes, replication_factor = 3. Write Consistency Levels = 2
    • Strategy 2: two data centers, 3 nodes on each data center,

The final matrix will be looked like this:

Read/Write operation pattern
Max Throughput
Response time


99% Reads
99.99% Reads
99% Writes
99.99% Write
100% read
? reads/second
< ? seconds
< ? seconds

99% read, 1% write
? reads/second,
? writes/second
< ? seconds
< ? seconds
< ? seconds
< ? seconds
90% read, 10% write
? reads/second
? writes/second
< ? seconds
< ? seconds
< ? seconds
< ? seconds
50% read, 50% write
? reads/second
? writes/second
< ? seconds
< ? seconds
< ? seconds
Less than ? seconds
Disk storage size : ? GB.

Please help to fill the green text ? with estimated numbers.
If you could tell us how did you calcuate these number, it will be much better.


Thanks,
Charlie 木匠 | Database Architect Developer