Monday, September 24, 2012

Riak Cluster, Sanity Check, Part Two

In this blog, we continued with our sanity check on Riak Cluster.

Setting up Riak Cluster

I have used HBase, Cassandra, MongoDB in the past and I was preparing for a long, laborious efforts to setup a five-node Riak cluster but I was pleasantly surprised on how easy it was to set up the cluster. I found the command line tools are extremely helpful at diagnosing the cluster status, etc.

Sanity Check

Our sanity check consists of loading 20 million objects in the cluster and executes a series of Get, link walking, and free text search over 10 concurrent threads, while physical nodes are brought down and brought back up in the middle of performance testing.

Use HA Proxy for load balancing and failover

We couldn't get Java client's cluster client to work for us so we switch to use HA proxy for failover and load balancing purpose and it works out really well for us. 

When we brought down a Riak node during the performance testing, some of the in-flight queries failed but succeeded immediately after we retried with the same method call.

Test Result

Our sanity check performed flawless when physical nodes were brought down during the testing. Inflight queries failed but recovered right after we retried with the same query and there is little degradation when losing a physical node.

However, a big surprise came when we brought a physical node backup during the performance testing. It increased the overall performance time by 6 times. Baffled, I reached out to Brian at Basho and there is his explanation,
"The reason I ask is the http/pb API listener will start up before Riak KV has finished starting. So, while these nodes will be available for requests they will be slow and build up queues as they start up needed resources. You can use the riak-admin wait-for-service riak_kv <nodename> command from any other node to check if riak_kv is up."

My follow-on question, 
"Yes, I have my performance test running, through a HA proxy that maps to 5 nodes. I understand that the initial response from the starting up node will be slow, but it is impacting the entire performance, which consists of 10 testing threads. I would rather have the node not accepting request until it is ready than accepting requests but queuing them up to drag down the cluster performance."

And Brian's response,
"I completely agree, there is currently work being done on separating the startup into two separate processes. One which will reply with 503 to all requests and then dies when KV is started and ready to take requests. Currently, even if a request is not sent to a node directly, in the case of a PUT, the data will be forwarded to the node even though KV is not ready to accept it. Depending on your W quorum this could result in a long response time for requests as the coordinating node waits for a response from the starting up node.  Currently there is no method to make a node 'invisible' to other nodes in the cluster without bringing it down or changing the cookie in vm.args."

Summary

We are pleased with our overall sanity check results but there are definitely kinks need to be worked out and I am glad Riak team is on top of things.

Riak Sanity Check, Take One

Before we can officially recommend Riak to the management team, we executed a series of sanity checks to make sure that we will make a sound recommendation and won't come back and bite us in the rear.

Testing Scenario

Our sanity testing environment consists of 20 million objects, running on a single physics node. After 20 million objects are loaded into the Riak 1.1, we executed a series of get, link walking, and free text search with 10 concurrent threads.

We are not interested in the actual performance number, but are looking for obvious bottlenecks and abnormal behaviors.

Test Result

We are surprised at the poor results exhibit by our sanity checks, Riak consuming 100% of CPU and 75% of memory and queries didn't return in any reasonable fashion.

Needless to say, we are somewhat concerned. This is where Riak's excellent support jumps into the play. Riak's Develop Advocate Brian worked with us and came up with an excellent diagnosis;

"LevelDB holds data in a "young level" before compacting this data into numbered levels (Level0, Level1, Level2….) who's maximum space grow as their number iterates. In pre Riak 1.2 levelDB, the compaction operation from the young level to sorted string table's was a blocking operation. Compaction and read operations used the same scheduler thread to do their work so if a compaction operation was occurring that partition would be effectively locked. Riak 1.2 moves compaction off onto it's own scheduler thread so the vnode_worker (who is responsible for GET/PUT operations) will not be waiting for compaction to complete before read requested can be serviced by the levelDB backend (write requests are dropped in the write buffer, independent of compaction).

In your scenario, the bulk load operation caused a massive amount of compaction. Most likely, because of the large amount of objects you loaded, there was compaction occurring on all 64 partitions for a while after your write operation completed. The input data sat in the write buffer and young level and eventually compaction moved them to their appropriate levels but during this time read operations will timeout.  "

So our bulk loading led to a contention on a single thread, which is responsible for compacting and querying at the same time.

We can't upgrade to 1.2 just yet because a bug in search so we will have to wait for the next release. 

Summary

Overall we are not pleased with our sanity checks but am satisfied with the explanation.