RethinkDB 2.0 Is Amazing

April 17 2015 Rethinkdb Nosql

I just turned in my next course for Pluralsight: RethinkDB Fundamentals - almost in time for their announcement that 2.0 has been released. This is very exciting - it's a "production-ready" release which means that all the work they've done over the last 3+ years is ready to go, and it's mind-blowing.

Dabbling

I have been using RethinkDB (in a hobby-ish way) for the last 2 or so years, in fact I featured it in a TekpubTV video and used it for a few small things, including one my many tekpub.com rewrites.

The web-based administrative screen was impressive, as was it's "ubiquitous" functional query language (ReQL) and "durable by default" setting (meaning it writes your data to disk before acknowledging the write). Sharding and replication blew my mind - but there were a few small things missing. Things like date/time support, regex/fuzzy text searches and, for the lack of better words: a track record.

I'm more than willing to help with the track record but I was a bit gunshy after MongoDB soiled the bed with tekpub.com (early days, lost data, crashes, etc) so I decided not to bet my business on any storage technology that wasn't somewhat battle-tested. That was 2 years ago.

I hadn't done much with it since then - devoting most of my time to working with Postgres and seeing what could be done with the new jsonb data type. And then Nate Kohari tweeted this :

I knew betting on @rethinkdb was a good move. http://t.co/RP8nvcKXaz

— Nate Kohari (@nkohari) January 27, 2015

Nate was excited about change feeds in RethinkDB, essentially managed pub/sub from your database. Not revolutionary, sure, but very compelling on what RethinkDB offers you.

And then they announced a "Production Ready" release candidate: 2.0 - and I was all in. I plugged it into the project I'm working on and I'm not looking back.

RethinkDB 2.0 Is Impressive, Immediately

The features from 2 years ago are essentially the same, with the addition of much-needed things (date/time support, regex matching, etc) as well as some sweetness (change feeds). These additions aren't what impressed me the most.

It was the polish and speed. I can't tell you how much this impressed me: they didn't go off and geek out on features, they focused squarely on stabilizing and "getting it right".

I was able to optimize a ridiculous query using a join on two tables (did I mention RethinkDB supports joins... across your cluster?) with 100,000 records, filtering on a single text record. In a relational engine (like Postgres) this would be horrible, and indeed with RethinkDB it was horrible as well - coming in at over 16 seconds.

UPDATE: above I mentioned that I filtered on a "fuzzy text match" and I was asked about this. In looking over the demo that I put together, the filter *was not a fuzzy match*. It was, instead, a straight text match. Egg on face. I have a followup post detailing this.

I then applied two secondary indexes and changed the join type (from innerJoin to eqJoin) and the result came back in 9ms. That's faster than you can blink... a fuzzy text search over a joined result set of over 200,000 records.

What About Postgres?

So I went all-in, and with each day I found something more to love and I'll share this with you now. But first, if you read my blog you'll probably be wondering about Postgres and if I'm just "chasing something shiny and new". Which is a reasonable thing to wonder.

Yes, I am chasing something shiny and new that's not really new. As I mention, I've been dabbling with RethinkDB (and a few other DBs too!) for a few years and, generally, I have the ability to like more than one thing. Hopefully you'll see why in a second but I'll get right to it: as far as document storage goes, nothing I've used can touch RethinkDB.

Usability, administration, platform support, reliability and speed - what I've seen so far is, simply, the best document storage option available. Maybe you'll agree, maybe you won't, either way let's see some details.

Reason 1: A Functional Query Language

RethinkDB's query language isn't a set of arbitrary commands put together into a string. It's functional:

It’s no secret that ReQL, the RethinkDB query language, is modeled after functional languages like Lisp and Haskell. The functional paradigm is particularly well suited to the needs of a distributed database while being more easily embeddable as a DSL than SQL’s ad hoc syntax...

While at first seeming a tad verbose, once you get your eyes used to it you start to see the power. Consider the Chinook Database - a sample database with albums, recording artists and invoice data. I imported some of it into a table called "catalog":

chinook

Each record is an album, each album has "details" which are the tracks. Let's say I want to filter all albums that contain a track with a media_type_id equal to 2:

r.db("music").table("catalog").filter(function(album){
  return album("details").filter(function(track){
    return track("media_type_id").eq(2)
  })
})

This reads just like Javascript, doesn't it? At first this seems a little off-putting, at least it was for me. It seemed like an awful lot of noise just to run a simple query.

Here's the SQL if you're using postgres assuming three separate tables:

select catalog.* from catalog
inner join details on details.catalog_id = catalog.id
inner join vendor on vendor.id = catalog.vendor_id
where details.media_type_id = 2;

We have two joins here - not really that bad but if you're running this query in a high-read environment, this is now a bottleneck. We can get over this by adding an index if we want, so it's not terribly critical. As far as "noise" goes, this, to me, is just as noisy as the RethinkDB query. But, "noise" is not actually my point; I'm after something a bit different.

Let's add the requirement that you only want the artist's name from the query:

select vendor.name from vendor
inner join catalog on vendor.id = catalog.vendor_id
inner join details on details.catalog_id = catalog.id
where details.media_type_id = 2;

I had to rewrite the first line of the query and (for clarity only) I moved the first join statement up. Not earth-shattering, but with ReQL all I need to do is attach a function:

r.db("music").table("catalog").filter(function(album){
  return album("details").filter(function(track){
    return track("media_type_id").eq(2)
  })
}).map(function(album){
  return {
    artist : album("vendor")("name")
  }
})

This is the big leap - at least for me. When you query with ReQL you tack on functions and "compose" your data, with SQL you tell the query engine the steps necessary (aka prescribe) to return your data:

To grok ReQL, it helps to understand functional programming. Functional programming falls into the declarative paradigm in which the programmer aims to describe the value he wishes to compute rather than prescribe the steps necessary to compute this value.

This has an upside and a down. The upside is that, armed with a few functions, you can filter and shape your data exactly as you need. These functions include mapping, filtering, grouping and joining tables. It's pretty easy to "think your way" through the toughest query.

The downside is that your queries end up quite long and, for some, rather intimidating. Like Vim, there's almost always a better way. The syntax is forgiving and you can drop optimizations in easily (in the form of secondary indexes). For instance, in the query above I can both shorten it and make it more performant by creating a secondary index:

r.db("music").table("catalog")
  .indexCreate("media_type", r.row("details")("media_type_id"), {multi : true})

This statement creates a secondary index on the embedded array details. Since this returns more than one result that the index needs to consider, I have to pass in {multi : true}.

Now that I have a secondary index, I can use some sugary syntax to pull out the data I want - filtering by tracks with a media_id_type of 2:

r.db("music").table("catalog")
  .getAll(2, {index : "media_type"})
  .map(function(album){
    return {
      artist : album("vendor")("name")
    }
  })

Shorter, faster, sweeter. Here I'm using getAll(), a specific method for querying secondary indexes. I'm giving it a value and the index to use and off we go. I can also pass in multiple values if I want, say a 3 or 4, and the query run as expected (using an inclusive AND-style query).

Reason 2: Replication and Failure

Setting up a cluster is pretty simple with RethinkDB. With a simple set of commands you can have a complete cluster up and running in very little time. Once the servers are running, you can also use the web interface to see realtime statistics on what's going on. In addition, you can shard, replicate and index your tables individually right through the web interface:

rethink_cluster

When a server fails in your cluster - let's say Zayn decided to leave our Boy Band Cluster above - an issue is raised and things get interesting. If Zayn carried any primary replicas of a sharded bit of data, that table would become unwritable, and RethinkDB won't do anything about it. That, to me, is great.

This gives you time to consider what should happen next as opposed to RethinkDB making a decision for you and screwing things up. For instance:

Thoughtful and intelligent. I love it.

Reason 3: Driver Support

There are 3 official drivers: Python, Ruby and Node. Using these drivers involves is just like using the Data Explorer in the web interface: ReQL remains ubiquitous.

If you're using the Node driver, for instance, here's how you would write the query above:

var r = require("rethinkdb");

r.connect({db : "music"}, function(err,conn){
  r.table("catalog")
    .getAll(2, {index : "media_type"})
    .map(function(album){
      return {
        artist : album("vendor")("name")
      }
    })
    .run(conn, function(err,cursor){
     cursor.each(next); //or you can use toArray()
    });
});

The query itself didn't change at all - I could copy and paste it right in. I had to wrap it with connection info and a run() function, but that's it.

I like this because it allows me to goof around, proof and optimize queries in the Data Explorer before I use them in code! I also like this because my DB query is code, not a composition of a string.

Summary

There is so much more I can show you! If you're a Pluralsight subscriber, I'll have a video out in the coming weeks - but if you're not, the documentation on the site is fun to read and easy to understand - go have a play!

As you can tell, I really like RethinkDB and for good reason. It's fun as hell to use and it took very little time to get up to speed with ReQL. Administration is simple and I feel like I have complete command over what's going on.

It's another tool in the box, and I'm really happy about what it can do for me. Many congratulations to the RethinkDB team on the 2.0 launch!